January 16th 2014

Setting Up Tika's Extracting Request Handler

Some of this is covered in the set-up of Solr

Sometimes indexing prepared text files (such as XML, CSV, JSON, etc) is not enough. There are numerous situations where you need to extract data from binary files. For example, indexing PDF files – actually their contents. To do that we can use Apache Tika which comes built in with Apache Solr by using its ExtractingRequestHandler.

Preparation

You should have worked through the set-up for Solr prior to this point and can be found at:

http://amac4.blogspot.co.uk/2013/07/setting-up-solr-with-apache-tomcat-be.html

If you wish to have a fully functioning file or web crawler using Nutch that Indexes to Solr then follow the next steps of the guide at:

http://amac4.blogspot.co.uk/2013/07/configuring-nutch-to-crawl-urls.html

http://amac4.blogspot.co.uk/2013/07/setting-up-nutch-to-crawl-filesystem.html

http://amac4.blogspot.co.uk/2013/07/web-service-to-query-solr-rest.html

Discover the Power of Fortnite on Xbo…
A List of the Best College Graduation…
Introducing Hisense S59 Series Smart …
Tips for College Students Who Want to…
Is the TP-Link Tapo C100 Security Cam…

Set-Up Guide

In the $SOLR_HOME/collection1/conf/solrconfig.xml file there will be a section with heading - Solr Cell Update Request Handler. The code there should be updated or replaced to say:

<requestHandler name="/update/extract" class="solr.extraction.ExtractingRequestHandler">
<lst name="defaults">
<str name="fmap.content">text</str>
<str name="lowernames">true</str>
<str name="uprefix">attr_</str>
<str name="captureAttr">true</str>
</lst>
</requestHandler>

Create an "extract" folder anywhere in the system, one option would be putting it in the solr_home folder. Then place the solr-cell-4.3.0.jar file in it from the $SOLR/dist. Then copy the contents of the $SOLR/contrib/extraction/lib/ folder into your extract folder.

In the solrconfig.xml file add code for the directory you have chosen:

<lib dir="$SOLR_HOME/extract" regex=".*\.jar" />

In the schema.xml file the line needs edited to say

<field name="text" type="text_general" indexed="true" stored="true" multiValued="true"/>

To test that it works open command prompt and navigate to any directory containing a pdf file and execute the following code replacing the filename with the file to be used:

curl "http://localhost:8080/solr/update/extract?literal.id=1&commit=true" -F "myfile=@FILENAME.pdf"

If all has worked correctly then the following output should be displayed

<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">578</int>
</lst>
</response>

Next Steps

You now have Solr configured properly and ready to use Tika to extract the data that you need. The next step is configure Nutch, an open source web crawler that will crawl the web to find pages to index:
http://amac4.blogspot.co.uk/2013/07/configuring-nutch-to-crawl-urls.html

How It Works

Binary file parsing is implemented using the Apache Tika framework. Tika is a toolkit for detecting and extracting metadata and structured text from various types of documents, not only binary files but also HTML and XML files. To add a handler that uses Apache Tika, we need to add a handler based on the solr.extraction.ExtractingRequestHandler class to our solrconfig.xml file as shown in the example. In addition to the handler definition, we need to specify where Solr should look for the additional libraries we placed in the extract directory that we created. The dir attribute of the lib tag should be pointing to the path of the created directory. The regex attribute is the regular expression telling Solr which files to load. Let's now discuss the default configuration parameters. The fmap.content Parameter Tells Solr what field content of the parsed document should be extracted. In our case, the parsed content will go to the field named text. The next parameter lowernames is set to true; this tells Solr to lower all names that come from Tika and have them lowercased. The next parameter, uprefix, is very important. It tells Solr how to handle fields that are not defined in the schema.xml file. The name of the field returned from Tika will be added to the value of the parameter and sent to Solr. For example, if Tika returned a field named creator, and we don't have such a field in our index, then Solr would try to index it under a field named attrcreator which is a dynamic field. The last parameter tells Solr to index Tika XHTML elements into separate fields named after those elements. Next we have a command that sends a PDF file to Solr. We are sending a file to the /update/ extract handler with two parameters. First we define a unique identifier. It's useful to be able to do that during document sending because most of the binary document won't have an identifier in its contents. To pass the identifier we use the literal.id parameter. The second parameter we send to Solr is the information to perform the commit right after document processing.

Source Code

If you are unsure of anything then pop me an email and i can send you sample schema.xml and solrconfig.xml for you to use.

This post first appeared on Share What You Know, Learn What You Don't, please read the originial post: here