Setting Up Tika's Extracting Request Handler
Some of this is covered in the set-up of Solr
Sometimes indexing prepared text files (such as XML, CSV, JSON, etc) is not enough. There are numerous situations where you need to extract data from binary files. For example, indexing PDF files – actually their contents. To do that we can use Apache Tika which comes built in with Apache Solr by using its ExtractingRequestHandler.
Preparation
You should have worked through the set-up for Solr prior to this point and can be found at:
http://amac4.blogspot.co.uk/2013/07/setting-up-solr-with-apache-tomcat-be.html
If you wish to have a fully functioning file or web crawler using Nutch that Indexes to Solr then follow the next steps of the guide at:
http://amac4.blogspot.co.uk/2013/07/configuring-nutch-to-crawl-urls.html
http://amac4.blogspot.co.uk/2013/07/setting-up-nutch-to-crawl-filesystem.html
http://amac4.blogspot.co.uk/2013/07/web-service-to-query-solr-rest.html
Set-Up Guide
- In the
$SOLR_HOME/collection1/conf/solrconfig.xml
file there will be a section with heading - Solr Cell Update Request Handler. The code there should be updated or replaced to say:
<requestHandler name="/update/extract" class="solr.extraction.ExtractingRequestHandler">
<lst name="defaults">
<str name="fmap.content">text</str>
<str name="lowernames">true</str>
<str name="uprefix">attr_</str>
<str name="captureAttr">true</str>
</lst>
</requestHandler>
- Create an "extract" folder anywhere in the system, one option would be putting it in the
solr_home
folder. Then place thesolr-cell-4.3.0.jar
file in it from the$SOLR/dist
. Then copy the contents of the$SOLR/contrib/extraction/lib/
folder into yourextract
folder.
- In the
solrconfig.xml
file add code for the directory you have chosen:
<lib dir="$SOLR_HOME/extract" regex=".*\.jar" />
- In the
schema.xml
file theline needs edited to say
<field name="text" type="text_general" indexed="true" stored="true" multiValued="true"/>
- To test that it works open command prompt and navigate to any directory containing a pdf file and execute the following code replacing the filename with the file to be used:
curl "http://localhost:8080/solr/update/extract?literal.id=1&commit=true" -F "myfile=@FILENAME.pdf"
- If all has worked correctly then the following output should be displayed
<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">578</int>
</lst>
</response>
Next Steps
You now have Solr configured properly and ready to use Tika to extract the data that you need. The next step is configure Nutch, an open source web crawler that will crawl the web to find pages to index: http://amac4.blogspot.co.uk/2013/07/configuring-nutch-to-crawl-urls.html
How It Works
Binary file parsing is implemented using the Apache Tika framework. Tika is a toolkit for detecting and extracting metadata and structured text from various types of documents, not only binary files but also HTML and XML files. To add a handler that uses Apache Tika, we need to add a handler based on the solr.extraction.ExtractingRequestHandler class to our solrconfig.xml file as shown in the example. In addition to the handler definition, we need to specify where Solr should look for the additional libraries we placed in the extract directory that we created. The dir attribute of the lib tag should be pointing to the path of the created directory. The regex attribute is the regular expression telling Solr which files to load. Let's now discuss the default configuration parameters. The fmap.content Parameter Tells Solr what field content of the parsed document should be extracted. In our case, the parsed content will go to the field named text. The next parameter lowernames is set to true; this tells Solr to lower all names that come from Tika and have them lowercased. The next parameter, uprefix, is very important. It tells Solr how to handle fields that are not defined in the schema.xml file. The name of the field returned from Tika will be added to the value of the parameter and sent to Solr. For example, if Tika returned a field named creator, and we don't have such a field in our index, then Solr would try to index it under a field named attrcreator which is a dynamic field. The last parameter tells Solr to index Tika XHTML elements into separate fields named after those elements. Next we have a command that sends a PDF file to Solr. We are sending a file to the /update/ extract handler with two parameters. First we define a unique identifier. It's useful to be able to do that during document sending because most of the binary document won't have an identifier in its contents. To pass the identifier we use the literal.id parameter. The second parameter we send to Solr is the information to perform the commit right after document processing.
Source Code
If you are unsure of anything then pop me an email and i can send you sample schema.xml and solrconfig.xml for you to use.
This post first appeared on Share What You Know, Learn What You Don't, please read the originial post: here