Right now we are working on a new project using Apache Nutch 2.x, Apache Hadoop, Apache Solr 4 and a lot of other cool tools/modules/API’s/etc. After following the instructions found on http://nlp.solutions.asia/?p=180, I’ve successfully connected Apache Nutch, MySQL and Apache Solr.

In summary:

  • Create a database to hold your data
  • Use SQLDataStore and add configuration for your MySQL server
  • Update Apache Nutch configuration
  • Update Solr schema

Now our Apache Nutch uses MySQL as data store (the place where it keeps the result of the crawling process, such as URL, text content, metadata, and so on). That’s grand, but there is one part missing in the Solr Schema provided in the blog post.

Due to SOLR-3432, after following the tutorial and replacing the schema, we couldn’t delete the whole index anymore. After following the instructions in the bug comments, and adding the following entry in schema.xml it worked again.

<field name="_version_" type="long" indexed="true" stored="true"/>

Restart Apache Solr and run the following command and your index will be reset.

curl http://localhost:8983/solr/collection1/update?commit=true -H "Content-Type: text/xml" --data-binary "<delete><query>*:*</query></delete>"

Hope it helps if you are creating a similar set up. In the next posts we will explain how to set up Apache Nutch 2.x branch in Eclipse. It is very helpful for writing and debugging plug-ins.

Laters! -B