Blog Taxonomy

Posts tagged with 'web crawling'

Integrating Nutch 2.x, MySQL and Solr

kinow @ Sep 14, 2012 00:02:31

Right now we are working on a new project using Apache Nutch 2.x, Apache Hadoop, Apache Solr 4 and a lot of other cool tools/modules/API’s/etc. After following the instructions found on, I’ve successfully connected Apache Nutch, MySQL and Apache Solr.


In summary:

  • Create a database to hold your data
  • Use SQLDataStore and add configuration for your MySQL server
  • Update Apache Nutch configuration
  • Update Solr schema

Now our Apache Nutch uses MySQL as data store (the place where it keeps the result of the crawling process, such as URL, text content, metadata, and so on). That’s grand, but there is one part missing in the Solr Schema provided in the blog post.

Due to SOLR-3432, after following the tutorial and replacing the schema, we couldn’t delete the whole index anymore. After following the instructions in the bug comments, and adding the following entry in schema.xml it worked again.

( Read more ... )