2007-04-10

Nutch experience

I've experimented a bit with Nutch and the project is both ambitious and potentially important for the open source/Java space. I'm using Nutch to index the file server at work where most of our documents are stored. This index gets used by our intranet which is a Confluence site. The Nutch generated index is used to show related documents on our intranet site. If the viewed page in confluence is a page describing a software architecture document, the confluence plugin will show all related documents from the Nutch index. I'm also planning to create a simple search dialog to do explicit search towards the index.

I think the project has great potential, but still needs some work. My biggest complaint towards Nutch is around configuration. This is both how you configure Nutch, but also the way the code is structured.

Nutch has a concept of default values and site/setup specific values. So if you need to override something for each setup (which you always have to), you must edit the nutch-site.xml. There's the default values which is called nutch-default.xml. I think that Nutch should have "good defaults" and move the whole nutch-default.xml file into the jar file. This way - the user don't have to know about it, but still is able to alter and view it by extracting it from the jar file.

The other part that I'm not to happy about is the number of configuration files. There's the Hadoop configuration, regex-urlfilter.txt, craw-urlfilter.txt and a lot of others. This should be cleaned up so that these configurations could be consolidated so that the user don't have to know about all of these files. Personally I would like to have ONE file to configure and I don't see any reason why this shouldn't be possible. One way to reduce the problem with these configuration settings, is to provide a GUI client which generates the appropriate configuration files.

The other part is the way the code is for the configuration part. First of, I would really like to have in interface for configuration because the current implementation has to many assumption about where to find the configuration files for Hadoop. It's expected to be available in the classpath/classloader. This might be an issue if you have complex classloader hierarchies. Personally I think that the project should consider Common Configuration, but at least start using interfaces so anyone can provide an alternative implementation.

No comments: