Nutch experience

I've experimented a bit with Nutch and the project is both ambitious and potentially important for the open source/Java space. I'm using Nutch to index the file server at work where most of our documents are stored. This index gets used by our intranet which is a Confluence site. The Nutch generated index is used to show related documents on our intranet site. If the viewed page in confluence is a page describing a software architecture document, the confluence plugin will show all related documents from the Nutch index. I'm also planning to create a simple search dialog to do explicit search towards the index.

I think the project has great potential, but still needs some work. My biggest complaint towards Nutch is around configuration. This is both how you configure Nutch, but also the way the code is structured.

Nutch has a concept of default values and site/setup specific values. So if you need to override something for each setup (which you always have to), you must edit the nutch-site.xml. There's the default values which is called nutch-default.xml. I think that Nutch should have "good defaults" and move the whole nutch-default.xml file into the jar file. This way - the user don't have to know about it, but still is able to alter and view it by extracting it from the jar file.

The other part that I'm not to happy about is the number of configuration files. There's the Hadoop configuration, regex-urlfilter.txt, craw-urlfilter.txt and a lot of others. This should be cleaned up so that these configurations could be consolidated so that the user don't have to know about all of these files. Personally I would like to have ONE file to configure and I don't see any reason why this shouldn't be possible. One way to reduce the problem with these configuration settings, is to provide a GUI client which generates the appropriate configuration files.

The other part is the way the code is for the configuration part. First of, I would really like to have in interface for configuration because the current implementation has to many assumption about where to find the configuration files for Hadoop. It's expected to be available in the classpath/classloader. This might be an issue if you have complex classloader hierarchies. Personally I think that the project should consider Common Configuration, but at least start using interfaces so anyone can provide an alternative implementation.


My new goodie of the month: AbstractTransactionalDataSourceSpringContextTests

I've bought the SpringOne DVD and got it this autumn, but haven't had the time to listen to it before I started on a new project. Finally using the Spring Framework, a really good IDE and an application server that doesn't take forever to start up.

This little project I'm working on doesn't have to much logic on the business tier so what I really wanted was to test the SQL statements. I've tried the mock object approach and as everybody else says: It's not the way to go when testing this kind of code.

When listening to Rod's talk about testing, I had to agree with the statement about using in memory database. When you are testing SQL towards a database, I really want to use the same database as the production environment because of the difference between SQL dialects.

So this really cool, long named class: AbstractTransactionalDataSourceSpringContextTests does a fantastic job solving my testing needs. By utilizing the transaction mechanism in the database, you can issue update and insert statements, getting error messages from the database if something went wrong, and if everything went fine - everything just gets rolled back so my tests are repeatable. Brilliant - just brilliant.

I think that Spring-boys from time-to-time should borrow the slogan from Jetbrains: "Develop with pleasure".....