Creating a language detection API in 30 minutes

Friday 24th October, 2008

This week I needed to test out the performance of the n-gram technique for statistical language detection, and only had about half an hour to do it, so I brought in the experts…

Lucene provides a huge number of text analysis features, but currently doesn’t provide out-of-the-box language identification.

Nutch on the other hand, does. It’s kindly provided as a Nutch plugin. Testing that out, I discovered a dependency on a Hadoop Configuration class, so went and dug out that JAR too.

So, libraries in hand, I knocked up a quick proof-of-concept, full of messy dependencies and ugly string manipulation.


import java.io.IOException;
import javax.servlet.ServletException;
import javax.servlet.http.HttpServlet;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;

import org.apache.hadoop.conf.Configuration;
import org.apache.nutch.analysis.lang.LanguageIdentifier;

public class LanguageServlet extends HttpServlet
{
	private static LanguageIdentifier _identifier;

	public void init() throws ServletException
	{
		_identifier = new LanguageIdentifier(new Configuration());
	}

	protected void doPost(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException
	{
		this.execute(request, response);
	}

	protected void doGet(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException
	{
		this.execute(request, response);
	}

	protected void execute(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException
	{
		response.setCharacterEncoding("utf-8");
		response.setHeader("Content-Type","text/xml");

		String query = request.getParameter("text");
		String language = "unknown";

		if ( query != null && query.length() > 0 )
		{
			language = _identifier.identify(query);
		}

		StringBuffer b = new StringBuffer();
		b.append("<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n");
		b.append("<languagequery guess=\""+language+"\">\n");
		b.append("</languagequery>\n");

		response.getWriter().write(b.toString());
	}
}

As you can see the code is basic, but the actual method call is very simple indeed. One thing that’s missing is the ability to see what level of certainty assigns to its language guess, but you can add that yourself once you get comfortable enough with the technique to hunt down the Nutch source, or build your own.

Advertisements

Hadoop User Group – UK

Wednesday 20th August, 2008

Yesterday I attended the Hadoop User Group in Clerkenwell, kindly organised by Johan Oskarsson of Last FM. Below are some of my notes. I’ve paid more attention to topics that affect me directly so please don’t expect these notes to be comprehensive.

Hadoop Overview – Doug Cutting

Doug took us through the Hadoop timeline. Starting with the problems faced by the Nutch team and the Google MapReduce paper. The Hadoop project enabled Nutch to overcome scaling limitations to reach what at the time was web scale. Yahoo have now broken the Terasort record using a Hadoop cluster.

Hadoop on Amazon S3/EC2 – Tom White

Tom highlighted a implementation issues surrounding his experience deploying Hadoop clusters to AWS.

Predictably he’s found that positioning job data closer to the Hadoop cluster improves performance, for example storing data in S3 results in approx 50% of the performance of storing the data locally in the EC2 cluster.

Tom’s experience of node failure is about 2%. This is a nuisance if it’s a regular node, but a serious problem if your name node fails… so as things stand you have to be prepared to deal with job failures. In light of that failure rate, Tom highlighted a potential gap in the Hadoop toolset, namely a way of reporting failures to subsequent dependent jobs.

Smartfrog – Steve Loughran

Steve presented a system called Smartfrog which enables configuration management for distributed systems. Hadoop comes into its own when you’re dealing with more than a handful of nodes, and even at those small scales a sensible nerd will need automated configuration management.

Hadoop usage at Last FM – Martin, Elias and Johan

From the looks of things everyone at Last FM is hip-deep in Hadoop. That makes a lot of sense since one of the most important features of Last FM is the ability to recommend music… a difficult and wooly job at best. No wonder they churn a lot of data. The main point I took away from this was that there are lots of people actively using Hadoop and thinking about ways to improve it.

Using Hadoop and Nutch for NLP – Miles Osborne

Miles teaches Natural Language Processing at the University of Edinburgh. Using the Mahout sub-project he’s been using Hadoop to process blogs (amongst other things) to form models of the structures used in natural language.

PostgreSQL to HBase replication – Tim Sell

Tim set up a replication system so that the team can run heavy queries on their data without endlessly harrassing their PostgreSQL database. Due to the lack of triggers in HBase it’s unlikely that the reverse of this process will be possible in the short term, but due to the nature of the two systems it’s less likely to be required.

Distributed Lucene – Mark Butler

One of the failings of the current open source stack is a solid choice for a distributed search index. Mark took Doug Cutting’s proposal for a distributed index based on Lucene and implemented an alpha version of a working system.

This system features;

  • Name node
  • Heatbeats to detect failures
  • Updates indexes transactionally by versioning and committing across the cluster
  • Sharding is handled via the client API rather than by the name node
  • Replication is handled via data node leases

Dumbo – Klass Bosteels

Klass has implemented Dumbo, a system that allows you to write disposable Hadoop streaming programs in Python. The aim was to reduce the amount of work involved writing one-off jobs. This seems like it could become part of the Hadoop toolset as it’s certain to be useful to a lot of people.