Creating a language detection API in 30 minutes

Friday 24th October, 2008

This week I needed to test out the performance of the n-gram technique for statistical language detection, and only had about half an hour to do it, so I brought in the experts…

Lucene provides a huge number of text analysis features, but currently doesn’t provide out-of-the-box language identification.

Nutch on the other hand, does. It’s kindly provided as a Nutch plugin. Testing that out, I discovered a dependency on a Hadoop Configuration class, so went and dug out that JAR too.

So, libraries in hand, I knocked up a quick proof-of-concept, full of messy dependencies and ugly string manipulation.


import java.io.IOException;
import javax.servlet.ServletException;
import javax.servlet.http.HttpServlet;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;

import org.apache.hadoop.conf.Configuration;
import org.apache.nutch.analysis.lang.LanguageIdentifier;

public class LanguageServlet extends HttpServlet
{
	private static LanguageIdentifier _identifier;

	public void init() throws ServletException
	{
		_identifier = new LanguageIdentifier(new Configuration());
	}

	protected void doPost(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException
	{
		this.execute(request, response);
	}

	protected void doGet(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException
	{
		this.execute(request, response);
	}

	protected void execute(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException
	{
		response.setCharacterEncoding("utf-8");
		response.setHeader("Content-Type","text/xml");

		String query = request.getParameter("text");
		String language = "unknown";

		if ( query != null && query.length() > 0 )
		{
			language = _identifier.identify(query);
		}

		StringBuffer b = new StringBuffer();
		b.append("<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n");
		b.append("<languagequery guess=\""+language+"\">\n");
		b.append("</languagequery>\n");

		response.getWriter().write(b.toString());
	}
}

As you can see the code is basic, but the actual method call is very simple indeed. One thing that’s missing is the ability to see what level of certainty assigns to its language guess, but you can add that yourself once you get comfortable enough with the technique to hunt down the Nutch source, or build your own.

Advertisements

Who should control my data?

Monday 20th October, 2008

ReadWriteWeb today asks who will control my data in the web 3.0 world. To answer their question: I will. I created it, I own it. I will grant rights to those that want it, and can offer something to me in return.

Data is one of the natural resources of your own private nation. For someone else to control it at best a lost opportunity, and at worst a form of theft.


Logging in Javascript

Wednesday 15th October, 2008

Today’s Ajaxian article about the Blackbird Javascript logging library has prompted a stream of comments asking what advantage this gives over console.log() in Firebug.

I think both the commenters and the Blackbird author have missed a very important aspect of logging in Javascript, namely the ability to record errors outside your own session, i.e. server-side.

This is all that’s missing (in roughly JQuery syntax for brevity):

window.onerror = function(e)
{
    // get the error data
    if (!e) e = window.error;
    // save error and UI activity for context
    $.post( log_url, 
        { 
            error:e.toString(), 
            lines:logger.getRecentLines(20) 
        }, 
        function(){
            // do callback if necessary
        });
};

Pixsta job specs

Wednesday 15th October, 2008

I’ve just noticed that we now have job specs for the positions we’re looking to fill.

Positions include:

The office is on Portobello Road in west London. We’re playing with some cutting-edge technology, in a positive and enthusiastic environment, so if you’re looking or have any questions then get in touch.