Creating a language detection API in 30 minutes

This week I needed to test out the performance of the n-gram technique for statistical language detection, and only had about half an hour to do it, so I brought in the experts…

Lucene provides a huge number of text analysis features, but currently doesn’t provide out-of-the-box language identification.

Nutch on the other hand, does. It’s kindly provided as a Nutch plugin. Testing that out, I discovered a dependency on a Hadoop Configuration class, so went and dug out that JAR too.

So, libraries in hand, I knocked up a quick proof-of-concept, full of messy dependencies and ugly string manipulation.


import java.io.IOException;
import javax.servlet.ServletException;
import javax.servlet.http.HttpServlet;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;

import org.apache.hadoop.conf.Configuration;
import org.apache.nutch.analysis.lang.LanguageIdentifier;

public class LanguageServlet extends HttpServlet
{
	private static LanguageIdentifier _identifier;

	public void init() throws ServletException
	{
		_identifier = new LanguageIdentifier(new Configuration());
	}

	protected void doPost(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException
	{
		this.execute(request, response);
	}

	protected void doGet(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException
	{
		this.execute(request, response);
	}

	protected void execute(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException
	{
		response.setCharacterEncoding("utf-8");
		response.setHeader("Content-Type","text/xml");

		String query = request.getParameter("text");
		String language = "unknown";

		if ( query != null && query.length() > 0 )
		{
			language = _identifier.identify(query);
		}

		StringBuffer b = new StringBuffer();
		b.append("<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n");
		b.append("<languagequery guess=\""+language+"\">\n");
		b.append("</languagequery>\n");

		response.getWriter().write(b.toString());
	}
}

As you can see the code is basic, but the actual method call is very simple indeed. One thing that’s missing is the ability to see what level of certainty assigns to its language guess, but you can add that yourself once you get comfortable enough with the technique to hunt down the Nutch source, or build your own.

Advertisements

2 Responses to Creating a language detection API in 30 minutes

  1. Gustavo Arjones says:

    Thanks! You saved me a lot of time 🙂
    Complementing your post, class path must contains those libraries:
    * commons-logging-1.0.4.jar
    * hadoop-0.20.2-core.jar
    * language-identifier.jar
    * nutch-1.1.jar

  2. […] Check out the original for detail Comments [0]Digg it!FacebookTwitterEdit Post […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: