New Home for the London Search Social

Wednesday 16th December, 2009

To avoid the somewhat annoying (and hopefully temporary) problem that not everyone in the world reads my blog, I’ve created a new home for our search social meet-ups over on

Sign up on the London Search Social page to get notifications of events.

Open Source Search Social

Thursday 5th November, 2009

It’s been a little while since the last Open Source Search Social, so we’re getting really imaginative and holding another one, this time on Wednesday the 18th of November. As usual the event is in the Pelican pub just off London’s face-bleedingly trendy Portobello Road.

The format is staying roughly the same. No agenda, no attitude, just some geeks talking about search and related topics in the presence of intoxicating substances.

Please come along if you can, just get in touch or sign up on the Upcoming page.

Open Source Search Social

Thursday 28th May, 2009

Following on from the undeniably interesting Search/Lucene social in London last month we’re organising another one… this time broadening the scope a little to other OS search projects and related geekery… Solr, Hadoop, Mahout, etc.

We’re meeting up on Monday the 15th of June, at The Pelican pub (nearest tube Westbourne Park).

If you’re working in the search field and fancy an informal chat then come along. Please sign up on Upcoming or drop me a line if you fancy coming along.

Update 28th May, 16:19:- Added Upcoming link

Semantic Vectors – semantic indexing on top of Lucene

Thursday 23rd April, 2009

For anyone interested in adding semantic structures on top of their unstructured or semi-structured data I recently came across Dominic Widdows’ Semantic Vectors project.

It’s not a big enough project to survive the ‘contributor departure’ test, but it’s in active development and reading the code didn’t make my eyes bleed, so may be worth a look if that’s your bag.

Search / Lucene social meet-up

Monday 6th April, 2009

Having just finished our product launch (apologies for the gratuitous plug) I’ve now got time to worry about more important things, i.e. organising beers.

We’ll be in The Pelican pub just near the Pixsta offices in Notting Hill from 7pm on the 27th of April. If you’re keen to come along and talk about Lucene, or search in general, then please do. There may also be talk of machine learning, computer vision, distributed systems, etc.

All I ask is that you sign up on the Yahoo event page so that I’ve got an idea about numbers (need to book tables, blah blah blah).

Creating a language detection API in 30 minutes

Friday 24th October, 2008

This week I needed to test out the performance of the n-gram technique for statistical language detection, and only had about half an hour to do it, so I brought in the experts…

Lucene provides a huge number of text analysis features, but currently doesn’t provide out-of-the-box language identification.

Nutch on the other hand, does. It’s kindly provided as a Nutch plugin. Testing that out, I discovered a dependency on a Hadoop Configuration class, so went and dug out that JAR too.

So, libraries in hand, I knocked up a quick proof-of-concept, full of messy dependencies and ugly string manipulation.

import javax.servlet.ServletException;
import javax.servlet.http.HttpServlet;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;

import org.apache.hadoop.conf.Configuration;
import org.apache.nutch.analysis.lang.LanguageIdentifier;

public class LanguageServlet extends HttpServlet
	private static LanguageIdentifier _identifier;

	public void init() throws ServletException
		_identifier = new LanguageIdentifier(new Configuration());

	protected void doPost(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException
		this.execute(request, response);

	protected void doGet(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException
		this.execute(request, response);

	protected void execute(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException

		String query = request.getParameter("text");
		String language = "unknown";

		if ( query != null && query.length() > 0 )
			language = _identifier.identify(query);

		StringBuffer b = new StringBuffer();
		b.append("<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n");
		b.append("<languagequery guess=\""+language+"\">\n");


As you can see the code is basic, but the actual method call is very simple indeed. One thing that’s missing is the ability to see what level of certainty assigns to its language guess, but you can add that yourself once you get comfortable enough with the technique to hunt down the Nutch source, or build your own.

Pixsta job specs

Wednesday 15th October, 2008

I’ve just noticed that we now have job specs for the positions we’re looking to fill.

Positions include:

The office is on Portobello Road in west London. We’re playing with some cutting-edge technology, in a positive and enthusiastic environment, so if you’re looking or have any questions then get in touch.

Hadoop User Group – UK

Wednesday 20th August, 2008

Yesterday I attended the Hadoop User Group in Clerkenwell, kindly organised by Johan Oskarsson of Last FM. Below are some of my notes. I’ve paid more attention to topics that affect me directly so please don’t expect these notes to be comprehensive.

Hadoop Overview – Doug Cutting

Doug took us through the Hadoop timeline. Starting with the problems faced by the Nutch team and the Google MapReduce paper. The Hadoop project enabled Nutch to overcome scaling limitations to reach what at the time was web scale. Yahoo have now broken the Terasort record using a Hadoop cluster.

Hadoop on Amazon S3/EC2 – Tom White

Tom highlighted a implementation issues surrounding his experience deploying Hadoop clusters to AWS.

Predictably he’s found that positioning job data closer to the Hadoop cluster improves performance, for example storing data in S3 results in approx 50% of the performance of storing the data locally in the EC2 cluster.

Tom’s experience of node failure is about 2%. This is a nuisance if it’s a regular node, but a serious problem if your name node fails… so as things stand you have to be prepared to deal with job failures. In light of that failure rate, Tom highlighted a potential gap in the Hadoop toolset, namely a way of reporting failures to subsequent dependent jobs.

Smartfrog – Steve Loughran

Steve presented a system called Smartfrog which enables configuration management for distributed systems. Hadoop comes into its own when you’re dealing with more than a handful of nodes, and even at those small scales a sensible nerd will need automated configuration management.

Hadoop usage at Last FM – Martin, Elias and Johan

From the looks of things everyone at Last FM is hip-deep in Hadoop. That makes a lot of sense since one of the most important features of Last FM is the ability to recommend music… a difficult and wooly job at best. No wonder they churn a lot of data. The main point I took away from this was that there are lots of people actively using Hadoop and thinking about ways to improve it.

Using Hadoop and Nutch for NLP – Miles Osborne

Miles teaches Natural Language Processing at the University of Edinburgh. Using the Mahout sub-project he’s been using Hadoop to process blogs (amongst other things) to form models of the structures used in natural language.

PostgreSQL to HBase replication – Tim Sell

Tim set up a replication system so that the team can run heavy queries on their data without endlessly harrassing their PostgreSQL database. Due to the lack of triggers in HBase it’s unlikely that the reverse of this process will be possible in the short term, but due to the nature of the two systems it’s less likely to be required.

Distributed Lucene – Mark Butler

One of the failings of the current open source stack is a solid choice for a distributed search index. Mark took Doug Cutting’s proposal for a distributed index based on Lucene and implemented an alpha version of a working system.

This system features;

  • Name node
  • Heatbeats to detect failures
  • Updates indexes transactionally by versioning and committing across the cluster
  • Sharding is handled via the client API rather than by the name node
  • Replication is handled via data node leases

Dumbo – Klass Bosteels

Klass has implemented Dumbo, a system that allows you to write disposable Hadoop streaming programs in Python. The aim was to reduce the amount of work involved writing one-off jobs. This seems like it could become part of the Hadoop toolset as it’s certain to be useful to a lot of people.