New Home for the London Search Social

Wednesday 16th December, 2009

To avoid the somewhat annoying (and hopefully temporary) problem that not everyone in the world reads my blog, I’ve created a new home for our search social meet-ups over on Meetup.com.

Sign up on the London Search Social page to get notifications of events.

Advertisements

Open Source Search Social

Thursday 5th November, 2009

It’s been a little while since the last Open Source Search Social, so we’re getting really imaginative and holding another one, this time on Wednesday the 18th of November. As usual the event is in the Pelican pub just off London’s face-bleedingly trendy Portobello Road.

The format is staying roughly the same. No agenda, no attitude, just some geeks talking about search and related topics in the presence of intoxicating substances.

Please come along if you can, just get in touch or sign up on the Upcoming page.


Open Source Search Social

Thursday 28th May, 2009

Following on from the undeniably interesting Search/Lucene social in London last month we’re organising another one… this time broadening the scope a little to other OS search projects and related geekery… Solr, Hadoop, Mahout, etc.

We’re meeting up on Monday the 15th of June, at The Pelican pub (nearest tube Westbourne Park).

If you’re working in the search field and fancy an informal chat then come along. Please sign up on Upcoming or drop me a line if you fancy coming along.

Update 28th May, 16:19:- Added Upcoming link


Semantic Vectors – semantic indexing on top of Lucene

Thursday 23rd April, 2009

For anyone interested in adding semantic structures on top of their unstructured or semi-structured data I recently came across Dominic Widdows’ Semantic Vectors project.

It’s not a big enough project to survive the ‘contributor departure’ test, but it’s in active development and reading the code didn’t make my eyes bleed, so may be worth a look if that’s your bag.


Search / Lucene social meet-up

Monday 6th April, 2009

Having just finished our product launch (apologies for the gratuitous plug) I’ve now got time to worry about more important things, i.e. organising beers.

We’ll be in The Pelican pub just near the Pixsta offices in Notting Hill from 7pm on the 27th of April. If you’re keen to come along and talk about Lucene, or search in general, then please do. There may also be talk of machine learning, computer vision, distributed systems, etc.

All I ask is that you sign up on the Yahoo event page so that I’ve got an idea about numbers (need to book tables, blah blah blah).


Creating a language detection API in 30 minutes

Friday 24th October, 2008

This week I needed to test out the performance of the n-gram technique for statistical language detection, and only had about half an hour to do it, so I brought in the experts…

Lucene provides a huge number of text analysis features, but currently doesn’t provide out-of-the-box language identification.

Nutch on the other hand, does. It’s kindly provided as a Nutch plugin. Testing that out, I discovered a dependency on a Hadoop Configuration class, so went and dug out that JAR too.

So, libraries in hand, I knocked up a quick proof-of-concept, full of messy dependencies and ugly string manipulation.


import java.io.IOException;
import javax.servlet.ServletException;
import javax.servlet.http.HttpServlet;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;

import org.apache.hadoop.conf.Configuration;
import org.apache.nutch.analysis.lang.LanguageIdentifier;

public class LanguageServlet extends HttpServlet
{
	private static LanguageIdentifier _identifier;

	public void init() throws ServletException
	{
		_identifier = new LanguageIdentifier(new Configuration());
	}

	protected void doPost(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException
	{
		this.execute(request, response);
	}

	protected void doGet(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException
	{
		this.execute(request, response);
	}

	protected void execute(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException
	{
		response.setCharacterEncoding("utf-8");
		response.setHeader("Content-Type","text/xml");

		String query = request.getParameter("text");
		String language = "unknown";

		if ( query != null && query.length() > 0 )
		{
			language = _identifier.identify(query);
		}

		StringBuffer b = new StringBuffer();
		b.append("<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n");
		b.append("<languagequery guess=\""+language+"\">\n");
		b.append("</languagequery>\n");

		response.getWriter().write(b.toString());
	}
}

As you can see the code is basic, but the actual method call is very simple indeed. One thing that’s missing is the ability to see what level of certainty assigns to its language guess, but you can add that yourself once you get comfortable enough with the technique to hunt down the Nutch source, or build your own.


Pixsta job specs

Wednesday 15th October, 2008

I’ve just noticed that we now have job specs for the positions we’re looking to fill.

Positions include:

The office is on Portobello Road in west London. We’re playing with some cutting-edge technology, in a positive and enthusiastic environment, so if you’re looking or have any questions then get in touch.