Wednesday 16th December, 2009
To avoid the somewhat annoying (and hopefully temporary) problem that not everyone in the world reads my blog, I’ve created a new home for our search social meet-ups over on Meetup.com.
Sign up on the London Search Social page to get notifications of events.
Thursday 23rd April, 2009
For anyone interested in adding semantic structures on top of their unstructured or semi-structured data I recently came across Dominic Widdows’ Semantic Vectors project.
It’s not a big enough project to survive the ‘contributor departure’ test, but it’s in active development and reading the code didn’t make my eyes bleed, so may be worth a look if that’s your bag.
Friday 24th October, 2008
This week I needed to test out the performance of the n-gram technique for statistical language detection, and only had about half an hour to do it, so I brought in the experts…
Lucene provides a huge number of text analysis features, but currently doesn’t provide out-of-the-box language identification.
Nutch on the other hand, does. It’s kindly provided as a Nutch plugin. Testing that out, I discovered a dependency on a Hadoop Configuration class, so went and dug out that JAR too.
So, libraries in hand, I knocked up a quick proof-of-concept, full of messy dependencies and ugly string manipulation.
public class LanguageServlet extends HttpServlet
private static LanguageIdentifier _identifier;
public void init() throws ServletException
_identifier = new LanguageIdentifier(new Configuration());
protected void doPost(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException
protected void doGet(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException
protected void execute(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException
String query = request.getParameter("text");
String language = "unknown";
if ( query != null && query.length() > 0 )
language = _identifier.identify(query);
StringBuffer b = new StringBuffer();
b.append("<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n");
As you can see the code is basic, but the actual method call is very simple indeed. One thing that’s missing is the ability to see what level of certainty assigns to its language guess, but you can add that yourself once you get comfortable enough with the technique to hunt down the Nutch source, or build your own.
Wednesday 15th October, 2008
I’ve just noticed that we now have job specs for the positions we’re looking to fill.
The office is on Portobello Road in west London. We’re playing with some cutting-edge technology, in a positive and enthusiastic environment, so if you’re looking or have any questions then get in touch.