Iggins: Conceived, not yet born

Tuesday 4th November, 2008

So I’ve started my first open source project. Aiming small I’ve chosen to implement a language identification tool on Appspot. It’s named Iggins, after ‘Enry ‘Iggins from Pygmalion.

Lets see if I can make the time to get it running.

Language detection using tri-grams

Tuesday 30th September, 2008

I recently came across this 2004 Python recipe by Douglas Bagnall that demonstrates a technique for statistical language detection using tri-grams.

Tri-grams (a subset of n-grams) are basically three character sequences. The idea is that given a selection of documents in known languages you can figure out the frequency of each three-character sequence for each language. Once you’ve got a frequency distribution for each language, and an idea of which trigrams regularly follow with other tri-grams, you can then assess the probability that a body of text in an unknown language is written in any specific language.

The beauty of this system is that you don’t need to maintain large dictionaries for each language, just a single number for each tri-gram / language combination. You can see a similar system in action on Google’s AJAX Language API.

Enigmatic Erlang

Monday 18th August, 2008

At the weekend I started learning Erlang.

As someone who didn’t study Computer Science at university there are a lot of languages which I ought to know about (Haskell, Lisp, etc) but don’t. I often hear the names of these languages thrown around and make mental notes to go and look them up when I get time, but never do.

This time is different, mostly thanks to RabbitMQ and this Joe Amstrong interview.