Language detection using tri-grams

I recently came across this 2004 Python recipe by Douglas Bagnall that demonstrates a technique for statistical language detection using tri-grams.

Tri-grams (a subset of n-grams) are basically three character sequences. The idea is that given a selection of documents in known languages you can figure out the frequency of each three-character sequence for each language. Once you’ve got a frequency distribution for each language, and an idea of which trigrams regularly follow with other tri-grams, you can then assess the probability that a body of text in an unknown language is written in any specific language.

The beauty of this system is that you don’t need to maintain large dictionaries for each language, just a single number for each tri-gram / language combination. You can see a similar system in action on Google’s AJAX Language API.

One Response to Language detection using tri-grams

  1. […] week I needed to test out the performance of the n-gram technique for statistical language detection, and only had about half an hour to do it, so I brought in the […]

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: