I recently came across this 2004 Python recipe by Douglas Bagnall that demonstrates a technique for statistical language detection using tri-grams.
Tri-grams (a subset of n-grams) are basically three character sequences. The idea is that given a selection of documents in known languages you can figure out the frequency of each three-character sequence for each language. Once you’ve got a frequency distribution for each language, and an idea of which trigrams regularly follow with other tri-grams, you can then assess the probability that a body of text in an unknown language is written in any specific language.
The beauty of this system is that you don’t need to maintain large dictionaries for each language, just a single number for each tri-gram / language combination. You can see a similar system in action on Google’s AJAX Language API.