Language detection using tri-grams

Tuesday 30th September, 2008

I recently came across this 2004 Python recipe by Douglas Bagnall that demonstrates a technique for statistical language detection using tri-grams.

Tri-grams (a subset of n-grams) are basically three character sequences. The idea is that given a selection of documents in known languages you can figure out the frequency of each three-character sequence for each language. Once you’ve got a frequency distribution for each language, and an idea of which trigrams regularly follow with other tri-grams, you can then assess the probability that a body of text in an unknown language is written in any specific language.

The beauty of this system is that you don’t need to maintain large dictionaries for each language, just a single number for each tri-gram / language combination. You can see a similar system in action on Google’s AJAX Language API.


Pixsta hiring developers

Tuesday 30th September, 2008

Just a quick note to say that Pixsta is hiring developers in London. We’re looking for a range of developers, and ideally a QA engineer. I’d like to hear from software superheroes whatever their experience.

The focus of the work is likely to be search and image processing, mostly in Java, so that’d have to be something that you’d be able to run with. You also have to be eligible to work in the UK.

As for us, we’ve got a solid team, in a sector that’s growing quickly. We’ve got loads of exciting projects, which is why we need more people, and have lots of opportunity for creative talent to spread their wings. Get in touch.

Contact me at richard (at) pixsta (dot) com. No agencies please.


Visual search indexing using parallel GPUs

Tuesday 23rd September, 2008

Building an image search index takes quite a lot of processing power. Apart from all the usual mucking about that building a regular search index entails, you also have to download, resize, and analyse all the images that you want in your index. That analysis itself will consist of many different tasks, usually including the use of visual features to analyse colour, texture, shape, etc. and the use of classifiers to recognise specific objects.

Canadian visual search outfit Incogna have taken an interesting approach to image processing, from what I can tell building their image search indexes using massively parallel GPUs. Asking around the team, that technique has anecdotally produced some very successful tests, so I’ll be keeping an eye on these guys in future.


If you want service-oriented architecture, don’t use RPC

Friday 19th September, 2008

Maintaining some Java RMI code recently I’ve been getting a feeling that something is wrong. Last year when I was working with WCF on Justgiving’s SOA setup I had the same feeling.

As a programmer I find that I make technical decisions based on varying mixtures of gut feeling, anecdotal evidence, experience and rationality. All of those motivations are interesting in their own way (if you’re interested in the study of decision making I’d recommend Gary Klein’s book Sources of Power), but the one I like the most is gut feeling.

It’s sometimes hard to explain why I get a gut feeling, but in the case of RMI I know exactly why. It’s because it ties both sides of an interface to the same technology.

The whole point of an interface is that each side doesn’t need to know anything about the other side, other than what directly affects the transaction between them. Adding constraints beyond that transaction seems inherently wrong to me.

When I worked at Moreover, one of the things that I liked most about their architecture was that they could build and replace any component with one built with a completely different technology to the rest of the system. They had a loosely-coupled architecture with clear, open, interfaces, and minimum dependency. Conventional RPC technologies such as RMI and WCF just can’t do that.

An interesting addendum to that point is Thrift (thanks Dimi) which is a cross-language RPC system looked after by Apache. I’ve got a gut feeling about that too, but that’s another post.


Spore: most pirated game ever

Sunday 14th September, 2008

Apparently (despite their DRM efforts) Spore is now the most pirated game ever. It’s a shame they didn’t pay attention to Cory Doctorow’s clearly argued explanation of how DRM is a waste of time and money for everyone but DRM vendors.


Innovation where tech meets design

Friday 12th September, 2008

I’ve recently discovered the work of Jonathan Feinburg, father, drummer, and IBM Researcher. His projects include Wordle and the Alphabet Synthesis Machine.

Wordle is an app that takes a passage of text and creates a word cloud based on your design sensibilities. The Alphabet Synthesis Machine takes a similar approach to designing characters for a fictional alphabet, taking a seed drawing and evolving that according to your constraints.

A word cloud create from the text of this post

A word cloud create from the text of this post

It’s really comforting to know that there are people out there who can create interesting technology that’s easy to just pick up and play with. My over-arching experience using these tools is one of discovery, and enjoyment.

Even though I have no practical use for his creations, I’ll go back and play with them because they’re beautiful.

I think there’s a lesson in there for those of us who are involved in the creation of user interfaces in the commercial world.


Spore ruined by DRM?

Tuesday 9th September, 2008

I’ve been waiting to play Spore for years, literally. I even paid £5 a few months back for the creature creator demo, even though charging for a demo feels a little weird to me. But now it’s been released, I’m not going to buy it.

Why? Because I’d rather vote with my feet and make a point to computer game distributors that I don’t want DRM. I’d rather sacrifice a little bit of fun by spending my money on something else (lets face it, it might not even be a sacrifice). It seems like I’m not alone, the Amazon review page for Spore (via ZDNet) is filled with complaints about the DRM they’ve bundled with it, pushing the rating down to a rather weak ‘one star’.

A few years ago when Half Life 2 was released, I bought it straight away, then spent ages waiting for the game to phone home every time I wanted to play it because Steam’s DRM servers were under strain. I haven’t tried playing it recently, but if the DRM provider have switched off their servers for any reason then I’ll be unable to play my own game.

I don’t want games to phone home whenever I play them, it’s creepy, it’s a potential point of failure, and it’s downright rude.

The PC game market (at least the grumpy older gamer segment) is pissed off. Lets see if the industry is listening.


Follow

Get every new post delivered to your Inbox.