Terminology, similarity search, and other animals

Thursday 30th April, 2009

In the walkway-level study room of my old Physics department there’s a desk, where I once found this timeless conversation etched into the surface like a prehistoric wooden version of Twitter:

Protagonist: – “You’re a mook”

Antagonist: – “What’s a mook?”

Protagonist: – “Only a mook would say that”

Aside from any revelations about the emotional maturity of undergrad physicists, I think the lesson here is that it speeds up comminucation if both parties use the same terminology and know what it means.

My area of the CBIR industry has a terminology problem. I’d like to have a vocabulary of terms to describe the apps that are emerging weekly.

Visual Search, Image Search, or Visual Image Search

We’re working on image search, of a sort, although the image isn’t necessarily the object of the search, nor does image search describe only CBIR-enabled apps. We’re searching using visual attributes of images, but “visual search” as a term has already been marked out by companies that visualise text search.

Similarity search

This one seems to hit the consumer-facing nail on the head, for some apps at least. Technologically I’d include audio search and image fingerprinting apps like Shazam and SnapTell in my term, but for consumers there may be no obvious connection so perhaps this is a runner.

Media As Search Term (MAST)

Media As SearchTerm describes for me the group of apps that use a media object such as an image or an audio clip as a search query to generate results, either of similar objects or of instances of  the same object. I think MAST sums up what I’d describe as my software peer group (media similarity and media fingerprinting apps), although it doesn’t seem as snappy as AJAX. Ah well.

Wolfram Alpha – poor user experience

Saturday 25th April, 2009

I have an apology to make. The title of this post leaves room for you to infer that I’ve actually used Wolfram Alpha… I haven’t.

What I *have* done (along with many others I’m sure) is signed up for access to their closed preview. Did I get access? Not yet. All I have so far is a series of  unfulfilled promises.

Either this was an honest oversubscription which they’ve handled badly, or it was a deliberate trick to create hype and aquire a mailing list.

Regardless of which is closer to the mark, I refer Stephen Wolfram and Hector Zenil to Seth Godin.

I suspect that in 18 months time Wolfram will be languishing in the Hall of Forgotten Hype alongside the equally world-changing Project Ginger.

The legacy of old habits

Friday 24th April, 2009

I used to have a favourite bag with a mobile phone pocket on one side. I used it every day going to and from work, and I always put my phone in that pocket to keep it out of my jeans pockets and away from my knackers. I know the radiation is probably not dangerous but it’s sometimes easier to keep your irrational gut feeling pacified by doing what it wants.

Because that pocket was on the left of the bag I always slung it to the right, so that the pocket was near my hand, easy for me to get to. That’s why I always carry bags like that: bag on my right side, strap crossing over to my left shoulder.

So I’m now have this habit, even though I no longer have a bag with a phone pocket. It doesn’t feel right carrying a bag any other way.

Do you have any redundant habits?

Is the ‘Big Rewrite’ important to bond a new team?

Thursday 23rd April, 2009

We’ve just finished the initial rebuild of our web layer required to get Empora up and running (one day I’m going to stop shamelessly linking to our new project but today is not that day), and our team is now working much more effectively together.

It’s prompted us to think, would we have started working so well as a team if we hadn’t tackled the notorious “big rewrite” together?

Semantic Vectors – semantic indexing on top of Lucene

Thursday 23rd April, 2009

For anyone interested in adding semantic structures on top of their unstructured or semi-structured data I recently came across Dominic Widdows’ Semantic Vectors project.

It’s not a big enough project to survive the ‘contributor departure’ test, but it’s in active development and reading the code didn’t make my eyes bleed, so may be worth a look if that’s your bag.

Google Image Similarity first impressions

Tuesday 21st April, 2009

Right in line with my too-obvious-to-be-worth-anything prediction, Google have just released a Labs image similarity feature for Google Images. Others have commented on this already, but obviously this is hugely interesting for me because of my currently work on Empora‘s exploratory visual search so I’m going to throw my tuppence into the ring aswell.

Below are my first impressions.

Product impact

Google Similar Images (GSI) offers just one piece of functionality, the ability to find images that are similar to your selected image. You may only select images from their chosen set, there’s no dynamic image search capacity yet. Similar images are displayed either as a conventional result set when you click on “similar images”, or as a list of thumbnails in the header when you click through to see the original source.

The aims of this work will be (broadly):

  1. Keeping up with the Joneses. The other major search engines are working on similar functionality and Google can’t be seen to fall behind.
  2. User engagement. The more time you spend exploring on Google, the more their brand is burned into your subconscious.
  3. Later expansion of search monetisation. Adsense and Adwords get a better CTR than untargeted advertising because they adapt to the context of your search. If context can also be established visually there seems like strong potential for revenue.

Getting results

The quality of results for a project like this are always going to be variable as the compromises between precision, recall, performance, and cost are going to continue to be sketched out in crayon until more mature vocabularies and toolsets are available. That said, Google need to keep users impressed, and they’ve done pretty well.

A few good examples:

A few bad examples:

Under the hood

Once the “qtype=similar” parameter is set in the URL, the only parameter that affects the set of similar images is the “tbnid” which identifies the query image. The text query parameter does not seem to change the result set, only changing the accompanying UI. While this doesn’t allow us to draw any dramatic conclusions it would allow them to pre-compute the results for each image.

The first clear conclusion is metadata. Google have obviously been leveraging their formidable text index, and why not. The image similarity behaviour indicates that the textual metadata associated with images is being used to affect the results.  One of the clearest indicators is that they’re capable of recognising the same individual’s face as long as that person’s name is mentioned. Unnamed models don’t benefit from the same functionality.

My second insight is that they’re almost certainly using a structural technique such as Wavelet Decomposition to detect shapes within images. The dead give-away here is that search results are strongly biased towards photographs taken from the same angle.

I suspect that they’re not yet using a visual fingerprinting technique (such as FAST) to recognise photographs of the same object. If they were doing this already I suspect that they’d have used this method to remove duplicate images. This may well come later.


All in all my impression is that they’ve implemented this stuff well, but that there’s a lot more yet to come. Namely:

  • Handling of duplicates, i.e. separation between searching for the similar images and instances of the same image
  • A revenue stream

Site log-in, HTTPS or HTTP?

Saturday 18th April, 2009

Four months on from Monster‘s big security breach they’re still using plain-text HTTP for logging in, and for changing your password.

While that’s fairly common for lower-risk web apps from cash-strapped start-ups and solo developers, for someone like Monster it seems inappropriate. Monster run sites all over the world, have a clear revenue stream, and they store an awful lot of personal information. Exactly the kind of information that’d be useful to identity thieves.

A few SSL certificates won’t break their bank account unless it was breaking anyway.

It’s got me wondering though. What proportion of sites actually bother with SSL? Sadly the only stats I’ve found on SSL adoption are some vague hints at data from Netcraft (scroll to the bottom). These stats seem to indicate that only 60 out of the “top 1000” sites use SSL, but I’m not sure exactly what criteria they’re using to gather those numbers.

Has anyone got any idea what proportion of sites use HTTPS for login?

ZK awarded the RSA’s Albert Medal

Friday 17th April, 2009

Just noticed the story over on the Justgiving blog that my old boss Zarine Kharas has been awarded the RSA‘s Albert Medal.

Since I’m so fashionably cynical I’m going to blatantly ignore all kings and queens in the list of previous winners. Even still, there are some fairly mammoth shoes to fill amongst the tabulated notables:

Good for you Zarine. What’s next?

Empora walk-through

Wednesday 8th April, 2009

The first flight is always a little wobbly, and true to form there was a slight hiccup for Empora over the weekend. Still, it’s been live for a week now and is holding up well. Considering how

So now all the excitement of  the launch has settled down and we’re back into routine I think it’s time for a quick walk through the functionality (which won’t take that long since we haven’t put that much live yet; there’s a lot of interesting functionality left to come).

Hunting vs. gathering

Plenty of people go into a shop armed with a plan. They know what they want, or at least what specific need they need to fill. Others like to browse, look at what there is, what other people are doing, and generally wait for inspiration or recommendation. We’ve tried to fulfil both of those patterns using the both standard “search vs. browse” split, but have tried to improve both.


When you view an item, for example this orange Ghibli bag, we obviously show a picture, description, etc. and link to the retailer. All standard stuff for a shopping aggregator. What we’ve added is that we also show the most visually similar items in our collection, according to three different sets of criteria:

  1. We show the most similar bags by shape, so that anyone who’s interested in a particular style or type of bag can see them straight away.
  2. We show bags in the most similar colours, so anyone who was drawn to that bag because of its colour can see lots of other bags that they may also be interested in.
  3. We show products from other categories in the same colour, in case users want to colour-coordinate.


In addition to the regular search options you’d expect (category, keywords, etc.) we also allow people to search by the overall colour of the item (from the top right corner of any page). Now in terms of technology I’m not particularly happy with this functionality yet, but I’m a perfectionist. It already performs a lot better visually than the Amazon equivalent*, and I know that we’ve got big improvements in the pipeline.

* To be fair to Amazon their results are better than they look. The products they show are available in the query colour, they just choose to show only the first image, so their results look broken by visual inspection.

Back to the physical shop metaphor

What we’re trying to do is help the searchers search by enabling them to search using visual data, effectively the equivalent to training all the staff in a shop to be able to answer questions like “have you got anything that goes with these shoes?”.

At the same time we’re trying to help the browsers by sorting each department by type and colour, so they always know where they’re going.

Obviously this is fairly fresh territory so there’ll always be wrinkles that need ironing out, but on the whole I think the trend towards smarter indexing is inevitable, and the indexing of visual information is part of that (that’s a whole other post).

Search / Lucene social meet-up

Monday 6th April, 2009

Having just finished our product launch (apologies for the gratuitous plug) I’ve now got time to worry about more important things, i.e. organising beers.

We’ll be in The Pelican pub just near the Pixsta offices in Notting Hill from 7pm on the 27th of April. If you’re keen to come along and talk about Lucene, or search in general, then please do. There may also be talk of machine learning, computer vision, distributed systems, etc.

All I ask is that you sign up on the Yahoo event page so that I’ve got an idea about numbers (need to book tables, blah blah blah).