Terminology, similarity search, and other animals

Thursday 30th April, 2009

In the walkway-level study room of my old Physics department there’s a desk, where I once found this timeless conversation etched into the surface like a prehistoric wooden version of Twitter:

Protagonist: – “You’re a mook”

Antagonist: – “What’s a mook?”

Protagonist: – “Only a mook would say that”

Aside from any revelations about the emotional maturity of undergrad physicists, I think the lesson here is that it speeds up comminucation if both parties use the same terminology and know what it means.

My area of the CBIR industry has a terminology problem. I’d like to have a vocabulary of terms to describe the apps that are emerging weekly.

Visual Search, Image Search, or Visual Image Search

We’re working on image search, of a sort, although the image isn’t necessarily the object of the search, nor does image search describe only CBIR-enabled apps. We’re searching using visual attributes of images, but “visual search” as a term has already been marked out by companies that visualise text search.

Similarity search

This one seems to hit the consumer-facing nail on the head, for some apps at least. Technologically I’d include audio search and image fingerprinting apps like Shazam and SnapTell in my term, but for consumers there may be no obvious connection so perhaps this is a runner.

Media As Search Term (MAST)

Media As SearchTerm describes for me the group of apps that use a media object such as an image or an audio clip as a search query to generate results, either of similar objects or of instances of  the same object. I think MAST sums up what I’d describe as my software peer group (media similarity and media fingerprinting apps), although it doesn’t seem as snappy as AJAX. Ah well.

Wolfram Alpha – poor user experience

Saturday 25th April, 2009

I have an apology to make. The title of this post leaves room for you to infer that I’ve actually used Wolfram Alpha… I haven’t.

What I *have* done (along with many others I’m sure) is signed up for access to their closed preview. Did I get access? Not yet. All I have so far is a series of  unfulfilled promises.

Either this was an honest oversubscription which they’ve handled badly, or it was a deliberate trick to create hype and aquire a mailing list.

Regardless of which is closer to the mark, I refer Stephen Wolfram and Hector Zenil to Seth Godin.

I suspect that in 18 months time Wolfram will be languishing in the Hall of Forgotten Hype alongside the equally world-changing Project Ginger.

The legacy of old habits

Friday 24th April, 2009

I used to have a favourite bag with a mobile phone pocket on one side. I used it every day going to and from work, and I always put my phone in that pocket to keep it out of my jeans pockets and away from my knackers. I know the radiation is probably not dangerous but it’s sometimes easier to keep your irrational gut feeling pacified by doing what it wants.

Because that pocket was on the left of the bag I always slung it to the right, so that the pocket was near my hand, easy for me to get to. That’s why I always carry bags like that: bag on my right side, strap crossing over to my left shoulder.

So I’m now have this habit, even though I no longer have a bag with a phone pocket. It doesn’t feel right carrying a bag any other way.

Do you have any redundant habits?

Is the ‘Big Rewrite’ important to bond a new team?

Thursday 23rd April, 2009

We’ve just finished the initial rebuild of our web layer required to get Empora up and running (one day I’m going to stop shamelessly linking to our new project but today is not that day), and our team is now working much more effectively together.

It’s prompted us to think, would we have started working so well as a team if we hadn’t tackled the notorious “big rewrite” together?

Semantic Vectors – semantic indexing on top of Lucene

Thursday 23rd April, 2009

For anyone interested in adding semantic structures on top of their unstructured or semi-structured data I recently came across Dominic Widdows’ Semantic Vectors project.

It’s not a big enough project to survive the ‘contributor departure’ test, but it’s in active development and reading the code didn’t make my eyes bleed, so may be worth a look if that’s your bag.

Google Image Similarity first impressions

Tuesday 21st April, 2009

Right in line with my too-obvious-to-be-worth-anything prediction, Google have just released a Labs image similarity feature for Google Images. Others have commented on this already, but obviously this is hugely interesting for me because of my currently work on Empora‘s exploratory visual search so I’m going to throw my tuppence into the ring aswell.

Below are my first impressions.

Product impact

Google Similar Images (GSI) offers just one piece of functionality, the ability to find images that are similar to your selected image. You may only select images from their chosen set, there’s no dynamic image search capacity yet. Similar images are displayed either as a conventional result set when you click on “similar images”, or as a list of thumbnails in the header when you click through to see the original source.

The aims of this work will be (broadly):

  1. Keeping up with the Joneses. The other major search engines are working on similar functionality and Google can’t be seen to fall behind.
  2. User engagement. The more time you spend exploring on Google, the more their brand is burned into your subconscious.
  3. Later expansion of search monetisation. Adsense and Adwords get a better CTR than untargeted advertising because they adapt to the context of your search. If context can also be established visually there seems like strong potential for revenue.

Getting results

The quality of results for a project like this are always going to be variable as the compromises between precision, recall, performance, and cost are going to continue to be sketched out in crayon until more mature vocabularies and toolsets are available. That said, Google need to keep users impressed, and they’ve done pretty well.

A few good examples:

A few bad examples:

Under the hood

Once the “qtype=similar” parameter is set in the URL, the only parameter that affects the set of similar images is the “tbnid” which identifies the query image. The text query parameter does not seem to change the result set, only changing the accompanying UI. While this doesn’t allow us to draw any dramatic conclusions it would allow them to pre-compute the results for each image.

The first clear conclusion is metadata. Google have obviously been leveraging their formidable text index, and why not. The image similarity behaviour indicates that the textual metadata associated with images is being used to affect the results.  One of the clearest indicators is that they’re capable of recognising the same individual’s face as long as that person’s name is mentioned. Unnamed models don’t benefit from the same functionality.

My second insight is that they’re almost certainly using a structural technique such as Wavelet Decomposition to detect shapes within images. The dead give-away here is that search results are strongly biased towards photographs taken from the same angle.

I suspect that they’re not yet using a visual fingerprinting technique (such as FAST) to recognise photographs of the same object. If they were doing this already I suspect that they’d have used this method to remove duplicate images. This may well come later.


All in all my impression is that they’ve implemented this stuff well, but that there’s a lot more yet to come. Namely:

  • Handling of duplicates, i.e. separation between searching for the similar images and instances of the same image
  • A revenue stream

Site log-in, HTTPS or HTTP?

Saturday 18th April, 2009

Four months on from Monster‘s big security breach they’re still using plain-text HTTP for logging in, and for changing your password.

While that’s fairly common for lower-risk web apps from cash-strapped start-ups and solo developers, for someone like Monster it seems inappropriate. Monster run sites all over the world, have a clear revenue stream, and they store an awful lot of personal information. Exactly the kind of information that’d be useful to identity thieves.

A few SSL certificates won’t break their bank account unless it was breaking anyway.

It’s got me wondering though. What proportion of sites actually bother with SSL? Sadly the only stats I’ve found on SSL adoption are some vague hints at data from Netcraft (scroll to the bottom). These stats seem to indicate that only 60 out of the “top 1000” sites use SSL, but I’m not sure exactly what criteria they’re using to gather those numbers.

Has anyone got any idea what proportion of sites use HTTPS for login?