Guest post – Similarity search: The Two Shoe Problem

Thursday 30th July, 2009

Today I’m introducing my first ever guest post, written by Pixsta‘s own Rohit Patange about some great work he’s been doing with the guidance of Tuncer Aysal. You’ll be able to see the results of their work shortly on our consumer-facing site Empora. – RM

We at Pixsta are interested in understanding what is in an image (recognise and extract) and do so in an automated way that involves a minimum amount of human input.

Our raw data (images and associated textual information) come from a variety of retailers with considerable variation in terms of data formats and quality. Some retailer images are squeaky clean with white backgrounds and a clear product depiction while others have multiple views of the product, very noisy backgrounds, models, mannequins and other such distracting objects. Since we only care about the product, an essential processing step involves identification of all image parts and the isolation of individual products, if several are present in the retailer image.

The n-shoe case:

Let’s take the case of retailer images with multiple product views. This is most commonly encountered in shoe images.  Let us call each of the product views a ‘sub-image’.

When we talk about similar shoes we talk about a shoe being similar to the other (note the singular). We have to disregard how the shoe is presented in the image, the position of the sub-images, the orientation and other noise. If we do not do so, image matching technology tends to pick out images with similar presentation rather than similar shoes. Typically a retailer image (a shoe they are trying to sell) will have a pair of sub-images of shoes in different viewing angles. Pictorially with standard image matching we get the following results for a query image on the left:

Visual similarity query showing product presentation affecting results

Even though the image database contains images like:

Two shoes pointing to the right

These are not in the result set despite them being much closer matches, because of the presentation and varying number of sub-images.  To overcome this drawback, we have to extract the sub-image which best represents the product for each of the images and then compare these sub-images. For the sub-image to be extracted, the image will need to go through the following processing steps:

  • Determine which of the sub-images is the best represents the shoe.
  • Extract that sub-image.
  • Determine the shoe orientation in that sub-image.
  • Standardise the image by rotation, flipping and scaling.

All the product images (shoes in this case) go through this process of standardisation, resulting in a uniform set of images. Pictorially the input and the output image of the standardisation process are:

Shoes segmented and standardised to point right

Let’s look at the procedure in more detail assuming that the image has been segmented into background and foreground.

  • The first step is to identify all the sub-images on the foreground. The foreground pixels of the images are labelled in such a way that different sub-images have different label to mark them as distinct.
  • After the first iteration of labelling there is a high possibility that a sub-image is marked with 2 or more labels. Therefore all connected labels have to be merged.
    Segmented shoe images
  • The third step is to determine which of the sub-images is of interest; that is picking the right label.
    Choosing an image segment
  • Once the right sub-image is extracted the orientation of this sub-image is corrected to match a predefined standard to remove the differences in the terms of size of the product image, orientation (the direction the shoe is pointing towards) and the position of the shoe (sub-image) within the image.
    Single shoe pointing to the right

All product images (shoes in this case) go through this process before the representative information from the image is extracted for comparison. Now the results for the query image will look like:

Resulting query showing standardised similarity

Generally there are two shoes in an image. But the method can be extended to ‘n’ shoes.

Terminology, similarity search, and other animals

Thursday 30th April, 2009

In the walkway-level study room of my old Physics department there’s a desk, where I once found this timeless conversation etched into the surface like a prehistoric wooden version of Twitter:

Protagonist: – “You’re a mook”

Antagonist: – “What’s a mook?”

Protagonist: – “Only a mook would say that”

Aside from any revelations about the emotional maturity of undergrad physicists, I think the lesson here is that it speeds up comminucation if both parties use the same terminology and know what it means.

My area of the CBIR industry has a terminology problem. I’d like to have a vocabulary of terms to describe the apps that are emerging weekly.

Visual Search, Image Search, or Visual Image Search

We’re working on image search, of a sort, although the image isn’t necessarily the object of the search, nor does image search describe only CBIR-enabled apps. We’re searching using visual attributes of images, but “visual search” as a term has already been marked out by companies that visualise text search.

Similarity search

This one seems to hit the consumer-facing nail on the head, for some apps at least. Technologically I’d include audio search and image fingerprinting apps like Shazam and SnapTell in my term, but for consumers there may be no obvious connection so perhaps this is a runner.

Media As Search Term (MAST)

Media As SearchTerm describes for me the group of apps that use a media object such as an image or an audio clip as a search query to generate results, either of similar objects or of instances of  the same object. I think MAST sums up what I’d describe as my software peer group (media similarity and media fingerprinting apps), although it doesn’t seem as snappy as AJAX. Ah well.