Clicks aren’t a tip jar

Friday 22nd August, 2008

I can’t help thinking there’s something missing from Seth Godin’s post “Ads are the new online tip jar“.

If people click through ads to thank the content provider rather than because they’re interested in the product, it seems certain that conversion rates will decline. The value of those clicks will then go down, possibly as far as the level they were at before. Worse still, content providers who didn’t encourage those extra clicks could be worse off because of that decreasing click value.

Perhaps a safer alternative is to ask your readers to look at the ads, and only click through if they’re actually interested.

Hadoop User Group – UK

Wednesday 20th August, 2008

Yesterday I attended the Hadoop User Group in Clerkenwell, kindly organised by Johan Oskarsson of Last FM. Below are some of my notes. I’ve paid more attention to topics that affect me directly so please don’t expect these notes to be comprehensive.

Hadoop Overview – Doug Cutting

Doug took us through the Hadoop timeline. Starting with the problems faced by the Nutch team and the Google MapReduce paper. The Hadoop project enabled Nutch to overcome scaling limitations to reach what at the time was web scale. Yahoo have now broken the Terasort record using a Hadoop cluster.

Hadoop on Amazon S3/EC2 – Tom White

Tom highlighted a implementation issues surrounding his experience deploying Hadoop clusters to AWS.

Predictably he’s found that positioning job data closer to the Hadoop cluster improves performance, for example storing data in S3 results in approx 50% of the performance of storing the data locally in the EC2 cluster.

Tom’s experience of node failure is about 2%. This is a nuisance if it’s a regular node, but a serious problem if your name node fails… so as things stand you have to be prepared to deal with job failures. In light of that failure rate, Tom highlighted a potential gap in the Hadoop toolset, namely a way of reporting failures to subsequent dependent jobs.

Smartfrog – Steve Loughran

Steve presented a system called Smartfrog which enables configuration management for distributed systems. Hadoop comes into its own when you’re dealing with more than a handful of nodes, and even at those small scales a sensible nerd will need automated configuration management.

Hadoop usage at Last FM – Martin, Elias and Johan

From the looks of things everyone at Last FM is hip-deep in Hadoop. That makes a lot of sense since one of the most important features of Last FM is the ability to recommend music… a difficult and wooly job at best. No wonder they churn a lot of data. The main point I took away from this was that there are lots of people actively using Hadoop and thinking about ways to improve it.

Using Hadoop and Nutch for NLP – Miles Osborne

Miles teaches Natural Language Processing at the University of Edinburgh. Using the Mahout sub-project he’s been using Hadoop to process blogs (amongst other things) to form models of the structures used in natural language.

PostgreSQL to HBase replication – Tim Sell

Tim set up a replication system so that the team can run heavy queries on their data without endlessly harrassing their PostgreSQL database. Due to the lack of triggers in HBase it’s unlikely that the reverse of this process will be possible in the short term, but due to the nature of the two systems it’s less likely to be required.

Distributed Lucene – Mark Butler

One of the failings of the current open source stack is a solid choice for a distributed search index. Mark took Doug Cutting’s proposal for a distributed index based on Lucene and implemented an alpha version of a working system.

This system features;

  • Name node
  • Heatbeats to detect failures
  • Updates indexes transactionally by versioning and committing across the cluster
  • Sharding is handled via the client API rather than by the name node
  • Replication is handled via data node leases

Dumbo – Klass Bosteels

Klass has implemented Dumbo, a system that allows you to write disposable Hadoop streaming programs in Python. The aim was to reduce the amount of work involved writing one-off jobs. This seems like it could become part of the Hadoop toolset as it’s certain to be useful to a lot of people.

Enigmatic Erlang

Monday 18th August, 2008

At the weekend I started learning Erlang.

As someone who didn’t study Computer Science at university there are a lot of languages which I ought to know about (Haskell, Lisp, etc) but don’t. I often hear the names of these languages thrown around and make mental notes to go and look them up when I get time, but never do.

This time is different, mostly thanks to RabbitMQ and this Joe Amstrong interview.

More Outlook Insanity

Thursday 7th August, 2008

A real gem from Outlook this morning. It’s been sending automated emails (SVN commit notices if you’re interested) from people in the office to my junk mail folder.

Marking those emails as “not junk” seemed to have no effect whatsoever, other than popping up a dialogue box which promised to move the email to my inbox (it didn’t). Subsequent emails still went to my junk mail folder.

Naively I thought I’d add the addresses of my colleagues to my “safe senders list” so that they didn’t go to the junk folder.

You’d think that’d just add the sender to my safe senders list, right? Wrong :o)

Apparently the sender is from within my organisation, so I can’t add them to my safe senders list. Instead I’ll have to spend more of my life working around Outlook’s bad design. Today it made three mistakes that got in the way of my productivity:

  1. Sending useful email (that I’ve set up rules to filter) to the junk folder
  2. Failing to mark them as not junk despite explicit commands
  3. Putting up a comedy alert message instead of adding those users to my safe senders list

I don’t mean to pick on Outlook, it’s just that I have to use it all the time, and it gets in the way so often and for such daft reasons that I feel compelled to verbalise it. I’m sure the Outlook team are nice guys.

The State of Image Search

Wednesday 6th August, 2008

There’s currently a lack of direction in the image search products offered by the leaders in the field. Each offering is quite different, and none have fully realised revenue streams. This is a quick summary of the current state of play.

Text search by any other name

Some image search engines learn about images solely by leveraging image meta-data and nearby text in parent documents. It’s a little like identifying a photograph by the name on the album cover and the writing on the back of the photo. This was an ideal solution for text search engines like Google and Yahoo, who could leverage their existing data and infrastructure.

Getting smarter

Microsoft’s Live Search have recently started broadening the mainstream by adding the capability to analyse the images themselves. For example, the Live Search team have added the ability for their system to recognise faces.

Playing the name game

The big players in search get revenue from serving up relevant advertising, but so far none of them have successfully monetised image search. Currently image search serves as a loss-leader that exists to support their search brands, a visible sign that they’ve still got chips in the big game.

That doesn’t mean they’re sitting on their hands. Both Microsoft and Google employ researchers in the area of image comparison and classification so expect big developments from them in 2009.

Pure image search start-ups

There are a few start-ups with an eye on the prize of being the first to monetise image search. Being smaller and more maneuverable than the big players they’ve got off the ground faster, but have yet to build up significant numbers. Start-ups to keep an eye on include Picitup (find similar images, celebrity face comparison), Riya/Like (text-driven image search and product search), and the Toronto-based Idée Inc (copyright monitoring, colour-based search).

These guys are hungry for revenue, so I expect to have fresh news in Q4 this year.

The home team

I work for Pixsta, another image search start-up. We’ve pulled together the basis of a decent team, and should start taking over the world shortly. As for what we’re working on, I’ll write more when I know what’s safe to write about outside NDA :o)