Yesterday I attended the Hadoop User Group in Clerkenwell, kindly organised by Johan Oskarsson of Last FM. Below are some of my notes. I’ve paid more attention to topics that affect me directly so please don’t expect these notes to be comprehensive.
Hadoop Overview – Doug Cutting
Doug took us through the Hadoop timeline. Starting with the problems faced by the Nutch team and the Google MapReduce paper. The Hadoop project enabled Nutch to overcome scaling limitations to reach what at the time was web scale. Yahoo have now broken the Terasort record using a Hadoop cluster.
Hadoop on Amazon S3/EC2 – Tom White
Tom highlighted a implementation issues surrounding his experience deploying Hadoop clusters to AWS.
Predictably he’s found that positioning job data closer to the Hadoop cluster improves performance, for example storing data in S3 results in approx 50% of the performance of storing the data locally in the EC2 cluster.
Tom’s experience of node failure is about 2%. This is a nuisance if it’s a regular node, but a serious problem if your name node fails… so as things stand you have to be prepared to deal with job failures. In light of that failure rate, Tom highlighted a potential gap in the Hadoop toolset, namely a way of reporting failures to subsequent dependent jobs.
Smartfrog – Steve Loughran
Steve presented a system called Smartfrog which enables configuration management for distributed systems. Hadoop comes into its own when you’re dealing with more than a handful of nodes, and even at those small scales a sensible nerd will need automated configuration management.
Hadoop usage at Last FM – Martin, Elias and Johan
From the looks of things everyone at Last FM is hip-deep in Hadoop. That makes a lot of sense since one of the most important features of Last FM is the ability to recommend music… a difficult and wooly job at best. No wonder they churn a lot of data. The main point I took away from this was that there are lots of people actively using Hadoop and thinking about ways to improve it.
Using Hadoop and Nutch for NLP – Miles Osborne
Miles teaches Natural Language Processing at the University of Edinburgh. Using the Mahout sub-project he’s been using Hadoop to process blogs (amongst other things) to form models of the structures used in natural language.
PostgreSQL to HBase replication – Tim Sell
Tim set up a replication system so that the team can run heavy queries on their data without endlessly harrassing their PostgreSQL database. Due to the lack of triggers in HBase it’s unlikely that the reverse of this process will be possible in the short term, but due to the nature of the two systems it’s less likely to be required.
Distributed Lucene – Mark Butler
One of the failings of the current open source stack is a solid choice for a distributed search index. Mark took Doug Cutting’s proposal for a distributed index based on Lucene and implemented an alpha version of a working system.
This system features;
- Name node
- Heatbeats to detect failures
- Updates indexes transactionally by versioning and committing across the cluster
- Sharding is handled via the client API rather than by the name node
- Replication is handled via data node leases
Dumbo – Klass Bosteels
Klass has implemented Dumbo, a system that allows you to write disposable Hadoop streaming programs in Python. The aim was to reduce the amount of work involved writing one-off jobs. This seems like it could become part of the Hadoop toolset as it’s certain to be useful to a lot of people.