Comprehensive notes from my three hour Redis tutorial

Last week I presented two talks at the inaugural NoSQL Europe conference in London. The first was presented with Matthew Wall and covered the ways in which we have been exploring NoSQL at the Guardian. The second was a three hour workshop on Redis, my favourite piece of software to have the NoSQL label applied to it.

I’ve written about Redis here before, and it has since earned a place next to MySQL/PostgreSQL and memcached as part of my default web application stack. Redis makes write-heavy features such as real-time statistics feasible for small applications, while effortlessly scaling up to handle larger projects as well. If you haven’t tried it out yet, you’re sorely missing out.

For the workshop, I tried to give an overview of each individual Redis feature along with detailed examples of real-world problems that the feature can help solve. I spent the past day annotating each slide with detailed notes, and I think the result makes a pretty good stand-alone tutorial. Here’s the end result:

Redis tutorial slides and notes

In unrelated news, Nat and I both completed the first ever Brighton Marathon last weekend, in my case taking 4 hours, 55 minutes and 17 seconds. Sincere thanks to everyone who came out to support us - until the race I had never appreciated how important the support of the spectators is to keep going to the end. We raised £757 for the Have a Heart children’s charity. Thanks in particular to Clearleft who kindly offered to match every donation.

WildlifeNearYou talk at £5 app, and being Wired (not Tired)

Two quick updates about WildlifeNearYou. First up, I gave a talk about the site at £5 app, my favourite Brighton evening event which celebrates side projects and the joy of Making Stuff. I talked about the site’s genesis on a fort, crowdsourcing photo ratings, how we use Freebase and DBpedia and how integrating with Flickr’s machine tags gave us a powerful location API for free. Here’s the video of the talk, courtesy of Ian Oszvald:

£5 App #22 WildLifeNearYou by Simon Willison and Natalie Downe from IanProCastsCoUk on Vimeo.

Secondly, I’m excited to note that WildlifeNearYou spin-off OwlsNearYou.com is featured in UK Wired magazine’s Wired / Tired / Expired column… and we’re Wired!

Wired / Tired / Expired column from May 2010 Wired UK

Some questions about the “blocking” of HTML5

  1. When people say that the publication of HTML5 “blocked” by Larry Masinter’s “formal objection”, what exactly do they mean?
  2. Why does the private w3c-archive mailing list exist? Why can’t anyone reveal what happens on there? What are the consequences for doing so? Who gets to be on that list in the first place?
  3. Can anyone raise a “formal objection”?
  4. Is anyone calling for the HTML Working Group to be “rechartered”? If so, what does that involve?
  5. If there are concerns about the inclusion of Canvas 2D in the specification, why were these not resolved earlier?

Some background reading. I was planning to fill in answers as they arrive, but I screwed up the moderation of the comments and got flooded with detailed responses - I strongly recommend reading the comments.

WildlifeNearYou: It began on a fort…

Back in October 2008, myself and 11 others set out on the first /dev/fort expedition. The idea was simple: gather a dozen geeks, rent a fort, take food and laptops and see what we could build in a week.

Fort Clonque

The fort was Fort Clonque on Alderney in the Channel Islands, managed by the Landmark Trust. We spent an incredibly entertaining week there exploring Nazi bunkers, cooking, eating and coding up a storm. It ended up taking slightly longer than a week to finish, but 14 months later the result of our combined efforts can finally be revealed: WildlifeNearYou.com!

WildlifeNearYou is a site for people who like to see animals. Have you ever wanted to know where your nearest Llama is? Search for “llamas near brighton" and you’ll see that there’s one 18 miles away at Ashdown Forest Llama Farm. Or you can see all the places we know about in France, or all the trips I’ve been on, or everywhere you can see a Red Panda.

The data comes from user contributions: you can use WildlifeNearYou to track your trips to wildlife places and list the animals that you see there. We can only tell you about animals that someone else has already spotted.

Once you’ve added some trips, you can import your Flickr photos and match them up with trips and species. We’ll be adding a feature in the future that will push machine tags and other metadata back to Flickr for you, if you so choose.

You can read more about WildlifeNearYou on the site’s about page and FAQ. Please don’t hesitate to send us feedback!

What took so long?

So why did it take so long to finally launch it? A whole bunch of reasons. Week long marathon hacking sessions are an amazing way to generate a ton of interesting ideas and build a whole bunch of functionality, but it’s very hard to get a single cohesive whole at the end of it. Tying up the loose ends is a pretty big job and is severely hampered by the fort residents returning to their real lives, where hacking for 5 hours straight on a cool easter egg suddenly doesn’t seem quite so appealing. We also got stuck in a cycle of “just one more thing”. On the fort we didn’t have internet access, so internet-dependent features like Freebase integration, Google Maps, Flickr imports and OpenID had to be left until later (“they’ll only take a few hours” no longer works once you’re off /dev/fort time).

The biggest problem though was perfectionism. The longer a side-project drags on for, the more important it feels to make it “just perfect” before releasing it to the world. Finally, on New Year’s Day, Nat and I decided we had had enough. Our resolution was to “ship the thing within a week, no matter what state it’s in”. We’re a few days late, but it’s finally live.

WildlifeNearYou is by far the most fun website I’ve ever worked on. To all twelve of my intrepid fort companions: congratulations - we made a thing!

Group photo at the Fort

Crowdsourced document analysis and MP expenses

As you may have heard, the UK government released a fresh batch of MP expenses documents a week ago on Thursday. I spent that week working with a small team at Guardian HQ to prepare for the release. Here’s what we built:

http://mps-expenses2.guardian.co.uk/

It’s a crowdsourcing application that asks the public to help us dig through and categorise the enormous stack of documents - around 30,000 pages of claim forms, scanned receipts and hand-written letters, all scanned and published as PDFs.

This is the second time we’ve tried this - the first was back in June, and can be seen at mps-expenses.guardian.co.uk. Last week’s attempt was an opportunity to apply the lessons we learnt the first time round.

Writing crowdsourcing applications in a newspaper environment is a fascinating challenge. Projects have very little notice - I heard about the new document release the Thursday before giving less than a week to put everything together. In addition to the fast turnaround for the application itself, the 48 hours following the release are crucial. The news cycle moves fast, so if the application launches but we don’t manage to get useful data out of it quickly the story will move on before we can impact it.

ScaleCamp on the Friday meant that development work didn’t properly kick off until Monday morning. The bulk of the work was performed by two server-side developers, one client-side developer, one designer and one QA on Monday, Tuesday and Wednesday. The Guardian operations team deftly handled our EC2 configuration and deployment, and we had some extra help on the day from other members of the technology department. After launch we also had a number of journalists helping highlight discoveries and dig through submissions.

The system was written using Django, MySQL (InnoDB), Redis and memcached.

Asking the right question

The biggest mistake we made the first time round was that we asked the wrong question. We tried to get our audience to categorise documents as either “claims” or “receipts” and to rank them as “not interesting”, “a bit interesting”, “interesting but already known” and “someone should investigate this”. We also asked users to optionally enter any numbers they saw on the page as categorised “line items”, with the intention of adding these up later.

The line items, with hindsight, were a mistake. 400,000 documents makes for a huge amount of data entry and for the figures to be useful we would need to confirm their accuracy. This would mean yet more rounds of crowdsourcing, and the job was so large that the chance of getting even one person to enter line items for each page rapidly diminished as the news story grew less prominent.

The categorisations worked reasonably well but weren’t particularly interesting - knowing if a document is a claim or receipt is useful only if you’re going to collect line items. The “investigate this” button worked very well though.

We completely changed our approach for the new system. We dropped the line item task and instead asked our users to categories each page by applying one or more tags, from a small set that our editors could control. This gave us a lot more flexibility - we changed the tags shortly before launch based on the characteristics of the documents - and had the potential to be a lot more fun as well. I’m particularly fond of the “hand-written” tag, which has highlighted some lovely examples of correspondence between MPs and the expenses office.

Sticking to an editorially assigned set of tags provided a powerful tool for directing people’s investigations, and also ensured our users didn’t start creating potentially libellous tags of their own.

Breaking it up in to assignments

For the first project, everyone worked together on the same task to review all of the documents. This worked fine while the document set was small, but once we had loaded in 400,000+ pages the progress bar become quite depressing.

This time round, we added a new concept of “assignments”. Each assignment consisted of the set of pages belonging to a specified list of MPs, documents or political parties. Assignments had a threshold, so we could specify that a page must be reviewed by at least X people before it was considered reviewed. An editorial tool let us feature one “main” assignment and several alternative assignments right on the homepage.

Clicking “start reviewing” on an assignment sets a cookie for that assignment, and adds the assignment’s progress bar to the top of the review interface. New pages are selected at random from the set of unreviewed pages in that assignment.

The assignments system proved extremely effective. We could use it to direct people to the highest value documents (our top hit list of interesting MPs, or members of the shadow cabinet) while still allowing people with specific interests to pick an alternative task.

Get the button right!

Having run two crowdsourcing projects I can tell you this: the single most important piece of code you will write is the code that gives someone something new to review. Both of our projects had big “start reviewing” buttons. Both were broken in different ways.

The first time round, the mistakes were around scalability. I used a SQL “ORDER BY RAND()” statement to return the next page to review. I knew this was an inefficient operation, but I assumed that it wouldn’t matter since the button would only be clicked occasionally.

Something like 90% of our database load turned out to be caused by that one SQL statement, and it only got worse as we loaded more pages in to the system. This caused multiple site slow downs and crashes until we threw together a cron job that pushed 1,000 unreviewed page IDs in to memcached and made the button pick one of those at random.

This solved the performance problem, but meant that our user activity wasn’t nearly as well targeted. For optimum efficiency you really want everyone to be looking at a different page - and a random distribution is almost certainly the easiest way to achieve that.

The second time round I turned to my new favourite in-memory data structure server, redis, and its SRANDMEMBER command (a feature I requested a while ago with this exact kind of project in mind). The system maintains a redis set of all IDs that needed to be reviewed for an assignment to be complete, and a separate set of IDs of all pages had been reviewed. It then uses redis set intersection (the SDIFFSTORE command) to create a set of unreviewed pages for the current assignment and then SRANDMEMBER to pick one of those pages.

This is where the bug crept in. Redis was just being used as an optimisation - the single point of truth for whether a page had been reviewed or not stayed as MySQL. I wrote a couple of Django management commands to repopulate the denormalised Redis sets should we need to manually modify the database. Unfortunately I missed some - the sets that tracked what pages were available in each document. The assignment generation code used an intersection of these sets to create the overall set of documents for that assignment. When we deleted some pages that had accidentally been imported twice I failed to update those sets.

This meant the “next page” button would occasionally turn up a page that didn’t exist. I had some very poorly considered fallback logic for that - if the random page didn’t exist, the system would return the first page in that assignment instead. Unfortunately, this meant that when the assignment was down to the last four non-existent pages every single user was directed to the same page - which subsequently attracted well over a thousand individual reviews.

Next time, I’m going to try and make the “next” button completely bullet proof! I’m also going to maintain a “denormalisation dictionary” documenting every denormalisation in the system in detail - such a thing would have saved me several hours of confused debugging.

Exposing the results

The biggest mistake I made last time was not getting the data back out again fast enough for our reporters to effectively use it. It took 24 hours from the launch of the application to the moment the first reporting feature was added - mainly because we spent much of the intervening time figuring out the scaling issues.

This time we handled this a lot better. We provided private pages exposing all recent activity on the site. We also provided public pages for each of the tags, as well as combination pages for party + tag, MP + tag, document + tag, assignment + tag and user + tag. Most of these pages were ordered by most-tagged, with the hope that the most interesting pages would quickly bubble to the top.

This worked pretty well, but we made one key mistake. The way we were ordering pages meant that it was almost impossible to paginate through them and be sure that you had seen everything under a specific tag. If you’re trying to keep track of everything going on in the site, reliable pagination is essential. The only way to get reliable pagination on a fast moving site is to order by the date something was first added to a set in ascending order. That way you can work through all of the pages, wait a bit, hit “refresh” and be able to continue paginating where you left off. Any other order results in the content of each page changing as new content comes in.

We eventually added an undocumented /in-order/ URL prefix to address this issue. Next time I’ll pay a lot more attention to getting the pagination options right from the start.

Rewarding our contributors

The reviewing experience the first time round was actually quite lonely. We deliberately avoided showing people how others had marked each page because we didn’t want to bias the results. Unfortunately this meant the site felt like a bit of a ghost town, even when hundreds of other people were actively reviewing things at the same time.

For the new version, we tried to provide a much better feeling of activity around the site. We added “top reviewer” tables to every assignment, MP and political party as well as a “most active reviewers in the past 48 hours” table on the homepage (this feature was added to the first project several days too late). User profile pages got a lot more attention, with more of a feel that users were collecting their favourite pages in to tag buckets within their profile.

Most importantly, we added a concept of discoveries - editorially highlighted pages that were shown on the homepage and credited to the user that had first highlighted them. These discoveries also added valuable editorial interest to the site, showing up on the homepage and also the index pages for political parties and individual MPs.

Light-weight registration

For both projects, we implemented an extremely light-weight form of registration. Users can start reviewing pages without going through any signup mechanism, and instead are assigned a cookie and an anon-454 style username the first time they review a document. They are then encouraged to assign themselves a proper username and password so they can log in later and take credit for their discoveries.

It’s difficult to tell how effective this approach really is. I have a strong hunch that it dramatically increases the number of people who review at least one document, but without a formal A/B test it’s hard to tell how true that is. The UI for this process in the first project was quite confusing - we gave it a solid makeover the second time round, which seems to have resulted in a higher number of conversions.

Overall lessons

News-based crowdsourcing projects of this nature are both challenging and an enormous amount of fun. For the best chances of success, be sure to ask the right question, ensure user contributions are rewarded, expose as much data as possible and make the “next thing to review” behaviour rock solid. I’m looking forward to the next opportunity to apply these lessons, although at this point I really hope it involves something other than MPs’ expenses.

Node.js is genuinely exciting

I gave a talk on Friday at Full Frontal, a new one day JavaScript conference in my home town of Brighton. I ended up throwing away my intended topic (JSONP, APIs and cross-domain security) three days before the event in favour of a technology which first crossed my radar less than two weeks ago.

That technology is Ryan Dahl’s Node. It’s the most exciting new project I’ve come across in quite a while.

At first glance, Node looks like yet another take on the idea of server-side JavaScript, but it’s a lot more interesting than that. It builds on JavaScript’s excellent support for event-based programming and uses it to create something that truly plays to the strengths of the language.

Node describes itself as “evented I/O for V8 javascript”. It’s a toolkit for writing extremely high performance non-blocking event driven network servers in JavaScript. Think similar to Twisted or EventMachine but for JavaScript instead of Python or Ruby.

Evented I/O?

As I discussed in my talk, event driven servers are a powerful alternative to the threading / blocking mechanism used by most popular server-side programming frameworks. Typical frameworks can only handle a small number of requests simultaneously, dictated by the number of server threads or processes available. Long-running operations can tie up one of those threads - enough long running operations at once and the server runs out of available threads and becomes unresponsive. For large amounts of traffic, each request must be handled as quickly as possible to free the thread up to deal with the next in line.

This makes certain functionality extremely difficult to support. Examples include handling large file uploads, combining resources from multiple backend web APIs (which themselves can take an unpredictable amount of time to respond) or providing comet functionality by holding open the connection until a new event becomes available.

Event driven programming takes advantage of the fact that network servers spend most of their time waiting for I/O operations to complete. Operations against in-memory data are incredibly fast, but anything that involves talking to the filesystem or over a network inevitably involves waiting around for a response.

With Twisted, EventMachine and Node, the solution lies in specifying I/O operations in conjunction with callbacks. A single event loop rapidly switches between a list of tasks, firing off I/O operations and then moving on to service the next request. When the I/O returns, execution of that particular request is picked up again.

(In the talk, I attempted to illustrate this with a questionable metaphor involving hamsters, bunnies and a hyperactive squid).

What makes Node exciting?

If systems like this already exist, what’s so exciting about Node? Quite a few things:

  • JavaScript is extremely well suited to programming with callbacks. Its anonymous function syntax and closure support is perfect for defining inline callbacks, and client-side development in general uses event-based programming as a matter of course: run this function when the user clicks here / when the Ajax response returns / when the page loads. JavaScript programmers already understand how to build software in this way.
  • Node represents a clean slate. Twisted and EventMachine are hampered by the existence of a large number of blocking libraries for their respective languages. Part of the difficulty in learning those technologies is understanding which Python or Ruby libraries you can use and which ones you have to avoid. Node creator Ryan Dahl has a stated aim for Node to never provide a blocking API - even filesystem access and DNS lookups are catered for with non-blocking callback based APIs. This makes it much, much harder to screw things up.
  • Node is small. I read through the API documentation in around half an hour and felt like I had a pretty comprehensive idea of what Node does and how I would achieve things with it.
  • Node is fast. V8 is the fast and keeps getting faster. Node’s event loop uses Marc Lehmann’s highly regarded libev and libeio libraries. Ryan Dahl is himself something of a speed demon - he just replaced Node’s HTTP parser implementation (already pretty speedy due to it’s Ragel / Mongrel heritage) with a hand-tuned C implementation with some impressive characteristics.
  • Easy to get started. Node ships with all of its dependencies, and compiles cleanly on Snow Leopard out of the box.

With both my JavaScript and server-side hats on, Node just feels right. The APIs make sense, it fits a clear niche and despite its youth (the project started in February) everything feels solid and well constructed. The rapidly growing community is further indication that Ryan is on to something great here.

What does Node look like?

Here’s how to get Hello World running in Node in 7 easy steps:

  1. git clone git://github.com/ry/node.git (or download and extract a tarball)
  2. ./configure
  3. make (takes a while, it needs to compile V8 as well)
  4. sudo make install
  5. Save the below code as helloworld.js
  6. node helloworld.js
  7. Visit http://localhost:8080/ in your browser

Here’s helloworld.js:

var sys = require('sys'), 
  http = require('http');

http.createServer(function(req, res) {
  res.sendHeader(200, {'Content-Type': 'text/html'});
  res.sendBody('<h1>Hello World</h1>');
  res.finish();
}).listen(8080);

sys.puts('Server running at http://127.0.0.1:8080/');

If you have Apache Bench installed, try running ab -n 1000 -c 100 ‘http://127.0.0.1:8080/’ to test it with 1000 requests using 100 concurrent connections. On my MacBook Pro I get 3374 requests a second.

So Node is fast - but where it really shines is concurrency with long running requests. Alter the helloworld.js server definition to look like this:

http.createServer(function(req, res) {
  setTimeout(function() {
    res.sendHeader(200, {'Content-Type': 'text/html'});
    res.sendBody('<h1>Hello World</h1>');
    res.finish();
  }, 2000);
}).listen(8080);

We’re using setTimeout to introduce an artificial two second delay to each request. Run the benchmark again - I get 49.68 requests a second, with every single request taking between 2012 and 2022 ms. With a two second delay, the best possible performance for 1000 requests 100 at a time is 1000 requests / (1000 / 100) * 2 seconds = 50 requests a second. Node hits it pretty much bang on the nose.

The most important line in the above examples is res.finish(). This is the mechanism Node provides for explicitly signalling that a request has been fully processed and should be returned to the browser. By making it explicit, Node makes it easy to implement comet patterns like long polling and streaming responses - stuff that is decidedly non trivial in most server-side frameworks.

djangode

Node’s core APIs are pretty low level - it has HTTP client and server libraries, DNS handling, asynchronous file I/O etc, but it doesn’t give you much in the way of high level web framework APIs. Unsurprisingly, this has lead to a cambrian explosion of lightweight web frameworks based on top of Node - the projects using node page lists a bunch of them. Rolling a framework is a great way of learning a low-level API, so I’ve thrown together my own - djangode - which brings Django’s regex-based URL handling to Node along with a few handy utility functions. Here’s a simple djangode application:

var dj = require('./djangode');

var app = dj.makeApp([
  ['^/$', function(req, res) {
    dj.respond(res, 'Homepage');
  }],
  ['^/other$', function(req, res) {
    dj.respond(res, 'Other page');
  }],
  ['^/page/(\\d+)$', function(req, res, page) {
    dj.respond(res, 'Page ' + page);
  }]
]);
dj.serve(app, 8008);

djangode is currently a throwaway prototype, but I’ll probably be extending it with extra functionality as I explore more Node related ideas.

nodecast

My main demo in the Full Frontal talk was nodecast, an extremely simple broadcast-oriented comet application. Broadcast is my favourite “hello world” example for comet because it’s both simpler than chat and more realistic - I’ve been involved in plenty of projects that could benefit from being able to broadcast events to their audience, but few that needed an interactive chat room.

The source code for the version I demoed can be found on GitHub in the no-redis branch. It’s a very simple application - the client-side JavaScript simply uses jQuery’s getJSON method to perform long-polling against a simple URL endpoint:

function fetchLatest() {
  $.getJSON('/wait?id=' + last_seen, function(d) {
    $.each(d, function() {
      last_seen = parseInt(this.id, 10) + 1;
      ul.prepend($('<li></li>').text(this.text));
    });
    fetchLatest();
  });
}

Doing this recursively is probably a bad idea since it will eventually blow the browser’s JavaScript stack, but it works OK for the demo.

The more interesting part is the server-side /wait URL which is being polled. Here’s the relevant Node/djangode code:

var message_queue = new process.EventEmitter();

var app = dj.makeApp([
  // ...
  ['^/wait$', function(req, res) {
    var id = req.uri.params.id || 0;
    var messages = getMessagesSince(id);
    if (messages.length) {
      dj.respond(res, JSON.stringify(messages), 'text/plain');
    } else {
      // Wait for the next message
      var listener = message_queue.addListener('message', function() {
        dj.respond(res, 
          JSON.stringify(getMessagesSince(id)), 'text/plain'
        );
        message_queue.removeListener('message', listener);
        clearTimeout(timeout);
      });
      var timeout = setTimeout(function() {
        message_queue.removeListener('message', listener);
        dj.respond(res, JSON.stringify([]), 'text/plain');
      }, 10000);
    }
  }]
  // ...
]);

The wait endpoint checks for new messages and, if any exist, returns immediately. If there are no new messages it does two things: it hooks up a listener on the message_queue EventEmitter (Node’s equivalent of jQuery/YUI/Prototype’s custom events) which will respond and end the request when a new message becomes available, and also sets a timeout that will cancel the listener and end the request after 10 seconds. This ensures that long polls don’t go on too long and potentially cause problems - as far as the browser is concerned it’s just talking to a JSON resource which takes up to ten seconds to load.

When a message does become available, calling message_queue.emit(‘message’) will cause all waiting requests to respond with the latest set of messages.

Talking to databases

nodecast keeps track of messages using an in-memory JavaScript array, which works fine until you restart the server and lose everything. How do you implement persistent storage?

For the moment, the easiest answer lies with the NoSQL ecosystem. Node’s focus on non-blocking I/O makes it hard (but not impossible) to hook it up to regular database client libraries. Instead, it strongly favours databases that speak simple protocols over a TCP/IP socket - or even better, databases that communicate over HTTP. So far I’ve tried using CouchDB (with node-couch) and redis (with redis-node-client), and both worked extremely well. nodecast trunk now uses redis to store the message queue, and provides a nice example of working with a callback-based non-blocking database interface:

var db = redis.create_client();
var REDIS_KEY = 'nodecast-queue';

function addMessage(msg, callback) {
  db.llen(REDIS_KEY, function(i) {
    msg.id = i; // ID is set to the queue length
    db.rpush(REDIS_KEY, JSON.stringify(msg), function() {
      message_queue.emit('message', msg);
      callback(msg);
    });
  });
}

Relational databases are coming to Node. Ryan has a PostgreSQL adapter in the works, thanks to that database already featuring a mature non-blocking client library. MySQL will be a bit tougher - Node will need to grow a separate thread pool to integrate with the official client libs - but you can talk to MySQL right now by dropping in DBSlayer from the NY Times which provides an HTTP interface to a pool of MySQL servers.

Mixed environments

I don’t see myself switching all of my server-side development over to JavaScript, but Node has definitely earned a place in my toolbox. It shouldn’t be at all hard to mix Node in to an existing server-side environment - either by running both behind a single HTTP proxy (being event-based itself, nginx would be an obvious fit) or by putting Node applications on a separate subdomain. Node is a tempting option for anything involving comet, file uploads or even just mashing together potentially slow loading web APIs. Expect to hear a lot more about it in the future.

Further reading

Why I like Redis

I’ve been getting a lot of useful work done with Redis recently.

Redis is typically categorised as yet another of those new-fangled NoSQL key/value stores, but if you look closer it actually has some pretty unique characteristics. It makes more sense to describe it as a “data structure server” - it provides a network service that exposes persistent storage and operations over dictionaries, lists, sets and string values. Think memcached but with list and set operations and persistence-to-disk.

It’s also incredibly easy to set up, ridiculously fast (30,000 read or writes a second on my laptop with the default configuration) and has an interesting approach to persistence. Redis runs in memory, but syncs to disk every Y seconds or after every X operations. Sounds risky, but it supports replication out of the box so if you’re worried about losing data should a server fail you can always ensure you have a replicated copy to hand. I wouldn’t trust my only copy of critical data to it, but there are plenty of other cases for which it is really well suited.

I’m currently not using it for data storage at all - instead, I use it as a tool for processing data using the interactive Python interpreter.

I’m a huge fan of REPLs. When programming Python, I spend most of my time in an IPython prompt. With JavaScript, I use the Firebug console. I experiment with APIs, get something working and paste it over in to a text editor. For some one-off data transformation problems I never save any code at all - I run a couple of list comprehensions, dump the results out as JSON or CSV and leave it at that.

Redis is an excellent complement to this kind of programming. I can run a long running batch job in one Python interpreter (say loading a few million lines of CSV in to a Redis key/value lookup table) and run another interpreter to play with the data that’s already been collected, even as the first process is streaming data in. I can quit and restart my interpreters without losing any data. And because Redis semantics map closely to Python native data types, I don’t have to think for more than a few seconds about how I’m going to represent my data.

Here’s a 30 second guide to getting started with Redis:

$ wget http://redis.googlecode.com/files/redis-1.01.tar.gz
$ tar -xzf redis-1.01.tar.gz
$ cd redis-1.01
$ make
$ ./redis-server

And that’s it - you now have a Redis server running on port 6379. No need even for a ./configure or make install. You can run ./redis-benchmark in that directory to exercise it a bit.

Let’s try it out from Python. In a separate terminal:

$ cd redis-1.01/client-libraries/python/
$ python
>>> import redis
>>> r = redis.Redis()
>>> r.info()
{u'total_connections_received': 1, ... }
>>> r.keys('*') # Show all keys in the database
[]
>>> r.set('key-1', 'Value 1')
'OK'
>>> r.keys('*')
[u'key-1']
>>> r.get('key-1')
u'Value 1'

Now let’s try something a bit more interesting:

>>> r.push('log', 'Log message 1', tail=True)
>>> r.push('log', 'Log message 2', tail=True)
>>> r.push('log', 'Log message 3', tail=True)
>>> r.lrange('log', 0, 100)
[u'Log message 3', u'Log message 2', u'Log message 1']
>>> r.push('log', 'Log message 4', tail=True)
>>> r.push('log', 'Log message 5', tail=True)
>>> r.push('log', 'Log message 6', tail=True)
>>> r.ltrim('log', 0, 2)
>>> r.lrange('log', 0, 100)
[u'Log message 6', u'Log message 5', u'Log message 4']

That’s a simple capped log implementation (similar to a MongoDB capped collection) - push items on to the tail of a ‘log’ key and use ltrim to only retain the last X items. You could use this to keep track of what a system is doing right now without having to worry about storing ever increasing amounts of logging information.

See the documentation for a full list of Redis commands. I’m particularly excited about the RANDOMKEY and new SRANDMEMBER commands (git trunk only at the moment), which help address the common challenge of picking a random item without ORDER BY RAND() clobbering your relational database. In a beautiful example of open source support in action, I requested SRANDMEMBER on Twitter yesterday and antirez committed just 12 hours later.

I used Redis this week to help create heat maps of the BNP’s membership list for the Guardian. I had the leaked spreadsheet of the BNP member details and a (licensed) CSV file mapping 1.6 million postcodes to their corresponding parliamentary constituencies. I loaded the CSV file in to Redis, then looped through the 12,000 postcodes from the membership and looked them up in turn, accumulating counts for each constituency. It took a couple of minutes to load the constituency data and a few seconds to run and accumulate the postcode counts. In the end, it probably involved less than 20 lines of actual Python code.

A much more interesting example of an application built on Redis is Hurl, a tool for debugging HTTP requests built in 48 hours by Leah Culver and Chris Wanstrath. The code is now open source, and Chris talks a bit more about the implementation (in particular their use of sort in Redis) on his blog. Redis also gets a mention in Tom Preston-Werner’s epic writeup of the new scalable architecture behind GitHub.