Sunday, June 13, 2010 #

TwitterFS - A HackCamp 2010 Production

I’ve just spent the weekend at the Google offices in London, taking part in HackCamp, which was a replacement event for BarCamp which was cancelled due to problems with the venue.

I went not knowing what I’d be doing or what I’d be doing it with, and after a presentation by @themattharris on Twitter annotations at the start of the day, @JHollingworth and I decided that with an absence of good ideas it would be fun to abuse a new feature (annotations) by doing something pointless and absolutely useless to anybody, even ourselves.

Hence the idea of TwitterFS was born.

The setup

  • With twitter annotations, we now have the ability to store 512 bytes of arbitrary data against each tweet
  • Each tweet has a unique identifier assigned to it on save
  • Each tweet can have 140 additional characters stored in the content of the tweet itself
  • Tweets cannot be modified once written

With twitter annotations, it became obvious that the thing that we needed to do was create a low availability, low consistency and low performance file system against Twitter.

Twitter could then be used as an “in the cloud store” of arbitrary data which could then be synched between machines ala drop box. (TwitterBox?)

So what do we have?

Effectively, in our file system each tweet is an inode, an an inode contains the data for that inode and a link to the next inode (if the data is too large to be stored on a single inode).

  • Directories can be implemented as a sequence of inodes which contain a list of ids for other inodes
  • Files can be implemented as a sequence of inodes which contain the data for that file

In an ordinary file system, we have a finite amount of space and when files are deleted or modified, any freed up inodes need re-allocating so they can be written to again when more data is added.

This is not the case with Twitter as we cannot re-allocate tweets, but we can therefore treat Twitter as an append-only infinite sized hard drive.

  • Adding a file means
    • breaking up the file into separate inodes
    • writing them all to twitter (backwards)
    • writing the last written inode into the directory the file belongs to
    • re-writing any directories in the tree for that file (up to and including the “root” directory)
  • Deleting a file means
    • Removing the reference from the directory it belongs to
    • re-writing any directories in the tree for that directory (up to and including the “root” directory)
  • Editing a file means
    • Deleting the file
    • Adding the file

The same goes for directories.

On top of this, a file system watcher was written which would detect changes on the hard drive, add/remove files/directories to the in-memory store and flush the changes to twitter when they were made. (And detect changes to twitter and perform the reverse operation).

Obviously loading the entire tweet stream would defeat the point of storing the data on twitter, so a look-ahead/caching algorithm was implemented, pulling back 200 nodes when 1 was requested and keeping our requests to a minimum.

What the code looks like

This is a bit of code that loads two files into the root directory, a directory and another file so you can see that the act of dealing with Twitter is not the concern of the application. In one of our tests, data was a byte array loaded from an image on the hard drive and that worked fine too.

   1:    persister = TwitterPersister.new
   2:   
   3:      fs = FileSystem.new persister, :isnew => true
   4:      root = fs.root
   5:      
   6:      documenta = Document.new(fs, :title => "Document A", :data =>  "Some Data (a)")
   7:      documentb = Document.new(fs, :title => "Document B", :data => "Some other data (b)")
   8:      
   9:      root.add_documents([documenta, documentb])
  10:      
  11:      dir = Directory.new(fs, nil)
  12:      documentc = Document.new(fs, :title => "Document C", :data => "Some lovely data (c)")
  13:      dir.add_document(documentc)
  14:      
  15:      root.add_directory(dir)
  16:      
  17:      fs.flush()

 

How did it go?

Implementing a file system against a persistence store that is unreliable was always going to cause problems, when it came to do the presentation we still didn’t have access to the annotations feature due to Twitter falling over all weekend, so we had to start cramming the file data into the tweet itself (70 bytes per inode – ouch!), and then Twitter itself went down anyway.

I doubt we’ve won anything (although we got some laughter), but I’m overall quite impressed with how far we got (all the above has been written and tested)

(Edit, okay – we won “Stupidest hack most likely to win a million dollars of VC funding”, sweet!)

  • The slides for our 1-minute presentation can be found here
  • The code for TwitterFS and the commit history for the weekend can be found here
  • The account we used for testing can be found here

We ended up writing the whole thing in Ruby because we’re C# developers by day and as we were doing something ultimately pointless we tried to give it a bit of a point by working in something we were unfamiliar with for education purposes.

My opinion on that? I really liked working within another dynamic language, our tests were a bit crap because we were unfamiliar with the frameworks and it was a pain to get working properly on my windows laptop (I ended up using cygwin for *everything*), but it’s given me some enthusiasm for going and giving Rails a second glance.

Anyway, normal service will resume tomorrow and I’ll be pushing out another RavenDB/CouchDB entry or two this week, just in case you were wondering where they had gotten to :)

posted @ Sunday, June 13, 2010 4:11 PM | Feedback (2)

Copyright © Rob Ashton

Design by Rob Ashton, Based On A Design By Bartosz Brzezinski