RavenDB & CouchDB – Map and Reduce

Previous entries in the series

One of the recurring features present in the popular document databases is the use of map-reduce functions as the primary way to create views on the stored data.

Map Reduce

At this point, I could go into a long description of what map/reduce actually is but that kind of thing is available via the use of a convenient google search.

The short of it is that you map some data from each document into a structure to be queried on, and then run (and re-run) a reduce function over the mapped data in order to group it by some key.

Now, these map functions can get quite complicated, but the concept remains the same from the most basic versions up to the more complicated reports on the data.

Let’s look at the standard example of getting the number of comments across all the blog entries by a certain author.

Here is the structure of our example document:

   1:  {
   2:      title: 'robashton',
   3:      category: 'tech',
   4:      content: 'blah blah blah',
   5:      comments: [
   6:          { author: 'anon', content: 'blah' },
   7:          { author: 'anon', content: 'more blah' }
   8:      ]
   9:  }
 
 

The important data here is ‘author’ and the length of the comments array, so naturally we would map these in our map function.

In RavenDB

   1:  from doc in docs
   2:  select new
   3:  {
   4:       author = doc.author,
   5:       count = doc.comments.Length
   6:  }

In CouchDB

   1:  function(doc) {
   2:    emit(doc.author, doc.comments.length);
   3:  }

There isn’t anything much to say about these, the RavenDB map function just returns the mapped data, and the CouchDB function emits the mapped field(s) as a key alongside the value(s) associated with that key.

The reduction will therefore take place on ‘author’ (the key), and we would sum all the comment counts (the value) for that author in the reduce function.

In RavenDB

   1:  from result in results
   2:  group result by result.author into g
   3:  select new 
   4:  {
   5:       author = g.Key,
   6:       count = g.Sum(x => x.count)
   7:  }

 

In CouchDB

   1:  function (key, values, rereduce) {
   2:     return sum(values);    
   3:  }

 

The structure of these two functions immediately strike us as being very different and questions are therefore raised by that.

  • In RavenDB the reduce function is handed a selection of mapped values and its job is to group them by some key and return a new set of mapped values.
  • In CouchDB, the map function emits the key to reduce on, and a separate value to be combined by the reduce method. This means that when the reduce method is called, it might have a collection of those values, but they’re already grouped by key.
  • in RavenDB, the input of the reduce function must consume the output of the Map function, and then re-output the input of the reduce function. What this roughly equates to is the reduce function and map function must return the same result shape.
  • In CouchDB , the result shape of the map and reduce functions don’t have to bear any relation to each other, and of course emit can be called multiple times per document. This is slightly more complicated, and while it can be more flexible, it also leads to a greater amount of complexity in the reduce function because the input can be different depending on the context in which it is called.

The reduce function can be called more than once for a set of documents (and this holds true for both Raven + Couch, that’s  the whole point of map/reduce), and in CouchDB an extra parameter is present called “rereduce”, which specifies whether this is the first pass or a subsequent pass. If the result shapes differ between the map function and reduce function, a check is required so different logic can be performed based on this.

For those of you who skipped that big body of text, the important thing to take away is that in RavenDB the responsibility of deciding what to group the documents on falls to the Reduce function, and in CouchDB the responsibility falls to the Map function.

So, these are two rather different ways of thinking about MapReduce and this is definitely something to be aware of when trying to jump between the two.

This was quite a long entry with a really short summary, so in the next entry, I’ll be listing and explaining some of the actual functionality differences between CouchDB and RavenDB.



   


Print | posted on Sunday, June 06, 2010 9:15 PM

Feedback

No comments posted yet.

Your comment:





 
Please add 1 and 7 and type the answer here:

Copyright © Rob Ashton

Design by Rob Ashton, Based On A Design By Bartosz Brzezinski