RavenDB - StackOverflow style voting with Live Projections

Published on 2010-12-16

Going all StackOverflow for a moment, let’s pretend I have a class structure that looks something like this:

public class User
 {
     public string Id { get; set; }
     public string DisplayName { get; set; }
 }
 
 public class Question
 { 
     public string Id { get; set; }
     public string UserId { get; set; }
     public string Title { get; set; }
     public string Content{ get; set; }
 }
 
 public class QuestionVote
 {
     public string QuestionId { get; set; }
     public int Delta { get; set; }
 }
 
 public class Answer
 {
     public string UserId { get; set; }
     public string QuestionId { get; set; }
     public string Content { get; set; }
 }
 
 public class AnswerVote 
 {
     public string QuestionId { get; set; }
     public string AnswerId { get; set; }
     public int Delta { get; set; }
 }

I’ve removed all the “other” stuff that we might have on those classes to keep the relationships simple, but essentially we have a user, and users can add answers to questions, and those questions have  votes added to them.

Note that these are all separate documents, we don’t add votes to the answer (there are potentially thousands of them), and same goes for answers to questions.

When viewing a page in StackOverflow, you are given the question (and its score), with all the relevant data from perhaps the user who asked that question and other such data from other documents.

You are also given a list of comments, ordered by how many positive votes they have, again with relevant information from the user as well as information in that comment.

In a relational store you probably don’t want to be doing these aggregations every page load, and you don’t want to be doing all those joins either -  I haven’t actually got any insight into how the StackOverflow database is put together, but I’m going to guess that they probably have some de-normalisation going on here, or they have a separate read store or caching system. (Or they have some seriously crap-hot database servers)

There are two queries here on this page, one of them is a query to get the question and its vote total, and the other is a query to get a page of comments against that  question with their vote totals.

These present themselves as two projections:

public class QuestionView
{
    public string QuestionId { get; set; }
    public string UserDisplayName { get; set; }
    public string QuestionTitle { get; set; }
    public string QuestionContent { get; set; }
    public int VoteTotal { get; set; }
}
 
public class AnswerViewItem
{
    public string QuestionId { get; set; }
    public string AnswerId { get; set; }
    public string Content { get; set; }
    public string UserId { get; set; }
    public string UserDisplayName { get; set; }
    public int VoteTotal { get; set; }
}

Note that we’re pulling in information from across multiple documents here, and we’re doing an aggregation in both of these views (We’re counting the total deltas of the votes that have been made for a specific question/comment – we’re doing a group by on QuestionId/CommentId), and then pulling in data from other documents.

Now, aggregations in a document database like Raven or Couch mean doing a map/reduce, which mean defining an index – but what are we defining that index against?

What we actually want is information pertaining to a question or comment, but what we want to do an aggregation on are the votes, so we define the index against the votes.

Let’s start with an index to get all the information for a question

public class QuestionWithVoteTotalIndex : AbstractIndexCreationTask<QuestionVote, QuestionView>
{
    public QuestionWithVoteTotalIndex()
    {
        Map = docs => from doc in docs
                      select new
                      {
                          QuestionId = doc.QuestionId,
                          VoteTotal = doc.Delta
                      };
    }
}

We define an index against QuestionVote, because that’s what we’re aggregating – and we’re creating a QuestionView from that index (So,  AbstractIndexCreationTask<QuestionVote, QuestionView> )

We can then do a reduce on QuestionId and work out whether it has minus/positive/etc vote totals

Reduce = mapped => from map in mapped
                   group map by map.QuestionId into g
                   select new
                   {
                       QuestionId = g.Key,
                       VoteTotal = g.Sum(x=>x.VoteTotal)
                   };

And then finally, we can use those reduced results and pull in all the other information (like the question we’re actually looking at) with a live projection:

TransformResults = (database, results) =>
    from result in results
    let question = database.Load<Question>(result.QuestionId)
    let user = database.Load<User>(question.UserId)
    select new
    {
        QuestionId = result.QuestionId,
        UserDisplayName = user.DisplayName,
        QuestionTitle = question.Title,
        QuestionContent = question.Content,
        VoteTotal = result.VoteTotal
    };

Querying this is then just a matter of asking the session for the view by QuestionId

var questionInfo = session.Query<QuestionView, QuestionWithVoteTotalIndex>()
                    .Where(x => x.QuestionId == questionId)
                    .SingleOrDefault();

Getting a page-able list of comments alongside this, complete with their totals and information about who posted it would then look something like this:

public class Answers_ByQuestion : AbstractIndexCreationTask<AnswerVote, AnswerViewItem>
{
    public Answers_ByQuestion()
    {
        Map = docs => from doc in docs
                      select new
                      {
                          AnswerId = doc.AnswerId,
                          QuestionId = doc.QuestionId,
                          VoteTotal = doc.Delta
                      };
        Reduce = mapped => from map in mapped
                           group map by new
                           {
                               map.QuestionId,
                               map.AnswerId
                           } into g
                           select new
                           {
                               AnswerId = g.Key.AnswerId,
                               QuestionId = g.Key.QuestionId,
                               VoteTotal = g.Sum(x => x.VoteTotal)
                           };
        TransformResults = (database, results) =>
            from result in results
            let answer = database.Load<Answer>(result.AnswerId)
            let user = database.Load<User>(answer.UserId)
            select new
            {
                QuestionId = result.QuestionId,
                AnswerId = result.AnswerId,
                Content = answer.Content,
                UserId = answer.UserId,
                UserDisplayName = user.DisplayName,
                VoteTotal = result.VoteTotal
            };
        this.SortOptions.Add(x => x.VoteTotal, Raven.Database.Indexing.SortOptions.Int);
    }
}

With a query of

var questionAnswers = session.Query<AnswerViewItem, Answers_ByQuestion>()
    .Where(x => x.QuestionId == questionId)
    .OrderByDescending(x => x.VoteTotal)
    .Take(10)
    .ToArray();

Note: In the index we told RavenDB how we’d like to treat VoteTotal when sorting, and that also then gave us the functionality of sorting the returned answers by popularity and page through them.

So what does this give us?

This is actually a remarkably simple way to generate views for web pages directly from our primary data store, without having to introduce the complexity of our own caching layers, our own de-normalised views or any other architectural clever-bits just in order to get a site that _works_.

All of these indexes are pre-computed, we’re not doing the calculations at read-time, our indexes are our read store, our documents are our write store.

This technique will work for nearly all situations where we want to display some sort of  aggregation (total, average, etc) next to the document we actually wish to display – it won’t of course work for multiple aggregations, but because queries are so cheap in Raven, it isn’t going to harm you to do a couple of extra look-ups for that data.

Hurrah!

2015 © Rob Ashton. ALL Rights Reserved.