RavenDB - The HiLo what how and why

Published on 2010-5-16

One of the issues I touched on in with the basic interaction with RavenDB was the awkwardness of with having to call SaveChanges in order to get the ids of entities that had been saved across the unit of work. This is not a problem new to the document db space, nor is it a problem new to any system where the domain has been mapped to any id based data store (ORMs/RDBMS/etc).

I was going to cook a home brew solution specifically for my use within my projects and blog about it in order that other people could use it, but after posting my intentions in the RavenDB mailing list to create something like this, Oren suggested that making it the default behaviour and moving id generation to the Store would be a welcome move.

After posting on Twitter about this now being default, I got asked quite a few questions on what HiLo was, what the advantages were, and why it was a good thing that in the .NET client for RavenDB this was now going to be the default.

The gist

The algorithm

The basic premise, is that the server still controls the id generation, but effectively hands out a range of ids to each client, which the client can then hand out to objects as they are created, and when the client runs out of ids, it simply requests more.

Obviously, requesting a heap of Ids all at the same time would be expensive, so the idea is that the server provides a single id, a “Hi” value which controls the creation of the range on the client. (which provides the “Lo” value)

There are a number of ways this can be implemented, but the one I chose was probably the simplest, and credit goes to Tuna Toksoz for the blog entry which provided the means to implementing it myself.

   1:  (currentHi - 1)*capacity + (++currentLo)
  • When currentLo reaches capacity, a new Hi is requested and the cycle starts over again

In the actual implementation, there is some locking going on around this algorithm in order to make the client generator available across threads (web requests) and avoid having to create a new generator per session (defeating the point of having one if you only create a single object in a session).

Let’s look at a sample run through, with a small capacity of “3”, to keep the sample small!

Description currentLoBefore currentHi Created Id currentLoAfter
Hi Request 0 1 1 1
  1 1 2 2
  2 1 3 3 (capacity)
Hi Request 0 2 4 1
  1 2 5 2
  2 2 6 3 (capacity)

As we can see, if all the clients are using the same capacity, and they are given different “Hi” values, then they can’t generate duplicate keys, but by and large they’ll be sequential in nature.

The implementation in RavenDB

In RavenDB, the default function configured against the DocumentConvention is now HiLo, which means if a new document is saved against the session with its Id set to NULL, it will have an Id generated on the spot which contains the name of the document and the incremented Id. Obviously this can be overridden by changing the convention to leave the created id at some default value of your application’s choosing.

My original implementation was a bit poor, generating quite a bit of noise in the document database (it was inserting documents to get the ids), and the incremented Ids were being shared amongst objects – which meant if you created say, blogentry/1, saving a new user would mean having newuser/2.

Oren changed this to directly store a single object in the RavenDB for the generator, and to create a generator per-type – which means a lot less noise and more sensible ids being generated for each document.

What it means

What this essentially means, is if you’re using RavenDB out of the box without changing any of the conventions, documents will have a generated Id as soon as Store is called for that document. This means that SaveChanges does not have to be called until right at the very end of the Unit of Work, which means all changes can be efficiently batched in a single request and as a result applications should be easier to write and performance should be easier to maintain.

This is a .NET client specific feature and nothing was changed in the database itself to make this work.

What this does mean, is that if multiple clients from different platforms are going to be connecting to RavenDB and manipulating data, if you’re using the default HiLo implementation then a similar algorithm will need implementing for those other platforms, using the same capacity in order to prevent concurrency issues. This is not necessarily a downside, but is worth making a note of if you are going to be having this sort of set up.

What I learned

While I might contribute the odd bug fix to open source projects now and then, the idea of going in and changing the fundamental way the .NET RavenDB client worked was a bit daunting – not from a technical perspective, but from a taste perspective as I wasn’t sure how Oren wanted things done. As he later said, he’d prefer that code that has to then change be submitted, then no code at all be submitted. I’d like to raise that with anybody who wants to contribute to this project – if you’ve got a good idea then hit the mailing list and suggest it and maybe implement it – nothing to be lost if it’s something people want to use.

In the end, my implementation is barely visible in there, but I'm still pleased that this is in there, it makes *my* life easier :)

This used to ask if you wanted to hire me

But chances are I'm not available, as I'm busy shipping stuff.

Drop me an e-mail anyway, as I like interesting problems.

Get in touch

blog comments powered by Disqus

Sean


When using hilo myself, I found it worked better if I implimented the hi by eg 1000 and then getting a new id with (currentHi) + (++currentLo)That lets you change the capacity client side without problems. It also simplified getting a new id inside sql server stored procs.

robashton


Surely that means the client still needs to be aware of the capacity (because they'd still be able to go past capacity if it was changed on the server).Either way it's brittle if you're not in control of the whole system - but if you're writing a proper system then only one client will access the database (ideally), and everything else will go through that client so it doesn't matter.

Ken Egozi


@rob: proper system =&rt; only one client?with large, complex systems, you can easily find a mix of technologies. Like RoR front-end, erlang based chat server, .NET logic engine, and Java based batch processing. One of the things that can make RavenDB appeal would be a consistent set of clients for major environments. For e.g., I like that with MongoDB you get official server build for every possible type of host (irrelevant with RavenDB as it is dependant on ESENT afaik), and a large list of consistent client APIs.

robashton


Sure - but most architects would balk at the idea of letting all those things go directly to the database. Instead, the main application would most likely expose services which these would go to.Even if you weren't to do that (Because let's face it, RavenDB exposes the ability to load logic directly into it *and* exposes REST services so why *not* go directly to it), the fact remains that with any client-id assignation system you're going to have to standardise how and when those clients generate those ids and it doesn't matter which variant of the HiLo algorithm you use :)