This consultant shows you one weird trick to take down a managed database

Published on 2013-10-29

Sysadmins hate him...

I was visiting a client last week who have been having trouble with their RavenDB instance for a few months and understandably getting a bit frustrated as time went on.

The scene

I arrived, drank some coffee and we hit a room with the projector in it, and brought up the graphs of resource usage on the server running RavenDB - they've been pretty handy with the splunk and they have quite a few graphs! (Their usage of Splunk was awesome actually, can highly recommend looking at it)

Memory usage looks something like this through the day

That 4am block is a result of an automated process to kill their RavenDB instance every day because if they left it running it would being down the server when people were actually using the system - not so good! (It starts spiking around 9am because it starts being under quite a reasonable load).

My line of questioning on seeing this

Ah.

The thing is, RavenDB can deal with large documents. Internally it does quite a few things to avoid objects ending up on the Large Object Heap or being promoted to the 2nd generation.

If you were to create objects with lots of fields that reached the above size in all likelihood RavenDB's practises around this kind of thing would result in happy developers, happy ops and happy sales teams?

But byte arrays that are automatically put on the Large Object Heap? There is little Raven can do about these, as when the objects internally are de-serialized into tokens, the smallest token it can make with them is however large the byte array is!

Under what circumstances does RavenDB load these fields?

Imagine now that you create a new index on the server and it has to

This is just typical .NET behaviour, and to make things worse, when the issues first started being noticed the first port of call was to open Raven Studio and start inspecting the server (performing queries), thus adding to the problem and causing even more hilarious memory spikes.

To give an indication, when opening up the performance counters for the server the kind of thing we were seeing looked like this:

Yes indeed, that's nearly all the memory on the server being allocated to the LHO as a result of excessive large objects of varying sizes being aggressively loaded through the indexing and querying processes.

The solution?

Much like with every other database out there, storing binary blobs in a store which is built for querying/transactions isn't ideal - but there are two options available here

The latter isn't encouraged as it's just a convenience - but to prove a point I generated 1.5 million documents of varying sizes with byte arrays on the fields to reproduce the problem successfully on my laptop (that's actually the screenshot above), then migrated them into attachments to show what a difference this would make as attachments are never loaded fully into memory.

What a difference choosing an appropriate store makes! In the second number the "PDFs" are still being stored in RavenDB, just not in the primary document store.

When I left the client their server was sitting flat at 4gb consumption (with the database still full of PDFs, but instructions in how to avoid causing issues until they had been purged)

The summary

I'm currently writing my own database in a different managed platform and I'm strongly considering sticking indexing into its own process to avoid this sort of long-term build up of issues. That said - the JVM doesn't do per-process GC so that might not help that much.

Either way it's interesting and points to one of the limitations of writing a database or any high throughput system in a managed environment if you're going to be expecting big chunks of data that can't be broken up somehow. (Okay, this is quite specific, and will rarely catch anybody out).

2020 © Rob Ashton. ALL Rights Reserved.