Sunday, June 13, 2010
#
I’ve just spent the weekend at the Google offices in London, taking part in HackCamp, which was a replacement event for BarCamp which was cancelled due to problems with the venue.
I went not knowing what I’d be doing or what I’d be doing it with, and after a presentation by @themattharris on Twitter annotations at the start of the day, @JHollingworth and I decided that with an absence of good ideas it would be fun to abuse a new feature (annotations) by doing something pointless and absolutely useless to anybody, even ourselves.
Hence the idea of TwitterFS was born.
The setup
- With twitter annotations, we now have the ability to store 512 bytes of arbitrary data against each tweet
- Each tweet has a unique identifier assigned to it on save
- Each tweet can have 140 additional characters stored in the content of the tweet itself
- Tweets cannot be modified once written
With twitter annotations, it became obvious that the thing that we needed to do was create a low availability, low consistency and low performance file system against Twitter.
Twitter could then be used as an “in the cloud store” of arbitrary data which could then be synched between machines ala drop box. (TwitterBox?)
So what do we have?
Effectively, in our file system each tweet is an inode, an an inode contains the data for that inode and a link to the next inode (if the data is too large to be stored on a single inode).
- Directories can be implemented as a sequence of inodes which contain a list of ids for other inodes
- Files can be implemented as a sequence of inodes which contain the data for that file
In an ordinary file system, we have a finite amount of space and when files are deleted or modified, any freed up inodes need re-allocating so they can be written to again when more data is added.
This is not the case with Twitter as we cannot re-allocate tweets, but we can therefore treat Twitter as an append-only infinite sized hard drive.
- Adding a file means
- breaking up the file into separate inodes
- writing them all to twitter (backwards)
- writing the last written inode into the directory the file belongs to
- re-writing any directories in the tree for that file (up to and including the “root” directory)
- Deleting a file means
- Removing the reference from the directory it belongs to
- re-writing any directories in the tree for that directory (up to and including the “root” directory)
- Editing a file means
- Deleting the file
- Adding the file
The same goes for directories.
On top of this, a file system watcher was written which would detect changes on the hard drive, add/remove files/directories to the in-memory store and flush the changes to twitter when they were made. (And detect changes to twitter and perform the reverse operation).
Obviously loading the entire tweet stream would defeat the point of storing the data on twitter, so a look-ahead/caching algorithm was implemented, pulling back 200 nodes when 1 was requested and keeping our requests to a minimum.
What the code looks like
This is a bit of code that loads two files into the root directory, a directory and another file so you can see that the act of dealing with Twitter is not the concern of the application. In one of our tests, data was a byte array loaded from an image on the hard drive and that worked fine too.
1: persister = TwitterPersister.new
2:
3: fs = FileSystem.new persister, :isnew => true
4: root = fs.root
5:
6: documenta = Document.new(fs, :title => "Document A", :data => "Some Data (a)")
7: documentb = Document.new(fs, :title => "Document B", :data => "Some other data (b)")
8:
9: root.add_documents([documenta, documentb])
10:
11: dir = Directory.new(fs, nil)
12: documentc = Document.new(fs, :title => "Document C", :data => "Some lovely data (c)")
13: dir.add_document(documentc)
14:
15: root.add_directory(dir)
16:
17: fs.flush()
How did it go?
Implementing a file system against a persistence store that is unreliable was always going to cause problems, when it came to do the presentation we still didn’t have access to the annotations feature due to Twitter falling over all weekend, so we had to start cramming the file data into the tweet itself (70 bytes per inode – ouch!), and then Twitter itself went down anyway.
I doubt we’ve won anything (although we got some laughter), but I’m overall quite impressed with how far we got (all the above has been written and tested)
(Edit, okay – we won “Stupidest hack most likely to win a million dollars of VC funding”, sweet!)
- The slides for our 1-minute presentation can be found here
- The code for TwitterFS and the commit history for the weekend can be found here
- The account we used for testing can be found here
We ended up writing the whole thing in Ruby because we’re C# developers by day and as we were doing something ultimately pointless we tried to give it a bit of a point by working in something we were unfamiliar with for education purposes.
My opinion on that? I really liked working within another dynamic language, our tests were a bit crap because we were unfamiliar with the frameworks and it was a pain to get working properly on my windows laptop (I ended up using cygwin for *everything*), but it’s given me some enthusiasm for going and giving Rails a second glance.
Anyway, normal service will resume tomorrow and I’ll be pushing out another RavenDB/CouchDB entry or two this week, just in case you were wondering where they had gotten to :)
Monday, June 07, 2010
#
Taking a brief interlude from my RavenDB series, I was doing some work on an internal project tonight with the build scripts and test-runner and I finally got bored of having to deal with un-managed SQLite dependencies with a project which other than that was platform agnostic.
The problem with having un-managed dependencies in a managed project is that Visual Studio quite frankly sucks at it, you can set up certain projects (in this case the tests) to be x86 only, and remove their Any CPU configuration – but as soon as you add a new project to the solution it decides to re-add the old configuration and potentially break things again.
This doesn't really rear any problems until you write a build script and things start falling over as your test runner tries to run as an x64 process and tries to load in the x86 dependency, or any number of combinations where this kind of thing can blow up. If it can happen, it will happen and it’s just something I’d rather not deal with.
So I had a look at Sqlite-Csharp, the code is atrocious as far as natively written C# libraries go (that’s not the point though, it’s a *port*), but it looks to be a superb direct-port of a C project (Sqlite) and passes most of the tests that it needs to in order for it to be viable for use in at least our in-memory tests.
Anyway, you can’t download binaries, so you have to build it – but no changes are required so just do it.
I’m not going to cover the process of setting up in-memory databases for testing with SQLite as that’s an easily Google-able topic, but there are a few differences between doing it with the unmanaged libraries and with the pure managed libraries.
This is what my FluentNHibernate configuration looks like:
1: Fluently.Configure()
2: .Database(
3: SQLiteConfiguration.Standard.ConnectionString(
4: x => x.Is(mConnectionString)).Driver<SqliteDriver>());
I’ve had to create a driver to make this work properly as there isn’t one provided as stock in NHibernate, the code for this is as simple as this:
1: public class SqliteDriver : ReflectionBasedDriver
2: {
3: /// <summary>
4: /// Initializes a new instance of <see cref="SQLiteDriver"/>.
5: /// </summary>
6: /// <exception cref="HibernateException">
7: /// Thrown when the <c>Community.CsharpSqlite.SQLiteClient</c> assembly can not be loaded.
8: /// </exception>
9: public SqliteDriver()
10: : base(
11: "Community.CsharpSqlite.SQLiteClient",
12: "Community.CsharpSqlite.SQLiteClient.SqliteConnection",
13: "Community.CsharpSqlite.SQLiteClient.SqliteCommand")
14: {
15: }
16:
17: public override bool UseNamedPrefixInSql
18: {
19: get { return true; }
20: }
21:
22: public override bool UseNamedPrefixInParameter
23: {
24: get { return true; }
25: }
26:
27: public override string NamedPrefix
28: {
29: get { return "@"; }
30: }
31:
32: public override bool SupportsMultipleOpenReaders
33: {
34: get { return false; }
35: }
36:
37: public override bool SupportsMultipleQueries
38: {
39: get { return true; }
40: }
41: }
Yeah, not terribly exciting – just add a reference to Community.CsharpSqlite.SQLiteClient and this will work.
The other major difference is the delimiter between connection string components is a comma and the method of selecting an in-memory database looks different. This is my connection string:
1: "uri=file://:memory:,Version=3";
And this is the code I use to create the connection:
1: private SqliteConnection GetConnection()
2: {
3: if (mConnection == null) {
4: mConnection = new SqliteConnection(mConnectionString);
5: mConnection.Open();
6: }
7: return mConnection;
8: }
And this is therefore the code I use to create a session factory:
1: mFactory.OpenSession(GetConnection());
A word of warning
Mileage may vary, I had 11 tests from about 300 fail, mostly due to unrecognised types/null values and exceptions that were different in this version of Sqlite, I’m submitting some code fixes for the unrecognised types and null values and modifying my tests to take into account the new exception types.
Also, I can’t guarantee I’ve done it right, so let me know if I’ve done something stupid.
All of my tests are now Any CPU and my build process is suddenly a lot simpler, I’ll take the hit of having to submit and change a bit of code in order to get that.
Sunday, June 06, 2010
#
Previous entries in the series
One of the recurring features present in the popular document databases is the use of map-reduce functions as the primary way to create views on the stored data.
Map Reduce
At this point, I could go into a long description of what map/reduce actually is but that kind of thing is available via the use of a convenient google search.
The short of it is that you map some data from each document into a structure to be queried on, and then run (and re-run) a reduce function over the mapped data in order to group it by some key.
Now, these map functions can get quite complicated, but the concept remains the same from the most basic versions up to the more complicated reports on the data.
Let’s look at the standard example of getting the number of comments across all the blog entries by a certain author.
Here is the structure of our example document:
1: {
2: title: 'robashton',
3: category: 'tech',
4: content: 'blah blah blah',
5: comments: [
6: { author: 'anon', content: 'blah' },
7: { author: 'anon', content: 'more blah' }
8: ]
9: }
The important data here is ‘author’ and the length of the comments array, so naturally we would map these in our map function.
In RavenDB
1: from doc in docs
2: select new
3: {
4: author = doc.author,
5: count = doc.comments.Length
6: }
In CouchDB
1: function(doc) {
2: emit(doc.author, doc.comments.length);
3: }
There isn’t anything much to say about these, the RavenDB map function just returns the mapped data, and the CouchDB function emits the mapped field(s) as a key alongside the value(s) associated with that key.
The reduction will therefore take place on ‘author’ (the key), and we would sum all the comment counts (the value) for that author in the reduce function.
In RavenDB
1: from result in results
2: group result by result.author into g
3: select new
4: {
5: author = g.Key,
6: count = g.Sum(x => x.count)
7: }
In CouchDB
1: function (key, values, rereduce) {
2: return sum(values);
3: }
The structure of these two functions immediately strike us as being very different and questions are therefore raised by that.
- In RavenDB the reduce function is handed a selection of mapped values and its job is to group them by some key and return a new set of mapped values.
- In CouchDB, the map function emits the key to reduce on, and a separate value to be combined by the reduce method. This means that when the reduce method is called, it might have a collection of those values, but they’re already grouped by key.
- in RavenDB, the input of the reduce function must consume the output of the Map function, and then re-output the input of the reduce function. What this roughly equates to is the reduce function and map function must return the same result shape.
- In CouchDB , the result shape of the map and reduce functions don’t have to bear any relation to each other, and of course emit can be called multiple times per document. This is slightly more complicated, and while it can be more flexible, it also leads to a greater amount of complexity in the reduce function because the input can be different depending on the context in which it is called.
The reduce function can be called more than once for a set of documents (and this holds true for both Raven + Couch, that’s the whole point of map/reduce), and in CouchDB an extra parameter is present called “rereduce”, which specifies whether this is the first pass or a subsequent pass. If the result shapes differ between the map function and reduce function, a check is required so different logic can be performed based on this.
For those of you who skipped that big body of text, the important thing to take away is that in RavenDB the responsibility of deciding what to group the documents on falls to the Reduce function, and in CouchDB the responsibility falls to the Map function.
So, these are two rather different ways of thinking about MapReduce and this is definitely something to be aware of when trying to jump between the two.
This was quite a long entry with a really short summary, so in the next entry, I’ll be listing and explaining some of the actual functionality differences between CouchDB and RavenDB.
Wednesday, June 02, 2010
#
Previous entries in the series
Once you have a number of documents in the database, you soon want to do more complex operations than simply retrieving a list of them.
Consider therefore the following and rather over-used example document:
1: {
2: title: "Another blog entry",
3: content: 'blah blah blah',
4: category: 'code',
5: author: 'robashton'
6: }
Our example query would be to get all of the documents from the database that were written by a particular author AND in a certain category.
Obviously querying all the blogs written by a single author, or all the blogs in a certain category would be fairly expected queries too.
Indexes in RavenDB
In order to perform any queries whatsoever in RavenDB, we first need to create an index.
1: from doc in docs
2: select new {
3: doc.author,
4: doc.category
5: };
This is effectively a map function written as a LINQ query which returns a single value, an object that is a map of the values to be indexed.
Get all the documents by author and category
indexes/entriesByAuthorAndCategory?query=category:tech AND author:robashton
Get all the documents by category
indexes/entriesByAuthorAndCategory?query=category:tech
Get all the documents by author
indexes/entriesByAuthorAndCategory?query=author:robashton
Those queries will return a list of whole documents which match the queries passed in.
Indexes in CouchDB
The same goes for CouchDB, only map functions in CouchDB have two outputs, and are written in JavaScript.
1: function(doc) {
2: emit([doc.category, doc.author], doc);
3: }
Return values are specified by calling emit, and emit can be called more than once for each document, thus multiple keys can be created for each document with a single map function. The first parameter in Emit is the “key” to be searched on, and the second parameter is the data associated with that key (in this case, the document).
Get all the documents by author and category
blogs/_view/byAuthorAndCategory?startkey=["tech","robashton"]
Get all the documents by category
blogs/_view/byAuthorAndCategory?startkey=["tech"]
Get all the documents by author
Ah. This suddenly a bit more complicated. I’ve not actually managed to come to a convenient solution, as far as I can understand from the docs, if you want to query specific fields within the key, you have to submit a POST request containing a JSON document with the fields you wish to search.
So it’s either that or create specific indexes for the queries you wish to perform. Performance-wise this is probably optimal but I don’t actually know for sure.
Paging in RavenDB
Paging in RavenDB is as simple as appending a start + pageSize to the query string
indexes/entriesByAuthorAndCategory?query=category:tech&start=10&pageSize=10
This will perform the query across the entire index and only retrieve the documents requested, this is an operation with trivial expense.
Paging in CouchDB
In CouchDb, a similar query string can be used, using “skip” and “count parameters, but these are considered expensive and instead to perform paging you should:
- Get the first collection of documents, limiting by count(+1)
- Get the next collection of documents, starting at the last document in the first collection, limiting by count (+1)
- Etc
Summary
This really is just a whistle-stop of some basic functionality in these two systems, although it does highlight some fairly major differences in basic functionality between them.
Next up some more advanced functionality will be covered, going over the differences between writing reduce functions in the two
Monday, May 31, 2010
#
What we we comparing against?
One of the most oft-asked questions on Twitter, the RavenDB mailing list and other such methods of communication, is what are the differences between RavenDB and <insert currently preferred NoSql solution>.
The two main contenders are probably CouchDB and MongoDB – MongoDB in particular has been gathering a lot of momentum in the .NET space recently thanks to efforts such as NoRM and such.
Personally, I think that comparisons against MongoDB should stop after one question, “Is your application read or write heavy?”, comparing overall performance and functionality is completely redundant because the two pieces of software are, in my mind at least, geared for completely different scenarios. True you can create indexes to make read operations light, but then you need to make sure that your indexes fit in memory and etc etc.
I’ll stop there unless anyone complains notably, because trying to describe the differences between Couch and Mongo is a big enough task in itself, in fact you can read more about this here, and note for yourself that they really are entirely different animals.
CouchDB and RavenDB on the other hand are both geared very much towards read-heavy applications, providing up front map/reduce indexes (or materialised views) on data, scaling horizontally via replication (although Raven supports this and sharding), and they both use REST as their method of access to the data store.
There are other document databases out there, and they all do things differently, but these are the two getting the most traction from the developer eco-system that I am familiar with. Thus, I shall be comparing RavenDB against CouchDB in this series of blog entries.
So, they’re the same right?
No, not really – and to be honest I thought that aside from the major differences that RavenDB contains such as transactions, sharding, linq-based-indexes, full text search, extensibility that this was the case!
Okay, that’s a fairly big list of differences already, but the features that are additions over CouchDB are primarily candy for developer use
Disclaimer: Until last week, I hadn’t properly touched CouchDB, and it was only at a workshop ran by DevTank in London that I realised quite how different the two databases were in functionality, use and design. However – while I’m about to launch into a series of posts about this topic, I might well be wrong about a few things (due to lack of exposure), and I’ll be happy to update my blog posts and admit that I am wrong when it is pointed out to me :).
Where are we going with this?
Obviously I have next to me a massive list of differences between CouchDB and RavenDB, and it really is a big list. I’ll be splitting it out over the coming days into a series of blog entries exploring these differences in detail and giving my opinions on them.
Tuesday, May 18, 2010
#
The problem
When a query is executed against an index in RavenDB, one of the key aspects of that query is checking the task queue to see if any tasks are currently pending against that index. It is this call that dictates whether IsStale is set as a flag on the return result from that query.
When a call to WaitForNonStaleResults is made in the .NET client, the client simply makes multiple requests against the query until IsStale is found to be false, or until the WaitForNonStaleResults call times out. Thus, the client can wait until there are no more tasks waiting to be executed against the index.
But wait, I hear you cry, what if new tasks are added against those indexes in the meantime? Surely this means that the the results will always be stale on busy servers?
The solution
Thankfully, support is baked into RavenDB to allow for this scenario, so a request can be made to retrieve up to date results as of a specified time called the “cut off”.
This is exposed in the .NET Client as alternatives to the WaitForNoneStaleResults call.
1: BlogEntry[] entries = documentSession.Query<BlogEntry>("BlogEntryByCategory")
2: .WaitForNonStaleResultsAsOfNow(TimeSpan.FromSeconds(30))
3: .Where("Category:RavenDb")
4: .ToArray();
This particular version of the call will wait at the very most 30 seconds for non-stale data to be available as of the time the method was invoked. Thus, any data added after the method is invoked will not count towards whether the results count as being stale or not.
1: BlogEntry[] entries = documentSession.Query<BlogEntry>("BlogEntryByCategory")
2: .WaitForNonStaleResultsAsOf(DateTime.Now.Subtract(TimeSpan.FromMinutes(10)), TimeSpan.FromSeconds(30))
3: .Where("Category:RavenDb")
4: .ToArray();
A similar strategy has been used here, only we don’t care about anything added after about 10 minutes ago.
Summary
You probably still wouldn’t use this when requesting views of your data for displaying on the front page of a website, but this can be used for processes which do care about non-stale data and are willing to wait for it.
Sunday, May 16, 2010
#
One of the issues I touched on in with the basic interaction with RavenDB was the awkwardness of with having to call SaveChanges in order to get the ids of entities that had been saved across the unit of work. This is not a problem new to the document db space, nor is it a problem new to any system where the domain has been mapped to any id based data store (ORMs/RDBMS/etc).
I was going to cook a home brew solution specifically for my use within my projects and blog about it in order that other people could use it, but after posting my intentions in the RavenDB mailing list to create something like this, Oren suggested that making it the default behaviour and moving id generation to the Store would be a welcome move.
After posting on Twitter about this now being default, I got asked quite a few questions on what HiLo was, what the advantages were, and why it was a good thing that in the .NET client for RavenDB this was now going to be the default.
The gist
- Waiting until SaveChanges to get ids for saved entities makes writing logic against those entities troublesome
- Calling SaveChanges every time a new entity is created makes transactions troublesome
- Calling SaveChanges to get the entity id means a call across the wire just to get an entity id, which is expensive
- Simply assigning a Guid to the Id makes accessing documents via REST an unpleasant experience
- You can’t just assign a random integer, because you’d just get collisions as other clients did the same and tried to save their entities
- HiLo provides a method of creating *incremental* integer based ids for entities in a fashion that is safe in concurrent environments
The algorithm
The basic premise, is that the server still controls the id generation, but effectively hands out a range of ids to each client, which the client can then hand out to objects as they are created, and when the client runs out of ids, it simply requests more.
Obviously, requesting a heap of Ids all at the same time would be expensive, so the idea is that the server provides a single id, a “Hi” value which controls the creation of the range on the client. (which provides the “Lo” value)
There are a number of ways this can be implemented, but the one I chose was probably the simplest, and credit goes to Tuna Toksoz for the blog entry which provided the means to implementing it myself.
- The data store needs only store the latest “Hi” value, which starts at 1, and increases by 1 every time a new “Hi” value is requested by a client
- The clients all use the same number for a “Capacity”, that is – the range of numbers that each “Hi” value represents. For example 1000
- Each client requests a “Hi” value and resets their “Lo” value to 0
- Every time a new Id is requested from the generator, the Id is generated by combining the Hi and Lo numbers together:
1: (currentHi - 1)*capacity + (++currentLo)
In the actual implementation, there is some locking going on around this algorithm in order to make the client generator available across threads (web requests) and avoid having to create a new generator per session (defeating the point of having one if you only create a single object in a session).
Let’s look at a sample run through, with a small capacity of “3”, to keep the sample small!
| Description |
currentLoBefore |
currentHi |
Created Id |
currentLoAfter |
| Hi Request |
0 |
1 |
1 |
1 |
| |
1 |
1 |
2 |
2 |
| |
2 |
1 |
3 |
3 (capacity) |
| Hi Request |
0 |
2 |
4 |
1 |
| |
1 |
2 |
5 |
2 |
| |
2 |
2 |
6 |
3 (capacity) |
As we can see, if all the clients are using the same capacity, and they are given different “Hi” values, then they can’t generate duplicate keys, but by and large they’ll be sequential in nature.
The implementation in RavenDB
In RavenDB, the default function configured against the DocumentConvention is now HiLo, which means if a new document is saved against the session with its Id set to NULL, it will have an Id generated on the spot which contains the name of the document and the incremented Id. Obviously this can be overridden by changing the convention to leave the created id at some default value of your application’s choosing.
My original implementation was a bit poor, generating quite a bit of noise in the document database (it was inserting documents to get the ids), and the incremented Ids were being shared amongst objects – which meant if you created say, blogentry/1, saving a new user would mean having newuser/2.
Oren changed this to directly store a single object in the RavenDB for the generator, and to create a generator per-type – which means a lot less noise and more sensible ids being generated for each document.
What it means
What this essentially means, is if you’re using RavenDB out of the box without changing any of the conventions, documents will have a generated Id as soon as Store is called for that document. This means that SaveChanges does not have to be called until right at the very end of the Unit of Work, which means all changes can be efficiently batched in a single request and as a result applications should be easier to write and performance should be easier to maintain.
This is a .NET client specific feature and nothing was changed in the database itself to make this work.
What this does mean, is that if multiple clients from different platforms are going to be connecting to RavenDB and manipulating data, if you’re using the default HiLo implementation then a similar algorithm will need implementing for those other platforms, using the same capacity in order to prevent concurrency issues. This is not necessarily a downside, but is worth making a note of if you are going to be having this sort of set up.
What I learned
While I might contribute the odd bug fix to open source projects now and then, the idea of going in and changing the fundamental way the .NET RavenDB client worked was a bit daunting – not from a technical perspective, but from a taste perspective as I wasn’t sure how Oren wanted things done. As he later said, he’d prefer that code that has to then change be submitted, then no code at all be submitted. I’d like to raise that with anybody who wants to contribute to this project – if you’ve got a good idea then hit the mailing list and suggest it and maybe implement it – nothing to be lost if it’s something people want to use.
In the end, my implementation is barely visible in there, but I'm still pleased that this is in there, it makes *my* life easier :)
Wednesday, May 12, 2010
#
Note: The interfaces have been updated since this entry was written, and there is now Linq query support built into the .NET client, I’ve updated these posts to use the LuceneQuery syntax but that’s probably not the preferred way of doing things
There will be plenty more of these to talk about as I carry on developing this application against RavenDB, but there are a few immediate concepts that I thought would be worth writing about to do with the basic manner in which you interact with RavenDB.
DocumentSession vs DocumentStore
This is the most basic consideration:
- When do you create a DocumentStore
- When do you create a DocumentSession
The simple answer, is you create a DocumentStore on application start-up, and you create a document session for every unit of work following that.
In an MS MVC web application, this would be
- Create a DocumentStore in Application_Start
- Create a DocumentSession on BeginRequest
- Destroy the DocumentSession on EndRequest
Creating Indices
Because every example written as a tutorial of how to use RavenDB will no doubt include index creation as a part of it, the temptation will be there to get into the habit of invoking the code to create indexes every time your application is run (Or simply forget that you started off this way and leave the code in there).
1: documentStore.DatabaseCommands.PutIndex(
2: "BookByTitle",
3: new IndexDefinition<Book, Book>()
4: {
5: Map = docs => from doc in docs
6: where doc.Title != null
7: select new
8: {
9: Title = doc.Title
10: },
11: Stores = { { x => x.Title, FieldStorage.Yes } }
12: });
As data is added to the system or modified, RavenDB will (in its own time) run that dirty data across those indexes, and the application will use those indexes to pull the data out for display and manipulation purposes.
If an index is re-created, all of that indexed data becomes obsolete, and thus RavenDB must re-run *all* of the data in the system against that index. If your application is re-creating indexes or simply creating indexes on the fly as a regular action then performance will suffer.
The best practise is to treat these indices as a management function, something that is done once when the document database is first created – and then updated as part of maintenance/upgrades – like database changes in a traditional system (only somewhat easier!).
I have a simple script to create all the indexes in a blank, freshly created RavenDB instance so while I’m developing against the application I can start from scratch again anytime. The important thing of note is that I don’t run this every time I start the application up – just when I’ve made changes to those indexes.
I might talk about this in a future blog post as I’ve ended up with a nice structure that involves disposing of the magic strings that form the names of the indexes in RavenDB and that can’t be a bad thing.
Saving new objects
This actually goes for most operations such as deletion, updates to objects etc – but saving objects is probably more complete proposal from this collection. None of this is too dissimilar to the considerations we’d apply when working against a traditional RDBMS and an ORM, but it’s worth re-iterating for those who are unfamiliar with the concepts.
Consider a simple repository for entities in our system whose interface looks something like this.
1: public interface IBookRepository
2: {
3: Book Get(string id);
4: void Save(Book book);
5: }
A sample implementation of this repository might look like this:
1: public class BookRepository : IBookRepository
2: {
3: private IDocumentSession mDocumentSession;
4:
5: public BookRepository(IDocumentSession documentSession)
6: {
7: mDocumentSession = documentSession;
8: }
9: public Book Get(string id)
10: {
11: return mDocumentSession.Load<Book>(id);
12: }
13:
14: public void Save(Book book)
15: {
16: mDocumentSession.Store(book);
17: }
18: }
Ignoring the rest of the repository, there are decisions to be made at this point about what the Save method should actually do.
Consider a basic use of the repository like so:
1: public void PublishBook(Book book)
2: {
3: mRepository.Save(book);
4: mEventInvoker.RaiseEvent(new BookPublishedEvent(book.Id));
5: }
Ignoring the obvious (like this publish method isn’t actually publishing a book!), our problem here is that the created book does not yet have an Id because we haven’t called SaveChanges yet, and yet we’re attempting to use this Id as the argument for another action in our application.
The proposed fix? Change the repository so we call SaveChanges of course!
1: public void Save(Book book)
2: {
3: mDocumentSession.Store(book);
4: mDocumentSession.SaveChanges();
5: }
That appears to have fixed the problem, but in actual fact if we were using IDocumentSession to control our unit of work, calling SaveChanges just broke that because all the changes (including others made across the rest of the system) were just flushed across to the server.
We can fix that by wrapping our whole unit of work inside of a TransactionScope (which RavenDB respects), but we’ve still got one problem we need to be aware of:
1: foreach (Book book in booksToCreate)
2: {
3: mRepository.Save(book);
4: }
Now we’re saving a collection of books, let’s say there are 100 of them – that’s 100 calls to SaveChanges, which is 100 calls across the wire, and 100 calls to ‘whatever RavenDB does when you push an object to RavenDB’ (It’s expensive okay?).
That’s not to say you don’t do use this hammer to solve the problem, but you should think about it and do what makes sense in your application.
- You could still add more interfaces/methods specifically for batch operations, and still call SaveChanges at that level
- You could use your own client-side key generation code (RavenDB allows this) – and perhaps adopt something like HiLo against the Type of the document – thus negating the need to call SaveChanges at all until everything has been done that needs doing
I’m probably going to experiment with the second option and write a blog entry once I’ve worked out what is I want to achieve.
Update: I have since written a HiLo generator, and Oren has integrated this so HiLo is the default generator for RavenDB, this means a call to SaveChanges is not needed in order to get the id for an item so this bullet point is now almost irrelevant unless you override this behaviour to use keys generated by the server
Stale Data
Let’s say we have a top level page on our website which displays the top 20 books by popularity in a certain category. The following query is executed
1: Book[] categoryBooks = documentSession.LuceneQuery<Book>("BookByCategory")
2: .WaitForNonStaleResults()
3: .Where(String.Format("Category:{0}", category))
4: .Take(20)
5: .OrderBy("Popularity").ToArray();
The temptation is there to always use that call WaitforNonStaleResults because most demo code will do this as a matter of course (because invoking this will deterministically say “give me back the results I expect for this demo”).
The problem is, WaitForNonStaleResults will do exactly what it says, it will wait until the results coming back are no longer stale – which means your page request will hang, which means you won’t have a responsive application – and the whole point of using a database like RavenDB is that you want the application to be responsive!
There is a good reason that WaitForNonStaleResults is not the default – consider when you start writing it what it is you actually want. In this example, it really doesn’t matter if the data being displayed on this high traffic top level page is a bit out of date, and the call simply is not needed.
Paging
Let’s say there are 100,000 books in the document store and we invoke the following code:
1: Book[] books = documentSession.LuceneQuery<Book>()
2: .ToArray();
How many books do you expect for there to be in that collection? 100,000? If 100,000 objects were returned into that collection, how long would it take? What would you be doing to those 100,000 objects? How much memory would they require to hold in memory all together like that? Yeah, it’s unlikely that you’d ever write the above code in your production application, because bringing back all the objects is rarely what the developer actually intends.
Thankfully RavenDB safeguards against this kind of sloppy code and automatically limits the number of results returned back. Both the .NET client and server have this behaviour built into them and this means you’ll only get (at the moment), 128 objects coming back for the above query. This is equally true for all types of queries, including queries against indices with where clauses and orderings and everything else you might want to put in a query.
Currently the server itself will only let you page 1024 objects at one time, so you can’t be lazy and make a call to Take(100000) because it won’t let you. I’ve actually got an extension method which *does* bring back *all* the objects for testing purposes, but I’ll leave that one out of this blog entry for fear of people actually using it!
Just be aware that paging is there to help you and don’t be surprised when you don’t get all the documents back when doing a blanket query. Use paging properly!
Sunday, May 09, 2010
#
Note: The interfaces have been updated since this entry was written, and there is now Linq query support built into the .NET client, I’ve updated these posts to use the LuceneQuery syntax but that’s probably not the preferred way of doing things
As I mentioned in a brief entry a couple of days ago, I've been playing with RavenDB for about a week now, and mapping across an old project of mine which never got off the ground due to work and time constraints.
I spent a lot of time trying to get that project to play ball inside a relational database, and while I reached some satisfactory conclusions, it rather felt like I was trying to play ball with an anchor.
I had always resolved to come back to the project when I had more time, and perhaps to write some of the more complicated reporting aspects of the project against something more appropriate (think Solr or Lucene), and with the announcement of RavenDB I was motivated to at least play around with it again.
I have a few posts lined up on some of the more complicated usages of RavenDB I've encountered thus far, but first I want to go over the basic structure of 'how to use RavenDB' from the perspective of somebody using the .NET Client API (Ignoring the underlying HTTP requests for now)
Getting Started
The first thing you need to do is grab the source and build the binaries, as as far as I can see you can't get hold of any yet. This is probably a good thing because if you're writing code against RavenDB at this stage you'll want to be updating constantly.
Edit: Builds are now available from the build server found here
Anyway, get over to Github and pull from there using your preferred tool (or just download as a zip!)
http://github.com/ravendb/ravendb
RavenDB is a VS2010 project, which means unless you have VS2010 you're not going to be able to just open it up and build it in Visual Studio - happily there are some build scripts but I'm not going to go into detail on how to build RavenDB, there are plenty of instructions out there for such things elsewhere.
Once you've built RavenDB, the important binaries to look at are:
* Raven.Client: This is what your application will be referencing to talk to RavenDB
* Raven.Server: This is what you can run to create a standalone RavenDB server
For now, grab the contents of the built Raven.Client bin directory, create a console application and reference the lot of them.
Now you have a choice, you can launch the Server and get a nice web interface for managing your indices and viewing your data, or you can just run it embedded in your application. Choosing between the two is the difference between the following two lines of code:
using(var documentStore = new DocumentStore() { Url = "http://localhost:8080"}) {
or
using (var documentStore = new DocumentStore() { DataDirectory = "Data" }) {
If you opt for running the server, then you need to go to Raven.Server and run it (you might need to run it elevated, as for me it falls over if I don't).
Now, my basic program looks something like this:
class Program
{
static void Main(string[] args)
{
using (var documentStore = new DocumentStore() { Url = "http://localhost:8080" })
{
documentStore.Initialise();
using (var documentSession = documentStore.OpenSession())
{
}
}
}
}
Just to add some context to this, in a web application you'd create the document store on application start up, and then per request/unit of work you'd request a document session and keep that around for the lifetime of that request/unit of work.
The session controls unit of work, and controls some important tasks such as
1) Tracking loaded entities + Changes to those entities
2) Exposing methods to query/load/save to the document store
Saving Documents to the Store
No set up is required to store anything to RavenDB (it *is* a document database), by default the conventions will look for an "Id" property on any object you try to store however so this is worth bearing in mind.
Here is a basic entity:
public class BasicEntity
{
public string Id
{
get;
set;
}
public string SomeData
{
get;
set;
}
public string SomeOtherData
{
get;
set;
}
}
This can be dumped to the store with a simple call to documentSession.Store
BasicEntity entity = new BasicEntity()
{
SomeData = "Hello World",
SomeOtherData = "This is just another property",
};
documentSession.Store(entity);
However, a few things of note are
1) This has not actually gone to the server yet
2) The document still hasn't got an Id, don't try to do anything with that property yet
RavenDb will batch up changes to the store until SaveChanges is called, and only then will documents be given ids and be saved to the server. A call to SaveChanges is atomic and this is one of the ways RavenDb gives us some basic transaction support.
documentSession.SaveChanges();
Retrieving + Modifying an Document
I mentioned that the document session was transactional and could keep track of loaded documents and changes to those documents. This is a feature that makes the .NET client library a pleasure to work with because you can do the following:
// Load the entity by id
BasicEntity loadedEntity = documentSession.Load<BasicEntity>("SomeId");
// Modify the entity
loadedEntity.SomeData = "Greetings from Ohio";
// Flush any changes made to any entities to the store
documentSession.SaveChanges();
What this means in essence, is that if you've got a nice structured application, your documents can be modified by the application without having to worry about how they are persisted. This is something we're used to with NHibernate and it's good to see some of these concepts appearing in a document database client.
Querying for Documents
An important feature of anything we store data in, is the ability to query the store for 'views' or indeed the actual entities themselves.
We've gotten used to in NHibernate and other ORMs simply executing ad-hoc queries against the database, and while you can do that with some document databases, that's not really what RavenDb is designed for.
In order to query documents in RavenDB it is necessary to create an index across the properties of the documents you wish to query. This is done up front and exists in the database. A few things of note:
1) Documents are processed against those indexes when they are added (eventually)
2) Queries taking place against those indexes are therefore cheap (relative to say, doing an ad-hoc query)
In my application, I create all of my indexes up front as part of my 'database creation script' (actually, they're the only part of my database creation script, because there is so little setup involved. There is nothing to stop you doing it at any point when the application is running though.
Indexes exist as Linq queries against the documents in the store, and are either be defined as the strings that will be sent up to the server and stored as they are written, or defined as linq queries that will be converted *into* strings and stored on the server.
The downside to using the actual linq queries, is that the indexes on the server won't look exactly as you wrote them, but the upside is you get type safety and intellisense. I'm currently choosing to use the strongly typed linq queries because I don't mind how they look on the server, just so long as they work. I write tests for all of my indexes and queries so I know they're cool.
The recommended practise is still currently to define your indexes separately to the application, as strings in the Web UI.
There are two major components of each index, the "Map" query and the "Reduce" query. I'll not go into detail on what this means, because once again you can get this information across the internet, and Oren has written a very good visual explanation of what Map/Reduce looks like in Raven here:
In order to do a query, we need at the very least to create a Map telling Raven which fields we want indexing for our queries. This means we only index the fields that are relevant to our query and keep things small and (hopefully) more efficient.
Indexes are created against the document store (not the session), and the syntax for that looks something like this:
documentStore.DatabaseCommands.PutIndex(
"BasicEntityBySomeData",
new IndexDefinition<BasicEntity, BasicEntity>()
{
Map = docs => from doc in docs
where doc.SomeData != null
select new
{
SomeData = doc.SomeData
},
});
* "BasicEntityBySomeData" is the unique identifier of the index we have created
* The linq query is ran against "docs" which is (effectively) a collection of all the documents in the database (not just BasicEntities).
* SomeData is now a field that is being indexed with the name SomeData
There is nothing to stop you from indexing every field of the document in a single index and just using that index across all of your queries, but that would probably be unwise. The whole point of creating a map is you are limiting the data you are indexing to just the data you want to search on.
You can write *almost* any code you want in the linq statement, as on the server it will be converted into a proper linq query and executed as a function across the documents.
To use this Index, we invoke the Query method on the DocumentSession specifying the name of the index we wish to use and a Where clause (which is effectively a Lucene query) against that index.
BasicEntity[] documents = documentSession.LuceneQuery<BasicEntity>("BasicEntityBySomeData")
.Where("SomeData:Hello~")
.WaitForNonStaleResults()
.ToArray();
This will return a collection of documents where "SomeData" contains some text that looks like "Hello". Clearly there is some scope here for yet more strongly typed usefulness, but once again I have tests for all of my queries so it's not presenting a problem in this area.
The WaitForNonStaleResults call means that the call will wait a (default) amount of time for the documents to finish indexing before returning data (or timeout). The use of kind of call should be thought about carefully, as the whole point of the document database is that it's "eventually consistent", and you don't always *need* the most up to date result possible. (For example: Displaying a list of documents on the front page of your website)
Retrieving only the data you need
This is all very well and good, but because you're storing entire documents and sometimes you only want small portions of those documents, and this is of course possible too.
I've defined a simple projection of the BasicEntity containing a single property like so:
public class SomeDataProjection
{
public string SomeData
{
get;
set;
}
}
In order to get the value from the index rather than fetching the entire document from the store, we need to modify the index slightly:
documentStore.DatabaseCommands.PutIndex(
"BasicEntityBySomeData",
new IndexDefinition<BasicEntity, BasicEntity>()
{
Map = docs => from doc in docs
where doc.SomeData != null
select new
{
SomeData = doc.SomeData
},
Stores = { { x => x.SomeData, FieldStorage.Yes } }
});
We are then telling RavenDb to store the value in the index so it can be easily retrieved using the following query.
SomeDataProjection[] projections = documentSession.LuceneQuery<BasicEntity>("BasicEntityBySomeData")
.WaitForNonStaleResults()
.SelectFields<SomeDataProjection>("SomeData")
.ToArray();
This will mean only the data you want is transmitted across the wire and make the query much more performant. Note: You can only pull back fields that have been stored using the Store facility on the index.
Reporting on your data
So that's round tripping to and from the data store, but in the real world you soon need to be able to perform more complex queries across your data.
This is a contrived example, but how about summing up the total lengths of all the strings stored in all the SomeData properties across the document store?
Let's add a new property to the Entity called "Category" so we can get all the lengths by category:
public class BasicEntity
{
public string Id
{
get;
set;
}
public string Category
{
get;
set;
}
public string SomeData
{
get;
set;
}
}
And let's add a load of entities to the document store like thus:
documentSession.Store(new BasicEntity()
{
Id = "Document1",
Category = "One",
SomeData = "Text"
});
documentSession.Store(new BasicEntity()
{
Id = "Document2",
Category = "Two",
SomeData = "More text"
});
documentSession.Store(new BasicEntity()
{
Id = "Document3",
Category = "One",
SomeData = "And more"
});
documentSession.SaveChanges();
What we want to do, is index the *length* of the strings stored in the document, and index the category, before reducing the query across category to get the total lengths. If you don't understand what I mean by that, then go and read about Map/Reduce on Oren's blog linked above!
documentStore.DatabaseCommands.PutIndex(
"BasicEntityCountSomeDataLengthByCategory",
new IndexDefinition<BasicEntity, CategoryDataCountResult>()
{
Map = docs => from doc in docs where doc.SomeData != null
select new
{
Category = doc.Category,
SomeDataLength = doc.SomeData.Length
},
Reduce = results => from result in results
group result by result.Category into g
select new
{
Category = g.Key,
SomeDataLength = g.Sum(x => x.SomeDataLength)
}
});
It's as simple as that, this now means I can execute the query:
CategoryDataCountResult[] counts = documentSession.LuceneQuery<BasicEntity>("BasicEntityCountSomeDataLengthByCategory")
.WaitForNonStaleResults()
.SelectFields<CategoryDataCountResult>("SomeDataLength", "Category")
.ToArray();
And that will give me the results as expected:
One: 12
Two: 9
The beautiful thing about this, is that was pretty much calculated when the documents were added, reading the data out was a really cheap operation - think about the cost of doing this in T-SQL :)
Summary
This was quite a lengthy blog post to cover some of the simple features of RavenDB, I'll start getting into more detail in my next post where I'll cover some more complicated reporting queries/indexes and talk about the excellent Web interface that Oren has created part of this project.
My two cents
My experiences with the project so far have been mostly positive, I've found a few issues but Oren has been *very* fast to fix them and publish fixes to Github (and these have as he says, been "edge cases") which most people won't come into contact with when playing with RavenDB.
It's definitely worth a gander, I see this project taking off in the .NET space as it matures.
Saturday, May 08, 2010
#
This week I decided to pick up Ayende's latest project - RavenDB and have a go at building an application against it.
I haven't really had chance to play with any of the latest batch of document databases, and I figured I'd find this 'newer' project more interesting than any of the well established crowd.
I'm going to do a few blog posts on the subject as I go through, but as all series need an introduction I thought I'd share some of my initial thoughts on my first steps onto this project.
- RavenDB is very new, and there are features that you can see *will* be there, but you have to work around them for now if you want to use RavenDB! That said, my first post to the mailing list was met with a response of "I've done that and it will be in tomorrow". I get the feeling Ayende is working overtime on this project.
- The Web interface for managing RavenDB is just *amazing*, very smooth work
- RavenDB has a few things that I wasn't expecting from a document database:
- Transactions (both Unit of Work and across multiple requests)
- Unit of Work, and the .NET client tracks loaded entities and changes to those entities
To the people who are complaining that Ayende is wasting his time on "yet another DocDB", from just the top two things alone I'd say "wait and see", I get the feeling that this project has a lot of potential.
One thing I really like so far is the ability to write Map/Reduce functions as Linq queries attached to an Index. The only oddity here is that (currently) you have to write these as strings in the .NET client (because they'll be sent across the wire to the server). They are then compiled into actual Linq queries and executed against the objects to create indexes on the relevant parts of your documents.
You can't use all the code you'd like to inside those linq queries, I've already ran into problems trying to nest lambda expressions inside of them because of the way they're built on the server (There is a load of expression parsing going on, as well as code generation against the queries).
Creating all those indexes up front feels a lot like writing stored procedures against a traditional RDBMS, only with the benefit that all the hard work will be done on write, and reading will be cheap. I don't actually mind being up front about it, but it has meant I've had to write a few scripts to "initialize" the RavenDB on creation (for integration testing as well as deployment).
I'm actually using RavenDB as my primary data store in my test project, but I probably wouldn't if I was building a big application, NoSQL doesn't mean not using SQL, it means not only using SQL and there are still a few things that I'd prefer to have stuck behind NHibernate and in a traditional database.
I really like that I can just host RavenDB inside my project without running any external server, and changing my code so that RavenDB runs on a proper external server is a trivial task. So nifty.
My first blog post on the subject will probably deal with the process of creating a first project against RavenDB along with some of the current gotchas that will probably cease to be as the project becomes more mature. I'll then move onto some more complex map/reduce scenarios and talk a bit about how I'm exposing the data store to my application.