I was recently called in to do an emergency consult at a new client because their RavenDB instance was in a bit of a pickle and anybody they might have wanted to call in was already busy so I got the call ;-)
Ruh oh!
My immediate assumption on hearing that their secondary was full of documents that weren't on the primary was that they were actually running a primary/primary set-up by accident, but in actual fact it was slightly more involved than that.
RavenDB replication is set up by telling a server that it has a replication destination, that is a primary is told about the secondary and instructed to push documents over there when it can get the chance. It uses etags to determine which documents need to go over and uses etags to detect conflicts and create mulitiple versions of a conflicted document.
The different between primary/primary or primary/secondary is simply whether you set up both servers with a replication destination or just one of them.
So far so good
This is quite a sensible set-up, and a very reasonable one because you don't know what caused that failure, don't want the failure to repeat on the secondary and a diminished functionality is still better than no functionality at all.
A primary/secondary set-up is conceptually a lot easier to deal with than a primary/primary or cluster-setup because you never have to deal with conflicts. Conflicts in a lot of state-based systems are painful to deal with so if we can avoid dealing with them then we should.
RavenDB also has the option to allow writes to secondary and this is where the fun begins.
What we're saying here actually, is that if you are writing to secondary on failure, what you actually have is a primary/primary with the wrong name.
It's left in userland to determine what to do here, we could:
The essence of this though is that if we're going to be allowing writes to secondary during failure, then we need to have some form of conflict resolution set up because it's not really a primary/secondary relationship these servers have.
We have two choices, we can
Opting for the latter because their internal consumer always wants to be able to write, the easiest approach was to write a "last write wins" conflict resolver. Not always advisable but in this case there were few side effects from adopting such a position.
You should think about your topology and what you actually want to support when setting this up. This is usually always a business decision, as it revolves around what levels of availability they need to do their job.
2020 © Rob Ashton. ALL Rights Reserved.