One of the technical ideas I’ve been noodling around with recently has to do with language support for high availability. There are generally two approaches to HA nowadays:

  • Replicate everything, including computation, so that if one resource fails its partner resource can continue practically unaware. This has the obvious disadvantages of increasing hardware cost and/or reducing performance. It also creates an “impedance mismatch” between the replicated parts of the system and non-replicated parts (or external systems) which now have to deal with the fact that they will generally receive the same request/response twice but might receive it only once (or perhaps more than twice) if a failure occurs.
  • Replicate just enough state so that, when a failure occurs, someone else can look at that state and figure out how to pick up where the failed component left off.

Most of the systems I’ve worked on have used the second approach. The problem is that deciding what constitutes “just enough state” is difficult. Even worse, it’s easy to change how the software works in the normal case without changing what state gets replicated, and you might not know that you’ve broken fault recovery until you actually try to recover from a fault and realize that you don’t have the state you need to do so properly. One way to make this less likely is to make it easier for programmers to specify what needs to be replicated and let the system take care of it, instead of having a system where changing a single variable in an “HA-safe” way might require several steps for the programmer.

The basic idea of what I’m proposing is to add a “replicated” keyword to a programming language. It could be just about any programming language, or it could be done as a naming convention instead of a keyword for ease of implementation. This keyword could apply to any global-scope variable or function, or to a struct/union/class type. Using a tool somewhat similar to my stack ripper, any modification of a replicated variable/member or call of a replicated function/method gets modified to include an update to a replication list. Periodically, the application must call some sort of “commit” function to ensure that the replication list is flushed to another node which will take over in case of failure. Some process on that other node is generally sitting in a semi-idle mode, collecting replicated changes, until a failure occurs.

The rationale for including functions is that some functions represent hidden state that wouldn’t get replicated just by tagging variables. For example, it might be desirable to have the backup node replicate the active node’s database connections. Each such connection represents state not only on the active node, but also on the database server. Therefore, the correct approach would be not to replicate the connection handle, but to replicate the function that opens the connection.

The database-connection example also highlights another subtlety. The active and backup nodes are likely to get different handles back from the server for their database connections. Therefore, replication needs to occur at a semantic level; the replication list must use (internally maintained) IDs for replicated objects and variables, not their addresses or values. Similarly, this might impose certain limits on the replicated parts of a program. It would be extremely dangerous, for example, to rely on pointers between replicated and non-replicated objects remaining valid after a failover. Object-ID management could get pretty complicated if it has to identify statically as well as dynamically allocated replicated objects. Also, a lesser version of the “impedance mismatch” mentioned above would still exist. Requests/replies might get repeated after a failure, but that won’t be the common case; not having to deal with such repetition efficiently makes the problem much easier to solve than in the “replicate everything” approach.

Despite all of these caveats and limitations, I think such a facility would be very nice to have. If I add a new reference to a member (or even a whole new member) in a class already marked as replicated, I wouldn’t have to do anything to maintain HA functionality, whereas without this I would have to add explicit replication code for both the active and backup nodes. That can add up to a lot of code devoted entirely to recovery, plus extra replication clutter everywhere else. Any time a tool can help me make my code both cleaner and more robust, I’ll jump at the opportunity.