Several people on Twitter pointed to Kenn Ejima’s Thoughts on Redis, calling it a “good read” and so on. I’m going to take a contrarian view. The article does contain some insight, but also a whole lot of misunderstanding and vitriol. The first few paragraphs, at least, read like something Michael Stonebraker or Dennis Forbes might have written. OK, maybe that was a bit too harsh, but “anti-SQL fanboys” and “nothing really inherent in the NoSQL technologies” are clearly cries for attention so I’ll give Kenn some. I’m going to use the email/Usenet second-person style just to avoid jarring inconsistency between “Kenn says” in some places and “[generic] you should” in others, but I hope people will realize that it’s still meant to be an open conversation.
the relational algebra has a lot more mathematical implications in practice than the CAP theorem
Perhaps so, but does that make the CAP theorem invalid or inapplicable? Of course not. It’s not either/or. You don’t have to understand just one or another. If you’re building a modern database system, you need to understand both. Dismissing either one out of hand is the mark of an immature developer.
I first looked at Cassandra and MongoDB, and got an impression that they were already over-featured enough to obfuscate the true focus on what kind of problems they were trying to solve.
Cassandra, MongoDB, and Redis? That’s a very small sample (though a very fashionable one) and there’s a danger in extrapolating from small samples. That’s probably where some of the later ridiculous statements about “most NoSQL databases” comes from. If the only two alternatives you looked at seemed too complex, then first that’s because you don’t even understand the problems they’re trying to solve and second you need to enlarge your sample. If CouchDB and Riak are also too complex for you to understand, try looking at Voldemort or Kumofs or Membase which have simpler data models.
First, scaling horizontally has little to do with the database engine itself – creating a transparent, consistent hash function is the easiest part. The hard part is choosing a good namespace for keys – you still need to organize keys in some ways, and from time to time, you need to “migrate” keys on refactoring. And when you are set with a solid naming structure, you are able to choose whatever database, including RDBMS, by redirecing (or proxying) requests based on the same partitioning logic.
Consistent hashing is indeed the easy part. The hard part is managing replicas, ensuring that they remain available even during (inevitable) partitions and reach a consistent state after, and that the whole is done efficiently. That’s what the “cache invalidation” Hard Thing is really about, but it extends beyond transient caches to permanent replicas as well. Automatic rebalancing and proxying and such are part and parcel of what those “too complicated” options do for you, because doing those things in ad hoc ways becomes completely untenable for large long-lived systems.
Second, by scaling horizontally, you only get performance gain in O(N), at the cost of decreased MTBF. If you want to double the memory, you need to double the machines in the cluster. In the computer science terminology, an O(N) algorithm is considered “naive”, and in the computer security terminology, it even has a name – “brute force”.
An algorithm is only “naive” or “brute force” if the problem itself is known to have lower complexity than the solution, but that’s not even the worst thing about your appeal to computational complexity. Big-O notation is normally used to plot number of operations vs. number of entities, either to characterize latency (if the operations are serial) or throughput (if they’re parallel). O(N) behavior is only bad in that context – linear speedup is actually a very good result when looking at aggregate performance – and none of the systems you’re criticizing have that kind of O(N) behavior.
As for increased MTBF, that’s also misleading. Yes, the mean time between component failure does increase, but thanks to replication the mean time to system failure (i.e. data loss) in these systems is much lower than any single system. If you’re worried about failures, you need to worry about both durability and replication, and systems that have both shouldn’t be dismissed as “over-featured” or “obfuscated” compared to a system that lacks the second. That complexity exists for a reason, even if you don’t understand it.
I don’t think it’s reasonable to always expect exponential increase in the capital fund – I think adding 1000 servers every week should be considered as a necessary evil, not as a goal.
If you need extra capacity, you need to get it somehow. You can scale up or scale out. Having implemented both kinds of systems, I can say that they’re based on much of the same technology, except that in the scale-up case the connections and algorithms are often proprietary so there’s a huge price premium. Where is the system that can even scale up to 1000x the power of a modern commodity-based 1U server? They exist, I’ve actually worked on them, but the pricing makes them economically feasible only for special purposes which require a completely different CPU/memory/interconnect and performance/reliability balance than exists outside national labs and large universities. For everyone else, scale-out gets you exactly the same or superior (for your purposes) functionality far more economically.
It’s not horizontal scalability at all that interested me in NoSQL.
Here, after several paragraphs of pointless bashing, we get to the real nub of the matter: you don’t care about scalability, thus you’ve never invested the time necessary to understand it, and so you play “sour grapes” by dismissing it as a goal.
By saying “write to disk” it really means: fork a child process for background processing, serialize all data on memory, write it to a temporary file in the background and rename the file atomically to the real one upon finish. Even though the overhead of fork is zero in theory when the OS supports copy-on-write, you still need to turn on the overcommit_memory setting if you want to deal with a dataset more than 1/2 of RAM, which contradicts our habit to battle against OOM killer as a database administrator.
It’s a good habit, and one that should not be broken in this case. If all you’re using this system for is “global variables on steroids” then half of RAM should be more than enough. Also, you’ll probably run out of capacity to handle the requests to all of those global variables before you run out of RAM, and you damn well shouldn’t have a single non-replicated non-sharded system as a SPOF for any application that big. The half-of-RAM thing is only an implementation artifact anyway. Later on you point out that Redis snapshot files are only about the tenth the size of the data in memory. If you’re willing to sacrifice some disk-space efficiency for the sake of memory efficiency, you could write out the snapshot incrementally and only compress/dedup within a chunk instead of globally. With a different snapshot implementation you should be able to use 90% or more of your memory for live data, and basing a design critique on such implementation artifacts is invalid.
First, with AOF, Redis has introduced the possible bottleneck that RDBMS have been suffering from.
Yes, and that bottleneck will always be there for any non-distributed store, no matter how well it’s implemented. That, along with the need for redundancy to ensure availability of all that data, is why those “too complicated” alternatives exist. Are you starting to see, by having these needs placed into the context of your own comments, why I consider your dismissal of them premature?
Finally, I’m going to formulate the 1:10 problem – that is, the ability or inability to handle 10GB of dataset with 1GB of memory.
Even 10GB of memory is nothing nowadays, when even commodity systems have ten times as much and terabyte systems are likely within a year. That might seem like an irrelevant point, that you can just bump the numbers up and have the ensuing argument remain valid, except for what I said above about how big your “global variables on steroids” system needs to be. Even the very largest application shouldn’t need 100GB worth of high-throughput high-semantic-level shared global variables. That’s bad not only from a system-design standpoint, but also from a basic-software-engineering “don’t rely on shared data so damn much” standpoint as well. You need a backing store for durability, sure, but if you need it for capacity in this scenario then You’re Doing It Wrong and you’ll fail anyway because you’ll run out of access to that capacity before you run out of capacity itself.
Usually it is suggested to avoid reinventing yet another virtual memory layer at the application level, but Redis has already crossed the line.
It’s a good line. Reinventing virtual memory uniformly a bad idea, leading to leaky abstractions and unmanaged contention for the resources that both the I/O and VM systems need. Been there, done that, again. Step back from that line, already.
For people who don’t really grok what’s been said in this post (maybe because it was just too long to read), my recommended setup is: “Use Redis for small datasets that don’t grow fast (stay far less than 1GB). Have at least 2x memory than the dataset. Use default snapshotting and disable AOF.”
OK, enough. You get the point. I’d rephrase above as “Use Redis for small datasets (less than 50GB this year) that don’t need to be highly available, have memory at least 2x your actual dataset (until the snapshot implementation improves), use frequent snapshotting or AOF (depending on your need for performance vs. durability – not both) and always avoid overcommit.” I also have nothing against Redis, it’s a fine tool for what it does, but I think its durability story is a bit confused and its reinvented VM can only serve a need that it’s not good for anyway. As always, the real answer is to use multiple data stores to serve multiple needs, with careful consideration of the tradeoffs each represents.