NoSQL Live Trip Report

Here’s an edited version of the trip report I sent out. More commentary below.

There’s no better way to feel old than to attend a conference on a technology that’s currently at the top of its hype cycle. There might have been a handful of others there – out of almost 300 – who were my age or older, but there were more who were half that. Overall interest/excitement level seemed high, but probably not as high as some of the previous NoSQL conferences. In one of the early presentations, the presenter claimed that “NoSQL” as a brand was probably close to the peak of the Gartner hype cycle; based on the number of NoSQL-backlash articles I’ve seen I would have said we’re already heading for the “trough of disillusionment” and what I saw at the conference confirmed that. People are beginning to think about how they’ll use NoSQL in actual live production, and seeing some weaknesses. More about that in a bit, but first some quick notes.

  • “No-sequel” is much more common that “no-S-Q-L” but that battle will probably continue forever.
  • There was surprisingly little talk about whether NoSQL means NO SQL or Not Only SQL, whether it’s more to do with SQL or ACID, etc. I guess people have (finally) decided those matters are settled.
  • The Azure Storage guy wasn’t getting a lot of love. Even MySQL seemed to be more favorably viewed, with more than one person saying if it works for you then use it.
  • The Gear6/memcached guy wasn’t getting much love either. Most people seemed to think memcache doesn’t qualify as NoSQL (and they’re right
    IMO). A surprising number mentioned using it in combination with NoSQL to absorb reads, particularly for a more write-optimized store like Cassandra, but none liked it.
  • Numbers: Twitter over 9B rows, StumbleUpon over 12B (36 hours to load).
  • “Object Relational Mappers are the Viet Nam of computer science” — Jonathan Ellis
  • “The relationship between SQL and document databases is the same as between static typing and duck typing” — Paul Davis

One recurring theme was raising the semantic level of NoSQL stores. Secondary indices and complex range/boolean queries came up again and
again, with vendors who support them (e.g. Couch or Mongo) quick to mention it and others (e.g. Cassandra or Riak) eager to talk about roadmaps. I guess everyone’s catching the meme at the same time, because as of just a couple of weeks ago I had occasion to write about some of these issues a bit and during my research found that on most of the projects nobody seemed to be taking them very seriously. Nobody likes transactions, nobody likes joins, but there seems to be growing acceptance that something beyond get/set/delete would be nice. Even Voldemort, which has generally the most bare-bones API, is going to add atomic queue operations.

The “NoSQL in the Cloud” panel was particularly interesting. Jonathan Ellis, of Rackspace and Cassandra, made the controversial statement that cloud and NoSQL do *not* go well together, because if you’re small enough to be running your apps in the cloud then you’re not big enough to need the scale of NoSQL. This was closely related to the issue of multi-tenancy. There were three people there who have actually run hosted database services – Rackspace and Heroku with MySQL, Cloudant with CouchDB. They agreed that it was very difficult, especially with respect to performance isolation and preventing one careless customer from effectively destroying others’ performance. Without solving that problem, which none of the projects seem interested in doing any time soon, deployment as a shared service is likely to be far less common than deployment within users’ own VMs. Also relevant is Ellis’s comment that a cloud colocated with managed servers has significant appeal to many users.

For me, the best part was just getting to meet in person a bunch of folks I’d only met online before, to listen to all the conversations and hear where people’s thinking was wrt various issues, etc. There’s still a lot to be learned from things like tone and body language, for example watching everybody (or some important subset) nod or cringe when certain statements are made or questions asked, etc. That stuff just doesn’t come across in IRC/email/blogs/Twitter. It’s a high-bandwidth blast of information, both verbal and otherwise, coming from multiple directions at once. It’s a shame that I had to miss the Jillian’s party, but dinner at Coda was a blast. Throughout the day I learned a lot about who’s doing what, about the internals of various codebases, and even about the Large Hadron Collider. Thanks to the folks at 10gen/Cloudant/Hashrocket for setting this up.

NoSQL and Data Models

I’ve been thinking lately about putting a version of this blog into a system that uses a NoSQL database as a backing store instead of MySQL. Why? Certainly not because I expect to need the scalability. As of right now I have 1559 posts (including drafts) and 4708 comments (including spam). I get less than 4000 page views even on a busy day, and there’s no concurrency in the system to speak of. Mostly my reason is the same as it was for CassFS: I don’t think it’s possible to understand a technology until you’ve actually tried to use it, and understanding NoSQL data stores is important to me. Neither exercise makes me an expert by any means, but even if my level of expertise is 3 compared to someone else’s 100 it’s still infinitely more than the zero of any noxious pundit who hasn’t even tried to think about how they’d use NoSQL in a real project. Nothing forces a balanced evaluation of pros and cons like actual use. Read on to see where this mini-odyssey led.

CAP and Leases

One of the few good things to come out of Jim Starkey’s ill-fated attempt to brush aside Brewer’s CAP Theorem with some made-up definitions and MVCC hand-waves was that it afforded an opportunity for people to think and talk about issues of availability, partition tolerance, etc. In his latest message (note the new title – apparently anything that doesn’t draw positive attention becomes boring) Jim constructs a strawman that I’m sure is very familiar to folks who’ve been involved in CAP discussions before.

That slop isn’t good enough for a database system. If somebody in Tokyo
and somebody in New York can each withdraw the last $1M from an account,
there’s going to be an unhappy banker who’s going to think that getting
the right answer trumps a funny definition of availability.

Yawn. There’s nothing about CAP that says all systems need to be AP, and only a very few misguided folks who even remotely suggest it. If strong consistency is a requirement, you still get to choose between CA and CP. Do you make the system completely unavailable for some, or allow some subset of requests hang for anyone who makes them? Most people who’ve worked on high-availability systems really hate having an ill-defined and unpredictable subset of requests hang, so they tend to choose some form of quorum rule. That’s CP according to these definitions, which is confusing but hardly matters; the question and the tradeoffs remain the same.

That brings us to leases. Because they expire in bounded time, leases turn a hang (which breaks partition tolerance) into a finite delay (which preserves it). Combined with MVCC, it’s easy to imagine a system which appears to preserve consistency as well without having to sacrifice availability by forcing any nodes down. Wait, isn’t that the Holy Grail that CAP tells us is impossible? (This is where I half-expect Starkey to say that’s what he was proposing, even though he never got that far in his reasoning. I’ve worked with many DEC alumni of approximately the same age who had the same obsessive credit-stealing behavior; it must have been hell for the truly creative people who worked there at the time.) To see why this idea doesn’t quite work, let’s flesh it out a little. To make a change, you start a transaction. You take out leases on anything you plan to change, and because it’s MVCC all changes you make are private until you commit. When you commit, you check all of your leases; if any have expired, then you abort and retry. That’s all pretty obvious to anyone skilled in the art, right?

The first problem that occurs to me is the lack of forward-progress guarantees and the potential for deadlock, but the real problem is the same one I pointed out to Jim in the original thread: how does a transaction terminate? How does everyone agree that it has terminated, so that they can start a dependent transaction without introducing inconsistency? Leases ensure that everyone capable of accurate timekeeping (including you) will know that your lease expired, but how will they know whether it did so before or after you committed? What if you committed and then immediately became isolated? Where’s the newly committed data? If you’re the only one who has it, then someone might hang waiting for the data instead of the lock and we’re not partition-tolerant any more. If one or more copies of the data must be somewhere else, then you need to provide an unambiguous and simultaneous signal to everyone involved that the new version you wrote is now the committed version. You can go around and around on acknowledgements and counter-acknowledgements and commit protocols until you realize that you need something like Paxos.

The problem with Paxos is that it’s very heavyweight for a widely distributed system. The time for such an algorithm to run is effectively proportional not to the normal message latency but to the maximum time that a node can go to sleep and then continue with its old state intact. In a local environment you can have many redundant connections per node and mostly need to worry about such delays (e.g. caused by an OS glitch) on the order of a few seconds, even for a very sick system. In a wide-area environment – which is where CAP is meant to apply – nodes are likely to have few connections and phenomena such as routing instability can leave them stranded for minutes. A multi-round protocol potentially involving multiple minutes per round just doesn’t preserve partition tolerance in any meaningful sense. Consensus itself is a conceptual sibling of consistency, so you might as well say that the C in CAP stands for Consensus. If you have to wait for consensus, then you have to sacrifice one of the other properties and we’re back to having the same problem we started with. (Hmmm. Global + Agreement + Timely = GAT Theorem? Or TAG?)

I’m not saying that you can’t use leases to create a “good enough” compromise between C and A and P, leading to some extremely useful systems. I love leases. They were the subject of my first patent, which is how I ran into some of their limitations head-on and can write about them now. You can also make that “good enough” compromise using many other mechanisms, without refuting CAP. In the end, as many already knew, it’s about tradeoffs. It’s about knowing they’re necessary, and thinking about them, and choosing the right ones for your needs.

Mommy, Where Do Standards Come From?

Today I got drawn into a discussion on Sutter’s Mill about standards. The topic at hand was the ISO C++ standard and whether it qualifies as open, but it gave me a chance to read and think a bit more about other issues as well. Let’s start with a little background on how ISO works. Here are some of the more significant details.

ISO standards are developed by technical committees, (subcommittees or project committees) comprising experts from the industrial, technical and business sectors which have asked for the standards, and which subsequently put them to use. These experts may be joined by representatives of government agencies, testing laboratories, consumer associations, non-governmental organizations and academic circles.

Proposals to establish new technical committees are submitted to all ISO national member bodies

Experts participate as national delegations, chosen by the ISO national member body for the country concerned

In other words, there is no true individual involvement in the ISO standards process. A standard is initiated by a national standards body – such as ANSI in the US – which also decides who can be a delegate. Oh yeah, and they pay for it too. According to ISO in figures, member bodies pay 140 million Swiss francs per year to run the committees that do the hands-on work of developing standards, vs. 35M CHF for the Central Secretariat itself. 55% of that 35M – not of the entire 175M – also comes from the member organizations. The other 45%, or a mere 9% of the total, comes from publications and other services. This is why I said, in Herb’s thread, that the claim of ISO needing to charge $30 for a copy of the C++ standard is utter crap. They don’t need that fee to support their operations any more than IETF – which publishes their standards completely for free – does. They charge that fee for completely other reasons. This brings us back to the process.

ISO refers to theirs as an open process, and ANSI follows suit, but in fact it’s a very political process. You can’t walk in as an individual, even the most knowledgeable invididual in the world regarding a particular subject, and participate. You have to know or work for someone, who knows someone, and so on. Corporations feed people and proposals to industry or trade associations, which feed them into the national standards bodies, which feed them into ISO. There’s also politics at a higher level, as proposed ISO standards are often national standards somewhere already. Sometimes there are multiple competing national standards involved. Sometimes there are multiple technical bodies within ISO, or other bodies outside ISO (such ITU) wrangling over who controls a standard. In short, it’s a mess. In many cases, users’ needs would be just as well satisfied by allowing standards to remain within their originating national body (as with ANSI C) or industry/trade association (as with Java). Often the only people driving demand for an ISO standard are the very people who get paid to develop the ISO standard, or who hope to profit from it. What a racket.

This brings us back to the question of whether ISO standards are open. Participation is political, because of the strategic value standards have to the various parties at all levels, and thus in effect limited even though it’s nominally open. The output is only available for a fee that is totally unjustified by the cost structure that’s involved, and then subject to copyright. “Open” is the wrong word for that. On the other hand, the fact that the standards are published at all is an improvement over proprietary “de facto standards” and the freedom from licensing fees is an improvement over that. “Available” and “unencumbered” cover those properties, but “open” to me still means something more like the IETF model. A standard is only open if participation does not require sponsorship, and if its text is available for only nominal cost.

Getting Silly About NoSQL

(This is a cut-and-paste of a comment I made on Dennis Forbes’s Getting Real about NoSQL and the SQL-Isn’t-Scalable Lie – found via MyNoSQL – in which Dennis responds to others’ pro-NoSQL silliness by engaging in some silliness of his own.)

I was going to try to be polite, until I saw your slam against All Things Distributed as a NoSQL advocacy site. What rubbish. Vogels and company know more about scalability than just about anyone, and more about using the right tool for the right job – which is why they provide RDS as well as SimpleDB. Even if that weren’t the case, what you’re engaging in is mere ad hominem. Vogels’s definition of scalability is right or wrong on the merits, not based on who he is or what other opinions you attribute to him. Might as well dismiss all of *your* definitions and claims based on your being an RDBMS advocate. As it happens, I was making Oracle scale before there was a NoSQL, before there was even RAC (it was OPS back then), and from then until now I’ve always used a very similar definition of scalability: maintaining a ratio of work done to resources used. For you to offer a different definition *is* the same sort of self-serving wordplay which you criticize in others.

Second point: equating “highly interrelated” with “relational” doesn’t do justice to either, and characterizing social-media workloads as “largely unrelated islands of data” is just laughable. Friend relationships and recommendations and such create a much *higher* level of data linkage at social-media sites than at banks which are your other example.

Third point: your notions of I/O performance are way off. At the low end of the scale, you claim only 30MB/s for some of the larger instance types. I’ve personally measured 3x that on smaller instance types, and also observed that EC2 is notably bad in terms of disk I/O relative to other kinds of performance. I’ve written about that, as have others such as Randy Bias at Cloudscaling. If you want to get a real feel for I/O performance, you’d be better off measuring some of the more powerful instances at Rackspace or GoGrid, or even better for current purposes would be direct measurement of RDS at Amazon.

At the high end of the scale, you’re also off. Single-machine I/O capabilities might be enormous by your standards, but not by mine or by those of anyone who has worked with modern storage systems. 700MB/s for an ultra-pricey FusionIO card? Try 40GB/s. I’ve personally developed code to do that, and it wasn’t on a single system even at a million-dollar price point. It was using exactly the kind of horizontal scaling techniques that you seem to think are unnecessary.

Fourth point: saying that RDBMSes are scalable, because you can (a) throw a ton of money at Oracle/Sybase or (b) do most of the work sharding and replicating something like MySQL, is a bit disingenuous. By the same reasoning, SMP is also scalable because you can pay SGI/3Leaf a ton of money or write your own SDSM layer. It doesn’t really work that way. The better NoSQL solutions – certainly not all of them – are inherently scalable in the sense that they scale in human terms as well as machine terms. In other words, they allow you to add capacity without spending tons of money *or* chewing up months of developer/administrator time. Administering a hundred servers is very nearly as easy as administering five. That’s a definite advantage over any RDBMS clustering I’ve ever seen.

Fifth and last point: defining scalability in terms of “highest realistic level of usage” and “maintaining acceptable service level” is also a big pile of weasel words. Just because most users won’t need the scale of a Facebook of a LinkedIn doesn’t mean they shouldn’t choose solutions that scale well. Such solutions are also likely to be more cost-effective at smaller scale, and some small percentage of users will eventually scale up to the point where the RDBMS cost/benefit curve levels off or falls. “Plan for success” is a perfectly valid business principle. More importantly, one of the whole points of the NoSQL “movement” is that people should think hard about what “acceptable service level” means to them in terms of performance, in terms of CAP, and so on. A solution that far exceeds the necessary service level in some terms (e.g. consistency) but exacts a cost for it is not an ideal solution.

You’re absolutely right that many NoSQL advocates have made immoderate and even ridiculous statements. Of those you link to, I’ve taken a few to task myself. Every time a new technology catches on, no matter how legitimately, it will attract a few loud know-nothings. I invite you to take them on and correct some of their misstatements, but becoming their mirror image only tends to validate their extreme opposition to RDBMS traditionalism. Both types of systems, plus more that we haven’t even discussed, have their advantages and drawbacks.

NOTE at 4:49pm: it appears that Dennis is being very heavy-handed about moderating comments. Considering how controversial the topic is, and all of the links I’ve seen, I don’t think I’m going out on a limb too much by saying that there have probably been more than three comments since 9am this morning. That’s all that have passed moderation, though, including two that were posted after mine (so he’s clearly not just working through a long queue). I’m not sure whether that’s better or worse than lobbing bombs from a blog that doesn’t even pretend to allow comments, but it’s right down there on the low end of the scale either way.