I’ve been thinking lately about putting a version of this blog into a system that uses a NoSQL database as a backing store instead of MySQL. Why? Certainly not because I expect to need the scalability. As of right now I have 1559 posts (including drafts) and 4708 comments (including spam). I get less than 4000 page views even on a busy day, and there’s no concurrency in the system to speak of. Mostly my reason is the same as it was for CassFS: I don’t think it’s possible to understand a technology until you’ve actually tried to use it, and understanding NoSQL data stores is important to me. Neither exercise makes me an expert by any means, but even if my level of expertise is 3 compared to someone else’s 100 it’s still infinitely more than the zero of any noxious pundit who hasn’t even tried to think about how they’d use NoSQL in a real project. Nothing forces a balanced evaluation of pros and cons like actual use. Read on to see where this mini-odyssey led.

Before committing, even in my own mind, to such a project, I decided to take a look at various blog software with an eye toward how difficult it would be. I believe in doing my homework, but I’m no masochist. The software had to have a fairly robust feature set including: comments – (internal, not via Disqus or similar); categories; RSS feed; import from WordPress; and an admin panel for composing new entries etc. I looked at a few Python and Ruby options, but pickings seemed slim so I turned to Open Source CMS for ideas even though it’s pretty much limited to PHP stuff. I took several of the better-rated and/or better-looking packages, and started by counting how many SQL statements I’d have to re-write. In some cases it was several hundred. No thanks. I eventually sort of settled on Dotclear, which looks pretty decent and had an SQL-statement count of just over a hundred. I’m not wild about using Wiki markup instead of HTML, but it’s survivable. Then I started thinking about how I would actually do the translation from SQL to NoSQL (probably Cassandra). To begin, I considered some of the standard queries, especially the one to retrieve the most recent N posts for the blog’s main page. It would generally be something like this.

SELECT * FROM posts WHERE status = posted SORT BY date DESC LIMIT 10;

A few of the more RDBMS-like NoSQL stores will handle the WHERE part for you, though in general you’ll have to do the filtering yourself. Similarly, some options will make the SORT BY part easier, but with most you’re on your own. The crux of the matter is that I think you should be able to do both of those plus the LIMIT with something short of a full table scan, and the obvious way to do that is with an index. That’s where my head was when I read Royans Tharakan’s Cassandra : inverted index. (Royans has a lot of good things to say about this stuff, and certainly draws much prettier pictures than I do, but bring a grain of salt; some of his ideas e.g. about the operative definitions of CAP seem distinctly off-base to me.) In any case, he points out what I’d consider the first problem with maintaining an index in a NoSQL data store.

Every time you insert one row of data, you would also have to insert multiple rows to build the inverted index. You would also have to update the inverted index yourself if any of the column values are updated or deleted. Cassandra 0.5.0 which was recently released has been benchmarked to insert about 10000 rows per second on a 4 core server with 2GB of RAM. If you have an average of 5 columns per row, that is about 1.5k actual row inserts per second (that includes 5 rows of inserts/updates required for an inverted index).

All true but, as I pointed out in a comment, it can get even worse than that.

The problem of updating an inverted index is much worse than merely an extra update per new cell. The first extra bit of pain is dealing with concurrent updates; a simple read-extend-write of the index now falls prey to the classic problem of N-1/N concurrent updates being lost. The next bit of pain is dealing with a really big index. If you have a few thousand rows, let alone a billion, updates of a simple index are going to be extremely inefficient. Now you’ll have to maintain your index as a B-tree or some such using multiple rows. Now you get to the most painful part of all: concurrent multi-row updates.

For my own current use case, with few rows and no concurrency, these issues don’t matter. In the more general case, though, these issues can become very important. I spent a while thinking about how to deal with them, but the real point here is that one should not try to facilitate such queries as much as avoid them. In some cases avoidance is not feasible, and I do think the problem of supporting wait-free concurrent billion-row indices in a non-ACID non-SQL environment is well worth solving, but I already have a job for this year. ;) For now, the question is how to support the kinds of operations that blog software needs, without an index but also without queries that result in full table (column family, whatever) scans.

The answer, to put it simply, is to change the data model. Instead of trying to keep the data in a highly generic form and use queries to turn it into what we want, we anticipate the kinds of results we’ll need and update those results when we write instead of when we read even if that means introducing some duplication and thus some potential for inconsistency. The technical term for this is denormalization. In the blog context, what this means is that we should actually store a “last N posts” item and update it whenever we add a post. Now the process of displaying the last N posts involves not a single very expensive query but N+1 simple gets – the first to retrieve the last-N list, and the remainder to fetch the posts themselves individually. This involves slightly more code, but much less work. It also makes updates more complex – a read-modify-write instead of a simple insert. This is where data stores that allow complex operations – such as Redis – really shine compared to their contemporaries which offer only get/put/delete, especially when concurrency is an issue (which it basically isn’t in my case).

The upshot for the NoSQL-blog project is that it won’t be sufficient just to rewrite the SQL into directly equivalent operations on a NoSQL store. That’s simply the wrong way to think about this kind of transition, and while I could get away with that for the simple case of a small blog the Right Thing To Do is still to change how the blog code works at a slightly higher semantic level. For a larger-scale system, that’s even more true and the semantic level at which changes should be made is likely to be even higher. As I’ve mentioned before, when comparing alternatives in this space it’s important to consider not just performance and scalability using a least-common-denominator interface but also aspects of each candidate’s data model and operation set as they apply to your own application, because those can make a huge difference in both the actual system load and the higher-level code complexity. That, in a very small nutshell, is the sort of lesson I was hoping to learn by actually trying to apply a NoSQL approach instead of just considering it from a distance, and then to pass on to my readers. At this point I probably will continue with this project, even though it’s likely to be a little more involved than I thought and my free time is still very scarce, and I’m sure I’ll find more lessons like these.