Archive for August, 2012

glupy – Writing GlusterFS Translators in Python

Over the weekend – and, obviously, a little bit today – I’ve been working on one of those projects that has been baking in the back of my mind for a long time. I always have a bunch of these queued up. Sometimes I use them as warmups or breaks when I’m feeling a bit stuck, much like a novelist in one genre might write a poem or a short story in a different genre to overcome writer’s block. Other times I use them to learn or refresh skills that I don’t otherwise get to exercise as part of my regular duties. Anyway, in this case the project was spurred by my recent efforts to add a Python interface to Avati’s new “gfapi” GlusterFS library interface. Why stop there? Why not go all the way and provide glue to write actual translators in Python? Thus was born glupy (pronounced “gloopy” in my head because I find it amusing). With that in mind, I read the Python ctypes documentation more carefully and added the embedding documentation as well. Python extension (letting Python code call C) is pretty familiar territory to me, but I had never tried embedding (letting C call Python) before and I get the impression that my experience mirrors that of the community in general, so this was a learning experience. With all of that information semi-digested, I started hacking. After overcoming several typical kinds of issues with this kind of glue programming, I got to the point where I had something that basically worked, and decided to implement a version of my negative-lookup-caching translator using the new Python infrastructure. Having done that – results below – I now feel comfortable that glupy is “for real” enough to write about. I’m going to save the “how” for later, because it turns out that I might get the chance to write about that at greater length elsewhere, but let’s get some of the “why” out of the way.

The reasoning behind glupy is mostly the same as the reasoning behind FUSE itself, or the Python bindings for FUSE. I’m a firm believer that X functionality should be implemented in the X subsystem, where X in this case is storage. I’m frankly a bit tired of seeing people implement storage functionality as layer after uncoordinated layer on top of the storage subsystem, just because writing code within the storage system is too hard for them, so anything I can do to make it easier seems worthwhile. The simple fact is that higher-level languages reduce barriers to entry. Having access to sophisticated code and data structures with automatic memory management makes code easier to write. This effect tends to compound itself, as the higher-level-language libraries for any given task also tend to have more coherent and generally pleasant interfaces than their C counterparts, so the higher up the stack you go the more benefit you get. I know this approach works, because I’ve personally worked on a project (C3D at EMC) where just the conversion of a prototype implementation from Python to C took longer than getting the prototype working in the first place. If I’d had to debug the protocol and the language-specific implementation at the same time, in the less convenient language, I’m quite sure the overall project time would have tripled. Sometimes the storage subsystem is the right place to implement functionality but C is the wrong language.

The secondary questions have to do with my choice of higher-level language. Why Python instead of Ruby or Lua? Why CPython instead of PyPy? In both cases, my own familiarity was a factor. I learned Python back in the 1.5 days and, having learned it, never felt the others were different enough to justify an extended effort to learn them properly. Furthermore, I have experience integrating Python with C, so this probably took me half as long as the other integrations would have. Also, Python is the alternative people ask for. CPython in particular is the scripting language that’s most likely to be installed on GlusterFS users’ systems out of the box, it’s the only scripting language I’ve heard people ask for, UFO is written in Python, many parts of HekaFS were written in Python, etc. Maybe I’ll stretch a little more and do one of the others some day, but I already have my work cut out for me so don’t hold your breath.

OK, enough justification. How about those performance results? What I expected was that the same performance benefit I used in my Red Hat Summit slides would still exist, because – and I can’t stress this often or strongly enough – when you’re dealing with performance in a distributed system the first thing you should seek to minimize is network round trips and synchronization delays. Only then should you even worry about disk performance, let alone CPU overhead. The use of a higher-level language just shouldn’t matter for the case that negative lookup caching is meant to address. So, without further ado, here are the results for my “PHP simulation” which measures average time to do a thousand include-file lookups across ten directories (with a power-law distribution with 80% of requests to 10% of files).

  • Vanilla configuration: 5.8ms
  • Add Python-based negative lookoup caching: 1.5ms

This is actually a better result than the 3x improvement I saw with the C implementation. I wouldn’t obsess over the precise numbers too much because this is just one run of a fairly small-scale synthetic benchmark, but it’s certainly enough to support my theory that language overhead doesn’t matter in this case. Also, the Python code (for this specific translator, not the infrastructure) is approximately half as long as the equivalent parts in C and I’d say it’s a lot more understandable as well.

This is still early days for glupy, there are literally dozens more functions I have to implement in addition to the two I needed for this test, and then there are all sorts of other infrastructure I need to add to make the Python environment as complete as that for C, but it’s a very auspicious start.

 

Making Choices

One of the key points I tried to make in my Red Hat Summit talk about GlusterFS last month is that GlusterFS quite deliberately does not trade away data safety or consistency for performance. That’s a painful choice, because everyone always wants to be the speed king and they’ll be sharply critical of anyone they feel is not running as fast as they can. However, one thing I’ve learned about the storage marketplace is that the recognition of priorities other than speed is what separates the pros you can trust from the amateurs you can’t. Harsh, perhaps, but true. Sure, speed matters, but so do robustness and ease of use and cost and features and community and compatibility and a whole bunch of other things. This is especially true when the system is designed to be scalable so that you can address performance issues by adding hardware at linear rather than exponential cost per increment. If you can buy more performance but you can’t buy more of those other things, you’d be a fool to buy the system that’s built for speed and speed alone.

This issue came up again recently when – for what seems like the thousandth time – someone commented on GlusterFS’s poor small-write performance. Well, yeah, because when we say a write is done it’s done. It’s on as many remote servers as you asked for, not just buffered locally like our highest-profile competitor. That’s an example of refusing to sacrifice data safety for the sake of better performance. Similarly, when you list a directory we actually check whether new files have appeared or old ones have been deleted, instead of just returning cached and possibly stale information, so directory listings and general “many small file” workloads tend to perform poorly on GlusterFS compared to systems that take those shortcuts. That’s an example of refusing to sacrifice consistency/correctness for the sake of performance. Sure, we could buffer writes more and cache reads more, and most users would probably not even notice except for the improved performance, but some users would experience failures and even data loss because their expectations (however unrealistic those might be) were not met. Safety and correctness are the defaults, and I shouldn’t even need to defend that position. Where we haven’t done as well is in allowing those defaults to be changed.

The fundamental problem here is a three-way tradeoff: performance vs. consistency vs. simplicity. (If this seems a lot like CAP, or even more like PACELC, that should come as no surprise. Probably because messages are so expensive and you can only do so much with each one, “triangles” like this seem particularly common when dealing with distributed systems.) In this case, performance and consistency are pretty self-explanatory. Simplicity is a bit harder. Far from being a mere matter of aesthetics, simplicity is also a matter of the effort required to make progress in other directions. When a system becomes too complex, it becomes incredibly hard to deal with all of the cases that arise during fault recovery, let alone those that result in slowdowns without a fault. New features are harder to add, new developers are harder to attract, monitoring and ease of use suffer, and so on. In distributed systems, simplicity is not a virtue – it’s a necessity. Thus, if you choose to give up simplicity, a lot of people might not care about your performance and consistency because your system either won’t do what they need or can’t be trusted to keep on doing it. That said, let’s look at what choices various systems have made:

  • Sacrifice performance. Obviously I’d put GlusterFS in this category. Some highly available NFS implementations have made the same choice, as have relational databases in general.
  • Sacrifice consistency. This is exactly what all of the NoSQL data stores are known for, along with HDFS and “blob stores” like Amazon’s S3 or OpenStack’s Swift. Non-HA implementations of NFS also tend this way, though it could well be argued that they give up very little consistency to very good effect.
  • Sacrifice simplicity. AFS (and its descendants) pretty much went this route, which is why most people have barely heard of them. Some might put Lustre and/or NFSv4.x in this category as well.

The point is not to say one set of tradeoffs is better than the others, even though – as with CAP – I feel that one choice leads to a distinctly less useful system than the other two. What’s more important is to realize that a tradeoff is being made, and to understand its nature. I for one would like to see GlusterFS offer more of an option to sacrifice consistency when and how the user chooses to do so, even if that’s not the default behavior. That’s why I’ve worked on asynchronous replication, replication bypass, negative-lookup caching, and a whole bunch of other things that all weaken consistency in a controlled (and modular) way. I’ve spent years trying to implement distributed invalidation-based consistency in the past, I was one of the most stubborn holdouts for that approach, but even I’ve come to believe that it’s firmly in the “too complex” category especially when fault handling is considered. Nonetheless, I still hope to add some sort of TTL-based on-disk client caching some day. The performance “sweet spot” for GlusterFS will continue to grow, including more and more workloads and use cases, but never by trading away characteristics that are even more important.