Archive for October, 2011

The Future of HekaFS

A lot of people have asked me about what the acquisition of Gluster by Red Hat means for HekaFS. In the interests of transparency I’m going to share a bit of my thinking on that, but I should be extra-careful to note that none of this represents the official position or plans of Red Hat. These are just my own personal predictions, or even hopes. I can influence the official direction to some extent, but I’m like the tenth guy down the totem pole when it comes to making actual decisions.

The main thing is that HekaFS as a separate project is likely to go away, in favor of having its functionality rolled piece by piece into GlusterFS. This is actually a very good thing for the HekaFS “vision” in much the same way that the acquisition itself was a very good thing for the Gluster vision. When Gluster was a partner, I personally went to great pains to make sure there was some differentiation between GlusterFS and HekaFS, to preserve their identity and business model. This was consistent with Red Hat’s corporate mission to support other open-source communities instead of crushing them (contrast to some of our partners/competitors). Now that we’re all part of the same family, that differentiation is no longer necessary and can only sow confusion. Having HekaFS become part of GlusterFS also brings several more concrete advantages.

  • The HekaFS functionality can be more tightly integrated with the GlusterFS management infrastructure. Having our separate CLI and GUI is a bit of a pain for both developers and users. Having tenant and certificate/key management be part of the GlusterFS UI will be much nicer for everyone. This is still a work in progress, regardless of whether HekaFS maintains a separate identity, but when we’re done it will be a great improvement.
  • HekaFS can now “piggyback” on GlusterFS’s greater access to the user-experience, documentation, QA, support and other infrastructure at Red Hat. It might seem odd that HekaFS didn’t already have that access, being at Red Hat and everything, but that’s the way it is. HekaFS was an “outbound” project, enhancing and contributing to an external upstream project. GlusterFS’s status as an “inbound” project contributing to Fedora or to Red Hat subscription-based products was completely separate from that. It’s the inbound status that brings access to all that other goodness. That status is strengthened by Gluster becoming Red Hat Storage, and now HekaFS becomes an inbound project as well.
  • This might get me out of the project leader/evangelist role, and back to design/coding. I’m not going to comment on squishy feelings about that; let’s focus on the practical effect. There are people associated with Gluster and GlusterFS who are much better at all that other stuff than I am. Anand Babu Periasamy is a better visionary than I am, John Mark Walker is a better community organizer/evangelist, Hitesh and Vidya and Dave and others can do other pieces of that far better than I ever will. I hope I’ll be able to spend more time with Avati and Vijay and Amar and everyone else on the technical side, and together we’re going to kick some ass.

More resources and more focus for planning and community management, more resources and more focus for development, more resources and more focus for just about every other part of making a product . . . what’s not to like? You know all those features I’ve always had to put on the slides at the end, because they’re so far in the future? Like these? Or this? Yeah, getting off the HekaFS dune buggy and onto the Red Hat Storage rocket brings all of those closer. Stay tuned for the official version.


Taming the BEAST

By now, a lot of people have heard of BEAST, which is an attack against the AES-CBC encryption used in SSL. Some people might also have noticed that the HekaFS git sources include “aes” and “cbc” branches which represent two different implementations of a new at-rest encryption method to replace the weak AES-CTR version that we’re using as a placeholder, and those people might wonder whether we share the BEAST vulnerability. Short answer: we don’t. While Edward’s “aes” branch might implement real CBC, my “cbc” branch does not. Yeah, I know that’s confusing. Simply put, I use some of the “xxx_cbc” entry points for convenience, but only for one cipherblock at a time so there’s no actual chaining involved. One correspondent has already pointed out – correctly – that “cbc” is a misnomer for what’s really tweaked ECB. Our scheme is actually pretty similar to LRW, but it uses a hash and a unique (per file) salt instead of Galois-field multiplication. It was designed to defeat a completely different attack (modification in one ciphertext block leading to a predictable change in the next plaintext block), but it also avoids the guessable-IV flaw that is the basis of BEAST.

There are no absolutes in this world, but I believe that our yet-to-be-named scheme is as secure as anything else out there with respect to the known attacks – including BEAST – it was designed to thwart. More importantly, people who know a ton more about this stuff than I do seem to agree. The main knock against it is performance. Calculating a hash per cipherblock is more expensive than a simple XOR. It also precludes using AES-capable hardware to its fullest. (Sad fact: commodity crypto hardware will practically always implement the crypto that’s near if not beyond the end of its useful life.) On the other hand, there are many hashes with a good balance between cryptographic strength and computational difficulty, and our approach allows us to use any of them. Hashing is in any case considerably less expensive than a full extra round of encryption would be, and that approach is commonly used (check out XEX and XTS in the Wikipedia article) without too many people complaining.

In the interest of full disclosure, I will point out that there is still one issue with the encryption we’re using. It’s one shared with practically all forms of storage encryption: the IV for a given block, though it has many good cryptographic properties, is still constant unless the key/IV/salt is changed periodically. This is a very expensive process, which essentially involves both decrypting and re-encrypting the entire file. It also introduces significant key-management complexity. We might implement this some day, or at least some way to do it manually per file without having to make an actual copy (which wouldn’t be as transparent to someone using the file). Bear in mind, though, that attacks against the storage itself would have to be executed either by your storage provider or someone who had already compromised that provider, and from then on would be equivalent to attacks against the disk in your on-premise system. Other users of the same service would not be able to execute such an attack. It’s not perfect, but it’s at least as secure as your local storage and it’s as good as anything I know about except maybe Tahoe-LAFS.

P.S. One of these days I’d like to do a comparison between HekaFS and Tahoe-LAFS, and maybe some thoughts on when you might want to use each. Zooko, would you be interested in collaborating on that?


All That . . . And a Pony!

Last night I gave a talk at BBLISA about GlusterFS, HekaFS, and cloud filesystems generally. It was a great time with a great group, and I thank everyone involved. One thing that did come up during the talk had to do with this slide about what I consider to be HekaFS’s most important future feature.
HekaFS Global Replication
Someone in the audience asked, “Can I have a pony with that?” Everyone laughed, including me. I mentioned that I had actually proposed “PonyFS” as the name for CloudFS, even before the HekaFS debacle. Everyone laughed again. Here’s the thing, though: as ambitious at that agenda might seem, it’s entirely within the realm of possibility. The first three features are common across all of the Dynamo-derived data stores such as Riak, Voldemort, and Cassandra. In the filesystem world, the best known examples of doing something like this are AFS and its descendants such as Coda and Intermezzo. Bayou and Ficus are less well known, and that’s kind of sad. If you’re at all interested in this area, you must read up on those; if you can’t see how Bayou inspired half of the “new” ideas in Dynamo (and thus in all of its derivatives) then read again until you get it. Even the old-school database dullards have done this multiple times. Really, the techniques for doing that part are pretty well known.

The claim about caching as a special case of replication might be a bit more controversial, but I don’t think it’s hard to explain. With full “push mode” optimistic replication, one side assumes that the other wants some data, and won’t throw away that data once received. Both of those assumptions can be weakened. If you don’t know that a peer wants that data, don’t send it. They might have an out-of-date copy or they might not have it all, but either way that’s explicitly what they want. If you don’t know that a peer will keep the data, don’t assume they’ve done so for purposes of ensuring your target replica count. Pulling an object into your cache thus becomes a simple matter of expressing an interest in it, which automatically triggers transmission of its current state. When you want to shut off the flow of updates, whether you’re keeping a copy or not, you tell your peers you’re no longer interested. It’s easy enough to “mix and match” full replicas with caches in this model, with only full replicas considered for data-protection purposes but updates also pushed to caches when that’s appropriate. Once you’re using a common framework, it’s even possible to do fancier things like express interest only if the last transmitted version is more than N seconds old.

If it were all just this simple, though, people wouldn’t have laughed. In fact, I wouldn’t have been showing this slide because it wouldn’t be new functionality. The reason it’s not common already is that asynchronous write-anywhere replication doesn’t come for free. Some tradeoff has to be made, and anyone who has seen or heard me go on about the CAP Theorem has probably already guessed what that is: consistency. A system like this has lots of nice features, but strong or even predictable consistency is not one of them. Sites will be out of date with respect to one another, perhaps only for milliseconds or perhaps for days if there’s a major network disruption. For some applications that’s unacceptable; for others it’s well worth it because of the performance and availability advantages of doing things this way. Some people in the second group don’t even care if the replication maintains any semblance of the original operation order, but I’d say that the most common need is for replication that’s asynchronous but still ordered to some degree. There’s a whole continuum here. At one extreme you have total ordering across an entire dataset (e.g. filesystem volume). This is almost as strong a guarantee as full consistency. It’s also almost as difficult or expensive to implement. At the other extreme is ordering only within objects (e.g. files). In between you have various ways of grouping objects into multiple ordered replication streams, either “naturally” (e.g. an entire directory hierarchy becomes one stream) or by explicit individual assignment of objects to streams. In HekaFS, the highly tentative plan is to support multiple named streams, with everything assigned by default to one stream per volume. Namespace operations would always remain within that default stream; if you want a completely separate stream including namespace operations, create a separate volume and mount it on top of the first. File operations can be directed into a different stream by tagging the file, or by inheriting a tag from an ancestor directory. This supports practically all of the useful ordering models from very tight to very loose, though I wouldn’t promise that the implementation will really scale up to millions files with an explicit separate stream for each.

The one thing I haven’t addressed yet is how conflicts are resolved, when clients in two places really do update the same file either simultaneously or during a single partition event. That’s a complex enough topic that it’s best left to a separate post. What I hope I’ve been able to do here is explain some of why I believe the feature set represented by that slide is both useful and achievable, and what tradeoffs or pitfalls people need to be aware of to take advantage of it (once it actually exists).