In case people are wondering why I haven’t been posting here, it’s partly being busy at work, partly being busy with Christmas stuff, partly being sick with what has come to be known in the office as the Creeping Death, and partly because I’ve been writing a lot over at CloudFS.org. If you’re interested in the why and what of CloudFS, go check it out.
I try to keep blogging about $dayjob to an absolute minimum, but this is kind of a big deal. Today, I pushed the first bits of CloudFS out to a public git repository. A week or so ago I registered cloudfs.org to hold non-code content, but it’s not really up yet so for now you can clone that repository and look at the README.md therein, or at the Fedora 15 feature page. Here are a few nuggets to get you started.
- CloudFS is basically GlusterFS enhanced to allow deployment by a provider as a permanent shared service, rather than as a private thing that users can run within their own compute instances.
- The enhancements necessary fall into several broad categories: authentication, encryption, isolation (each tenant gets their own namespace and UID/GID space), quota/billing, and some necessary enhancements to existing GlusterFS functionality.
- This is a very pragmatic and unambitious release, explicitly not including the improved-DHT and multi-site-replication functionality that I think will make CloudFS really cool. Think of it as a warm-up to the main attraction.
- The code is nowhere near complete. The three translators I’ve written are complete enough to do useful things – and more importantly to be worth reviewing – but all need to be improved in various ways and there are other bits (mostly around configuration and management) that don’t even exist yet. To put it another way, I think the code represents that point on a journey where you’ve climbed the highest hill and can see the destination, but there are many miles yet to be walked.
Once I get cloudfs.org set up the rest of the way, I’ll probably start posting more info there. Stay tuned.
I’m a card-carrying ACLU member – literally – and proud of it. I truly support and applaud the work they do, even on issues where I don’t quite share their position. After all, being able to support others’ right to do things that make you uncomfortable is kind of the whole point. However, the number of issues on which I think they’ve lost their way is significant and seems to keep growing, plus I’m getting very tired of them as an organization. The far-too-frequent calls asking for more money (I’m already a monthly donor) are one example; the totally broken member-login web page so I can’t change my donation preferences in protest is another. In a world where the ACLU was the only organization doing what they do I’d just suck it up and keep supporting them, but they’re not so I’m looking for alternatives. For example, does anyone know anything about the Center for Constitutional Rights, or Bill of Rights Defense Committee? Can a left-leaning civil libertarian like me support one of these instead, in good conscience? Are there others I might not have heard of?
We will return to our regularly scheduled technical program shortly. ;)
Since this is a subject that has been much on my mind lately, I’m going to take another shot at discussing and clarifying the differences between caches and replicas. I’ve written about this before, but I’m going to take a different perspective this time and propose a different distinction.
A replica is supposed to be complete and authoritative. A cache can be incomplete and/or non-authoritative.
I’m using “suppose” here in the almost old-fashioned sense of assume or believe, not demand or require. The assumption or belief might not actually reflect reality. A cache might in fact be complete, while a replica might be incomplete – and probably will be, when factors such as propagation delays and conflict resolution are considered. The important thing is how these two contrary suppositions guide behavior when a client requests data. This distinction is most important in the negative case: if you can’t find a datum in a replica then you proceed as if it doesn’t exist anywhere, but if you can’t find a datum in a cache then you look somewhere else. Here are several other possible distinctions that I think do not work as well.
- Read/write vs. read-only. CPU caches and NoSQL replicas are usually writable, while web caches and secondary replicas used for Disaster Recovery across a WAN are more often read-only.
- Hierarchy vs. peer to peer. CPU caches are usually hierarchical with respect to (more authoritative) memory, but treat each other as peers in an SMP system – and don’t forget COMA either. On the flip side, the same NoSQL and DR examples show that both arrangements apply to replicas as well.
- Increasing performance vs. increasing availability. This is the distinction I tried to make in my previous article, and in practice that distinction often applies to the domain where I made it (see the comments). I’m not backing away from the claim that a cache which fails to improve performance or a replica which fails to improve availability has failed, but either a cache or a replica can serve both purposes so I don’t think it’s the most essential distinction between the two.
- Pull vs. push. Caches usually pull data from a more authoritative source, while replicas are usually kept up to date by pushing from where data was modified, but counterexamples do exist. Many caches do allow push-based warming, and in the case of a content delivery network there’s probably more pushing than pulling. I’m having trouble thinking of a pull-based replica that’s not really more of a proxy (possibly with a cache attached) but the existence of push-based caches already makes this distinction less essential. One thing that does remain true is that a cache can and will fall back to pulling data for a negative result, while a replica either can’t or won’t.
- Independent operation. Again, replicas are usually set up so that they can operate independently (i.e. without connection to any other replica) while caches usually are not, but this is not necessarily the case. For example, Coda allowed users to run with a disconnected cache.
Now that I’ve said that some of these differences are inessential, I’m going to incorporate them into two pedagogical examples for further discussion. Let’s say that you’re trying to extend your existing primary data store with either a push-based peer-to-peer replica or a pull-based hierarchical cache at a second site. What are the practical reasons for choosing one or the other? The first issue is going to be capacity. A replica requires capacity roughly equal to that of any other replica; a cache may be similarly sized, but may also and most often will be smaller. While storage is cheap, it does add up and for very large data sets a (nearly) complete replica at the second site might not be an economically feasible option. The second issue is going to be at the intersection of network bandwidth/latency/reliability and consistency. Consider the following two scenarios:
- For a naively implemented push-based replica, each data write is going to generate cross-site traffic. If the update rate is high, then this is going to force a tradeoff between high costs for high bandwidth vs. low consistency at low bandwidth. You can reduce the cross-site bandwidth by delaying and possibly coalescing or skipping updates, but now you’re moving toward even lower consistency.
- For a pull-based cache, your tradeoff will be high cost for a large cache vs. high latency for a small one.
Yes, folks, we’re in another classic triangle – Price vs. Latency vs. Consistency. Acronyms and further analysis in those terms is left for the reader, except that I’ll state a preference for PLaC over CLaP. The point I’d rather make here is that the choice doesn’t have to be one or another for all of your data. Caches and replicas both have a scope. It’s perfectly reasonable to use caching for one scope and replication for another, even between two sites. The main practical consequences of being a cache instead of a replica are as follows.
- You can drop data whenever you want, without having to worry about whether that was the only copy.
- You can ignore (or shut off) updates that are pushed to you.
- You can fall back to pulling data that’s not available locally.
If you have a data set that’s subdivided somehow, and a primary site that’s permanently authoritative for all subsets, then it’s easy to imagine dynamically converting caches into replicas and vice versa based on usage. All you need to do is add or remove these behaviors on a per-subset basis. In fact, you can even change how the full set is divided into subsets dynamically. Demoting from a replica to a cache is easy (so long as you’ve ensured the existence of an authoritative copy elsewhere); you just give yourself permission to start dropping data and updates. Promoting from a cache to a replica is trickier, since you have to remain in data-pulling cache mode until you’ve allocated space and completed filling the new replica, but it’s still quite possible. With this kind of flexibility, you could have a distributed storage system capable of adapting to all sorts of PLaC requirements as those change over time, instead of representing one fixed point where only the available capacity/bandwidth can change. Maybe some day I’ll implement such a system. ;)
For those (few) who follow me but not there, here are my top ten from November.
- OH at museum: Now that we’ve appreciated all the diversity, can we please move on? (November 07)
- If I have publicly and violently clashed with the founders, please pardon my raucous laughter when you try to recruit me. (November 09)
- Using Inconsolata font for code editing. Quite nice. (November 11)
- My take on the “your argument is invalid” meme, inspired by driftx on #cassandra. http://imgur.com/WtiFL (November 12)
- Tea Party yard work: borrow a neighbor’s leaf blower, then blow all your leaves onto his yard. (November 14)
- http://www.cs.virginia.edu/~weimer/ shows a *negative* correlation between some popular coding-style preferences and actual readability. (November 15)
- If you work in distributed systems but haven’t read Saito and Shapiro then fix that. (November 16)
- How many applications have you used today? How many are you personally willing to rewrite to use an “alternative data store”? (November 29)
- I am the king of . . . just a minute. Where was I? Oh yeah, the king of . . . just a sec . . . multitasking. (November 29)
- If hand-waving built muscle, I’d know some very buff architects. (November 30)
Yes, you read that right – Zettar, not Zetta. Robin Harris mentioned them, so I had a quick look. At first I was going to reply as a comment on Robin’s site, but the reply got long and this site has been idle for a while so I figured I’d post it here instead. I also think there are some serious issues with how Zettar are positioning themselves, and I feel a bit more free expressing my thoughts on those issues here. So, here are my reactions.
First, if I were at Zetta, I’d be livid about a company in a very similar space using a very similar name like this. That’s discourteous at best, even if you don’t think it’s unethical/illegal. It’s annoying when the older company has to spend time in every interaction dispelling any confusion between the two, and that’s if they even get the chance before someone writes them off entirely because they looked at the wrong website. The fact that Robin felt it necessary to add a disambiguating note in his post indicates that it’s a real problem.
Second, Amazon etc. might not sell you their S3 software, but there are other implementations – Project Hail’s tabled, Eucalyptus’s Walrus, ParkPlace. Then there’s OpenStack Storage (Swift) which is not quite the same API but similar enough that anyone who can use one can pretty much use the other. People can and do run all of these in their private clouds just fine.
Third, there are several packages that provide a filesystem interface on top of an S3-like store – s3fs, s3fuse, JungleDisk, even my own VoldFS (despite the name). Are any of these production-ready? Perhaps not, but several are older than Zettar and several are open source. The difference in deployment risk for Zettar vs. alternatives is extremely small.
Lastly, Zettar’s benchmarks are a joke. The comparison between local Zettar and remote S3 are obviously useless, but even the comparison with S3 from within EC2 is deceptive. Let’s just look at some of the things they did wrong:
- They compare domU-to-domU for themselves vs. EC2 instances which are likely to be physically separate.
- They fail to disclose what physical hardware their own tests were run on. It’s all very well to say that you have a domU with an xxx GHz VCPU and yyy GB of memory, but physical hardware matters as well.
- Disk would have mattered if their tests weren’t using laughably small file and total sizes (less than memory).
- 32-bit? EC2 m1.small? Come on.
- They describe one set of results, failing to account for the fact that EC2 performance is well known to vary over time. How many runs did they really do before they picked these results to disclose?
- To measure efficiency, they ran a separate set of tests using KVM on a notebook. WTF? Of course their monitoring showed negligible load, because their test presents negligible load.
There’s nothing useful about these results at all. Even Phoronix can do better, and that’s the very lowest rung of the storage-benchmarking ladder.
This looks like a distinctly non-enterprise product measured in distinctly non-enterprise ways. They even refer to cloud storage 2.0, I guess to set themselves apart from other cloud-storage players. They’ve set themselves apart, all right. Beneath.
Disclaimer: I work directly with the Project Hail developers, less so with the OpenStack Storage folks. My own CloudFS (work) and VoldFS (personal) projects could also be considered to be in the same space. This is all my own personal opinion, nothing to do with Red Hat, etc.