Breaking Things Up

After quite a few years of relative quiet, there’s a lot of activity and interest in the filesystem world. Many interesting local filesystems are newly available or well into development. Besides allowing the creation of bigger and faster filesystems, features like improved data integrity and built-in snapshot/clone support are being added to the mix. There’s another group of new filesystems oriented toward providing better support for solid-state disks. There’s activity in the parallel-filesystem world too, with both old and new entrants battling for and sometimes gaining acceptance beyond their traditional niches. A lot of old assumptions are being revisited, but maybe there are still others that need to be.

In the old days, a filesystem’s size was determined when it was created, and could never be changed. If you wanted more capacity, you created a new filesystem on a new set of disks or partitions. If you wanted to free up space, perhaps to use it for a new filesystem mounted elsewhere, you were pretty much out of luck. Nowadays, of course, things are a bit better. Filesystems and volume managers often work together to support increasing (and, less often, decreasing) the size of a filesystem. Throughout all of this, however, the filesystem has remained monolithic in the sense that all files within the filesystem have essentially the same characteristics. If the underlying volume is RAID-1, then all files are replicated to the same degree. If it’s RAID-5, then all files are striped to the same degree. If you want files in a particular subdirectory to have different characteristics, such as being more widely striped on faster disks, then you pretty much have to create a whole separate filesystem on a separate volume, and mount it on that subdirectory. In some parallel filesystems, you get a little more flexibility; you can define how many object servers, and perhaps even which ones, a certain file or directory should be striped across. That still doesn’t give you much control over replication, though, and the parameters you specify generally apply only to newly created files (i.e. existing files aren’t re-striped if you change the parameters).

What’s going on here is that a filesystem boundary is not only a capacity boundary but also a layout-policy boundary and a consistency boundary and an everything-else boundary as well. Many people realize that moving a multi-gigabyte file within a filesystem is likely to be quick and atomic, while moving it across filesystems is really a copy-then-delete sequence that will require massive data movement. Somewhat fewer realize that you can force ordering of writes within a filesystem but you have no guarantees when you do writes across more than one. Very few indeed seem to think about whether these boundaries all need to be the same. Valerie Aurora’s chunkfs is one example of breaking up a filesystem into multiple units that can be checked separately. It’s not hard to imagine a local filesystem that offers the same kind of control over striping as parallel filesystems do, or even one that adds re-striping and control over replication as well. In a distributed filesystem, consistency or ordering across sub-filesystem domains could be quite beneficial. In some other environments, implementing quota or security policy across smaller domains would be helpful too.

The key in all of these cases is that the concept of a “policy point” – the top of a hierarchy within which a certain layout or consistency or other policy applies – needs to be decoupled from the concept of a mount point. One fairly obvious way to do this would be to treat every directory as a policy point, inheriting from its parent if need be, but that might result in having so many policy points that managing them becomes a problem. It’s probably sufficient to label only specific directories – e.g. a user’s home directory, an application’s working directory – as policy points, however many directory levels exist below each. A large filesystem might therefore have some dozens to hundreds, but not thousands to millions, of policy points within it. It’s a slightly more complex model than what we have now, but I think it’s also one that maps better to users’ needs. I expect that I’ll be applying the concept to any filesystems I work on, but maybe that’s getting ahead of myself.

A Future Without Books

Cindy: If books disappear, what are people going to do when they need a heavy flat object to press leaves or something?

me: Use last year’s computer.

Storage Classification

During my day-job quest to define what “cloud filesystem” might most usefully mean, I keep coming back to one statement:: cloud users need somewhere to store their data. Yes, I know that seems obvious. It’s so obvious that one can scarcely imagine anybody thinking otherwise . . . and yet such thinking (or perhaps lack of thinking) seems all too common among people involved in cloud computing. Here’s a little pie chart of how a lot of cloud folks seem to allocate their time spent thinking about the technical issues.

As a storage guy, I find this not only annoying but stupid as well. It’s not just cloud storage that gets this treatment, by the way. It has long been the case that storage was the red-headed stepchild of computing. Kids come out of college knowing a lot about processors and compilers and AI and all sorts of other computation-oriented stuff. Nowadays they also know a lot about networking, though that’s often too specific to IP networking and doesn’t cover enough about other kinds of communication that occur within and between modern computer systems (e.g. internal or cluster interconnects). They might even learn something about security, but storage? That’s just the black box where you put stuff when you’re done. Never mind that a modern high-end storage system is likely to be more powerful and sophisticated in any dimension than anything connected to it . . . but I digress. The real point here is that cloud people have put off talking (much) about storage for a long time, but they finally seem ready to talk, so let’s talk. First, let’s talk about a distinction that’s already often made between different cloud offerings, and see how that distinction applies to cloud storage.

  • “Infrastructure as a Service” means providing familiar but low-level functionality, close to the nuts and bolts of how clouds are built. Emphasis here is on letting users build their own application “stacks” almost the same way they do outside of the cloud, but on demand and on somebody else’s hardware. Amazon’s EC2 is the obvious example, with Rackspace and GoGrid providing the best known alternatives. In storage, “familiar” means block or filesystem access; examples might include Amazon’s EBS or various vendors’ provision of SAN (especially iSCSI) or NAS facilities to cloud users.
  • “Platform as a Service” is more abstracted from hardware and operating systems, providing its own “stack” which defines (some might say dictates) more of the applications’ structure. Google’s AppEngine, Microsoft Azure, or the entire J2EE ecosystem are examples here. Familiarity is less of an issue here, so a plethora of options have appeared – traditional databases, schema-less and/or “NoSQL” databases, generic key/value stores, in-memory or persistent data grids (especially in the Java world), and so on.
  • I’d also put Amazon’s S3 in the “platform” category. Some might say it’s not quite the same as the other platform options I’ve mentioned, and they’re right, because of the second distinction I think matters. People in the storage world have been thinking about operational vs. archival storage for a long time, but the concepts and terms are finally entering the cloud conversation. In that context I would also add a third category, as follows.

  • Operational storage is what an application uses while it’s running, perhaps only while a single request is being processed. Emphasis is on low latency and high transaction rates. In traditional storage, this translates into small random I/O, and is often best served by fast disks or SSDs. In cloud storage this would also encompass in-memory caching systems, data grids, and key/value stores.
  • Archival storage is the opposite of operational storage in most regards. Emphasis is on data permanence, often with retention/deletion guarantees and/or rich metadata. Examples in traditional storage include virtual tape libraries or content-addressed storage systems using larger/slower disks or even non-disk media. In the cloud space this is where I’d put Amazon’s S3, EMC’s Atmos, and anything based on the Simple Cloud API.
  • Batch storage is my third category, and a bit of a hybrid. Performance is again a focus, but in this case it’s more about high bandwidth for large sequential I/O. Permanence matters more than for operational storage but less than for archival. Device speed matters less in this case than number of devices coupled with fast controllers and interconnects. In traditional storage, many parallel filesystems used in HPC or video (both processing and distribution) address this need. Some of them do so intentionally, never intending to support operational-storage access patterns in the first place, while others end up this way because they’re unintentionally lousy for those access patterns. In the cloud world, this is where I’d put GoogleFS and HDFS.

Cross infrastructure/platform with operational/archival/batch, and you get six categories. Cross with traditional/cloud and you get twelve. Where do I sit? Mostly at the intersection of the infrastructure level with the operational type, with a side order of cloud. As infrastructure a cloud filesystem has to use a familiar interface. As operational storage, it has to provide good performance especially for small/random I/O patterns. As a cloud component it has to be shared, distributed, dynamically scalable, and multi-tenant. It’s a bit of a gap right now. I think that it’s something users might want, but even those who are thinking about storage in the cloud tend to be heading in other directions. Probably the closest you can get right now is a clustered NAS such as those provided by Isilon, BlueArc, or Exanet. The money that has been poured into these companies and that they make in return validates the need/interest, but they all cost $$$ for proprietary hardware and software. I think there’s a place and a possibility for something more cost effective, which also more directly addresses cloud needs such as distribution and multi-tenancy.

That’s One Angry Cat

Apparently there’s a big debate about UTIs in Snow Leopard. No, not that kind of UTI, and not that kind of snow leopard. It just goes to show what an alternate universe we geeks live in sometimes. That’s why trying to Google for information about malus domestica, or transparent wall openings, will often lead you to something else entirely.

Fixing Linux Bloat

Apparently, Linus himself has come out and said that Linux is getting bloated and huge.

“We’re getting bloated and huge. Yes, it’s a problem,” said Torvalds.

Asked what the community is doing to solve this, he balked. “Uh, I’d love to say we have a plan,” Torvalds replied to applause and chuckles from the audience. “I mean, sometimes it’s a bit sad that we are definitely not the streamlined, small, hyper-efficient kernel that I envisioned 15 years ago…The kernel is huge and bloated, and our icache footprint is scary. I mean, there is no question about that. And whenever we add a new feature, it only gets worse.”

I can’t help but wonder how much of the reason for this bloat is a general aversion among many Linux kernel hackers to stable kernel interfaces in favor of getting code into the mainline Linux tree. Greg Kroah-Hartman has written eloquently on the subject, but I stand by my own dissent from almost two years ago. In addition to the objections I raised before, I believe that the “only one download” attitude is also part of the reason the kernel is bloated. Everyone pays not only to download and configure/build code for platforms and devices they’ll never see, but also to run core-kernel code that’s only there to support environments that they’ll also never see. For example, a lot of core changes have been made to support various kinds of virtualization. Virtualization is a very valuable feature on which I myself often rely, but is it really fair to make everyone carry the baggage for a feature they don’t use? Might that be an example of an anti-patch philosophy contributing to the bloat Linus mentioned?

The problem is that, if you can’t do something as a completely separate module (and BTW it’s pretty amazing what you can do that way), then you have two choices: maintain your own patches forever, or get them into the mainline kernel where they’ll affect users in every environment all over the world. Both approaches are unpleasant. Maintaining your own patches across other people’s random kernel-interface changes is a pain. Dealing with all the LKML politics to get your patches accepted can be a pain too. What if there were a middle ground? What if the community were more patch-friendly, so that functionality requiring patches weren’t treated quite so shabbily. That would mean more stable kernel interfaces – not infinitely stable, but not allowing unilateral change every time one of those senior folks learns a new trick. It would also mean better ways of distributing patches alongside, rather than in, the kernel, such as having a well-known area on kernel.org for real-time and virtualization and NUMA and similar major-feature patches. I recognize the problem of maintaining every possible combination, but shouldn’t users at least have a choice between well-tested “plain” and “everything” kernels? Wouldn’t that help address the bloat, with a minimum of pain for all involved? Shouldn’t we at least discuss alternatives to the model that led to things being so bloated that even Linus has commented on it?

Use The Right Word

From what is actually a very interesting paper:

To simplify our presentation, we assume that read and
write operations always refer to entire chunks. We also
assume that the size of a file grows monotonously

Hey, somebody might find that new data interesting!

What Does “Multi Tenant” Mean?

One of the terms that often comes up in conversations about cloud computing is multi-tenancy, but it’s generally left even less defined than “cloud” itself. This recently came up on the cloud mailing list, so it seems like a good time to take a stab at explaining what I think multi-tenancy means. First, just to get it out of the way, let’s say that a multi-tenant system is much like a multi-user system. In fact, a multi-tenant system is a multi-user system if “user” is defined the right way, but that’s where the differences start to creep in. A cloud provider actually interacts with two classes of users. For the sake of terminological clarity, I’ll refer to them thus:

  • A “tenant” is a person with whom the cloud provider has a contractual relationship, to whom they send bills, etc. Someone must be authenticated and authorized as a tenant to allocate or free cloud resources.
  • An “end user” is someone (or some program) outside the cloud, doing things that generate requests within the cloud. Tenants act as end users’ proxies facilitating access to the cloud, so some sort of relationship must exist, but it need not be a legal or financial one.

For the most part, the cloud provider is concerned with tenants. End users, and conflicts between end users, are the tenant’s problem. One of a tenant’s end users could do something that denies service to all of that tenant’s other end users, and the cloud provider need not care unless it starts to affect other tenants.

That brings us to the big difference between multi-user and multi-tenant. Tenants have Service Level Agreements. Users don’t. In multi-user systems, users routinely contend for resources, affect each others’ performance, etc. One example of this, and one closely related to cloud computing, is web hosting services. If you’re on a shared host, as I am, you’re in a multi-user world. Your account represents one user on a shared bit of hardware, with some access control between you and other users, but there’s practically no fault or performance isolation between you and them. You will be affected by their activity. Sure, most hosts have something in their TOS about not hogging resources, but enforcement of such terms is pretty random. The only effect I’ve ever seen was when a previous host had underprovisioned their database servers, and started pointing the “resource hog” finger at random customers whenever it started to falter under the load. Anyone with any sense knows that to get any kind of real isolation you have to go from a multi-user system to a multi-tenant one – a Virtual Private Server. How is a VPS different than a virtual machine in a cloud? At a technical level, there’s hardly any difference. The main difference is that one is created by a user and billed by the hour while the other is created by an administrator and billed by the month. Either way you have a multi-tenant system, which is to say that it provides strong enough security/fault/performance isolation to support an SLA.

Is a multi-tenant system necessarily virtual? Not entirely. You could, in theory, enforce multi-tenant isolation in a non-virtual environment. You could start with a fair-share scheduler and filesystem quotas, providing isolation for two kinds of resources. You’d need to add similar isolation for memory, swap (space and activity), network and storage bandwidth, etc. If you then provide each tenant with their own filesystem view, UID/PID space, and so on, then you would have reinvented OS-level virtualization. If you just put each tenant in a group, and give them some way to allocate user IDs within the group, then I suppose you’d have a multi-tenant but non-virtual environment . . . but it seems like a lot of work to get something that you could have had in minutes with virtualization, and you still wouldn’t have the same level of fault isolation that virtualization gives you. Of course, you could also have a system that supports multi-tenancy without virtualization by provisioning whole physical machines instead of virtual ones, but I have my doubts whether such an approach is economically competitive with those where you can allocate at a finer granularity. In some cases a virtual instance equivalent to a single 2GHz processor with 256MB of memory and 10GB of storage is all one needs, and therefore all one should pay for, and paying more because physical machines don’t come that small any more won’t be very appealing.

In the end, I think it’s all about isolation. A multi-tenant system is like individual apartments, providing each tenant with a certain level of isolation and control over their environment within a defined space for a defined period, enforced by a lease. By contrast, a multi-user system is more like a hostel. You have your own bunk, but you get to hear your neighbor snore and in the morning you eat whatever was cooked in the shared kitchen. They’re similar in some ways, notably that neither involves actual ownership of the property involved, but they’re also quite different in other ways.

Brevity is Not Power

In the context of responding to a post about C++, I realized that part of what I was addressing was the fairly common attitude that brevity in a programming language was indicative of the language’s power or expressiveness. This is common in many communities, especially among perl programmers, but probably its best known expression is from Paul Graham.

making programs short is what high level languages are for. It may not be 100% accurate to say the power of a programming language is in inverse proportion to the length of programs written in it, but it’s damned close. Imagine how preposterous it would sound if someone said “The program is 10 lines of code in your language and 50 in my language, but my language is more powerful.” You’d be thinking: what does he mean by power, then?

Like many high-profile programmers, Paul tends to assume that if he can’t think of an answer then nobody can. He almost certainly meant the above as a rhetorical question with no good answer, but in fact it’s not hard to answer at all. A diesel-powered truck is likely to be more powerful than a Prius. It might take more to start it up, it might be the wrong vehicle for a daily commute or a trip to the grocery store, but once it gets going it can do things that a Prius never could. In other words, power has to do with ultimate capability, not initial cost. What if modifying the 10-line example in Paul’s example to run across many processors required increasing its size by an order of magnitude, but modifying the 50-line example required not one extra line because half of what the original fifty lines did was set up the infrastructure for that? Which language is more powerful now? This same argument has recently played out at the server-framework level in discussions of Twisted vs. Tornado. Twisted is more complex, it’s harder to write simple apps in it, but few would last long arguing that it’s not also more powerful. (I’m not actually a big Twisted fan, BTW, but it does illustrate this particular point well.) Writing a shorter “hello world” program is not interesting. Writing a shorter complete application that does something real, in a real world where performance and scalability and robust handling of unusual conditions all matter, is much closer to the true measure of a language’s (or framework’s) power.

I say “much closer” because brevity does not truly equal power even in the context of a serious program. Part of my initial point about C++ is that so much of its brevity is bad brevity. If you have deep class hierarchies with complex constructors, and you use lots of operator overloading, then a single line of your C++ code might translate into many thousands of instructions. That same line of C++ code might, under other circumstances, result in only a few instructions. The problem is largely that the person writing that line might not know – might not even be able to find out without trying – what the results will be with respect to instructions and cache misses and page faults and stack depth and all the other things that it might be important to know. I would modify Graham’s claim by saying that recognizable and predictable brevity is an indicator of programming-language power. Any programmer in a given language will immediately recognize that certain constructs might cause lots of code to be emitted by a compiler or executed by an interpreter, most often by explicitly redirecting the flow of execution – loops, function calls, throwing exceptions, etc. Decent programmers in just about any language know that operating on large data structures, even to assign them, might incur a large cost. They don’t need to know how someone else wrote their part of the program to know which operations are likely to be expensive; they just need to know the language itself. Contrast this with C++, where a simple declaration or assignment of a simple object might or might not – through the “magic” of constructors and destructors, templates, smart pointers, and so on – take whole milliseconds to execute.

If a language allows you to express a complicated operation simply, with no ambiguity and with the execution complexity clearly recognizable by any competent programmer in that language, then that might be a legitimate marker of the language’s power and expressiveness. Paul Graham’s Arc language might in fact be considered powerful by that standard. On the other hand, if understanding a single simple line of code’s likely execution complexity requires careful study of code elsewhere, then that language might not be more powerful or expressive. It might just be more obfuscated, or even broken. C++ completely fails this test, though it’s worth noting that close cousins on either side – C and Java – do much better. Even perl does better, despite being terrible in other ways. The real point of brevity is not fewer lines or characters, but more efficient communication of intent and effects to another programmer. If your three lines communicate those things clearly, then congratulations. If they just leave the reader confused or needing to look elsewhere, then you have abused whatever language you’re writing in. If that language makes it inevitable for abuse to crowd out good use, then it is a bad language and programmers interested in writing good code should avoid it.

File Transfer Fun

I’ve written before about the extreme usefulness of sshfs for accessing files remotely without having to install server software. It continues to be an important part of my toolbox, as does its cousin CurlFtpFS which – unlike sshfs – I can even use to mount a directory here on my web host. Of course, either becomes even more useful when combined with some easy method of synchronization. You probably have rsync already. You can also use Unison if you need bidirectional synchronization – which you generally will if you’re trying to use a single directory somewhere as a “drop box” to share between multiple machines.

I just found another sshfs trick today. If the connection between your two machines is already secure – e.g. on the same private network or connected via a secure VPN/tunnel – you might want to avoid an extra round of encryption and decryption by using the “-o directport” option to sshfs. This causes sshfs to bypass all of the ssh stuff and just create a simple TCP connection . . . but what should you run at the other end? Here’s where socat (another extra-useful tool) comes in handy. On the server end (assuming a pretty standard Linux setup), you can just run this:

socat TCP4-LISTEN:7777 EXEC:/usr/lib/sftp-server

Then, on the client:

sshfs -o directport=7777 remote:/dir /local/dir

Now you have a completely insecure TCP between the two machines, so you’d really better not do this directly over the internet without some other way of securing things at a lower level. It’s still pretty handy, though, and I couldn’t find a mention anywhere else of how to do it so there it is.

Shifting Alliances

Apparently, 3PAR is partnering with Exanet now, using Exanet’s clustered NAS on front of 3PAR’s storage. One can’t help but wonder whether this is a reaction to 3PAR’s previous clustered-NAS partner ONStor getting bought out by 3PAR competitor LSI. Such is life in the computer industry. Each company has its own web of partnerships, and occasionally has to scramble a bit because of changes somewhere in that web. Did a partner get bought out? Did a supplier enter a strategic relationship with a competitor? Did a partner’s partner do something that might force you to change? It’s particularly hard on the little guys, but it affects the big guys too. I’m sure some of Sun’s partners have had to make some strategic shifts because of the Oracle buy-out, just as they might have had to make different shifts if the buyer had been IBM instead.