Pomegranate First Thoughts

Pomegranate is a new distributed filesystem, apparently oriented toward serving many small files efficiently (thanks to @al3xandru for the link). Here are some fairly disconnected thoughts/impressions.

  • The HS article says that “Pomegranate should be the first file system that is built over tabular storage” but that’s not really accurate. For one thing, Pomegranate is only partially based on tabular storage for metadata, and relies on another distributed filesystem – Lustre is mentioned several times – for bulk data access. I’d say Ceph is more truly based on tabular storage (RADOS) and it’s far more mature than Pomegranate. I also feel a need to mention my own CassFS and VoldFS, and Artur Bergman’s RiakFuse, as filesystems that are completely based on tabular storage. They’re not fully mature production-ready systems, but they are counterexamples to the original claim.
  • One way of looking at Pomegranate is that they’ve essentially replaced the metadata layer from Lustre/PVFS/Ceph/pNFS with their own while continuing to rely on the underlying DFS for data. Perhaps this makes Pomegranate more of a meta-filesystem or filesystem sharding/caching layer than a full filesystem in and of itself, but there’s nothing wrong with that just as there’s nothing wrong with similar sharding/caching layers for databases. Compared to Lustre, this is a significant step forward since Pomegranate’s metadata is fully distributed. Compared to Ceph, though, it’s not so clearly innovative. Ceph already has a distributed metadata layer, based on advanced distribution algorithms to distribute load etc. Pomegranate’s use of ring-based consistent hashing suits my own preference a little better than Ceph’s tree-based approach (CRUSH), but there are many kinds of ring-based hashing and it looks like Pomegranate won’t really catch up to Ceph in this regard until their scheme is tweaked a few times.
  • I’m really not wild about the whole “in-memory architecture” thing. If your update didn’t make it to disk because it was at the end of the in-memory queue and hadn’t been flushed yet, that’s no better for reliability than if you just left it in memory for ever (though it does improve capacity) and if you acknowledged the write as complete then you lied to the user. Prompted by some of the hyper-critical and hypocritical comments I’ve seen lately bashing one project for lack of durability, I have another blog post I’m working on about how the critics’ own toys can lose or corrupt data, and how claiming superior durability while using “unsafe” settings for benchmarks is dishonest, so I’ll defer most of that conversation for now. Suffice it to say that if I were to deploy Pomegranate in production one of the first things I’d do would be to force the cache to be properly write-through instead of write-back.
  • I can see how the Pomegranate scheme efficiently supports looking up a single file among billions, even in one directory (though the actual efficacy of the approach seems unproven). What’s less clear is how well it handles listing all those files, which is kind of a separate problem similar to range queries in a distributed K/V store. This is something I spent a lot of time pondering for VoldFS, and I’m rather proud of the solution I came up with. I think that solution might be applicable to Pomegranate as well, but need to investigate further. Can Ma, if you read this, I’d love to brainstorm further on this.
  • Another thing I wonder about is the scalability of Pomegranate’s approach to complex operations like rename. There’s some mention of a “reliable multisite update service” but without details it’s hard to reason further. This is a very important issue because this is exactly where several efforts to distribute metadata in other projects – notably Lustre – have foundered. It’s a very very hard problem, so if one’s goal is to create something “worthy for [the] file system community” then this would be a great area to explore further.

Some of those points might seem like criticism, but they’re not intended that way – or at least they’re intended as constructive criticism. They’re things I’m curious about, because I know they’re both difficult and under-appreciated by those outside the filesystem community, and they’re questions I couldn’t answer from a cursory examination of the available material. I hope to examine and discuss these issues further, because Pomegranate really does look like an interesting and welcome addition to this space.

The Oracle/Google Patents

A lot of people are commenting on the Oracle/Google suit without having looked at the patents involved. That’s a bad idea, guaranteed to yield incorrect conclusions. For reference, here are the ones actually mentioned in the formal complaint . . . and yes, I did enjoy looking these up on Google.

  • 6125447: Protection domains to provide security in a computer system
  • 6192476: Controlling access to a resource
  • 5966702: Method and apparatus for pre-processing and packaging class files
  • 7426720: System and method for dynamic preloading of classes through memory space cloning of a master runtime system process
  • RE38,104: Method and apparatus for resolving data references in generated code
  • 6910205: Interpreting functions utilizing a hybrid of virtual and native machine
  • 6061520: Method and system for performing static initialization

First thing to remember is that this is a patent suit, not a copyright suit. That means it’s not about “Java” at all. It’s about certain ways of implementing a dynamic runtime, regardless of what name or input language is used. In that context, 5966702 is probably the most specific to Oracle’s actual Java-runtime technology, and that’s all about class files. The others are pretty general ideas, even if the Java runtime was the first embodiment used in the patent descriptions. For purposes of determining infringement, it’s mostly the claims – not the description – that matter. It’s probably quite premature for anybody who hasn’t looked at the Dalvik code to say whether it infringes most of these patents or not, or whether Google could avoid infringing on these claims without fundamentally changing how Dalvik works.

NoSQL and Cloud Security

By now, most people interested in NoSQL and cloud storage and so on has probably seen the story of go-derper, which demonstrates two things.

  1. Memcached has no security of its own.
  2. Many people deploy memcached to be generally accessible.

Obviously, this is a recipe for disaster. Less obviously, the problem is hardly limited to memcached. Most NoSQL stores have no concept of security. They’ll let anyone connect and fetch or overwrite any object. One of the best known doesn’t even check that input is well formed, so “cat /dev/urandom | nc $host $port” from anywhere would crash it quickly. Among all of the other differences between SQL and NoSQL systems – ACID, joins, normalization and referential integrity, scalability and partition tolerance, etc. – the near-total abandonment of security in NoSQL is rarely mentioned. Lest it seem that I’m throwing stones from some other garden, I’d have to say many filesystems hardly fare any better. For example, I generally like GlusterFS but it provides only the most basic kind of protection against information leakage or tampering. As a POSIX filesystem it at least has a notion of authorization between users, but it does practically nothing to authenticate those users and authorization without authentication is meaningless. The system-level authorization to connect is trivially crackable, and once I’ve done that I can easily spoof any user ID I want – including root. I’ve had to make the point over and over again in presentations that cloud storage in general – regardless of type – is usually only suitable for deployment within a single user’s instances, protected by those instances’ firewalls and sharing a common UID space. For most such stores, if a cloud provider wants to offer it as a public, shared, permanent service separate from compute instances, a lot more work needs to be done.

What kind of work? Mostly it falls into two categories: encryption and authentication/authorization (collectively “auth”). For encryption, there’s a further distinction to be made between on-the-wire and at-rest encryption. A lot of cloud-storage vendors make all sorts of noise about their on-the-wire encryption, but they stay quiet or vague about at-rest encryption and that’s actually more important. The biggest threat to your data is insiders, not outsiders. The insiders aren’t even going on the wire, so all of that AES-256 encryption there doesn’t matter a bit. Insiders should also be assumed to have access to any keys you’ve given the provider, so the only way you can really be sure nobody’s messing with your data is if you never give them unencrypted data or keys for that data. Your data must remain encrypted from the moment it leaves your control until the moment it returns again, using keys that only you possess. I know how much of a pain that is, believe me. I’ve had to work through the details of how to reconcile this level of security with multi-level caching and byte addressability in CloudFS, but it’s the only way to be secure. Vendors’s descriptions of what they’re doing in this area tend to be vague, as I said, but Nasuni is the only one who visibly seems to be on the right track. It sure would be nice if people could get that functionality through open source, instead of paying both a software and a storage provider to get it. Cue appearance by Zooko to plug Tahoe-LAFS in 5, 4, 3, …

The other area where work needs to be done is handling user identities, which covers both auth and identity mapping. For starters, the storage system must internally enforce permissions between users, which of course means it must have a notion of there even being multiple users. For systems which can assume that a single connection belongs to a single user, you can then authenticate using SASL or similar and be well on your way to a full solution. For systems that can’t make such an assumption, which includes things like filesystems, that’s not sufficient. You need to identify and authenticate not just the system making a request, but the user as well. I’m not a security extremist, so I can accept the argument that if you can fully authenticate a system and communicate with them through a secure channel then you can trust them to identify users correctly. The alternative is something like GSSAPI, which requires less trust in the remote system but can be a pretty major pain to implement.

The last issue is identity mapping. Even if you can ensure that a remote system is providing the correct user IDs, those IDs are still only correct in their context. If you’re a cloud service provider, you really can’t assume that tenant A’s user X is the same as tenant B’s user X. Therefore, you need to map A:X and B:X to some global users P and Q. Because you might need to store these IDs and then return them later (e.g. on a stat() call if you’re a filesystem) you need to be able to do the reverse mapping back to A:X and B:X as well. Lastly, because cloud tenants can and will create new users willy-nilly, you can’t require pre-registration; you need to create new mappings on the fly, whenever you see a new ID. This ends up becoming pretty entangled with the authentication problem because authentication information needs to be looked up based on the global (not per-tenant) ID, so this can all be a big pain but – again – it’s the only way to be secure.

To sum up, the lesson of go-derper is not that memcached is uniquely bad. Lots of systems are equally bad, and making them less bad is going to be hard, but it needs to be done before the other promises made by those systems can be realized. For a great many people, systems that are so totally insecure are useless, no matter what other wonderful functionality they might provide.

Spread of a Meme

I find it fascinating how links get distributed over time. Here’s an example involving the amazing pencil-tip sculptures by Dalton Ghetti,and the times I’ve been presented with the same link in Google Reader.

I predict that it will show up on at least one more website I follow. In maybe a week or so, I’ll see it in print media for the first time. Another week or two after that, I’ll see it in the Boston Globe. That’s the usual pattern, anyway.

Language Specific Package Managers

As many people I’ve talked to IRL probably know, I really hate language-specific package managers. Java has several, Python/Ruby/Erlang etc. each have their own, etc. I totally understand the temptation. I know it’s not all about NIH Syndrome (though some is); some of it’s about Getting Stuff Done as well. Consider the following example. I tried to install Tornado using yum.

[root@fserver-1 repo]# yum install python-tornado
Loaded plugins: presto
Setting up Install Process
Resolving Dependencies
--> Running transaction check
---> Package python-tornado.noarch 0:1.0-2.fc15 set to be updated

(hundreds of lines of dependency stuff)

Transaction Summary
Install      13 Package(s)
Upgrade     161 Package(s)

Total download size: 156 M
Is this ok [y/N]: n

Is this OK? Are you kidding? Of course it’s not OK, especially when I can see that the list includes things like gcc, vim, and yum itself. I know how systems get broken, and that’s it. By way of contrast, let’s see how it goes with easy_install.

[root@fserver-1 repo]# easy_install tornado
Searching for tornado
Reading http://pypi.python.org/simple/tornado/
Reading http://www.tornadoweb.org/
Best match: tornado 1.0
Downloading http://github.com/downloads/facebook/tornado/tornado-1.0.tar.gz
Processing tornado-1.0.tar.gz
Running tornado-1.0/setup.py -q bdist_egg --dist-dir /tmp/easy_install-6Wcauv/tornado-1.0/egg-dist-tmp-NEPqMm
warning: no files found matching '*.png' under directory 'demos'
zip_safe flag not set; analyzing archive contents...
tornado.autoreload: module references __file__
Adding tornado 1.0 to easy-install.pth file

Installed /usr/lib/python2.6/site-packages/tornado-1.0-py2.6.egg
Processing dependencies for tornado
Finished processing dependencies for tornado

Yeah, I see the appeal. On one hand, hours spent either rebuilding a broken system or debugging the problems that are inevitable when 161 packages get updated. On the other hand, Getting Stuff Done in about a minute. Yes, I tested, and the result does work fine with the packages/versions I already had. Still, though, having to do things this way is awful. It’s bad enough that there are still separate package managers for different Linux distros, but now programmers need to have several different package managers on one system just to install the libraries and utilities they need. Worse still, most of these language-specific package managers suck. None of them handle licensing, and few of them handle dependency resolution in any kind of sane way. One of the most popular Java package managers doesn’t even ask before downloading half the internet with no version or authenticity checking to speak of. Good-bye, repeatable builds. Hello, Trojan horses. I can see (above) the problems of having One Package Manager To Rule Them All, or of having dependency resolution be too strict, but there has to be a better way.

What if the system package manager could delegate to a language-specific package manager when appropriate (e.g. yum delegating to easy_install in my example)? Then the system package manager could save itself a lot of work in such cases, and also avoid violating the Principle of Least Surprise when installing in the “standard way” for the system yields different results than installing in the “standard way” for the language. There’d still be difficult cases when dependencies cross language barriers, but those are cases that the system package manager already has to deal with. I know there are a lot of details to work out (especially wrt a common format for communicating what’s wanted and what actually happened), possibly there’s even some fatal flaw in this approach, but my first guess is that a federation/delegation model is likely to be better than an everyone-conflicting model.

Server Design Revisited

About eight years ago, I wrote a series of posts about server design, which I then combined into one post. That was also a time when debates were raging about multi-threaded vs. event-based programming models, about the benefits and drawbacks of TCP, etc. For a long time, my posts on those subjects constituted my main claim to fame in the tech-blogging community, until more recent posts on startup failures and CAP theorem and language wars started reaching an even broader audience, and that server-design article was the centerpiece of that set. Now some of those old debates have been revived, and Matt Welsh has written a SEDA retrospective, so maybe it’s a good time for me to follow suit to see what I and the rest of the community have learned since then.

Before I start talking about the Four Horsemen of Poor Performance, it’s worth establishing a bit of context. Processors have actually not gotten a lot faster in terms of raw clock speed since 2002 – Intel was introducing a 2.8GHz Pentium 4 then – but they’ve all gone multi-core with bigger caches and faster buses and such. Memory and disk sizes have gotten much bigger; speeds have increased less, but still significantly. Gigabit Ethernet was at the same stage back then that 10GbE is at today. Java has gone from being the cool new kid on the block to being the grumpy old man the new cool kids make fun of, with nary a moment spent in between. Virtualization and cloud have become commonplace. Technologies like map/reduce and NoSQL have offered new solutions to data problems, and created new needs as well. All of the tradeoffs have changed, and of course we’ve learned a bit as well. Has any of that changed how the Four Horsemen ride?