Pomegranate is a new distributed filesystem, apparently oriented toward serving many small files efficiently (thanks to @al3xandru for the link). Here are some fairly disconnected thoughts/impressions.
- The HS article says that “Pomegranate should be the first file system that is built over tabular storage” but that’s not really accurate. For one thing, Pomegranate is only partially based on tabular storage for metadata, and relies on another distributed filesystem – Lustre is mentioned several times – for bulk data access. I’d say Ceph is more truly based on tabular storage (RADOS) and it’s far more mature than Pomegranate. I also feel a need to mention my own CassFS and VoldFS, and Artur Bergman’s RiakFuse, as filesystems that are completely based on tabular storage. They’re not fully mature production-ready systems, but they are counterexamples to the original claim.
- One way of looking at Pomegranate is that they’ve essentially replaced the metadata layer from Lustre/PVFS/Ceph/pNFS with their own while continuing to rely on the underlying DFS for data. Perhaps this makes Pomegranate more of a meta-filesystem or filesystem sharding/caching layer than a full filesystem in and of itself, but there’s nothing wrong with that just as there’s nothing wrong with similar sharding/caching layers for databases. Compared to Lustre, this is a significant step forward since Pomegranate’s metadata is fully distributed. Compared to Ceph, though, it’s not so clearly innovative. Ceph already has a distributed metadata layer, based on advanced distribution algorithms to distribute load etc. Pomegranate’s use of ring-based consistent hashing suits my own preference a little better than Ceph’s tree-based approach (CRUSH), but there are many kinds of ring-based hashing and it looks like Pomegranate won’t really catch up to Ceph in this regard until their scheme is tweaked a few times.
- I’m really not wild about the whole “in-memory architecture” thing. If your update didn’t make it to disk because it was at the end of the in-memory queue and hadn’t been flushed yet, that’s no better for reliability than if you just left it in memory for ever (though it does improve capacity) and if you acknowledged the write as complete then you lied to the user. Prompted by some of the hyper-critical and hypocritical comments I’ve seen lately bashing one project for lack of durability, I have another blog post I’m working on about how the critics’ own toys can lose or corrupt data, and how claiming superior durability while using “unsafe” settings for benchmarks is dishonest, so I’ll defer most of that conversation for now. Suffice it to say that if I were to deploy Pomegranate in production one of the first things I’d do would be to force the cache to be properly write-through instead of write-back.
- I can see how the Pomegranate scheme efficiently supports looking up a single file among billions, even in one directory (though the actual efficacy of the approach seems unproven). What’s less clear is how well it handles listing all those files, which is kind of a separate problem similar to range queries in a distributed K/V store. This is something I spent a lot of time pondering for VoldFS, and I’m rather proud of the solution I came up with. I think that solution might be applicable to Pomegranate as well, but need to investigate further. Can Ma, if you read this, I’d love to brainstorm further on this.
- Another thing I wonder about is the scalability of Pomegranate’s approach to complex operations like rename. There’s some mention of a “reliable multisite update service” but without details it’s hard to reason further. This is a very important issue because this is exactly where several efforts to distribute metadata in other projects – notably Lustre – have foundered. It’s a very very hard problem, so if one’s goal is to create something “worthy for [the] file system community” then this would be a great area to explore further.
Some of those points might seem like criticism, but they’re not intended that way – or at least they’re intended as constructive criticism. They’re things I’m curious about, because I know they’re both difficult and under-appreciated by those outside the filesystem community, and they’re questions I couldn’t answer from a cursory examination of the available material. I hope to examine and discuss these issues further, because Pomegranate really does look like an interesting and welcome addition to this space.
Very good post. Can you elaborate on how VoldFS get around directory listing, directory/file renaming?
Thanks, juri. To answer your questions, I think I first need to describe some of the “principles of operation” for VoldFS. Here’s the basic flow, for both file and directory operations.
(1) Convert the inode number to a key in the underlying data store.
(2) Fetch the inode and current version.
(3) Do all updates to data and indirect blocks (the whole thing is B-tree-ish) via COW using new keys.
(4) Do a conditional update – Voldemort maybe_put or memcache “cas” – using the previously saved version.
(5) If a conflict is detected, go back to (2).
I try to preserve as much prior work as possible on a retry, but there’s probably more I can do. There are also issues with ensuring forward progress, reaping superseded blocks, etc. that I still have to deal with. The important thing is that this scheme allows a single operation to update arbitrarily many keys/blocks atomically without ever leaving the filesystem in an inconsistent state and without locking.
OK, now on to directories. For the most part I use exactly the same data structures and algorithms for directory entries as I do for data blocks. They’re different sizes, of course, so I pack them differently into keys in the underlying store. The other major difference is that data blocks are inserted into the tree by block number, whereas directory entries are inserted using a hash of the filename. Thus, listing a directory of any size is a pretty straightforward tree walk. The tree could theoretically become unbalanced, but since I’m using a hash instead of the actual filename then any semi-decent hash function should prevent the problem from being too severe.
Lastly, about renames. I haven’t actually implemented renames in VoldFS yet, but the basic plan is two-fold. First, it’s only a rename if it’s in the same directory; anything else is a move and might fail if it’s across mountpoints. There’s a lot of machinery already to handle EXDEV by falling back to copy-and-unlink, which nobody expects to be atomic, so I just punt and let that machinery take over. That leaves only the case of renames within a directory. The easy thing to do would be link-new followed by unlink-old, but that wouldn’t be atomic. If anybody ever convinces me that there’s an environment where atomic rename would be useful and using VoldFS isn’t completely insane for other reasons, then I don’t think it would be too horrible to do both directory rewrites simultaneously using the same atomic-update mechanism described above.
Jeff,
Thanks for the explanation, but what is wrong with the naive approach: Use the directory path as the key, use the this key to find meta data information of all the files in that directory:
foobar — Key
{
a.inc 100 01-02-99
b.c 200 02-03-99 } — Value
Of course, it could grow really large when dealing lots of files. But even with B+ tree, you could only have one effective index, that doesn’t really solve the range query problem, does it?
The problem occurs when you’re dealing with lots of files – millions, say. If you’re using filenames to insert into the B*-tree then (a) you have to do string compares instead of numeric and (b) you have to actively keep the tree balanced in case of pathological insertion patterns such as files that are already in lexical order – which is actually pretty common BTW. By using hashes VoldFS avoids both problems: comparisons are simple numeric compares and the hashes should be distributed evenly enough to avoid any imbalance worth worrying about.
As for solving the general range-query problem, that was not my intent. I said the problem is *similar* to the range-query problem, and it is, but you’re right that there’s only one index. Worse still, that index is on the file *hash* which supports the necessary “M-entries-starting-at-N” VFS/FUSE interface but is pretty useless for any other purpose. Anybody who cares about lexical order will have to do their own sorting, but that’s kind of the price one pays for the other benefits. There ain’t no such thing as a free lunch. ;)
Thanks Jeff. Good questions!
1. We originally claim “Pomegranate should be the first file system that is built over tabular storage” for the following reasons:
We have implement a truely tabular storage, named xTable, underlying Pomegranate. It supports query like to find a table cell. Thus, it is different from Ceph’s RADOS both in API and the internal implementation.
However, we do not notice your CassFS and VoldFS, and Artur Beergman’s PiakFuse. Therefor, we withdraw our claim to be the first one. Thanks for your reference:)
2. Actually, the small files’ data are not stored on underlying DFS, they are stored on our underlying tabular storage. We just relay large files to underlying DFS. Pomegranate manages both metadata and small files’ data.
3. For posix write, we do not guarantee the data hit disk on the return of write syscall. However, if we see a fsync, we trigger our caching layer to snapshot the memory image and commit the modifications to disk. We can also keep our promise that in 5 or other customized seconds later, the modified files will be on
disk. Thus, this 5 seconds promise can catch up with ext3. The small files’s data could be not cached in caching layer. In our test, we set file data write-through instead of write-back. However, in our test we also use our dynamic snapshot interval adjustment to delay the write-backs to absorb more metadata modifications.
4. Pomegranate supports POSIX readdir() operation. First, let me describe distributed extendible hash that we used in Pomegranate.
1) Tranditional extendible hash(EH) can extend by explot the two level indexes: the directory and buckets. On inserting, if the bucket is overflow, new bucket are generated to absort outflows. If bucket slots are full, the directory is extended exponentially. This approach is efficient both for small numbers of entries and for large numbers of entries.
2) We use GIGA+ like distributed extendible hash for locating a table slice (bucket in EH). The directory in EH is represented as a distributed bitmap. Each bits means a bucket exists or not.
On list a directory, the client firstly request to get the most accurate bitmap of the directory. Then, it use this bitmap to iterate and retrieve the existing table slices. Each table slice contains some table rows(fs’s directory entries).
5. We have optimize complex operations such as mkdir, rmdir, hard link. While, rename is not optimized yet. It is slow for now. Yep, the reliable update service should be the qualitative change from tabular storage to file system. In general, the update service accepts delta update requests and execute as follows:
1) log the delta updates to disk in source site’s snapshot;
2) transfer the delta updates to sink site’s memory;
3) update sink site’s memory;
4) commit to disk in sink site’s snapshot;
5) sink site acknowledge the request to source site;
6) mark in the log that the update completed.
There are so many stages in the execution, however, the node failures can be recovered from the logs, memory state, and on-disk state. It is truely very complex, and implementation and validation are on-going.
Thanks for your constructive questions! We are glad to receive your valuable feedbacks:)