As someone who was once hired to work on a “cloud file system” I was quite intrigued by this tweet from Kir Kolyshkin.
@kpisman This is our very own distributed fs, somewhat similar to Gluster or CEPH but (of course) better. http://t.co/DlysXvve
Trying to ignore the fact that what the link describes is explicitly not a real filesystem,I immediately responded that the file/block discussion seemed misguided, and – more importantly – the code seemed to be MIA. The link is not to an implementation, in either source or binary form. It’s not even to an architecture or design. It’s just a discussion of high-level requirements, similar to what I did for HekaFS before I even considered writing the first line of code. Naturally, Kir challenged me to elaborate, so I will. Let’s start with what he has to say about scalability.
It’s interesting to note that a 64-node rack cluster with a fast Ethernet switch supporting fabricswitching technology can, using nothing more than 1Gb network cards and fairly run-of-the-mill SATA devices, deliver an aggregate storage bandwidth of around 50GB/s
I’ve actually seen a storage system deliver 50GB/s. I doubt that Kir has, because it’s not that common and if he had I’m pretty sure it would be mentioned somewhere in the document. Even if we assume dual Gb/s full-duplex NICs per node, that’s only 250MB/s/node or 16GB/s total. At 64 nodes per rack I don’t think you’re going to be cramming in more NICs, plus switches, so basically he’s just off by 3x. I work on the same kind of distributed “scale-out” storage he’s talking about, so I’m well aware of how claims like that should and do set off alarm bells for anybody who’s serious about this kind of thing. Let’s move on to the point I originally addressed.
each VE root contains a large number of small files,
and aggregating them in a file environment causes the file server to see a massively growing number of
objects. As a result, metadata operations will run into bottlenecks. To explain this problem further: if
each root has N objects and there are M roots, tracking the combined objects will require an N times M
scaling of effort.
How does this require “N times M” effort any more for M>1 servers than for M=1? The only explanation I can think of is that Kir is thinking of each client needing to have a full map of all objects, but that’s simply not the case. Clients can cache the locations of objects they care about and look up any locations not already in cache. With techniques such as consistent hashing, even those rare lookups won’t be terribly expensive. Servers only care about their own objects, so “N times M” isn’t true for any entity in the system. This is not entirely a solved problem, but both GlusterFS and Ceph (among many others) have been doing things this way for years so anybody claiming to have innovated in this space should exhibit awareness of the possibility. Let’s move on.
use of sparse objects typically is not of interest to hosting providers because they
already generally have more storage than they need.
O RLY? My customers – who I’d guess are probably more “enterprise-y” than Kir’s – certainly don’t seem to be wallowing in idle storage. On the contrary, they seem to be buying lots of new storage all the time and are very sensitive to its cost. That’s why one of the most frequently requested features for GlusterFS is “network RAID” or erasure coding instead of full-out replication, and deduplication/compression are close behind. They’re all geared toward wringing the most out of the storage people already have so that they don’t need to buy more. That hardly sounds like “more than they need” does it?
Because of these misunderstandings, I don’t think Parallels “cloud storage” is really comparable to GlusterFS, so I’m not sure why he mentioned it or why I’d care. It seems a lot more like RBD or Sheepdog, leaving open the question of why Parallels didn’t use one of those. Maybe they specifically wanted something that was closed source (or open source but you’re not supposed to know you’re paying them for something free). What’s really striking is what Kir never even mentions. For example, there’s no mention at all of security, privacy, or multi-tenancy. Surely, if this is supposed to be cloud storage, some mention should be made of accounts and authentication etc. There’s also no mention of management. If this is supposed to be all cloudy, shouldn’t there be something about how easy it is to add capacity or provision user storage from that pooled capacity? Without so much as an architectural overview it’s impossible to tell how well the result meets either the requirements Kir mentions or those he omits, and with such a start it’s hard to be optimistic.