Some software projects are obviously hard. Nobody thinks writing a compiler or an operating system, especially one comparable to existing production-grade examples, will be easy. Other projects seem to be easy until you get into them. Unfortunately, distributed storage is one of those. It’s really not that hard to put together some basic server/connection management and some consistent hashing, add a simple REST interface, and have a pretty useful type of distributed object store. The problem is that some people do that and then claim to have reinvented Swift, or to have invented something that’s even better than GlusterFS because it’s simpler. Um, no. A real distributed storage system has to go well beyond those simple requirements, and I’m not even talking about all of the complexity imposed by POSIX. I’ve beaten the NoPOSIX drum a few times, calling for simplification of semantics for distributed filesystems. I have nothing against object stores like Swift, which simplify even further. However, NoPOSIX and NoFS and NoSQL don’t mean giving up all requirements and expectations altogether. A minimum standard still includes things like basic security, handling disk-full or node-failure errors gracefully, automating the process of adding/removing/rearranging servers (preferably without downtime), and so on. That complexity is there for a reason. Comparing something that has these features to something that doesn’t isn’t just incorrect. It’s dangerous. Users are even less capable than fellow developers of evaluating claims about such systems. Overstating your capabilities increases risk that users will choose systems which aren’t really ready for serious use, and might even cost them their valuable data.

What brought all of this to mind was some recent Quora spamming by the author of Weed-FS, claiming that it would perform better than existing systems because of its great simplicity. In some ways it’s unfair to pick on Weed-FS specifically, but it represents a general category of “existing data stores are too complex” and “we invented a better data store” BS that I’ve been seeing entirely too much lately. Also, I kind of promised/threatened to run some performance tests myself if the author was too lazy or scared, so here we go.

STRIKE 1: no real packaging I fired up a Rackspace cloud server, and went to see if I could install Weed-FS on it. No such luck. The only build packages are for Windows/Darwin/Linux amd64, but that’s a relative rarity on cloud services, so I cloned the source tree and tried to build from that. Too bad there are no makefiles. Apparently the author builds using Eclipse, and didn’t bother including all of the information from that in the source tree. Nonetheless, it only took me a few minutes to figure out the correct build order and build the single executable using gccgo.

STRIKE 2: barely usable interface Unlike most object-store interfaces, Weed-FS has no buckets/containers/directories and insists on assigning its own keys to objects. Therefore you can’t use keys that are meaningful to you; you have to use theirs and store the mapping in some other kind of database. There also seems to be no enumeration function (I guess we don’t need to bother measuring that kind of performance) so if you ever lose the mapping between your own key and theirs then you’ll never find your data again. Similarly, there are no functions to get/set metadata on objects, so there’s pretty much no way to use Weed-FS except by pairing it with a database and wrapping a library around the whole thing. Oh, and there’s no delete either. Too bad if you ever want to reclaim any space.

STRIKE 3: poor performance Despite all the above, I set about writing some scripts to test performance. Before I could read a million files I had to create a million files. Just to be on the safe side, I decided to try creating 100K files first to make sure it wasn’t going to take forever – both making this exercise very tedious and costing me money in extra instance hours. It didn’t take forever, but it did take over 14 minutes. That’s over 8ms per object create, or over two hours just to set up for a real test. It’s particularly egregious since there doesn’t seem to be any evidence of using O_SYNC/fsync, so it’s not even clear that the index file is sure to have been updated. I tried speeding it up by running five client threads in parallel, but one by one they hung waiting for a response from one of the volume servers – probably related to the “unexpected EOF” errors that the volume server would spit out periodically. I guess concurrency isn’t a strong suit, and neither of error reporting since I had already noticed that the servers would return an HTTP 200 response even when requests failed. Just for comparison, GlusterFS completed the same setup in about seven minutes. That’s with full data durability, plus a result that has real user-friendly file names (plus extended attributes) and directory listings and actual security. At this point I decided it wasn’t even worth moving on to the test I’d meant to do. I’d seen enough.

Some people might think I’m being too harsh here, but I disagree. As I said in the first paragraph, systems like Weed-FS can be quite useful. As I also said, representing them as more than they are is not only incorrect but dangerous. The author as developer has done some interesting stuff and deserves encouragement. This feedback might not be what he wants to hear, but it’s a kind of feedback that developers need to hear, plus I’ve already done some testing and scripting work that might be useful. On the other hand, the author as social-media marketer deserves nothing but contempt. This is not a system that one can trust with real data, or that is in any significant way comparable to systems that already existed, and yet it was blithely presented as something actually superior to alternatives. That’s not acceptable. Building real storage is hard and often tedious work. The people who do it – including my competitors – don’t deserve to have their efforts trivialized by comparison to half-baked spare-time projects. They deserve better, and users deserve better, and anybody who doesn’t respect that deserves a few harsh words.