Archive for August, 2011

HekaFS Development Workflow

Pete Zaitcev was having trouble building HekaFS the way he (apparently) builds other things, which led to an email discussion about how those of us already working on HekaFS currently build. With Pete’s permission – at his urging, in fact – here’s the scoop.


I can’t speak for Kaleb (or Edward) but my work flow is generally
something like this:

* Make changes on my desktop where I have a nice cscope setup etc.

* Rsync to build/test machine.

* Go into …/cloudfs/packaging.

* “make fedora” or “make gluster” as appropriate.

* Go into …/rpmbuild.

* “rpm –rebuild SRPMS/hekafs…”

* “yum reinstall –nogpgcheck RPMS/x86_64/hekafs…”

It’s a tiny package, so the whole cycle still takes practically no
time and it avoids most forms of inconsistency or rpm breakage. It
also allows me to deal with the differences between the Gluster and
Fedora packaging easily, which is important because I have to switch
between the two almost daily.

> If you write it somewhere, it would be appreciated. In particular,
> suppose you want to change something. Do you make changes to the git
> tree, or to the unpackage (rpmbuild -bp) tree? If the first, do you
> run the whole gauntlet in order to test the results? If the second,
> how do you scatter the changes back into the git (the pathnames are
> all different and not just having a different prefix)?

Sometimes while I’m debugging I make changes in …/cloudfs on the
build/test machine and rsync back, but I’ve been burned too many times
by making and then losing changes in …/rpmbuild/BUILD to do that any


CloudFS is now HekaFS

In yet another illustration of how broken the US patent and trademark system can be, we’ve had to change our name to avoid conflict with someone else’s trademark. I think that’s utterly absurd, and I can give clear reasons why none of the other possible claimants you’ll find in a Google search should be considered to have a superior claim according to what I know of the law, but the folks at Red Hat who do actually know the law gave “CloudFS” a big thumbs-down and that’s all that really all that matters. That led to a scramble for new names. Any term remotely related to “cloud” or wind or water vapor is already gone by this time. “Elastic” is almost as bad, as are several other terms we might use. Still, we came up with several options and HekaFS is the only one that got a legal thumbs-up.

So, where does this name come from? First and foremost, Heka is an Egyptian god of magic. That’s why Kaleb suggested it, and it even has its own hieroglyph which could potentially serve as a logo. Personally, I prefer the Northern California slang interpretation. Mostly I’d just like to forget about how marketers, lawyers, and wannabes (like VMware and Oxygen Cloud) get to ruin everything for actual innovators. The best response to people who make the world worse is to make the world better, so it’s HekaFS and it’s time to make whatever it’s called as cool as it can be.

Bitter? Yeah.


Quick Look at XtreemFS

I like checking out other projects in my space, so yesterday I had a quick look at XtreemFS. This is one of the more interesting projects out there IMO, with a focus on distribution/replication beyond a single data center. The quick blurb says:

Clients and servers can be distributed world-wide. XtreemFS allows you to mount and access your files via the internet from anywhere.

With XtreemFS you can easily replicate your files across data-centers to reduce network consumption, latency and increase data availability.

open source
XtreemFS is fully open source and licensed under the GPLv2.

OK, sounds good. Since I was working from home, I decided to run some simple tests between Lexington and Westford, which might not seem like a long distance except that I was going through a VPN server in Phoenix. Installation was fairly straightforward using their RHEL6 repository. Initial setup using their Quick Start instructions. I was able to create a volume, mount it, and read/write data from both of my machines in practically no time. Well done, guys.

The next step was to do some actual replication, and that’s where things started to get a bit ragged around the edges. First, I just have to say that replication only of explicitly read-only files impresses me as little as I expected. Also, the process to perform the actual replication seems both cumbersome and error-prone. The instructions for this require at least two steps:

xtfs_repl --set_readonly ~/xtreemfs/movie.avi
xtfs_repl --add_auto --full ~/xtreemfs/movie.avi

Then those instructions didn’t even seem to work. It turned out that the problem was my own fault (insufficient iptables magic on my two machines) but the way the error presented itself was problematic. The actual commands just paused for a long time and then threw a generic I/O error. The logs had big Java tracebacks ending with “Set Replication Strategy not known” messages. This led me down a big blind alley trying to set the strategy on the second xtfs_repl command before I figured out the real problem; I suspect many users who haven’t been thinking about replication strategies for years might have felt even more lost.

The other problem I ran into this morning. My machine at home is no longer accessible, but the DIR and MRC processes plus one OSD are running here at work so I thought that I should be able to operate normally except for not replicating across sites. Wrong. When I tried to build in the iozone tree I had unpacked yesterday, I again saw long pauses followed by the thoroughly misleading “Set Replication Strategy not known” message in the OSD log. Further investigation suggests that the real problem is the iozone build process trying to modify old files that are marked read-only, but that should yield a pretty obvious EPERM/EROFS sort of error. Creating a separate volume and unpacking/building there seemed to work, though. This did make me wonder, though, about how well availability across sites really works. The site says that DIR and MRC replication are supposed to be features in version 1.3 (scheduled for Q1/10 but I don’t see any signs of 1.3 having been released yet. I looked around a bit for instructions on how to set up a redundant DIR/MRC with manual failover, but didn’t find any. As far as I can tell, XtreemFS still requires that remote sites be able to contact a primary DIR/MRC site even though their data might reside locally. That’s OK considering that most other distributed filesystems are exactly the same way, but since distribution across sites was (in my mind) XtreemFS’s main distinguishing feature it was a bit of a disappointment. If the situation is actually better than what I’ve presented here, then I hope one of the XtreemFS developers (with whom I’ve corresponded in the past) will stop by and point me in the right directions.

I know all of that seems like a bit of a downer, but I’d like to end on a high note. Once I had fixed my own configuration issues, and as long as I stayed within the limitations I’ve mentioned, XtreemFS was the only distributed filesystem besides GlusterFS that could get through my “smoke test” without crashes, hangs, or data corruption. That might not seem like a very high standard considering that the test is just iozone reading and writing files sequentially, but four out of six distributed filesystems that I’ve tested (or tried to test) couldn’t even get that far. I wasn’t testing on systems where performance results would be really meaningful except to say that I test GlusterFS this way all the time and XtreemFS performance didn’t seem radically different. The fact that XtreemFS can handle even that much, along with the relative ease of installation and setup, already puts it at #2 on my list. I expect that when 1.3 does come out it will address at least some of the issues I’ve mentioned and offer a worthwhile choice for those who are interested in its unique feature set. I highly recommend that anyone interested in this area give it a look.


POSIX Limitations in FUSE

Because GlusterFS (and thus CloudFS) is based on FUSE, people often bring up the issue of FUSE limitations. Most often their concerns are about performance, and I think those concerns are a bit misplaced. Modern versions of FUSE do quite well, and Sage Weil even points out that some FUSE filesystems such as PLFS significantly outperform their native cousins for many important workloads. FUSE performance is fine, especially for a horizontally scalable distributed filesystem. As Sage also points out, though, there are some functional issues with FUSE. I think he overestimates the importance of integrating tightly with the kernel for memory management and cache coherency, but the problems are there and so I think it’s worthwhile to understand where the “rough edges” are. This post is my attempt to put together a list of ways in which FUSE filesystems might violate either POSIX standards or people’s expectations based on those standards. Here are the first few things that come to mind. Many thanks to Anand Avati @ Gluster for re-explaining some of these to me, providing updated information about others, and helping to fill out the list.

  • Shared writable mmap
    This is the best known FUSE limitation, because it really does bring the coherency and memory-management issues to the fore. Nonetheless, versions of FUSE since Linux kernel 2.6.27 do support it. I happen to believe shared writable mmap is something you shouldn’t be doing on a distributed filesystem because it will never perform well and introduces some extremely nasty fault-handling problems, but for some people it’s still a real issue and not solved in the versions of FUSE that they have.
  • Atomic rename
    This is more of a distributed-filesystem issue generally, but there is a FUSE component to it as well. In a nutshell, the problem is that POSIX requires that if a file exists at a certain path before a rename, then users must be able to open that file at any point even if it’s the subject of a rename. Unfortunately, since FUSE uses a handle-based interface, the open is actually in two parts – first a lookup to get the handle, then the open itself. The file could be renamed away in between, causing a POSIX-violating failure on the second part. This is really hard to address without speculatively locking the entire directory, which is just nasty in a whole bunch of ways.
  • Forgetting inodes
    Kernels have the right to evict inodes from their caches under pressure, but this can introduce a problem if the inode evicted on a server is still in use on a client. The result is a spurious ENOENT error on the client. Again, FUSE has actually addressed this – long before the mmap fixes, in fact – with some callbacks to notify user space, but these callbacks are not widely used and GlusterFS specifically doesn’t have those hooks yet. NFS doesn’t always handle these cases well either, by the way.
    OK, this one’s not POSIX, but still. FUSE actively filters out O_DIRECT flags, for reasons that escape me at the moment. Gluster has a FUSE patch that will turn O_DIRECT into something else that FUSE does support and that’s nearly equivalent, and just yesterday Anand Avati submitted a second patch that is even more fully integrated with the rest of how FUSE works, so maybe soon people won’t have to choose between stock FUSE and FUSE that supports O_DIRECT.
  • Now, I know this list is incomplete. Are there any other areas people can think of where FUSE filesystems can’t do things that in-kernel filesystems can? Please let me know in the comments so we can have a comprehensive list and point people to it when they ask.


New GlusterFS Translator-API Guide

I wrote up a new version both because we need it internally for new CloudFS developers and because others have asked. It’s about 50% longer than the old version, and would probably be twice as long if I hadn’t deleted the section describing the details of the call_frame_t structure (which just doesn’t seem that useful). Topics covered include:

  • Introduction
  • Dispatch Tables and Default Functions
  • Per Request Context
  • Inode and File Descriptor Context
  • Dictionaries and Translator Options
  • Logging
  • Child Enumeration and Fan Out

Let me know if there are any significant subject areas I missed, or if there are errors.