HekaFS http://pl.atyp.us/hekafs.org Formerly known as CloudFS Fri, 17 May 2013 14:50:37 +0000 en-US hourly 1 http://wordpress.org/?v=3.5 Performance Variation in the Cloud http://pl.atyp.us/hekafs.org/index.php/2013/05/performance-variation-in-the-cloud/ http://pl.atyp.us/hekafs.org/index.php/2013/05/performance-variation-in-the-cloud/#comments Fri, 17 May 2013 14:50:37 +0000 Jeff Darcy http://pl.atyp.us/hekafs.org/?p=459 During my talks, I often try to make the point that horizontally scalable systems are often necessary not only to achieve high aggregate performance but also to overcome performance variation. This is true in general, but especially in a public cloud. In that environment, performance variation both between nodes and over time is much greater than in a typical bare-metal environment, and there’s very little you can do about it at the single-system level, so you pretty much have to deal with it at the distributed-system level. I’ve been using a graph that kind of illustrates that point, but it has a few deficiencies – it’s second-hand observation, it’s kind of old, and it’s for network performance whereas I usually care more about disk performance. To illustrate my point even more clearly, I took some measurements of my own recently and created some new graphs. Along the way, I found several other things that might be of interest to my readers.

The methodology here was deliberately simple. I’d get on a node, do whatever disk/volume setup was necessary, and then run a very simple iozone test over and over – eight threads, each doing random 4KB synchronous writes. I then repeated this exercise across three providers. It’s worth noting that each test is on a single machine at a single time. The variation across a broader sample is likely to be even greater, but these samples are already more than sufficient to make my point. Let’s look at the first graph, for a High I/O (hi1.4xlarge) instance.

Amazon I/O variation

Ouch. From peaks over 16K down to barely 6K, with barely any correlation between successive samples. That’s ugly. To be fair, Amazon’s result was the worst of the three, and that’s fairly typical. I also tested a Rackspace 30GB instance, and the measly little VR1G instance (yes, that’s 1GB) that runs this website at Host Virtual. The results were pretty amusing. To see how amusing, let’s look at the same figures in a different way.

IOPS distribution

This time, we’re looking at the number of samples that were “X or better” for any given performance level. This is a left/right mirror image of the more common “X or worse” kind of graph, which might seem a bit strange to some people. I did it this way deliberately so that “high to the right” is better, which I think is more intuitive. Too bad I don’t have comments so you can complain. :-P The way to interpret this graph is to keep in mind that the line always falls. The question is how far and how fast it falls. Let’s consider the three lines from lowest (overall to highest).

  • The Rackspace line is low, but it’s very flat. That’s good. 97% of the samples are in a range from just under 6000 to a bit more under 4000. That’s pretty easy to plan for, as we’ll discuss in a moment.
  • The Amazon line is awful. It has the highest peak on the left, but drops off continuously and sits below the HV line most of the time. As we’ve already noted, the range is also quite large. A flat line across a large range is exactly the opposite of a flat line across a small range; it’s very hard to plan around.
  • The Host Virtual line is the most interesting. 70% of the time it’s very nice and flat, from 13.5K down to 12K, but then it falls off dramatically. Is this a good or bad result? It requires a bit more complex mental model than a flat line, but once you’re used to the model it’s actually better for planning purposes.

Before I describe how to use this information for planning a deployment, let’s talk a bit about prices. That VR1G costs $20 a month. The Rackspace instance would cost $878 and the Amazon instance would cost $2562 (less with spot/reserved pricing). Pricing isn’t really my point here, but a 128x difference does give one pause. When the effect of variation on deployment size is considered, those numbers only get worse. Even when one considers the benefits of Amazon’s network (some day I’ll write about that because it’s so much better than everyone else’s that I think it’s the real reason to go there) and services and so on, any serious user would have to consider which workloads should be placed where. But I digress. On with the show.

Let’s look now at how to use this information to provision an entire system. Say that we want to get to 100K aggregate IOPS. How many instances it would take to get there assuming the absolute best case, and how many it would take to achieve a 99% probability based on these distributions?

Provider Best Case 99% Chance Ratio
Amazon 7 13 1.86
Rackspace 14 28 2.00
Host Virtual 8 11 1.38

Here we see something very interesting – the key point of this entire article, in my opinion. Even though Amazon is potentially capable of satisfying our 100K IOPS requirement with fewer instances than Host Virtual, once we take variation into account it requires more to get an actual guarantee. Instead of provisioning 38% more than the minimum, we need to need to provision 86% extra. As Jeff Dean points out in his excellent Tail At Scale article, variation in latency (or in our case throughput) is a critical factor in real-world systems; driving it down should be a goal for systems and systems-software implementors.

Before closing, I should explain a bit about how I arrived at these figures. Such figures can only be approximations of one sort or another, because the number of possibilities that must be considered to arrive at a precise answer is samples^nodes. Even at only 100 samples and 10 nodes, we’d be dealing with 10^20 possibilities. Monte Carlo would be one way to arrive at an estimate. Another way would be to divide the sorted samples into buckets, collapse the numbers within each bucket to a single number (e.g. average or minimum), then treat the results as a smaller number of samples. You can even use enumeration within a bucket as well as between buckets, and even do so recursively (which is in fact what I did). When there’s a nice “knee” in the curve, you can do something even simpler. Just eyeball a number above the knee and a number below, then work out the possibilities using those numbers and probability equal to the percentile at which the knee occurs. Whichever approach you use, you can do more work to get more accurate results but (except for Monte Carlo option) the numbers tend to converge very quickly so you’d probably be overthinking it.

OK, so what have we learned here? First, we’ve learned that I/O performance in the cloud is highly variable. Second, we’ve learned a couple of ways to visualize that variation and see the different patterns that it takes for each provider. Third, we’ve learned that consistency might actually matter more than raw performance if you’re trying to provision for a specific performance level. Fourth and last, we’ve learned a few ways to reason about that variation, and use repeated performance measurements to make a provisioning estimate that’s more accurate than if we just used an average or median. I hope this shows why average, median, or even 99th percentile is just not a good way to think about performance. You need to look at the whole curve to get the real story.

http://pl.atyp.us/hekafs.org/index.php/2013/05/performance-variation-in-the-cloud/feed/ 0
Object Mania http://pl.atyp.us/hekafs.org/index.php/2013/05/object-mania/ http://pl.atyp.us/hekafs.org/index.php/2013/05/object-mania/#comments Tue, 14 May 2013 22:10:24 +0000 Jeff Darcy http://pl.atyp.us/hekafs.org/?p=455 Apparently, at RICON East today, Seagate’s James Hughes said something like this.

Any distributed filesystem like GlusterFS or Ceph that tries to preserve the POSIX API will go the way of the dodo bird.

I don’t actually know the exact quote. The above is from a tweet by Basho’s Seth Thomas, and is admittedly a paraphrase. It led to a brief exchange on Twitter, but it’s a common enough meme so I think a fuller discussion is warranted.

The problem here is not the implication that there are other APIs better than POSIX. I’m quite likely to agree with that, and a discussion about ideal APIs could be quite fruitful. Rather, the problem is the implication that supporting POSIX is inherently bad. Here’s a news flash: POSIX is not the only API that either GlusterFS or Ceph support. Both also support object APIs at least as well as Riak (also a latecomer to that space) does. Here’s another news flash: the world is full of data associated with POSIX applications. Those applications can run just fine on top of a POSIX filesystem, but the cost of converting them and/or their data to use some other storage interface might be extremely high (especially if they’re proprietary). A storage system that can speak POSIX plus SomethingElse is inherently a lot more useful than a storage system that can speak SomethingElse alone, for any value of SomethingElse.

A storage system that only supported POSIX might be problematic, but neither system that James mentions is so limited and that’s what makes his statement misleading. The only way such a statement could be more than sour grapes from a vendor who can’t do POSIX would be if there’s something about supporting POSIX that inherently precludes supporting other interfaces as well, or incurs an unacceptable performance penalty when doing so. That’s not the case. Layering object semantics on top of files, as GlusterFS does, is pretty trivial and works well. Layering the other way, as Ceph does, is a little bit harder because of the need for a metadata-management layer, but also works. What really sucks is sticking a fundamentally different set of database semantics in the middle. I’ve done it a couple of times, and the impedance-mismatch issues are even worse than in the Ceph approach.

As I’ve said over and over again in my presentations, there is no one “best” data/operation/consistency model for storage. Polyglot storage is the way to go, and POSIX is an important glot. I’ve probably used S3 for longer than anyone else reading this, and I was setting up Swift the very day it was open-sourced. I totally understand the model’s appeal. POSIX itself might eventually go the way of the dodo, but not for a very long time. Meanwhile, people and systems that try to wish it away instead of dealing with it are likely to go the way of the unicorn – always an ideal, never very useful for getting real work done.

http://pl.atyp.us/hekafs.org/index.php/2013/05/object-mania/feed/ 0
Mounting GlusterFS as an Unprivileged User http://pl.atyp.us/hekafs.org/index.php/2013/03/mounting-glusterfs-as-an-unprivileged-user/ http://pl.atyp.us/hekafs.org/index.php/2013/03/mounting-glusterfs-as-an-unprivileged-user/#comments Sat, 16 Mar 2013 14:48:10 +0000 Jeff Darcy http://pl.atyp.us/hekafs.org/?p=448 Somebody asked on Twitter whether it was possible, so I tried it. I was able to make it work, but only with some code changes and other very nasty hacks. For the record, here’s what I had to do.

  • Remove the explicit check in fuse_mount_fusermount that causes mounts to fail with “Mounting via helper utility (unprivileged mounting) is supported only if glusterfs is compiled with –enable-fusermount”. I could probably get the same effect by building my RPMs with that option, but I find the build-time requirement noxious. IMO this should be enabled (in the code) by default.
  • Make both fusermount and glusterfsd set-uid. This is a stunningly bad idea in general, but for experimentation it’s OK. Don’t do this unless you’re sure that only trusted users can run these programs, and have reconciled yourself to the idea that you now have set-uid programs that haven’t been through a security audit appropriate to that usage.
  • Mount using “glusterfs –volfile …” to use a local volfile instead of fetching one from glusterd. It looks like glusterd isn’t processing the rpc-auth-allow-insecure option properly; if not for that, mounting normally should work.
  • Have the untrusted user work only on files in a directory owned by that user. The brick directory is still owned by root and should probably remain that way, but you can create a per-user subdirectory.

In short, making this work for everyone would require both code/packaging changes and site changes that are questionable in terms of security. I’m not sure it would be wise to do this, but it is possible.

http://pl.atyp.us/hekafs.org/index.php/2013/03/mounting-glusterfs-as-an-unprivileged-user/feed/ 0
GlusterFS, cscope, and vim (oh my!) http://pl.atyp.us/hekafs.org/index.php/2013/03/glusterfs-cscope-and-vim-oh-my/ http://pl.atyp.us/hekafs.org/index.php/2013/03/glusterfs-cscope-and-vim-oh-my/#comments Fri, 01 Mar 2013 14:22:01 +0000 Jeff Darcy http://pl.atyp.us/hekafs.org/?p=444 I’m generally pretty old-school when it comes to programming tools. Many IDE features either leave me cold (auto-complete) or seem actively harmful to understanding the code as it really is (“project” hierarchies). I do like syntax highlighting, though I’d probably like it just as much if the only thing it did was show comments in a distinct color. I suppose I wouldn’t mind some support e.g. for renaming a variable and having all references change without also changing references to a same-named variable somewhere else . . . but I’m not willing to put up with all the other BS just for that.

The one feature that I do find indispensable is cross-referencing. In a codebase as large as GlusterFS (over 00K lines and still growing quite rapidly) being able to jump to a function/structure definition or references is a pretty major productivity boost. Just do the math. When I’m exploring code for a review or in preparation for some change I plan to make, I often traverse chains of three to five calls at a rate of one such chain per minute. Every second counts. If the old grep/open/goto-line dance takes me only five seconds, that’s still a third of my total time plus additional cognitive disruption. If it takes me only a second each time, that’s pretty huge.

For a while I used Source Navigator but got tired of its insane multi-window and history behavior. For a longer while I used kscope, but then it died with KDE4. I tried exploring alternatives for a while, until Larry Stewart suggested I use cscope with the vim bindings. I tried it, and have never looked back. Here’s a really quick overview of how it works for me and will hopefully work for you.

To install, you need to do two things. First, you need to install cscope. That’s usually just a yum or apt-get install away, so no big deal there. Then you need to put cscope_maps.vim in your ~/.vim/plugin directory. That’s it.

To prepare cscope+vim for use in a particular directory, you also need to do two things – generate a list of files, then generate a list of tags (bookmarks) from that. I use the following two commands

$ find . -name '*.[ch]' > cscope.files
$ cscope -b

This is such a common operation that I even created a git checkout hook to do it automatically. Similarly, refreshing tags in the middle of an editing session is so common that you might want to consider binding this combination to a simple key sequence within vim.

:!cscope -b
:cs reset

The first of those calls out to regenerate the cscope.out file that contains all of your tags. It probably only works if you always start your editing session at the top of the source tree (so it can find cscope.files properly). The second command tells the vim part of the combo to re-read the new tags file.

Using cscope from this point on is extremely simple. The most common commands are prefixed with C-\ (control backslash) and implicitly take the identifier under the cursor as an argument. Some others start with with :cs instead (like reset above). The most common ones I use are as follows.

  • “C-\ g” jumps to the definition of a function or structure.
  • “C-\ c” shows a list of callers for a function, and you can jump to any one of those from the list.
  • “C-\ s” shows a list of references for an identifier – usually a function but could also be a structure or variable. This is often necessary in GlusterFS code to find functions which aren’t called but are passed to STACK_WIND as callbacks.
  • “C-t” undoes the last C-\ command, leaving you where before.
  • “:cs f e <string>” does a regex search through all of the files in cscope.files for <string>. There’s a bit of glitchiness in quoting/escaping/parsing the string, so I generally find it’s best to enclose any possibly-special characters in [] to have them treated as character sets. For example, instead of foo->bar I’ll usually use foo[-][>]bar.
  • “:cs f f <string>” will show you a list of files in cscope.files matching <pattern>. Actually I just noticed it while composing this article, and should use it more myself.
  • “:cs help” will show you some of the other options.

This setup generally works extremely well for me, but there are a few things particular to the GlusterFS code that don’t work quite as well as they could. For one thing, some symbols just don’t seem to get tagged. I suspect that it has something to do with macros – we overuse those terribly, and the problem always seems to be in files where the abuse is particularly bad – but I’ve never quite nailed it down. Sometimes you just have to do a string/regex search instead. Also, jumping to a type definition with “C-\ g” will often take you to a typedef, and then you have to do it again to get to the real structure definition. I wish there was a way to tell cscope that it should jump through typedefs automatically, but I don’t know of one and I doubt I’ll ever have enough time to patch it myself.

If you find yourself spending too much time on the mechanics of navigating GlusterFS code (or for that matter anything else), please give this technique a try. I’m sure there’s an emacs equivalent, but I don’t know the details off the top of my head; maybe some kind soul will provide some hints as a comment. Happy hacking.

http://pl.atyp.us/hekafs.org/index.php/2013/03/glusterfs-cscope-and-vim-oh-my/feed/ 1
Gdb Macros For GlusterFS http://pl.atyp.us/hekafs.org/index.php/2013/02/gdb-macros-for-glusterfs/ http://pl.atyp.us/hekafs.org/index.php/2013/02/gdb-macros-for-glusterfs/#comments Tue, 26 Feb 2013 20:34:47 +0000 Jeff Darcy http://pl.atyp.us/hekafs.org/?p=440 In previous jobs, especially at Revivio, I’ve spent a pretty fair amount of time creating gdb macros to make the inevitable debugging sessions a bit more productive. I’ve generally tried to stay away from that on GlusterFS, partly because there are usually better ways to debug the sorts of problems I have to deal with and partly because gdb macros are one of those things that will make you ill if you know anything about real scripting languages. For example, you can define recursive macros, but convenience variables are always global so you basically can’t use those. Instead, you have to take advantage of the fact that macro arguments are local and rely exclusively on those instead. What you end up with is this grossly inefficient and unreadable tail-recursive mess, just to work around the macro language’s deficiencies. You’ll see what I mean in a minute, but let’s start with something simple – printing out the contents of a dictionary.

define pdict
	set $curr = $arg0->members_list
	while $curr
		printf "%s = %p %s\n", $curr->key, $curr->value, $curr->value->data
		set $curr = $curr->next

That’s not too bad. Now let’s look at one to print out some essential information about a translator.

define pxlator
	printf "--- xlator %s type %s\n", $arg0->name, $arg0->type
	set $d = $arg0->options->members_list
	while $d
		printf "    option %s = %s\n", $d->key, $d->value->data
		set $d = $d->next
	set $x = $arg0->children
	while $x
		printf "    subvolume %s\n", $x->xlator->name
		set $x = $x->next

Now things get a bit hairier. What if we wanted to print out a translator and all of its descendants? This is where that global vs. local issue comes back to bite us, because any convenience variable we use to traverse our own descendant list will also be used in each of them to traverse their own descentant lists, and finding our parent’s next sibling when we’ve finished traversing such a list is really ugly. Instead, we end up with this.

define ptrav
	pxlator $arg0->xlator
	if $arg0->xlator->children
		ptrav $arg0->xlator->children
	if $arg0->next
		ptrav $arg0->next
define pgraph
	pxlator $arg0
	if $arg0->children
		ptrav $arg0->children

As you can see, ptrav has that ugly tail-recursive structure we talked about. The same thing happens when we try to print out a DHT layout structure.

define playout_ent
	if $arg1 < $arg2
		set $ent = $arg0[$arg1]
		printf "  err=%d, start=0x%x, stop=0x%x, xlator=%s\n", \
			$ent.err, $ent.start, $ent.stop, $ent.xlator->name
		playout_ent $arg0 $arg1+1 $arg2
define playout
	printf "spread_cnt=%d\n", $arg0->spread_cnt
	printf "cnt=%d\n", $arg0->cnt
	printf "preset=%d\n", $arg0->preset
	printf "gen=%d\n", $arg0->gen
	printf "type=%d\n", $arg0->type
	printf "search_unhashed=%d\n", $arg0->search_unhashed
	playout_ent $arg0->list 0 $arg0->cnt

I’ve really just started defining these, so if you have some suggestions please let me know. Otherwise, you can use them by just copying and pasting into your .gdbinit or (better yet) into a separate file that you can “source” only when you’re debugging GlusterFS. Share and enjoy. ;)

http://pl.atyp.us/hekafs.org/index.php/2013/02/gdb-macros-for-glusterfs/feed/ 1
FAST’13 http://pl.atyp.us/hekafs.org/index.php/2013/02/fast13/ http://pl.atyp.us/hekafs.org/index.php/2013/02/fast13/#comments Sun, 17 Feb 2013 18:22:01 +0000 Jeff Darcy http://pl.atyp.us/hekafs.org/?p=434 I got back from the USENIX File And Storage Technologies conference yesterday. It actually ended on Friday, but I don’t much care for flight schedules that get me to Boston well after midnight so I stayed in San Jose for the rest of the day. As always, it was an awesome conference. The mix of familiar old-timers and enthusiastic newcomers, all focused on making data storage better, is very energizing.

My own cloud-storage tutorial on Tuesday went well. We had a bit of trouble with the projection video at the beginning, but the USENIX folks are real pros so we got through it. I also had a moment of terror near the beginning when I got a persistent tickle in my throat (it is still flu season after all) and started to wonder if my voice would hold out for the full three hours, but I was able to get things under control. Thanks to the feedback from last year I think it was a much stronger presentation this time, and the general audience engagement seemed much higher. We spent the entire half-hour break continuing discussions in the middle of the room, and throughout the conference – even on the last day – people were still coming up to me to talk about the material. I’m grateful to everyone who made it possible, and everyone who made it awesome.

With that, it was on with the real show – the refereed papers, plus work-in-progress talks, birds-of-a-feather sessions, and posters. To be quite honest, there seemed to be a lower percentage of papers this year that were interesting to me personally, but that’s more a reflection of my interests than on the papers themselves which were all excellent. Here’s the full list.

FAST ’13 Technical Sessions
Posters and Work-in-Progress Reports (WiPs)

From my own perspective as a filesystem developer, here are some of my own favorites.

  • A Study of Linux File System Evolution (Lu et al, Best Paper [full length])
    Maybe not as much fun as some of the algorithmic stuff in other papers, but I’m very excited by the idea of making an empirical, quantitative study of how filesystems evolve. I’m sure there will be many followups to this.
  • Unioning of the Buffer Cache and Journaling Layers with Non-volatile Memory (Lee et al, Best Paper [short]).
    Devices that combine memory-like performance and byte addressability with persistence are almost here, and figuring out how to use them most effectively is going to be very important over the next few years. This paper’s observations about how to avoid double buffering between an NVM-based cache and an on-disk journal are worth looking into.
  • Radio+Tuner: A Tunable Distributed Object Store (Perkins et al, poster and WiP)
    It might be about object stores, but the core idea – dynamically selecting algorithms within a storage system based on simulation of expected results for a given workload – could be applied to filesystems as well
  • Gecko: Contention-Oblivious Disk Arrays for Cloud Storage (Shin et al).
    This is my own personal favorite, and why I’m glad I stayed for the very last session. They present a very simple but powerful way to avoid the “segment cleaning” problem of log-structured filesystems by using multiple disks. Then, as if that’s not enough, they use SSDs in a very intelligent way to boost read performance even further without affecting writes.

I can definitely see myself trying to apply some of the ideas from Tuner and Gecko to my own work. Who knows? Maybe I’ll even come up with something interesting enough to submit my own paper next year. There’s a lot more good stuff there, too, so check out the full proceedings.

http://pl.atyp.us/hekafs.org/index.php/2013/02/fast13/feed/ 0
Two Kinds of Open Source http://pl.atyp.us/hekafs.org/index.php/2013/02/two-kinds-of-open-source/ http://pl.atyp.us/hekafs.org/index.php/2013/02/two-kinds-of-open-source/#comments Fri, 08 Feb 2013 15:40:12 +0000 Jeff Darcy http://pl.atyp.us/hekafs.org/?p=431 There are two schools of thought about when you should release open-source code. One school says you should release it as early as it has any chance whatsoever of being useful or informative to other people. The other school says that if you can’t commit to doing it right – proper source control, packaging, bug tracker, etc. – then don’t bother. I’ve come to the conclusion that both views are right. The release-early approach is essential to maximizing collaboration, while the do-it-right approach maximizes user-friendliness. One is developer-focused while the other is user-focused, so the key is to be prepared to do both at the right times during the project’s lifetime. In fact, I’ll suggest a very specific point where you should switch.

  • Individuals contributing their own time (or sponsored to do stuff on some other not-for-immediate-profit basis) should use the release-early approach.
  • Companies actually selling stuff for real money (even if a free version is available) should use the do-it-right approach.

To illustrate the reason for this “mode switch” consider the two ways people get this wrong. On the one hand, you have developers who play stuff too close to the vest, who won’t release their code until it’s a perfectly polished gem. This is actually an urge I have to fight myself, even for code that’s only “released” to my own team. I don’t want to put code out there that might make people think I’m stupid or careless, so I keep it to myself until I’m at least sure that it’s sane. The problem with this approach is that it relies too much on the individual programmer’s motivation and ability to make progress. Lose that ability, and what might have been a useful idea sinks without a trace. Getting stuff out early at least makes it possible that someone else will get it over that immediate roadblock so that progress can continue.

The other way to screw it up is to form a company around an idea and still treat it as a throwaway. This is kind of what jwz was complaining about recently, and that’s what inspired this post. Examples are legion of companies that just toss their crap up on github not just with a “hey, this might be interesting” but with “you should be using this instead of that other thing” and then fail to stand behind it in any meaningful way. That’s getting a bit too close to fraud for my tastes. The reason companies get and accept funding is supposed to be (largely) to develop that infrastructure and hire the people to do all the less-fun parts of programming, because that’s the only way to make the company sustainable. If that money is instead spent on buying favorable TechCrunch/GigaOm editorials or sending “social media experts” to every conference everywhere, then both users and investors are being misled. It’s supposed to be a business plan, not a “cash out before the bubble pops” plan.

http://pl.atyp.us/hekafs.org/index.php/2013/02/two-kinds-of-open-source/feed/ 2
Plugin Initialization http://pl.atyp.us/hekafs.org/index.php/2013/02/plugin-initialization/ http://pl.atyp.us/hekafs.org/index.php/2013/02/plugin-initialization/#comments Fri, 01 Feb 2013 14:41:49 +0000 Jeff Darcy http://pl.atyp.us/hekafs.org/?p=427 Like many programs, GlusterFS has a plugin interface. We call them translators, but the idea is the same – use dlopen to load modules according to configuration, use dlsym to find functions with certain “well known” names, then call those. The problem is that in our case init does the initialization for a single translator object, of which there might be multiple associated with a single shared library, and sometimes you want to do some initialization exactly once for the library as a whole. It’s a common problem, and there are multiple solutions. Let’s go through a few of them.

If all you’re worried about is multiple sequential initialization calls, you have a very easy, robust, and portable solution: a simple static variable.

    static int is_inited = 0;
    if (!is_inited) {
        is_inited = 1;

In fact I think this would even work for the GlusterFS translator case, because of the way that translators’ init functions get called. The problem is that you’ll often need to deal with multiple concurrent initialization calls as well. For that, you need to do something slightly more complicated.

    static int is_inited = 0;
    static pthread_mutex_t lock_mutex = PTHREAD_MUTEX_INITIALIZER;
    if (!is_inited) {
        is_inited = 1;

So far, so good. As it turns out, this is such a common need that pthreads even includes a simpler method.

    static pthread_once_t init_once = PTHREAD_ONCE_INIT;

This is simple, and it should work even for multiple concurrent calls on any platform that has pthreads. A less obvious benefit is that you can decide first whether you want to do this particular initialization at all. For example, the thing that got me looking at this was the SSL code in GlusterFS. There, we only want to do our OpenSSL initialization if and when we encounter the first connection that uses SSL, and usually there are none. All of the solutions we’ve looked at so far can handle this, so why are we even talking about it? Let’s consider a method that doesn’t handle this as well.

    void __attribute__((constructor))
    do_one_time_stuff (void)

Yes, the double parens are necessary. Also, it might look like a C++ thing but it works for C as well. This will cause do_one_time_stuff to be called automatically when the library is loaded. You can even wrap up the __attribute__ weirdness in a macro, so it’s even simpler than the pthread_once method. What’s not to like? Portability is one concern. I’d be a bit surprised if LLVM doesn’t support this feature, but it wouldn’t be a total shocker. With other compilers it would even seem quite likely. On the other hand, if you’re using some kind of non-pthreads threading that doesn’t have a pthread_once equivalent, but your compiler does support an __attribute__ equivalent (perhaps with a different syntax) then this might be just the thing for you. Remember, though, that you’ll be giving up the ability to do certain types of initialization conditionally based on run-time state. In those cases you’d be better off with our fifth and last method, which has to be implemented in your main program rather than the plugin itself.

    if (ssl_is_needed()) {
        pi_init = (ssl_init_func_t *)dlsym(dl_handle,"plugin_ssl_init");
        if (pi_init) {

That’s simple (for the plugin author), robust, and portable to any platform that can run your main program. The snippet above doesn’t handle threading issues, which might also be a concern for you. The other disadvantage is that it’s very special-purpose. Instead of plugin authors adding initialization functions as they need to, the core author has to add a special hook each time. That’s not exactly a technical issue, but the whole nature of a plugin interface is that it allows third-party development so the logistical issue is likely to be quite significant.

Finally, let’s look at a solution that doesn’t work, because if I don’t mention it I’m sure someone will present it as the “obvious” answer in the comments.

    _init (void)

This has all the drawbacks of the previous approach, plus one huge show-stopper. The first time you try it, your compiler will probably complain about a conflict with the _init that’s already defined in the standard library. You can work around that in gcc with –nostdlib but then you run into another problem: automake and friends don’t always add or maintain that flag properly for shared libraries. You might be OK or you might not, and anything that might leave people fighting with autobreak has to go on the not-recommended list.

So there you have it – four solutions that work with varying tradeoffs, one that’s redundant (pthread_mutex_init vs. pthread_once), and one that’s clearly inferior (_init vs. __attribute__). Choose carefully.

http://pl.atyp.us/hekafs.org/index.php/2013/02/plugin-initialization/feed/ 0
Update on Ceph vs. GlusterFS http://pl.atyp.us/hekafs.org/index.php/2013/01/update-on-ceph-vs-glusterfs/ http://pl.atyp.us/hekafs.org/index.php/2013/01/update-on-ceph-vs-glusterfs/#comments Thu, 17 Jan 2013 01:50:46 +0000 Jeff Darcy http://pl.atyp.us/hekafs.org/?p=422 Since my last post has generated a bit of attention, I want to make sure the most important parts are not lost on anyone. First, let me reiterate: I love Ceph. I value Sage as a colleague and as an ally in the real fight. It would sadden me greatly if my comments had an adverse effect on that relationship. Partly I was writing out of frustration at being constantly compared to a vision. I can’t test a vision. The promise is there, but I can only test the reality. Partly I was also writing out of disappointment, because I know Ceph can do better. I have total faith in the quality of their architecture, and in the talent of their team. If there are glitches, they can be fixed. Maybe I’m trying to light a fire under them, but I don’t intend for anyone to get burned.

Second, how about that real fight? As I said, Ceph and GlusterFS are really on the same side here. The real fight is against proprietary storage, non-scalable storage, and functionally deficient storage. Users deserve better. As I’ve said in person many times, we have to win that battle before we squabble over spoils. The Bad Guys will laugh themselves silly if we tear each other apart. Ceph folks, if I have given offense I apologize. That was not my intent. I want us both to win, but to win we have to be honest with users about where we’re strong and where we still need to improve. So let’s measure, and improve, and kick some storage-industry ass together.

http://pl.atyp.us/hekafs.org/index.php/2013/01/update-on-ceph-vs-glusterfs/feed/ 0
GlusterFS vs. Ceph http://pl.atyp.us/hekafs.org/index.php/2013/01/ceph-notes/ http://pl.atyp.us/hekafs.org/index.php/2013/01/ceph-notes/#comments Tue, 15 Jan 2013 02:58:07 +0000 Jeff Darcy http://pl.atyp.us/hekafs.org/?p=414 Everywhere I go, people ask me about Ceph. That’s hardly surprising, since we’re clearly rivals – which by definition means we’re not enemies. In fact I love Ceph and the people who work on it. The enemy is expensive proprietary Big Storage. The other enemy is things like HDFS that were built for one thing and are only good for one thing but get hyped relentlessly as alternatives to real storage. Ceph and GlusterFS, by contrast, have a lot in common. Both are open source, run on commodity hardware, do internal replication, scale via algorithmic file placement, and so on. Sure, GlusterFS uses ring-based consistent hashing while Ceph uses CRUSH, GlusterFS has one kind of server in the file I/O path while Ceph has two, but they’re different twists on the same idea rather than two different ideas – and I’ll gladly give Sage Weil credit for having done much to popularize that idea.

It should be no surprise, then, that I’m interested in how the two compare in the real world. I ran Ceph on my test machines a while ago, and the results were very disappointing, but I wasn’t interested in bashing Ceph for not being ready so I didn’t write anything then. Lately I’ve been hearing a lot more about how it’s “nearly awesome” so I decided to give it another try. At first I tried to get it running on the same machines as before, but the build process seems very RHEL-unfriendly. Actually I don’t see how duplicate include-file names and such are distro-specific, but the makefile/specfile mismatches and hard dependency on Oracle Java seem to be. I finally managed to get enough running to try the FUSE client, at least, only to find that it inappropriately ignores O_SYNC so those results were meaningless. Since the FUSE client was only slightly interesting and building the kernel client seemed like a lost cause, I abandoned that effort and turned to the cloud.

For these tests I used a pair of 8GB cloud servers that I’ve clocked at around 5000 synchronous 4KB IOPS (2400 buffered 64KB IOPS) before, plus a similar client. The very first thing I did was test local performance to verify that local performance was as I’d measured before. Oddly, one of the servers was right in that ballpark, but the other was consistently about 30% slower. That’s something to consider in the numbers that follow. In any case, I installed Ceph “Argonaut” and GlusterFS 3.2 because those were the ones that were already packaged. Both projects have improved since then; another thing to consider. Let’s look at the boring number first – buffered sequential 64KB IOPS.

Async 64KB graph

No clear winner here. The averages are quite similar, but of course you can see that the GlusterFS numbers are much more consistent. Let’s look at the graph that will surprise people – synchronous random 4KB IOPS.

Sync 4KB graph

Oh, my. This is a test that one would expect Ceph to dominate, what with that kernel client to reduce latency and all. I swear, I double- and triple-checked to make sure I hadn’t reversed the numbers. My best guess at this point is that the FUSE overhead unique to GlusterFS is overwhelmed by some other kind of overhead unique to Ceph. Maybe it’s the fact that Ceph has to contact two servers at the filesystem and block (RADOS) layers for some operations, while GlusterFS only has a single round trip. That’s just a guess, though. The important thing here is that a lot of people assume Ceph will outperform GlusterFS because of what’s written in a paper, but what’s written in the code tells a different story.

Just for fun, I ran one more set of tests to see if the assumptions about FUSE overhead at least held true for metadata operations – specifically directory listings. I created 10K files, did both warm and cold listings, and removed them. Here are the results in seconds.

Ceph GlusterFS
create 109.320 184.241
cold listing 0.889 9.844
warm listing 0.682 8.523
delete 93.748 77.334

Not too surprisingly, Ceph beat GlusterFS in most of these tests – more than 10x for directory listings. We really do need to get those readdirp patches in so that directory listings through FUSE aren’t quite so awful. Maybe we’ll need something else too; I have a couple of ideas in that area, but nothing I should be talking about yet. The real surprise was the last test, where GlusterFS beat Ceph on deletions. I noticed during the test that Ceph was totally hammering the servers – over 200% CPU utilization for the Ceph server processes, vs. less than a tenth of that for GlusterFS. Also, the numbers at 1K files weren’t nearly as bad. I’m guessing again, but it makes me wonder whether something in Ceph’s delete path has O(n²) behavior.

So, what can we conclude from all of this? Not much, really. These were really quick and dirty tests, so they don’t prove much. It’s more interesting what they fail to prove, i.e. that Ceph’s current code is capable of realizing any supposed advantage due to its architecture. Either those advantages aren’t real, or the current implementation isn’t mature enough to demonstrate them. It’s also worth noting that these results are pretty consistent with both Ceph’s own Argonaut vs. Bobtail performance preview and my own previous measurements of a block-storage system I’ve been told is based on Ceph. I’ve seen lots of claims and theories about how GlusterFS is going to be left in the dust, but as yet the evidence seems to point (weakly) the other way. Maybe we should wait until the race has begun before we start predicting the result.

http://pl.atyp.us/hekafs.org/index.php/2013/01/ceph-notes/feed/ 11