Archive for February, 2013

Gdb Macros For GlusterFS

In previous jobs, especially at Revivio, I’ve spent a pretty fair amount of time creating gdb macros to make the inevitable debugging sessions a bit more productive. I’ve generally tried to stay away from that on GlusterFS, partly because there are usually better ways to debug the sorts of problems I have to deal with and partly because gdb macros are one of those things that will make you ill if you know anything about real scripting languages. For example, you can define recursive macros, but convenience variables are always global so you basically can’t use those. Instead, you have to take advantage of the fact that macro arguments are local and rely exclusively on those instead. What you end up with is this grossly inefficient and unreadable tail-recursive mess, just to work around the macro language’s deficiencies. You’ll see what I mean in a minute, but let’s start with something simple – printing out the contents of a dictionary.

define pdict
	set $curr = $arg0->members_list
	while $curr
		printf "%s = %p %s\n", $curr->key, $curr->value, $curr->value->data
		set $curr = $curr->next

That’s not too bad. Now let’s look at one to print out some essential information about a translator.

define pxlator
	printf "--- xlator %s type %s\n", $arg0->name, $arg0->type
	set $d = $arg0->options->members_list
	while $d
		printf "    option %s = %s\n", $d->key, $d->value->data
		set $d = $d->next
	set $x = $arg0->children
	while $x
		printf "    subvolume %s\n", $x->xlator->name
		set $x = $x->next

Now things get a bit hairier. What if we wanted to print out a translator and all of its descendants? This is where that global vs. local issue comes back to bite us, because any convenience variable we use to traverse our own descendant list will also be used in each of them to traverse their own descentant lists, and finding our parent’s next sibling when we’ve finished traversing such a list is really ugly. Instead, we end up with this.

define ptrav
	pxlator $arg0->xlator
	if $arg0->xlator->children
		ptrav $arg0->xlator->children
	if $arg0->next
		ptrav $arg0->next
define pgraph
	pxlator $arg0
	if $arg0->children
		ptrav $arg0->children

As you can see, ptrav has that ugly tail-recursive structure we talked about. The same thing happens when we try to print out a DHT layout structure.

define playout_ent
	if $arg1 < $arg2
		set $ent = $arg0[$arg1]
		printf "  err=%d, start=0x%x, stop=0x%x, xlator=%s\n", \
			$ent.err, $ent.start, $ent.stop, $ent.xlator->name
		playout_ent $arg0 $arg1+1 $arg2
define playout
	printf "spread_cnt=%d\n", $arg0->spread_cnt
	printf "cnt=%d\n", $arg0->cnt
	printf "preset=%d\n", $arg0->preset
	printf "gen=%d\n", $arg0->gen
	printf "type=%d\n", $arg0->type
	printf "search_unhashed=%d\n", $arg0->search_unhashed
	playout_ent $arg0->list 0 $arg0->cnt

I’ve really just started defining these, so if you have some suggestions please let me know. Otherwise, you can use them by just copying and pasting into your .gdbinit or (better yet) into a separate file that you can “source” only when you’re debugging GlusterFS. Share and enjoy. ;)



I got back from the USENIX File And Storage Technologies conference yesterday. It actually ended on Friday, but I don’t much care for flight schedules that get me to Boston well after midnight so I stayed in San Jose for the rest of the day. As always, it was an awesome conference. The mix of familiar old-timers and enthusiastic newcomers, all focused on making data storage better, is very energizing.

My own cloud-storage tutorial on Tuesday went well. We had a bit of trouble with the projection video at the beginning, but the USENIX folks are real pros so we got through it. I also had a moment of terror near the beginning when I got a persistent tickle in my throat (it is still flu season after all) and started to wonder if my voice would hold out for the full three hours, but I was able to get things under control. Thanks to the feedback from last year I think it was a much stronger presentation this time, and the general audience engagement seemed much higher. We spent the entire half-hour break continuing discussions in the middle of the room, and throughout the conference – even on the last day – people were still coming up to me to talk about the material. I’m grateful to everyone who made it possible, and everyone who made it awesome.

With that, it was on with the real show – the refereed papers, plus work-in-progress talks, birds-of-a-feather sessions, and posters. To be quite honest, there seemed to be a lower percentage of papers this year that were interesting to me personally, but that’s more a reflection of my interests than on the papers themselves which were all excellent. Here’s the full list.

FAST ’13 Technical Sessions
Posters and Work-in-Progress Reports (WiPs)

From my own perspective as a filesystem developer, here are some of my own favorites.

  • A Study of Linux File System Evolution (Lu et al, Best Paper [full length])
    Maybe not as much fun as some of the algorithmic stuff in other papers, but I’m very excited by the idea of making an empirical, quantitative study of how filesystems evolve. I’m sure there will be many followups to this.
  • Unioning of the Buffer Cache and Journaling Layers with Non-volatile Memory (Lee et al, Best Paper [short]).
    Devices that combine memory-like performance and byte addressability with persistence are almost here, and figuring out how to use them most effectively is going to be very important over the next few years. This paper’s observations about how to avoid double buffering between an NVM-based cache and an on-disk journal are worth looking into.
  • Radio+Tuner: A Tunable Distributed Object Store (Perkins et al, poster and WiP)
    It might be about object stores, but the core idea – dynamically selecting algorithms within a storage system based on simulation of expected results for a given workload – could be applied to filesystems as well
  • Gecko: Contention-Oblivious Disk Arrays for Cloud Storage (Shin et al).
    This is my own personal favorite, and why I’m glad I stayed for the very last session. They present a very simple but powerful way to avoid the “segment cleaning” problem of log-structured filesystems by using multiple disks. Then, as if that’s not enough, they use SSDs in a very intelligent way to boost read performance even further without affecting writes.

I can definitely see myself trying to apply some of the ideas from Tuner and Gecko to my own work. Who knows? Maybe I’ll even come up with something interesting enough to submit my own paper next year. There’s a lot more good stuff there, too, so check out the full proceedings.


Two Kinds of Open Source

There are two schools of thought about when you should release open-source code. One school says you should release it as early as it has any chance whatsoever of being useful or informative to other people. The other school says that if you can’t commit to doing it right – proper source control, packaging, bug tracker, etc. – then don’t bother. I’ve come to the conclusion that both views are right. The release-early approach is essential to maximizing collaboration, while the do-it-right approach maximizes user-friendliness. One is developer-focused while the other is user-focused, so the key is to be prepared to do both at the right times during the project’s lifetime. In fact, I’ll suggest a very specific point where you should switch.

  • Individuals contributing their own time (or sponsored to do stuff on some other not-for-immediate-profit basis) should use the release-early approach.
  • Companies actually selling stuff for real money (even if a free version is available) should use the do-it-right approach.

To illustrate the reason for this “mode switch” consider the two ways people get this wrong. On the one hand, you have developers who play stuff too close to the vest, who won’t release their code until it’s a perfectly polished gem. This is actually an urge I have to fight myself, even for code that’s only “released” to my own team. I don’t want to put code out there that might make people think I’m stupid or careless, so I keep it to myself until I’m at least sure that it’s sane. The problem with this approach is that it relies too much on the individual programmer’s motivation and ability to make progress. Lose that ability, and what might have been a useful idea sinks without a trace. Getting stuff out early at least makes it possible that someone else will get it over that immediate roadblock so that progress can continue.

The other way to screw it up is to form a company around an idea and still treat it as a throwaway. This is kind of what jwz was complaining about recently, and that’s what inspired this post. Examples are legion of companies that just toss their crap up on github not just with a “hey, this might be interesting” but with “you should be using this instead of that other thing” and then fail to stand behind it in any meaningful way. That’s getting a bit too close to fraud for my tastes. The reason companies get and accept funding is supposed to be (largely) to develop that infrastructure and hire the people to do all the less-fun parts of programming, because that’s the only way to make the company sustainable. If that money is instead spent on buying favorable TechCrunch/GigaOm editorials or sending “social media experts” to every conference everywhere, then both users and investors are being misled. It’s supposed to be a business plan, not a “cash out before the bubble pops” plan.


Plugin Initialization

Like many programs, GlusterFS has a plugin interface. We call them translators, but the idea is the same – use dlopen to load modules according to configuration, use dlsym to find functions with certain “well known” names, then call those. The problem is that in our case init does the initialization for a single translator object, of which there might be multiple associated with a single shared library, and sometimes you want to do some initialization exactly once for the library as a whole. It’s a common problem, and there are multiple solutions. Let’s go through a few of them.

If all you’re worried about is multiple sequential initialization calls, you have a very easy, robust, and portable solution: a simple static variable.

    static int is_inited = 0;
    if (!is_inited) {
        is_inited = 1;

In fact I think this would even work for the GlusterFS translator case, because of the way that translators’ init functions get called. The problem is that you’ll often need to deal with multiple concurrent initialization calls as well. For that, you need to do something slightly more complicated.

    static int is_inited = 0;
    static pthread_mutex_t lock_mutex = PTHREAD_MUTEX_INITIALIZER;
    if (!is_inited) {
        is_inited = 1;

So far, so good. As it turns out, this is such a common need that pthreads even includes a simpler method.

    static pthread_once_t init_once = PTHREAD_ONCE_INIT;

This is simple, and it should work even for multiple concurrent calls on any platform that has pthreads. A less obvious benefit is that you can decide first whether you want to do this particular initialization at all. For example, the thing that got me looking at this was the SSL code in GlusterFS. There, we only want to do our OpenSSL initialization if and when we encounter the first connection that uses SSL, and usually there are none. All of the solutions we’ve looked at so far can handle this, so why are we even talking about it? Let’s consider a method that doesn’t handle this as well.

    void __attribute__((constructor))
    do_one_time_stuff (void)

Yes, the double parens are necessary. Also, it might look like a C++ thing but it works for C as well. This will cause do_one_time_stuff to be called automatically when the library is loaded. You can even wrap up the __attribute__ weirdness in a macro, so it’s even simpler than the pthread_once method. What’s not to like? Portability is one concern. I’d be a bit surprised if LLVM doesn’t support this feature, but it wouldn’t be a total shocker. With other compilers it would even seem quite likely. On the other hand, if you’re using some kind of non-pthreads threading that doesn’t have a pthread_once equivalent, but your compiler does support an __attribute__ equivalent (perhaps with a different syntax) then this might be just the thing for you. Remember, though, that you’ll be giving up the ability to do certain types of initialization conditionally based on run-time state. In those cases you’d be better off with our fifth and last method, which has to be implemented in your main program rather than the plugin itself.

    if (ssl_is_needed()) {
        pi_init = (ssl_init_func_t *)dlsym(dl_handle,"plugin_ssl_init");
        if (pi_init) {

That’s simple (for the plugin author), robust, and portable to any platform that can run your main program. The snippet above doesn’t handle threading issues, which might also be a concern for you. The other disadvantage is that it’s very special-purpose. Instead of plugin authors adding initialization functions as they need to, the core author has to add a special hook each time. That’s not exactly a technical issue, but the whole nature of a plugin interface is that it allows third-party development so the logistical issue is likely to be quite significant.

Finally, let’s look at a solution that doesn’t work, because if I don’t mention it I’m sure someone will present it as the “obvious” answer in the comments.

    _init (void)

This has all the drawbacks of the previous approach, plus one huge show-stopper. The first time you try it, your compiler will probably complain about a conflict with the _init that’s already defined in the standard library. You can work around that in gcc with –nostdlib but then you run into another problem: automake and friends don’t always add or maintain that flag properly for shared libraries. You might be OK or you might not, and anything that might leave people fighting with autobreak has to go on the not-recommended list.

So there you have it – four solutions that work with varying tradeoffs, one that’s redundant (pthread_mutex_init vs. pthread_once), and one that’s clearly inferior (_init vs. __attribute__). Choose carefully.