Server Design in Serbo-Croatian

Ten and a half years ago, I wrote an article on server design. Considering that I probably worked harder on that than on anything I’ve posted since, I’m pleased that it has continued to be one of the most popular articles on the site despite its increasing age. It’s even more gratifying to know that some people are including it in their academic course materials – maybe half a dozen instances that I know of. Now, thanks to Anja Skrba, it has been translated into Serbo-Croatian. I’m totally unqualified to comment on the accuracy of the translation, but it’s a commendable contribution to the community nonetheless. Feedback on either the content or the translation would surely be appreciated by both of us. Thank you, Anja!

How (Not) To Collaborate

Collaboration is one of the most essential human skills, not just in work but in life generally, and yet it’s poorly taught (if at all) and a lot of people are bad at it. Programmers are especially bad at it, for a whole variety of reasons, and this last week has been like a crash course in just how bad. Collaboration means exchanging ideas. Here’s how I have seen people fail to participate in such exchanges recently.

  • Passive ignoring. No response at all.
  • Active ignoring. Nod, smile, put it on a list to die. This is what a lot of people do when they’ve been told they need to work on their collaboration skills, and want to create an appearance of collaboration without actually working at it.
  • Rejection. All variants of “no” and “what a terrible idea” and “my idea’s better” fall into this category.
  • “It’s my idea now.” The obvious version is just presenting the idea unchanged, as one’s own. The sneakier alternative is to tweak it a little, or re-implement it, so it’s not obvious it’s the same, but still present derivative work without credit to the original.
  • “It’s your problem now.” This is probably the most insidious of all. It presents an appearance of accession, but in fact no exchange of ideas has occurred. Just as importantly, the person doing this has presumed unilateral authority to decide whose problem it is, creating an unequal relationship.

The key to real collaboration is not only to accept a single idea itself, but to facilitate further exchange. Here are some ways to make that work.

  • Accept the context. Respect the priority that the other person gives to the idea along with the idea itself. , Assume some responsibility for facilitating it. Don’t force people to remind, re-submit or nag before you’ll really consider what they’re suggesting. Both active and passive ignoring are wrong because they violate this principle.
  • Don’t attach strings. Don’t make people jump through unnecessary hoops, or demand that they assume responsibility for more than the subject of their idea, just to have their idea considered. Obviously, “your problem now” and its cousin “you touch it you own it” violate this rule. I’ve left more jobs because of this tendency, which leaves people shackled to responsibilities they never asked for, than for any other reason. I don’t think I’m the only one.
  • Be a teacher, not a judge. Every opportunity for rejection is also an opportunity for teaching. If there’s something truly wrong with an idea, you should be able to explain the problem in such a way that everyone benefits. You owe it to your team or your community or even your friends and family to develop this skill.
  • Give credit. It will come back to you. People rarely give freely to notorious thieves and hoarders.

Note that I’m not making any appeals to morality here. I’m not saying it’s right to make collaboration easier. I’m saying it’s practical. When you make collaboration with you easy and pleasant, people want to do it more. That frees you to work on the problems that most interest you, and share credit for a successful project instead of getting no credit at all for a failed or stagnant one. When people try to do you a favor, try to accept graciously.

Is Eventual Consistency Useful?

Every once in a while, somebody comes up with the “new” idea that eventually consistent systems (or AP in CAP terminology) are useless. Of course, it’s not really new at all; the SQL RDBMS neanderthals have been making this claim-without-proof ever since NoSQL databases brought other models back into the spotlight. In the usual formulation, banks must have immediate consistency and would never rely on resolving conflicts after the fact . . . except that they do and have for centuries.

Most recently but least notably, this same line of non-reasoning has been regurgitated by Emin Gün Sirer in The NoSQL Partition Tolerance Myth and You Might Be A Data Radical. I’m not sure you can be a radical by repeating a decades-old meme, but in amongst the anti-NoSQL trolling there’s just enough of a nugget of truth for me to use as a launchpad for some related thoughts.

The first thought has to do with the idea of “partition oblivious” systems. EGS defines “partition tolerance” as “a system’s overall ability to live up to its specification in the presence of network partitions” but then assumes one strongly-consistent specification for the remainder. That’s a bit of assuming the conclusion there; if you assume strong consistency is an absolute requirement, then of course you reach the conclusion that weakly consistent systems are all failures. However, what he euphemistically refers to as “graceful degradation” (really refusing writes in the presence of a true partition) is anything but graceful to many people. In a comment on Alex Popescu’s thread about this, I used the example of sensor networks, but there are other examples as well. Sometimes consistency is preferable and sometimes availability is. That’s the whole essence of what Brewer was getting at all those years ago.

Truly partition-oblivious systems do exist, as a subset of what EGS refers to that way. I think it’s a reasonable description of any system that not only allows inconsistency but has a weak method of resolving conflicts. “Last writer wins” or “latest timestamp” both fall into this category. However, even those have been useful to many people over the years. From early distributed filesystems to very current file-synchronization services like Dropbox, “last writer wins” has proven quite adequate for many people’s needs. Beyond that there is a whole family of systems that are not so much oblivious to partitions as respond differently to them. Any system that uses vector clocks or version vectors, for example, is far from oblivious. The partition was very much recognized, and very conscious decisions were made to deal with it. In some systems – Coda, Lotus Notes, Couchbase – this even includes user-specifed conflict resolution that can accomodate practically any non-immediate consistency need. Most truly partition-oblivious systems – the ones that don’t even attempt conflict resolution but instead just return possibly inconsistent data from whichever copy is closest – never get beyond a single developer’s sandbox, so they’re a bit of a strawman.

Speaking of developers’ sandboxes, I think distributed version control is an excellent example of where eventual consistency does indeed provide great value to users. From RCS and SCCS through CVS and Subversion, version control was a very transactional, synchronous process – lock something by checking it out, work on it, release the lock by checking in. Like every developer I had to deal with transaction failures by manually breaking these locks many times. As teams scaled up in terms of both number of developers and distribution across timezones/schedules, this “can’t make changes unless you can ensure consistency” model broke down badly. Along came a whole generation of distributed systems – git, hg, bzr, and many others – to address the need. These systems are, at their core, eventually consistent databases. They allow developers to make changes independently, and have robust (though admittedly domain-specific) conflict resolution mechanisms. In fact, they solve the divergence problem so well that they treat partitions as a normal case rather than an exception. Clearly, EGS’s characterization of such behavior as “lobotomized” (technically incorrect even in a medical sense BTW since the operation he’s clearly referring to is actually a corpus callosotomy) is off base since a lot of people at least as smart as he is derive significant value from it.

That example probably only resonates with programmers, though. Let’s find some others. How about the process of scientific knowledge exchange via journals and conferences? Researchers generate new data and results independently, then “commit” them to a common store. There’s even a conflict-resolution procedure, domain-specific just like the DVCS example but nonetheless demonstrably useful. This is definitely better than requiring that all people working on the same problem or dataset remain in constant communication or “degrade gracefully” by stopping work. That has never worked, and could never work, to facilitate scientific progress. An even more prosaic example might be the way police share information about a fleeing suspect’s location, or military units share similar information about targets and threats. Would you rather have possibly inconsistent/outdated information, or no information at all? Once you start thinking about how the real world works, eventual consistency pops up everywhere. It’s not some inferior cousin of strong consistency, some easy way out chosen only by lazy developers. It’s the way many important things work, and must work if they’re to work at all. It’s really strong/immediate consistency that’s an anomaly, existing only in a world where problems can be constrained to fit simplistic solutions. The lazy developers just throw locks around things, over-serialize, over-synchronize, and throw their hands in the air when there’s a partition.

Is non-eventual consistency useful? That might well be the more interesting question.

Be a Better Raindrop

no single raindrop believes it is to blame for the flood

The computing industry is already awash in condescension and negativity, and it’s getting worse. Yes, I know it’s not a new phenomenon, but I’ve been around long enough to be sure of the trend. I’ve been blogging for over a decade, I was on Usenet even longer than that before, and I was on other forums even before that. I know all about operating-system wars, language wars, editor wars, license wars, and their ilk. I’ve fought many of those wars myself. Still, things seem to be getting worse. Practically no technical news nowadays comes unaccompanied by a chorus of hatred from those who prefer alternatives. Half the time I find out about something that’s really pretty cool only because I see the bitching about it. How sad is that?

The thing is, it really doesn’t matter why people act this way. Yes, some people are just basically spiteful or insecure. Others might think they’re acting from more noble motives, such as bursting a hype bubble or squashing an idea they believe is truly dangerous. Half of the articles on this site are based in such motivations, so I’m by no means claiming innocence. The problem is that even the best-motivated snark still contributes to the generally unpleasant atmosphere. Contrary to popular belief, we techies are social animals. We have our own equivalent of the Overton Window. Every Linus eruption or similar event from a perceived leader shifts that window toward a higher spleen-to-brain ratio. Others emulate that example, and the phenomenon reinforces itself. Those of us who are older, who are leaders, who find ourselves quoted often, owe it to the community not to keep shifting that window in the wrong direction. That’s not being “honest” or “blunt” or “clear” either, if your honesty/bluntness/clarity is only apparent when your comments are negative. Real life is not one-sided. If your commentary is, then you’re not being any of those things. You’re just being part of the problem.

Linus not helping

No one of us caused this and no one of us can fix it. However, we can each try to do better. That’s my New Year’s resolution: to start taking the high road and giving people the benefit of the doubt just a bit more often. Sure, some people might get besotted with a particular idea or technology that I think is inferior, but that doesn’t make them stupid or bad. Some people might get carried away with their praise for a company or its products/people, but that doesn’t make them fanbois or shills. Some people are all of those things, and I’m sure I’ll still let slip the dogs of war from time to time when the occasion warrants it, but I’ll at least try to adopt a doctrine of no first strikes and proportional response instead of the ever escalating verbal violence that is now commonplace. Would anyone else like to give it a try?

Limiting Bash Script Run Time

Another self-explanatory bash hack. This one was developed to limit the total run time of a test script, where one of the commands was hanging but I was trying to chase down a different bug.

# Run code with a time limit.  This is trickier than you'd think, because the alarm
# signal (or any other) won't be delivered to the parent until any foreground task
# completes.  That kind of defeats the purpose here, since a hung task will also
# block the signal we're using to un-hang it.  Fortunately, a directed "wait" gives
# us a way to work around this issue.  We start both the alarm task and the task
# that does real work in the background, then either way we get into exit_handler
# and kill whichever one's still running.  It's a little bit inconvenient that
# everything has to be wrapped in a "main" function to work, but there's a lot about
# bash that's unfortunate.
function exit_handler {
	if [ -n "$ALREADY_EXITING" ]; then
	if [ -n "$WATCHER" ]; then
		echo "killing watcher"
	if [ -n "$WORKER" ]; then
		echo "killing worker"
	echo "time to die"
trap exit_handler EXIT
function alrm_handler {
	echo "alarm went off"
	unset WATCHER
trap alrm_handler ALRM
export PARENT_PID=$$
(sleep $TIME_LIMIT; echo "ring ring"; kill -s SIGALRM $PARENT_PID) &
export WATCHER=$!
# Example function to demonstrate different completion sequences.
function main {
	if [ "$1" != 0 ]; then
		echo "sleeping"
		sleep $1
		echo "waking up"
main "$@" &
export WORKER=$!
wait $WORKER
unset WORKER
# Test with shorter sleep times to see the worker finish normally, with longer
# sleep times to see the watcher cut things short.

This Is Competition?

As I’m sure you’ve all noticed by now, I’ve become a bit sensitive about people bashing GlusterFS performance. It’s really hard to make even common workloads run well when everything has to go over a network. It’s impossible to make all workloads run well in that environment, and when people blame me for the speed of light I get a bit grouchy. There are a couple of alternatives that I’ve gotten particularly tired of hearing about, not because I fear losing in a fair fight but because I feel that their reality doesn’t match their hype. Either they’re doing things that I think a storage system shouldn’t do, or they don’t actually perform all that well, or both. When I found out that I could get my hands on some systems with distributed block storage based on one of these alternatives, it didn’t take me long to give it a try.

The first thing I did was check out the basic performance of the systems, without even touching the new distributed block storage. I was rather pleased to see that my standard torture test (random synchronous 4KB writes) would ramp very smoothly and consistently up to 25K IOPS. That’s more than respectable. That’s darn good – better IOPS/$ than I saw from any of the alternatives I mentioned last time. So I spun up some of the distributed stuff and ran my tests with high hopes.

synchronous IOPS

Ouch. That’s not totally awful, but it’s not particularly good and it’s not particularly consistent. Certainly not something I’d position as high-performance storage. At the higher thread counts it gets into a range that I wouldn’t be too terribly ashamed of for a distributed filesystem, but remember that this is block storage. There’s a local filesystem at each end, but the communication over the wire is all about blocks. It’s also directly integrated into the virtualization code, which should minimize context switches and copies. Thinking that the infrastructure just might not be handling the hard cases well, I tried throwing an easier test at it – sequential buffered 64KB writes.

buffered IOPS

WTF? That’s even worse that the synchronous result! You can’t see it at this scale, but some of those lower numbers are single digit IOPS. I did the test three times, because I couldn’t believe my eyes, then went back and did the same for the synchronous test. I’m not sure if the consistent parts (such as the nose-dive from 16 to 18 threads all three times) or the inconsistent parts bother me more. That’s beyond disappointing, it’s beyond bad, it’s into shameful territory for everyone involved. Remember 25K IOPS for this hardware using local disks? Now even the one decent sample can’t reach a tenth of that, and that one sample stands out quite a bit from all the rest. Who would pay one penny more for so much less?

Yes, I feel better now. The next time someone mentions this particular alternative and says we should be more like them, I’ll show them how the lab darling fared in the real world. That’s a load off my mind.

The “Gather, Prepare, Cook, Eat” Design Pattern

This post is actually about an anti-pattern, which I’ll call the “grazing” pattern. Code wanders around, consuming new bits of information here and there, occasionally excreting new bits likewise. This works well “in the small” because all of the connections between inputs and outputs are easy to keep in your head. In code that has to make complex and important choices, such as what response to a failure will preserve users’ data instead of destroying it, such a casual approach quickly turns into a disaster. You repeatedly find yourself in some later stage of the code, responsible for initiating some kind of action, and you realize that you might or might not have some piece of information you need based on what code path you took to get there. So you recalculate it, or worse you re-fetch it from its origin. Or it’s not quite the information you need so you add a new field that refers to the old ones, but that gets unwieldy so you make a near copy of it instead (plus code to maintain the two slightly different versions). Sound familiar? That’s how (one kind of) technical debt accumulates.

Yes, of course I have a particular example in mind – GlusterFS’s AFR (Advanced File Replication) translator. There, we have dozens of multi-stage operations, which rely on up to maybe twenty pieces of information – booleans, status codes, arrays of booleans or status codes, arrays of indices into the other arrays, and so on. That’s somewhere between one and ten thousand “is this data current” questions the developer might need to ask before making a change. There’s an awful lot of recalculation and duplication going on, leading to something that is (some coughing, something that might rhyme with “butter frappe”) hard to maintain. This is not a phenomenon unique to this code. It’s how all old code seems to grow without frequent weeding, and I’ve seen the pattern elsewhere more times than I can count. How do we avoid this? That’s where the title comes in.

  • Gather
    Get all of the “raw” information that will be relevant to your decision, in any code path.
  • Prepare
    Slice and dice all the information you got from the real world, converting it into what you need for your decision-making process.
  • Cook
    This is where all the thinking goes. Decide what exactly you’re going to do, then record the decision separately from the data that led to it.
  • Eat
    Execute your decision, using only the information from the previous stage.

The key here is never go back. Time is an arrow, not a wheel. The developer should iterate on decisions; the code should not. If you’re in the Cook or Eat phase and you feel a need to revisit the Gather or Prepare stages, it means that you didn’t do the earlier stages properly. If you’re worried that gathering all data for all code paths means always gathering some information that the actual code path won’t need, it probably means that you’re not structuring that data in the way that best supports your actual decision process. There are exceptions, I’m not going to pretend that a structure like this will solve all of your complex-decision problems for you, but what this pattern does is make all of those dependencies obvious enough to deal with. Having dozens of fields that are “private” to particular code flows and ignored by others is how we got into this mess. (Notice how OOP tends to make this worse instead of better?) Having those fields associated with stages is how we get out of it, because the progression between stages is much more obvious than the (future) choice of which code path to follow. All of those artifacts that lead to “do we have that” and “not quite what I wanted” sit right there and stare you in the face instead of lurking in the shadows, so they get fixed.

Use Big Data For Good

There seems to be a growing awareness that there’s something odd about the recent election. “How did Obama win the presidential race but Republicans get control of the House?” seems to be a common question. People who have never said “gerrymandering” are saying it now. What even I hadn’t realized was this (emphasis mine).

Although the Republicans won 55 percent of the House seats, they received less than half of the votes for members of the House of Representatives.
 – Geoffrey R. Stone

What does this have to do with Big Data? This is not a technical problem. Mostly I think it’s a problem that needs to be addressed at the state level, for example by passing ballot measures requiring that district boundaries be set by an independent directly-elected commission. Maybe those members could even be elected via Approval Voting or Single Transferable Vote – systems which IMO should actually be used to elect the congresscritters themselves, but that’s not feasible without establishing voter familiarity in a different context.

Here’s the technical part. Most of the Big Data “success stories” seem to involve the rich (who can afford to buy/run big clusters) getting richer by exploiting consumers and invading their privacy. Very rarely do I hear about good uses, such as tracking drug interactions or disease spread. Where are the “data scientists” doing real science? Here’s an opportunity, while the election and its consequences are fresh in everybody’s minds, for those tools to do some good. How about if we use Big Data tools and Machine Learning techniques to crunch through demographic data and at least come up with congressional-district proposals that meet some rationally debatable definition of fairness? Obviously the results themselves can’t just be used as is, nor can the algorithms or data sets be enshrined into law, but maybe at least the operative definitions and the results they produce can provide decent starting points for a commission or the people themselves to consider. It seems like a lot better goal than targeting ads, anyway.

Stackable Exit Hooks for Bash

I’m just going to leave this here and then quietly back away before the flames start.

# Stackable atexit functionality for bash.
# Bash's "trap ... EXIT" is somewhat similar to libc's "atexit" with the
# limitation that such functions don't stack.  If you use this construct twice,
# the cleanup code in the second invocation *replaces* that in the first, so
# the first actually doesn't happen.  Oops.  This snippet shows a way to get
# stackable behavior by editing the current trap function to incorporate a new
# one, either at the beginning or the end.  That's a really cheesy thing to do,
# but it works.
function atexit_func {
	# Bash doesn't have anything like Python's 'pass' so we do nothing
	# this way instead.
	echo -n
trap "atexit_func" EXIT
# Call this function to have your cleanup called *before* others.
function atexit_prepend {
	tmpfile=$(mktemp atexit.XXXXXX)
	typeset -f atexit_func > $tmpfile
	echo -en "2a\n$1\n.\nw\nq\n" | ed - $tmpfile
	. $tmpfile
	rm $tmpfile
# Call this function to have your cleanup called *after* others.
function atexit_append {
	tmpfile=$(mktemp atexit.XXXXXX)
	typeset -f atexit_func > $tmpfile
	echo -en "\$i\n$1\n.\nw\nq\n" | ed - $tmpfile
	. $tmpfile
	rm $tmpfile
function first_atexit {
	echo "first atexit function"
atexit_append first_atexit
function second_atexit {
	echo "second atexit function"
atexit_append second_atexit
function third_atexit {
	echo "third atexit function"
atexit_prepend third_atexit
# Should see third/first/second here.

Rackspace Block Storage

A while ago, Rackspace announced their own block storage. I hesitate to say it’s equivalent to Amazon’s EBS, them being competitors and all, but that’s the quickest way to explain what it is/does. I thought the feature itself was long overdue, and the performance looked pretty good, so I said so on Twitter. I also resolved to give it a try, which I was finally able to do last night. Here are some observations.

  • Block storage is only available through their “next generation” (OpenStack based) cloud, and it’s clearly a young product. Attaching block devices to a server often took a disturbingly long time, during which the web interface would often show stale state. Detaching was even worse, and in one case took a support ticket and several hours before a developer could get it unstuck. If I didn’t already have experience with Rackspace’s excellent support folks, this might have been enough to make me wander off.
  • Still before I actually got to the block storage, I was pretty impressed with the I/O performance of the next-gen servers themselves. In my standard random-sync-write test, I was seeing over 8000 4KB IOPS. That’s a kind of weird number, clearly well beyond the typical handful of local disks but pretty low for SSD. In any case, it’s not bad for instance storage.
  • After seeing how well the instance storage did, I was pretty disappointed by the block storage I’d come to see. With that, I was barely able to get beyond 5000 IOPS, and it didn’t seem to make any difference at all if I was using SATA- or SSD-backed block storage. Those are still respectable numbers at $15/month for a minimum 100GB volume. Just for comparison, at Amazon’s prices that would get you a 25-IOPS EBS volume of the same size. Twenty-five, no typo. With the Rackspace version you also get a volume that you can reattach to a different server, while in the Amazon model the only way to get this kind of performance is with storage that’s permanently part of one instance (ditto for Storm on Demand).
  • Just for fun, I ran GlusterFS on these systems too. I used a replicated setup for comparison to previous results, getting up to 2400 IOPS vs. over 4000 for Amazon and over 5000 for Storm on Demand. To be honest, I think these numbers mostly reflect the providers’ networks rather than their storage. Three years ago when I was testing NoSQL systems, I noticed that Amazon’s network seemed much better than their competitors’ and that more than made up for a relative deficit in disk I/O. It seems like little has changed.

The bottom line is that Rackspace’s block storage is interesting, but perhaps not enough to displace others in this segment. Let’s take a look at IOPS per dollar for a two-node replicated GlusterFS configuration.

  • Amazon EBS: 1000 IOPS (provisioned) for $225/month or 4.4 IOPS/$ (server not included)
  • Amazon SSD: 4300 IOPS for $4464/month or 1.0 IOPS/$ (that’s pathetic)
  • Storm on Demand SSD: 5500 IOPS for $590/month or 9.3 IOPS/$
  • Rackspace instance storage: 3400 IOPS for $692/month (8GB instances) or 4.9 IOPS/$
  • Rackspace with 4x block storage per server: 9600 IOPS for $811/month or 11.8 IOPS/$ (hypothetical, assuming CPU or network don’t become bottlenecks)

Some time I’ll have to go back and actually test that last configuration, because I seriously doubt that the results would really be anywhere near that good and I suspect Storm would still remain on top. Maybe if the SSD volumes were really faster than the SATA volumes, which just didn’t seem to be the case when I tried them, things would be different. I should also test some other less-known providers such as CloudSigma or CleverKite, which also offer SSD instances at what seem to be competitive prices (though after Storm I’m wary of providers who do monthly billing with “credits” for unused time instead of true hourly billing).