Good Programmers

When you hear that someone’s a good programmer, what do you think that means?

  • They write lots of code.
  • They write good code (e.g. clear logic , good error checking, etc.)
  • They write difficult code, that’s difficult to get right or debug when it goes wrong.
  • They write innovative code, representing new algorithms or techniques.

All of the above, you say? Sorry, you don’t get all of the above. An average programmer can do two of these things, but not at the same time. That means nearly half can’t even do that well. A pretty good programmer can do three, two at a time. A very good programmer can do all four, perhaps even three at a time. There are programmers who can do all four at once, but they’re like people who can juggle eight balls at once – you’ll probably never meet one.

My point here is not to say programmers suck, or that one combination of abilities is better than another. The real point is that programmers can be good in different ways, even if none of them are perfect, and if you want to get the most from a group of programmers it can really help to recognize these differences. Programmers with one set of strengths love to dump all over programmers with a different set, but specialization is the cornerstone of our profession (if not of the entire economy that supports it) and specialization cuts across styles as well as technical areas. If you can get your specialists to complement and reinforce one another, instead of undermining and alienating one another, you can do great things that no band of generalists could match. Don’t celebrate diversity for its own stupid sake; this ain’t no hippie commune. Exploit diversity to get things done. It happens to be true that people are happier when they’re working to their strengths instead of conforming to some arbitrary “jack of all trades and master of none” standard, but that’s just icing on the cake. Being successful makes people happy too.

Chess Ratings

The Elo system was originally developed for ranking chess players, but has since been used to rank performance in many other contexts as well. According to the Wikipedia article, this even includes the Bowl Championship Series in college football, which was a surprise to me. The article also mentions several common criticisms of Elo, some of which are addressed by existing alternatives, and more recently there has been a competition to improve the state of the art still further (warning: site is being Slashdotted as I write this).

I should probably enter, since I’ve done a lot of thinking about this exact problem over the years, but I’m oversubscribed already. Before this site even became a blog (which BTW was a little over ten years ago) I implemented a Yahoo/ICC rating script based on the principle of examining not just a person’s opponents but also their opponents and so on to whatever level one wants. It often found non-trivial anomalies in the ratings of players who were getting better or worse, or more often those who had lately changed their preferred time limits or criteria for selecting opponents. I used to run it to cut through the ratings manipulation then (probably still) rampant on both sites, to find opponents who were truly at my own level for a satisfying game. Another idea I’ve developed since then is to represent a person’s rating as not just a strength but a strength plus an internally calculated “style” measured along one or more axes. This has nothing to do with actual style, but can account for the often persistent and reproducible anomalies when one player just seems to beat another player more often than their ongoing records against other opponents would ever indicate. For example, as an endgame-oriented Caro-Slav type of player, I can genuinely expect to win more than I lose against easily-frustrated tacticians even if they’re rated 100-200 points above me. There are other players rated below me against whom I’ll fare poorly, but I’m not about to tell you those secrets. Another idea that I haven’t actually explored as much is of using a “coupled oscillator” approach similar to that used in time-synchronization protocols, to account for the fact that an opponent’s rating was probably changing even as you played them, but that the resulting inaccuracy can be detected and corrected in retrospect. In other words, subsequent analysis could show that a player you thought was at 1500 when you played them was probably more around 1600 but their rating hadn’t caught up yet, and then you could account for that when using the result to modify your own rating. The process could even be repeated until the results stabilized, but probably most of the bang for the buck would come in the first couple of iterations.

All of these approaches can be combined, of course, and I’m pretty sure that with sufficient tweaking of code I could come up with something that would outperform Elo on the competition’s sample. They’re both a bit out of line with the competition’s goal of using/maintaining roughly the same amount of data as Elo, though. On the other hand, they seem to be allowing Glicko, so there’s clearly room for a couple more values per player. Perhaps some simplified version of the last idea would work. In any case, it’ll be interesting to see what comes out of this.

Friday Feature: Latest Tweets

As always, follow @Obdurodon for more.

  • It’s an antipattern to use “it’s an antipattern” as the worst insult you can think of for something that’s not a pattern.
  • OH: “I’d run Eclipse to take a look, but I want my machine to have some resources left for real work.”
  • I’m going to stop talking about cloud storage. Clouds are ephemeral. OceanStore had the right naming idea.
  • Every time a programmer uses goto, an architect dies a little. Die, you bastard, die!
  • If you want authority, take responsibility.

Programmer Humor

I’ve never found a programmer-oriented comic as good as Angst Technology . . . until, perhaps, now. Yeah, sure, there’s XKCD but it’s more generically nerdy. Geek & Poke is good, but my absolute favorite right now has to be Not Invented Here. Since I’ve written a few times about rewriting code, the current series is particularly funny for me – and the dig at pattern-mania is welcome too.

597 bugs

Go check it out. Do you know of any other good ones?

Project Naming

On my project at work, we’ve repeatedly had to grapple with the issue of what to call the various major components. Mostly we’ve been trying to make the names descriptive and unambiguous. For example, my component on a project at work was originally called a “repo” but other components already use that term for package-management repositories, so mine became a “warehouse” instead. Now another team member is thinking about issues like domain-name availability and search-engine visibility, and that’s making things even more difficult. This all brings to mind Zooko’s Triangle, but with a different twist when it comes to project names. Basically, it seems that a project name can have only two out of three properties. It can be:

  • Descriptive.
  • Memorable.
  • Unique.

Examples of various combinations abound. For example, it’s usually not hard to come up with an acronym that’s both descriptive and unique, so many go that route, but the results are often distinctly forgettable. As the acronym space gets fuller, ensuring uniqueness usually means ever-longer acronyms so the problem only gets worse. Microsoft gives us many examples of descriptive and memorable names, such as Windows and Word, but even with their level of market presence they’ve had to fight hard to make those names seem unique to their products and not other window-related or word-related products. For memorable and unique, look no further than some of the projects I write about there – Cassandra, Voldemort, Riak. Would you know just from the name what any of them were about? I wouldn’t. There are also projects that don’t even achieve two of these properties, of course. Anybody remember the fun when Mozilla named a project “Firebird”? It was never intended to be descriptive, it definitely wasn’t unique, and it was only slightly memorable. They really screwed up that time.

One approach is to take a name that’s either descriptive or memorable or some combination of both, then use a prefix/suffix to make it more unique. Microsoft has WinEverything, Apple has iEverything, Lin-this, K-that, and so on. I’ve used “plat-” or “platy-” myself this way a few times. In a way, though, this just pushes the problem from the whole name to the prefix/suffix. Maybe iPhone has (or had) all three desirable properties, but “i-” as a prefix only has the necessary properties because it’s Apple. I couldn’t realistically start a project name pStore and expect the “p” to make it more memorable or unique. I can’t even expect that with most longer prefixes; dolphinStore wouldn’t be much better.

I don’t have any general solution. That’s why I’m suggesting that it’s a triangle. For now, it looks like we’ll go with an acronym that’s reasonably unique and descriptive but certainly not memorable (meaning among other things that when it shows up in a process listing nobody will know what it is). Like everyone else, I wish there were a better way.

August Tweets

Some of my (and others’) favorites from August.

  • Based on a mispronunciation by my daughter, thinking of renaming my blog to CyberVinegar.
  • The programmer equivalent of age/sex/location is OS/language/editor.
  • Laws should affect human behavior, not router behavior.
  • Using virtualization on the desktop means that crappy desktop software can force you to reboot an entire cluster.
  • Never say “I’m game” at a hunt club.
  • “Wisdom of crowds” is a misnomer. Crowds aggregate *knowledge*; wisdom is usually excluded or distorted.

Don’t want to miss my next pithy comment? Here I am.

I Love You, But…

I love programming. I love thinking about algorithms and data strucures. I love writing code, rearranging code, talking about code. I even love testing and debugging and documenting code. (This is not to say I do all of these things as consistently as I can. There are still only 24 hours in a day and so one must prioritize.) Sometimes I think of getting out of this field, though, because so much of working as a programmer nowadays has nothing to do with any of the things I love, and it seems to be getting worse. Nobody loves meetings and bureaucracy and such, but that’s not what I’m talking about.

I hate spending half my time dealing with build systems, source-control systems, package managers, and such. There are too many out there, they all suck, everybody has their favorite one and their favorite way of using it, and they’re not at all shy about ramming their preferences down your throat . . . which brings me to my real point. I hate programmers. Hot damn, but we are a noxious breed, aren’t we? I’m tired of the backstabbing, the trashing each others’ work, the holier-than-thou attitude from the GNU types, the rampant sexism, the bike-shedding, the endless effort to do and re-do all the fun stuff while dumping as much work as possible onto one’s peers, and on and on and on. I know I’ve exemplified many of these sins myself, I don’t need anyone else to tell me that, but if I made it my life’s goal to be as much of a jerk as possible I’d still find myself outdone just about every day by people who aren’t even at their worst.

Of course, I don’t know what else I’d do that pays the bills, so you all are stuck with me, but come on, people. Let’s stop sucking all the fun out of this profession.

How *NOT* to Lose Data

In my last post, I described several common data-loss scenarios and took people to task for what I feel is a very unbalanced view of the problem space. It would be entirely fair for someone to say that it would be even more constructive for me to explain some ways to avoid those problems, so here goes.

One of the most popular approaches to ensuring data protection is immutable and/or append-only files, using ideas that often go back to Seltzer et al’s log structured filesystem paper in 1993. One key justification for that seminal project was the observation that operating-system buffer/page caches absorb most reads, so the access pattern as it hits the filesystem is write-dominated and that’s the case for which the filesystem should be optimized. We’ll get back to that point in a moment. In such a log-oriented approach, writes are handled as simple appends to the latest in a series of logs. Usually, the size of a single log file is capped, and when one log file fills up another is started. When there are enough log files, old ones are combined or retired based on whether they contain updates that are still considered relevant – a process called compaction in several current projects, but also known by other names in other contexts. Reads are handled by searching through the accumulated logs for updates which overlap with what the user requested. Done naively, this could take linear time relative to the number of log entries present, so in practice the read path is often heavily optimized using Bloom filters and other techniques so it can actually be quite efficient. This leads me to a couple of tangential observations about how such solutions are neither as novel nor as complete as some of their more strident champions would have you believe.

  • The general outline described above is pretty much exactly what Steven LeBrun and I came up with in 2003/2004, to handle “timeline” data in Revivio’s continuous data protection system. This predates the publication of details about Dynamo in 2007, and therefore all of Dynamo’s currently-popular descendants as well.
  • Some people seem to act as though immutable files are always and everywhere superior to update-in-place solutions (including soft updates or COW), apparently unaware that they’re just making the complexity of update-in-place Somebody Else’s Problem. When you’re creating and deleting all those immutable files within a finite pool of mutable disk blocks, somebody else – i.e. the filesystem – has to handle all of the space reclamation/reuse issues for you, and they do so with update-in-place.

Despite those caveats, the log-oriented approach can be totally awesome and designers should generally consider it first especially when lookups are by a single key in a flat namespace. You could theoretically handle multiple keys by creating separate sets of Bloom filters etc. for each key, but that can quickly become unwieldy. It also makes writes less efficient, and – as noted previously – write efficiency is one of the key justifications for this approach in the first place. At some point, or for some situations, a different solution might be called for.

The other common approach to data protection is copy on write or COW (as represented by WAFL, ZFS, or btrfs) or its close cousin soft updates. In these approaches, blocks are updated in place, but with very careful attention paid to where and/or when individual block updates actually hit disk. Most commonly, all blocks are either explicitly or implicitly related as parts of a tree. Updates occur from leaves to root, copying old blocks into newly allocated space and then modifying the new copies. Ultimately all of this new space is spliced into the filesystem with an atomic update at the root – the superblock in a filesystem. It’s contention either at the root or on the way up to it that accounts for much of the complexity in such systems, and for many of the differences between them. The soft-update approach diverges from this model by doing more updates in place instead of into newly allocated space, avoiding the issue of contention at the root but requiring even more careful attention to write ordering. Here are a few more notes.

  • When writes are into newly allocated space, and the allocator generally allocates seqential blocks, the at-disk access pattern can be strongly sequential just as with the more explicitly log-oriented approach.
  • The COW approach lends itself to very efficient snapshots, because each successive version of the superblock (or equivalent) represents a whole state of the filesystem at some point in time. Garbage collection becomes quite complicated as a result, but the complexity seems well worth it.
  • There’s a very important optimization that can be made sometimes when a write is wholly contained within a single already-allocated block. In this case, that one block can simply be updated in place and you can skip a lot of the toward-the-root rigamarole. I should apply this technique to VoldFS. Unfortunately, it doesn’t apply if you have to update mtime or if you’re at a level where “torn writes” (something I forgot to mention in my “how to lose data” post) are a concern.

It’s worth noting also that, especially in a distributed environment, these approaches can be combined. For example, VoldFS itself uses a COW approach but most of the actual or candidate data stores from which it allocates its blocks are themselves more log-oriented. As always it’s horses for courses, and different systems – or even different parts of the same system – might be best served by different approaches. That’s why I thought it was worth describing multiple alternatives and the tradeoffs between them.

How To Lose Data

As I mentioned in my last post, I’ve been getting increasingly annoyed at a lot of the flak that has been directed toward MongoDB over data-protection issues. I’m certainly no big fan of systems that treat memory as primary storage (with or without periodic flushes to disk) instead of a cache or buffer for the real thing. I’ve written enough here to back that up, but I’ve also written plenty about something that bugs me even more: FUD. Merely raising an issue isn’t FUD, but the volume and tone and repetition of the criticism are all totally out of proportion when there are so many other data-protection issues we should also worry about. Here are just a few ways to lose data.

  • Don’t provide full redundancy at all levels of your system. It’s amazing how many “distributed” systems out there aren’t really distributed at all, leaving users entirely vulnerable to loss or extended unreachability of a single node, without one peep of protest from the people who are so quick to point the finger at systems which can at least survive that most-common failure mode.
  • Be careless about non-battery-backed disk caches. If data gets stranded in a disk cache when the power goes out, it’s no different than if it was stranded in memory, and yet many projects do absolutely nothing to detect let alone correct for obvious problems in this area.
  • Be careless about data ordering in the kernel. My colleagues who work on local filesystems and pieces of the block-device subsystem in Linux (and others working on other OSes) have done a great deal of too-little-appreciated work to provide the very highest levels of data safety that they can without sacrificing any more performance than necessary. Then folks who preach the virtues of append-only files without knowing anything at all about how they work turn around and subvert all that effort by giving mount-command and fstab-line examples that explicitly put filesystems into async mode, turn off barriers, etc.
  • A special case of the previous point is when people actually do seem to know the options that assure data protection, but forego those options for the sake of getting better benchmark numbers. That’s simply dishonest. You can’t claim great performance and great data protection if users can only really get one or the other depending on which options they choose. Pick one, and shut up about the other.
  • Be careless about your own data ordering. A single I/O operation can require several block-level updates. Many overlapping operations can create a huge bucket of such updates, conflicting in complex ways and requiring very careful attention to the order in which the updates actually occur. If you screw it up just once, and it takes a special brand of arrogance to believe that could never happen to you, then you corrupt data. If you corrupt metadata, you might well lose the user data it points to. If you corrupt user data that can be even worse than losing it, because there are security implications as well. It’s not nice when some of your confidential data becomes part of somebody else’s file/document/whatever. At least with mmap-based approaches, it’s fairly straightforward to do things with msync and fork and hypervisor/filesystem/LVM snapshots to at least guarantee that the state on disk remains consistent even if it’s not absolutely current.
  • Don’t provide any reasonable way to take a backup, which would protect against the nightmare scenario where data is lost not because of a hardware failure but because of a bug or user error that makes your internal redundancy irrelevant.

Of course, some of these issues won’t apply to Your Favorite Data Store, e.g. if it doesn’t have a hierarchical data model or a concept of multiple users. Then again, the list is also incomplete because the real point I’m making is that there are plenty of data-protection pitfalls and plenty of people falling into them. Some of the loudest complainers already had to suspend their FUD campaign to deal with their own data-corruption fiasco. Others are vulnerable to having the same thing happen – I can tell by looking at their designs or code – but those particular chickens haven’t come home to roost yet.

Look, I laughed at the “redundant coffee mug” joke too. It was funny at the time, but that was a while ago. Since then it’s been looking more and more like junior-high-school cliquishness, poking fun at a common target as a way to fit in with the herd. It’s not helping users, it’s not advancing the state of the art, and it’s actively harming the community. As one of the worst offenders once had the gall to tell me, be part of the solution. Find and fix new data-protection issues in whichever projects have them, instead of going on and on about the one everybody already recognizes.