“Anonymous” is Bad For Anonymity

I’ll probably get in trouble for writing this, but somebody has to. Feeling full of themselves after the Wikileaks affair, Anonymous has started going after other worthy targets. The problem is, they’re doing it in a way that almost guarantees a bad outcome. For example, look at their letter to Westboro Baptist Church.

We, the collective super-consciousness known as ANONYMOUS

Might as well stop there. This introduction, plus the hyperbole and contorted sentence structure throughout, makes me think of nothing so much as James T. Kirk’s painfully melodramatic speeches in old Star Trek episodes. This is not the image you want to project when you’re fighting for a cause. For an even worse example, consider the antics at last week’s RSA conference.

owned by anonymous. niiiice.

Again, stop there. Already we have text/internet shorthand, no caps, general swagger, etc. It looks like a child drunk on power, not an adult making a serious policy point. “In it 4 the lulz” indeed . . . and that’s the problem. I don’t object to Anonymous’s choice of targets here. Westboro Baptist definitely deserves some karmic payback, and evidence suggest the same of HBGary Federal. I don’t even object to their tactics, though some might. The problem is the kind of attention this will get them, and how that attention might turn into policy changes that adversely affect all of us. Anonymous clearly wields great power. Power can be used by heroes, and it can be used by bullies. The difference often lies in two things.

  • Identifying yourself. There is no way to tell who’s really Anonymous and who’s just some totally unrelated internet cretin using the name and cause as an excuse for random acts of vandalism. This is kind of ironic, since the real members of Anonymous are clearly experts in technologies such as secure anonymous publishing that would allow them to take or deny credit for any particular act without having to reveal their identities. Anonymous is really pseudonymous, not anonymous, and should take care to preserve the distinction.
  • Defining yourself. Real freedom fighters have identifiable goals and methods. We might not approve of either, but without any explanations (beyond generic “freedom of information” blather that could mean anything) or apparent limits nobody will see the nobility of the cause. Why is Anonymous more prominently taking on Westboro and HBGary, or even Visa and Mastercard, instead of Qaddafi? To extend that thought a little, how are their methods really distinct from Qaddafi’s? Without Anonymous taking a clear stand, “in it 4 the lulz” effectively becomes their credo.

I’m not suggesting that Anonymous should behave differently to satisfy my or anyone else’s comfort level. I’m suggesting they should do so for the sake of the very goals they (vaguely) claim to value. When people see a group with far more power than self-control, which fails to distinguish itself from any other band of bullies, then the Powers That Be will start to see bands of bullies on the internet as a Real Problem. Those who are already looking for any excuse to require ID before connecting to the internet, or to give security agencies more power to invade our privacy in the name of tracking down the Bad People, will be all over that. Policy makers aren’t listening to us, the people. They’re listening to the people with money – like HBGary Federal or worse – who stand to make even more money in such a world. They’re also listening to people like the RIAA/MPAA who would also dearly love a more controlled internet. A very likely outcome of all this is much less privacy and potential for anonymity on the internet. Thanks a lot, Anonymous.

Crypto For Kids

I mentioned on Twitter that I’d been teaching Amy (now six) about simple ciphers, and @lucasjosh asked how I went about it. The answer is way too long for Twitter, and probably of more general interest, so I’ll try to explain.

For a while now, Amy has been fascinated by two ideas: other languages (counting to twelve in Spanish as fast as she can) and writing things instead of speaking (writing notes for me and Cindy). These both have to do with alternate means of communication, so I think it’s a natural and common kind of curiosity. I don’t exactly remember where she first encountered the idea of simple letter-to-number substitution (A=1, B=2, …, Z=26) but it was a while ago. It might have come up again in an issue of Highlights that I picked up at the library for her, or maybe even on a kids’ placemat at Friendly’s. In any case, she was the one who suggested that we write some coded messages. After doing some simple ones at first – “I love mommy” and “daddy is a geek” – I let on that there were many more kinds of codes possible. First I used a simple code based on going around the “circle” of letters by mutually-prime seven: 1=A, 2=H, 3=O, and so on. I didn’t actually explain how the table was generated; to her it was just a simple lookup. At this point Cindy made the point that breaking a code is harder when you only have a little bit of text, so we gave her the table. Amy seemed to enjoy that, so I decided to take it a step further and explained that codes could involve reordering as well as substitution. I did this with a simple 4×4 square, with the message written across.

d a d d
y (space) i s
(space) a (space) g
e e k !

Then I showed her how to read down the columns instead of across the rows, so the result is:


So far, so good. Finally, I showed her how you could reverse the process to decode, and just for extra fun how you could repeat the process and end up where you started (giving her an early exposure to matrix operations as well). She thought that was great, and gave the message to Cindy who decoded it quite quickly. For the second one, I used a 6×2 matrix, and challenged Cindy (who had heard the original message) to figure out what size matrix I’d used. I don’t think the idea of the matrix configuration effectively being the key really sunk in, but I think I’ll be able to demonstrate that pretty well when I show her the closely related Caesar cipher.

At that point it was bed time, so Amy and I headed upstairs. The coolest part of the whole thing, though, was that Amy insisted I get her up at 7am sharp (usually Cindy does that while I get ready for work) so we could do some more codes. Awesome.

Fun With Cartograms

After reading this Paul Krugman’s unemployment differences between states, I thought it would be interesting to see what the results looked like in cartogram form. For those who don’t know, a cartogram is a map that has been redrawn so that the size of each part is proportional to some metric other than geographic area – most often population, but GDP or electoral count or other metrics work too. I searched for cartograms of the US by population, with state boundaries, and quickly found one for the 2004 presidential election. Then I spent a few minutes re-coloring the states to match Krugman’s chart, and the diagram below is the result. I tried to remove the worst red/blue state-border color artifacts from the original, and also corrected the mistaken inclusion of Michigan’s upper peninsula as part of Wisconsin.

US unemployment cartogram

I’m not trying to make any particular point with this; others have offered plenty of commentary already. I just thought other people might find the result interesting.

What a Disappointment

Almost a month ago – a month ago! – I ordered a new computer. It was going to be great. Quad core, lots of memory, SSD, the whole lot. When it finally arrived today, I couldn’t wait to begin setting it up. I opened the case, and everything sure looked professionally assembled with all of the cables neatly tied away and stuff like that. The only problem is that it wouldn’t boot. Things spun up, but I never even got any POST beeps. Upon investigating further, I found that the gigantic heatsink had not been installed properly, and the CPU had been totally wrecked. There were bent pins everywhere, even a couple broken, as though somebody had just not even bothered to seat it properly before slathering on a too-thick coating of low-quality thermal goop and then using power tools to cram the heatsink on top. I won’t name the vendor yet, until I see how they handle this, but to say I’m disappointed would be a huge understatement. I’m going to demand the name of the person they fire, or else I won’t believe that they’ve responded appropriately.

UPDATE 2011-02-11: The new processor arrived last night, and with that the machine has worked fine. It’s pretty sweet, actually.

Introduction to Distributed Filesystems

When Alex Popescu wrote about scalable storage solutions and I said that the omission of distributed filesystems made me cry, he suggested that I could write an introduction. OK. Here it is.

All filesystems – even local ones – have a similar data model and API model. The data model consists of files inside directories, where both have user-assigned names. In most modern filesystems directories can be nested, file contents are byte-addressable, and names are free-form character sequences. The API model, commonly referred to as POSIX after the standard of the same name, includes two broad categories of calls – those that operate on files, and those that operate on the “namespace” of files within directories. Examples of the first category include open, close, read, write, and fsync. Examples of the second category include opendir, readdir, link, unlink, and rename. People who actually develop filesystems, especially in a Linux context, often talk in terms of a different three-way distinction (file/inode/dirent operations) but that has more to do with filesystem internals than with the API users see. The other thing about filesystems is that they’re integrated into the operating system. Any program should be able to use any filesystem without using special libraries. That makes real filesystems a bit harder to implement, but it also makes them more generally useful than impostors that just have “FS” in the name to imply functionality they don’t have. There are many ways to categorize filesystems – according to how they’re accessed, how they’re implemented, what they’re good for, and so on. In the context of scalable storage solutions, though, the most important groupings are these.

  • A local filesystem is local to a single machine, in the sense that only a process on the same machine can make POSIX calls to it. That process might in fact be a server for some “higher level” kind of filesystem, and in fact local filesystems are an essential building block for most others, but for this to work the server must make a new local-filesystem call which is not quite the same as continuing the client’s call.
  • A network filesystem is one that can be shared, but where each client communicates with a single server. NFS (versions 2 and 3) and CIFS (a.k.a. SMB which is what gives Samba its name) are the best known examples of this type. Servers can of course be clustered and made highly available and so on, but this must be done transparently – almost behind the clients’ backs or under their noses. This approach fundamentally only allows vertical scaling, and the trickery necessary to scale horizontally within a single-server basic model can become quite burdensome.
  • A distributed filesystem is one in which clients actually know about and directly communicate with multiple servers (of one or more types). Lustre, PVFS2, GlusterFS, and Ceph all fit into this category despite their other differences. Unfortunately, the term “distributed filesystem” makes no distinction between filesystems distributed across a fast and lossless LAN and those distributed across a WAN with exactly opposite characteristics. I sometimes use “near-distributed” and “far-distributed” to make this distinction, but as far as I know there’s no concise and commonly accepted terms. AFS is the best known example of a far-distributed filesystem, and one of the longest-lived filesystems in any category (still in active large-scale use at several places I know of).
  • A parallel filesystem is a distributed filesystem in which a single file, or even a single I/O request, can be striped across multiple servers. This is primarily beneficial in terms of performance, but can also help to distribute capacity more evenly than if every file lives on exactly one server. I’ve often used the term to refer to near-distributed filesystems as distinct from their far-distributed cousins, because there’s a high degree of overlap, but it’s not technically correct. There are near-distributed filesystems that aren’t parallel filesystems (GlusterFS is usually configured this way) and parallel filesystems that are not near-distributed (Tahoe-LAFS and other crypto-oriented filesystems might fit this description).
  • A cluster or shared-storage filesystem is one in which clients are directly attached to shared block storage. GFS2 and OCFS2 are the best known examples of this category, which also includes MPFS. Once touted as a performance- or scale-oriented solution, these are now positioned mainly as availability solutions with a secondary emphasis on strong data consistency (compared to the weak consistency offered by many other network and distributed filesystems). Due to this focus and the general characteristics of shared block storage, the distribution in this case is always near.

This set of distinctions is certainly neither comprehensive nor ideal, as illustrated by pNFS which allows multiple “layout” types. With a file layout, pNFS would be a distributed filesystem by these definitions. With a block layout, it would be a cluster filesystem. With an object layout, a case could be made for either, and yet all three are really the same filesystem with (mostly) the same protocol and (definitely) the same warts.

One of the most important distinctions among network/distributed/cluster filesystems, from a scalability point of view, is whether it’s just data that’s being distributed or metadata as well. One of the issues I have with Lustre, for example, is that it relies on a single metadata server (MDS). The Lustre folks would surely argue that having a single metadata server is not a problem, and point out that Lustre is in fact used at some of the most I/O-intensive sites in the world without issue. I would point out that I have actually watched the MDS melt down many times when confronted with any but the most embarrassingly metadata-light workloads, and also ask why they’ve expended such enormous engineering effort – on at least two separate occasions – trying to make the MDS distributed if it’s OK for it not to be. Similarly, with pNFS you get distributed data but only some pieces of the protocol (and none of any non-proprietary implementation) to distribute metadata as well. Anybody who wants a filesystem that’s scalable in the same way that non-filesystem data stores such as Cassandra/Riak/Voldemort are scalable would and should be very skeptical of claims made by advocates of a distributed filesystem with non-distributed metadata.

A related issue here is of performance. While near-distributed parallel filesystems can often show amazing megabytes-per-second numbers on large-block large-file sequential workloads, as a group they’re notoriously poor for random or many-small-file workloads. To a certain extent this is the nature of the beast. If files live on dozens of servers, you might have to contact dozens of servers to list a large directory, or the coordination among those servers to maintain consistency (even if it’s just metadata consistency) can become overwhelming. It’s harder to do things this way than to blast bits through a simple pipe between one client and one server without any need for further coordination. Can Ma’s Pomegranate project deserves special mention here as an effort to overcome this habitual weakness of distributed filesystems, but in general it’s one of the reasons many have sought alternative solutions for this sort of data.

So, getting back to Alex’s original article and my response to it, when should one consider using a distributed filesystem instead of an oh-so-fashionable key/value or table/document store for one’s scalable data needs? First, when the data and API models fit. Filesystems are good at hierarchical naming and at manipulating data within large objects (beyond the whole-object GET and PUT of S3-like systems), but they’re not so good for small objects and don’t offer the indices or querying of databases (SQL or otherwise). Second, it’s necessary to consider the performance/cost curve of a particular workload on a distributed filesystem vs. that on some other type of system. If there’s a fit for data model and API and performance, though, I’d say a distributed filesystem should often be preferred to other options. The advantage of having something that’s accessible from every scripting language and command-line tool in the world, without needing special libraries, shouldn’t be taken lightly. Getting data in and out, or massaging it in any of half a million ways, is a real problem that isn’t well addressed by any storage system with a “unique” API (including REST-based ones) no matter how cool that storage system might be otherwise.