Data Integrity

One of the possible CloudFS 2.0 features that I forgot to mention in my post yesterday, and which was subsequently brought to my attention, is the addition of data-integrity protection using checksums or similar. Some people think this is just part of encryption, but it’s really more than that. This same kind of protection exists at the block level in the form of T10 DIF and DIX, or at the filesystem level inside ZFS or btrfs, and none of those were developed with encryption in mind. The basic idea behind all of them is that for each data block stored there is also a checksum that’s stored and associated with it. Then, when the block is read back, the checksum is also read back and verified as correct before the data can be admitted to the system. The only way encryption enters the picture is that the integrity check for a given data block has to be unforgeable, which implies the use of an HMAC instead of a simple checksum. This protects not only against random corruption, but also against a targeted attack which modifies both the data and the associated HMAC (which might be equally accessible to an attacker in naive designs, though our design is likely to avoid this flaw). Thus, we need three things.

  • A translator (probably client-side) to handle the storage of checksums/HMACs received from its caller during a write, and also their retrieval – but not checking – during a read.
  • A translator that generates and checks simple checksums for the non-encrypted case.
  • Enhancements to the encryption translator to generate and check HMACs instead.

Another possible implementation would be to implement all of the code inside the encryption translator, and simply have an option to turn actual encryption off. Such a “monolithic” approach is unappealing both for technical reasons and also because it would preclude offering the data-integrity feature as part of GlusterFS while keeping encryption as part of CloudFS’s separate value proposition.

If you think this kind of data-integrity protection would be important enough for you to justify making it part of CloudFS 2.0, please let me know. It’s much easier to make a case that Red Hat (or Gluster) resources should be devoted to it if there’s a clear demand from people besides the project developers.


6 Responses

You can follow any responses to this entry through the RSS 2.0 feed.

Both comments and pings are currently closed.

  1. Matt Harris says:

    This would make cloudFS a killer app for research data storage. I don’t know if I can stress enough how important this feature could be.

  2. mother says:

    Data integrity is key!

    ZFS really got it right.

    If we’re going to be storing so much data, and 50% more every year, then we can’t just put it into a fs and hope that what comes out later is the same. We have to *know* that it’s the same.

    I work in digital preservation data integrity is the #1 most important “feature” and I am disappointed in how few storage vendors pay attention to it.

  3. And what’s more, well-designed error-checking codes aren’t merely checksums. Checksums in the narrow, literal sense, just add up the bytes (N-byte words, actually) of the file. Good error-checking codes are more sophisticated and will find differnt kinds of errors; they can be optimized for specific models of expected errors. I don’t know if this really matters in the case we’re talking about, but I had to take a whole course in linear algebra at MIT of which this was probably the only practical application (it was otherwise an interesting but not too useful). So I could not help throwing in this probably-useless comment. :)

  4. Jeff Layton says:

    I think there needs to be some education around checksums and ECC (that’s not targeted at anyone here :) ). Checksums can tell us if the data associated with is the same but if the data is corrupt then a checksum can’t help you recover the data. You need to recover the data some other way perhaps via copy that is known to be good (maybe checked via checksums) or through parity data associated with the data.

    A great example of storing parity is Panasas’ file system (PanFS). It’s object based but they store parity at the block level so they check if the data is correct and if it isn’t they can reconstruct the block via parity (if the parity isn’t corrupt).

    ZFS by itself only does checksumming. It doesn’t do data repair by itself. You have to use RAID or replication within ZFS to get the data recovered. I think this is common misconception but I see it stated in many places.

    If you really want to drive yourself bonkers if you compute the checksum of a file so you can check for data corruption, how do you ensure the checksum is stored and not subject to data corruption as well? For example, people have proposed computing a checksum for each file in a name space and storing them in a database. But now you have to make sure the database doesn’t get corrupted. So do you checksum the database as well and make a copy in case the original database goes bad? But how do you make sure the checksum of the database is correct? Of course at some point you have to just say uncle and quit doing checksums of checksums and make multiple copies and compare them in some fashion.

    That brings up my last comment. Having just a single copy of the checksums may not be enough. What happens if one copy has one checksum and another has a different checksum. Which one is correct? You need three or more copies and develop some sort of quorum approach in determining which one is correct.

    Lots of aspects to this problem – more than meets the eye (I never even mentioned computing checksums in the datapath which can have a big impact on performance). I’ve been playing with ideas and implementations that are file system independent – just prototyping and experiments. But fun to work on since I get to learn much more than I did before.


    P.S. Since I’m in HPC I thought I would mention that walking a file system and checking files for corruption is really an HPC job! (Henry Newman wrote about this at Enterprise Storage Forum).

    BTW – great blog series!

  5. Jeff Darcy says:

    Thanks for stopping by, Jeff. Good points about checksums and their limitations. One of the nice things about GlusterFS, and thus about HekaFS, is that translators can be stacked in any order. That allows data integrity to be checked below the level where replication happens, so that any discrepancy between a block and its checksum – both coming from the same “brick” – can be treated as a failed read. The replication layer, seeing that failed read, can do repair using the same mechanisms as for any other failed read. In the worst case, if a single brick’s checksum database became corrupted, then all reads from that brick would fail and it would be very much like the case where a disk had failed.

    Some day I’d still like to do something better than replication, such as erasure codes or AONT-RS, which would offer much better storage utilization. That’s a long way off, though.

  6. Mysidia says:

    “Which one is correct? You need three or more copies and develop some sort of quorum approach in determining which one is correct”

    Not really… logically, the data itself is a copy of the checksum, and you already access that copy in the process of attempting to verify data matches the checksum. If you have multiple copies of a checksummed data block, and the checksum attached to one of the copies of that block is corrupt, the corrupt checksum won’t match the data, the correct checksum will properly validate with the data its attached to.

    The real problem comes if you have multiple copies of the data and the checksum.. and _all_ copies of the block contain an error in the data or the checksum.