At my last job, I had to work with InfiniBand. Believe me, this did not lead to an enduring love of IB. Before InfiniBand, I had worked with Fibre Channel and seen how overburdened it was with every vendor’s favorite feature or format or protocol variation, often with some little bit hidden somewhere to tell you which of several possible (and mutually incompatible) behaviors you were expected to exhibit in response. Compared to IB, FC is a model of streamlined simplicity. How’s that for scary? Nonetheless, now that all those thousands of person-hours have been poured into it, IB does actually manage to deliver somewhat on its original promise of high bandwidth and low latency at low cost.
So along comes 10-gigabit Ethernet (10GbE), which is so many levels removed even from the thing that people called Ethernet after original Ethernet had been dead and buried that nothing remains but the brand name. It seems that some folks are sure it’s going to displace IB as a cluster interconnect Any Day Now. Hitching themselves to that belief, they’ve started flinging FUD about IB’s “misleading” bandwidth numbers. Here’s one of the more egregious examples.
We will tear this black cable bandit down to size one claim at a time. First they assert that it’s 20Gbps, how about 12Gbps on it’s best day with all the electrons flowing in the same direction. Infiniband employs what is know as 8b/10b encoding to put the bits on the wire. For every 10 signal bits there are 8 useful data bits. Ethernet uses the same method, the difference is that Ethernet for the past 30 years has advertised the actual data rate while Infiniband promotes the 25% larger and useless signal rate. Using Infiniband math Ethernet would then be 12.5Gbps instead of the 10Gbps it actually is. So using Ethernet math Infiniband’s Double Data Rate (DDR) is actually only 16Gbps and not the 20Gbps they claim.
Apparently, according to “10GbE math” 16Gb/s is less than 10Gb/s. Spare me. DDR IB is at approximate price parity with 10GbE, and still 60% faster than 10GbE – with QDR products already available. How does that make 10GbE the superior choice, again? Wait, you say. Those are only nominal bandwidths, right? True enough, and just as true for 10GbE as for IB. It would be a little disingenuous to point out that IB doesn’t really achieve 16Gb/s except “on it’s best day with all the electrons flowing in the same direction” without also pointing out that 10GbE is subject to the same effects (and the vast majority of cards according to 10GbE.net’s own price lists aren’t even physically capable of more than 13Gb/s across two ports).
The writing style on 10GbE.net is strikingly similar to that of a certain Cisco employee. Instead of launching all this FUD from behind a screen of anonymity, would it be too much to ask that the author show a little more honest about his associations? When he can show repeatable, verifiable results indicating that DDR IB doesn’t still trounce 10GbE at the same price point, then we can have a real discussion about cluster interconnects.
Bandwidth isn’t the problem. Price parity is, and now that 10GbE price is dropping, people are seeing more synergy with larger Ethernet networks. I wrote more on my blog.
Also, why the passive aggressiveness? If you want to claim that whoever wrote that worked for Cisco or Juniper or whoever else, why not just call them out to their face?
I’m not claiming it’s a particular person because I don’t know it’s a particular person. I’m just noting a suspicion, and perhaps warning the author that he’s not doing a very good job of hiding his identity. If you want to complain about “passive aggressive” behavior (which this isn’t BTW) then maybe you should direct those comments toward the anonymity-abusing author of 10GbE.net. The same failure to put a specific name behind a specific claim, for which you’re so quick to criticize me, is his entire stock in trade.
Of course, since your blog only identifies you as “a (currently) anonymous network engineer” it’s easy to see where your sympathies lie on this issue. Before you criticize anyone for not taking a stand, fix that.
Jeff you hate me more than anyone but you know this stuff and I have a question (As well as the sales guy shit you think I’m filled with, everyone else who knows me will attest to the fact that I’m also filled with questions)
What do you think of scale out storage architectures? I ask as I think NetApp are going to use iWARP over 10GBe as node interconnect for their scale out architecture. Is there value in dumping the idea of a storage array and going for a unified namespace with nodes all over the place?
Is there value to such an idea? Yes. Is there risk to such an idea? Yes again. Big monolithic storage systems tend to hit the same wall that big monolithic compute systems hit a while ago – at a certain point the complexity to gain the next increment in capacity or performance starts to exceed the value of that increment to customers. At that point it makes more sense to have many smaller units, but if you want to do that within a single namespace then you have to deal with the issues of keeping that single namespace consistent. (Even if you don’t keep actual file data consistent, you have to keep most metadata consistent or people will find your product too annoying to use.) You also get to deal with a whole bunch of extra failure and partial-availability scenarios. That’s the kind of stuff I had to deal with when I was working on HighRoad/MPFS, and that I deal with now using Lustre or PVFS. It can be very rewarding, and it can also be very frustrating, for both developers and users. Overall I think it’s worth it, but there’s still a heckuva lot of work to be done making it as reliable and convenient as more centralized solutions already are.
This is where “scale-out architecture” becomes a little ambiguous, and it becomes a bit harder to answer your question. It’s possible to design a scale-out architecture where the protocols and interconnects used on the “front” (client/customer) side are only slightly different from those in use before, and there’s a separate back-end network and protocol used only by the servers. The Celerra, even with MPFS, is kind of like this, as are Panasas and Isilon. Spinnaker, which was absorbed by NetApp to provide a scale-out strategy, was of this type also. The other approach does away with the back-end network and puts all of the traffic on the front side, with clients using a fundamentally different – and more complex – kind of protocol than Plain Old NFS ever was. Lustre and PVFS are examples here, with significant traction in HPC but little elsewhere. In the first approach the ideal is to use the fastest back-end network you can, without regard for issues like familiarity or ubiquity, lest you find your product hobbled by an interconnect that can’t keep up (as the Celerra was when I was working on that stuff). In the second approach, the ideal is to use the same interconnect your clients already have for computation. Yes, there are arguments in favor of keeping the two kinds of traffic from interfering with one another, but isolating them on separate networks doesn’t achieve that as long as they can interfere with each other on the hosts themselves (as they tend to). Isolation becomes a weak argument compared to the cost or administrative-complexity arguments favoring a unified network. (Note that the isolation argument is stronger in the separate-back-end model where the two networks only meet at servers and not at clients.)
Going back to what I originally said, though, I think scale-out architectures will eventually dominate storage the way they already dominate computation, and for the same reasons. The real question is how much of the scale-out architecture will be directly visible to clients.
Hmmmmm.
Lots there for me to think about. Thanks!