Amazon has announced Cluster Compute Instances for EC2. This is a very welcome development. Having come from SiCortex, where we provided a somewhat cloud-like ability for users to allocate large numbers of very well-connected nodes on demand, I’ve been talking to people about the idea of provisioning cloud resources on special machines like this for at least the past year. In that light, I find a couple of things about the announcement a bit surprising. Let’s go down the specs first.

  • There’s a single type of HPC instance – dual quad-core “Nehalem” processors with 23GB. The Amazon page points out that this small additional amount of transparency about the exact CPU type allows people to do processor-specific optimization that they generally can’t do elsewhere in EC2.
  • Each instance comes with 1.7TB of instance storage. Performance is not mentioned, but at modern disk drive sizes that might well be just two drives.
  • Connectivity is via 10GbE (NIC and switch vendors not specified). Yuk. 10GbE still lags behind InfiniBand in terms of both bandwidth and latency, both absolute and per dollar. Much has been made lately of the significant and increasing dominance of IB in the HPC world, especially in the Top 500, and the customers Amazon is trying to attract are likely to consider 10GbE a strange choice at best.
  • There is a default limit of eight Cluster Compute Instances without filling out a request form. Eight machines is not enough for serious work of this type, even when the machines are this powerful, so that’s going to affect – and annoy – practically every user.
  • The instances are $1.60 per hour, which is $38.40 per day or a thousand a month. There are others far better qualified to comment on the economics, so I’ll leave it at that.
  • Cluster Compute Instances are only available in one availability zone.

My first thought is that the new offering as currently specified is nowhere near as interesting as it could be – and might be, as the service continues to evolve. Faster interconnects are one obvious way to make it more interesting. Removing the eight-machine default limit – which I strongly suspect is related to the capacity of the switches they’re using – is another. Then it gets even more interesting. When I’ve talked to people about heterogeneous clouds, which is what we’re heading towards here, I’ve generally meant far more kinds of specialization than this. How about instances which are optimized for communications instead of computation, such as with the same 10GbE (or better) but less powerful processors? How about instances which are optimized for disk I/O with multiple spindles and/or SSDs? How about special GPU-equipped instances? Once you can deal with the kind of heterogeneity that today’s CCIs represent, it’s but a short step to handling these other variations as well, so today’s announcement might merely foreshadow even bigger things to come.

The other thought I have about this is that it’s not just about the individual instances. The ability to specify that several instances should be provisioned close to another – probably on the same switch for reasons I mentioned above – is interesting with respect to both the user experience and the infrastructure needed to support it. Location transparency might be a defining feature of cloud computing, but that’s only in the sense of absolute location. Relative location is still a very valid parameter for allocation of cloud-computing resources. When you define a “cluster placement group” in EC2 you’re effectively saying that these instances should all be close to one another, regardless of where they all are relative to anything else. In other situations, such as disaster recovery, you might want to say certain instances should definitely be far from each other instead. We’ve been thinking through a lot of these issues on Deltacloud, but this isn’t a work blog so it would be both unwise and distasteful to say much more about that right now. Suffice it to say that facilitating this kind of placement requires a much more sophisticated cloud infrastructure than “grab whatever’s free wherever it is” which is pretty much the current standard. When you consider relationships not only between instances but also between instances and the data or connectivity they need, it can become quite a science project. The possibility that Amazon might be doing some of that science is, to me, one of the most exciting things about this announcement.