| |
Iowa State University, Ames, Iowa |
Until very recently, parallel computing came in a box an incredibly expensive one. Inside a boxed system is an activity center comprised of anywhere from several to many thousand central processing units (CPUs) that interpret data and execute instructions. It's like packing the processing power of a whole warehouse of computers into a shipping crate. What you get is a supercomputer, but that much power comes at an outrageous price.
Just one such boxed system can cost millions of dollars, and what if one or two CPUs go bad? How do you fix a supercomputer? Mark Gordon, Ames Lab program director for Applied Mathematics and Computational Sciences, explains the reality of supercomputer repair. "You pay the manufacturer 10 to 20 percent of the purchase price each year for maintenance," he says. "You can't get them fixed anywhere else."
| |
David Halstead (standing) and Mark Gordon are working to improve the communications among personal computers in network clusters. Their efforts are helping to make parallel computing power available for greatly reduced costs. |
But high-performance parallel computing doesn't have to cost a fortune. Gordon and several of his colleagues at Ames Laboratory's Scalable Computing Lab (SCL) are determined to make traditionally expensive supercomputing capabilities more economical and attainable for scientific and educational communities.
Playing an intriguing game of "workstation upset," Gordon and SCL researchers David Halstead, John Gustafson, Stephen Elbert, Don Heller, Dave Turner, and Bruce Harmon are dumping the traditional concept of boxed-system supercomputers. Incorporating the option to grow, or scale up, they are taking computing power out of the box and spreading it over networks of personal computers (PCs), creating clusters that operate at speeds comparable to today's best parallel computers, and for a fraction of the cost.
Parallel computers are like valued employees: they can handle a number of tasks at the same time. Their multiple processors work simultaneously on different parts of a single problem, making them far more efficient and able to handle more complex problems than sequential computers, which tackle problems one step at a time.
One of the ways the SCL team hope to make parallel computing more economical is by devising a "cluster cookbook" for the world wide web, which will provide guidelines on how to construct PC clusters. Team members also plan to develop and host a hands-on workshop that will help bring the cost-saving cluster computing technique into university departments individual research groups and the classroom.
"What precipitated the clustering effort is that in the last few years manufacturers of personal computers have been making them with speeds that are equivalent to workstations," says Gordon. "This means that you can take these computers as individual units and use them for really high-performance computing in a sequential sense. You can put on one $3,000 personal computer what you used to put on a high-performance workstation, and it will perform just as well for you.
"What's hard is going the next step: networking clusters of these PCs to make a true, parallel, high-performance computer that's competitive with boxed systems that cost millions of dollars," Gordon says.
Gordon and his colleagues have made that intricate job look easy. "It's clear that small clusters work, small being eight," he says. "We have a math cluster with eight nodes running partial differential equations. And all of our quantum chemistry codes can run in parallel." Gordon adds that in other areas of the Lab, clusters of eight PCs are doing materials simulations and modeling new materials with desirable magnetic properties. "So the issue is, does clustering work when you scale up to 64? Does it work when you scale up to 128? That's really unknown."
| Achieving parallel power with networks of personal computers requires connections on top of connections, and all correctly made. |
Whether larger networks of PC clusters will work depends to a great extent on how well researchers can optimize message-passing between the various computers in a cluster. The communications system currently in use is called gigabit, which is supposed to be several times faster than previous systems. But the performance has not matched the prediction. So part of the SCL team's research is to determine how best to optimize gigabit.
"The Achilles' heel of PC clusters in the past has been the communications capabilities," says Thom Dunning, Battelle Fellow in Computational Sciences at Pacific Northwest National Laboratory. "And that's one of the major problems they're looking at here at Ames Lab. So they're tackling what is probably the most significant problem in realizing the potential that PC clusters have."
Configuration also has a lot to do with how well large networks of PC clusters perform. "Imagine having 64 PCs," says Gordon. "You can arrange them in lots of different ways, and the further one computer has to go to communicate with another, the more arrangements you have. So there's a traffic-directing problem to deal with."
In addition, Halstead reminds us of the obvious issue. "There's a definite storage concern for a large cluster network. Where do you put the thing? Also, all the energy that goes into a computer comes out as heat, so you'll use more power for air conditioning."
Each personal computer in the SCL's 64-node cluster contains two central processing units, making a powerful system of 128 processors.
SCL researchers believe the potential benefits of scalable cluster computing far outshine the issues they are working to resolve. A big advantage of taking the cluster path to parallel computing is that you can always reconfigure the system to meet the needs of the day. For instance, instead of having a 128-node computer, you might choose to have two 64-node computers. "You can set these things up as a computer lab, launch them to run a word processing program for a class, and reboot them to run as a single parallel computer," says Halstead. "You can trade off with other departmentswhatever you want to do. With this kind of sharing of resources, you're only really limited by curriculum originality."
Halstead also notes that although PC clusters are not meant to replace the powerful supercomputers that operate at Department of Energy national labs, they can lessen the computing burden placed upon these machines and so reduce the barrier to national computer use.
Dunning explains, "The bigger systems are very full they're oversubscribed because far too often we go to the big machines to solve routine problems. We're not taking advantage of the economics that could come if we had tiers of computing capability designed for specific tasks. I see PC clusters as being very effective at satisfying an intermediate set of demands that would then free up the big supercomputers to do the job that they are really intended to do, which is the very high-end computing."
Without a doubt, however, the biggest advantage of PC clusters is their low cost.
"Cluster computing is a happy synchronization of technologies," says Halstead. "If a PC dies, you throw it away and go down and buy another one, kind of like replacing a fan belt. The beauty is if a particular PC is not needed in a cluster, it's still a very powerful desktop machine. We're not buying into a hardware technology that in a year's time will be completely useless due to lack of software support."
Maybe the answer to whether clusters will work when scaled up to larger and larger systems will come soon. The SCL team recently constructed a cluster of 64 PCs, each with two central processing units, and is now testing its ability to perform parallel computations.
Halstead has high praise for the ISU students employed by the SCL, noting that the cluster, fondly named ALICE (Ames Lab/ISU Cluster Environment), was made possible by their extensive knowledge of the operating system and network communication protocols. The students include Vasily Lewis, Brian Smith, Chris Csanady, Stephanie Holeman and Chris Williams. "Vasily was responsible for the design and implementation of the ALICE compute node, the ease of which is essential to the project," Halstead says. "And Brian produced a cluster monitoring tool that allows the status of the cluster to be ascertained using a web browser. The talents of the entire student group were invaluable in addressing the difficult issue of configuring 64 PCs so that they act as a single computer.
"At the end of the day, this thing should be four times as fast and have four times the storage capacity and memory as the largest supercomputer in the SCL, which cost just shy of a million dollars," says Halstead. "So it's four times as fast for a third of the price."
The SCL researchers have compared the performance of their clusters with that of commercial parallel computers by using a computer benchmark called HINT, which Gustafson developed and for which he earned an R&D 100 Award in 1995. "I think HINT is better than just about any other way of benchmarking computers," says Gordon. "If you take John Gustafson's method and evaluate our clusters against one of the best parallel computers on the market today and then go the next step and divide that by the cost to get the price/performance ratio, the clusters blow everything else away. And that's our point not to show people how to do really great parallel computing because lots of people can do that, but to show them how they can do it in a very cost-effective way."
Mark Gordon, 515-294-0452; gordon@ameslab.gov
David Halstead, 515-294-1943; halstead@ameslab.gov
Current research funded by: DOE Basic Energy Sciences Office
Published by Inquiry, winter 1998
The URL for this page is: http://www.scl.ameslab.gov/Publications/Inquiry,winter1998.html