Ten Steps for Managing
Parallel Computing Projects
Don
Heller
IEEE
Parallel and Distributed Technology, Spring
1994, pp. 6-8
[updated]
Parallel
computing can work in the real world. But, it's not
enough that
parallel computers can improve performance, capacity,
price,
reliability, flexibility and scalability compared with
more
traditional systems. Anyone trying to introduce parallel
systems --
or any other alternative architecture-- into an organization
must
also account for business plans and risk tolerance.
Here's
the setting: you have some existing hardware and a lot
of software,
and you want to consider some radically new hardware.
Chances are the
old programs won't run efficiently without changes.
Will the
reengineering effort, which extends beyond a simple programming
task,
return a sufficient reward? Can the risks be identified
and
managed?
The following simple rules should help in planning
and execution.
They are the result of 20 years of thinking about
parallel computers,
and 10 years of effort to put them into use. I
don't claim to
have followed all these rules myself, except perhaps
after violating
them.
Most of these ideas derive from my
direct experience during
1983-1993 working with Shell Oil's research,
application, and
system programs for a variety of parallel computers.
Shell was
nCUBE's first customer in 1984 and has been using its
systems
in a fully integrated operational setting since 1988. The
Royal
Dutch/Shell Group has also used Meiko and Parsytec systems.
The
applications have primarily been seismic exploration,
petroleum
reservoir simulation, and molecular dynamics.
1.
Be clear about your objectives.
- Start with a
business need for the project, or at least an
anticipated need at
delivery time. Meet the need, and don't miss
your window of
opportunity.
- Be prepared to deal with fuzzy objectives at the
start, and
plan to clarify them as you go. If you can't clarify
them, you
won't have a specific goal to meet, and can't claim
success.
Multiple goals are not necessarily good. Poorly focused
objectives
may slow progress.
- Find a way to measure the
project's positive effects, even
(or maybe especially) if it seems
to have failed. On the chance
that you won't meet the goals, be
prepared to salvage some ideas
for another project.
- Allow
exploratory research, but don't confuse it with product
development. If everything works out, research will evolve into
a
product. If nothing works out and you still didn't learn anything,
it wasn't good research. If the research staff dissipates into
unrelated projects, you haven't gained anything.
2. Be
realistic about your objectives.
- Systems rarely work as well as advertised. Don't promise
what
no one can deliver. Don't promise much more than *you* can
deliver.
- Do not place importance on performance figures in
non-technical
literature, or for systems not available
commercially, or for
application programs by graduate students.
- Be sure you understand how the components operate and how
they
connect. Disregard advertised peak rates, and discover the
real
performance constraints. Try to operate at these limits.
Try to
acquire enough equipment to operate at the desired solution
rate.
- Removing one constraint will expose another. Some constraints
can't be removed. Removing a secondary constraint is either a
waste
of time or planning for the future. Can you predict which
case you
have?
- A false analogy to optimization theory could tempt you
to
use a function of the active constraints to measure system
balance
or scalability. Your active constraints may be unrelated to
the
alternative architecture. Rest assured, the sales staff will
inform you about the good qualities of the inactive constraints.
- Beware of scaling an existing machine's performance to a
nonexistent machine. The latter may never exist, may be beyond
your
budget, may add a constraint that invalidates the extrapolation,
or
may remove a constraint, making the measurements obsolete.
- If
there's only one of some component, the system won't scale,
though
that component might become faster or bigger. If there
are only a
few, the system will scale differently than if there
are many. If
there are very many, it will still not scale infinitely.
- A lot
of optimal algorithms aren't, by the time you read
the fine print.
PRAM's aren't sold on the open market, and that
$log log N$ factor
could easily matter less than getting the
program to
work.
3. Keep pace with standard architectures. Better
yet, keep
ahead by several years.
- Rapid progress is required with an alternative architecture;
otherwise, you could meet the same objectives -- and save
development
costs -- by waiting for faster standard machines.
Reduced goals
can be met with vastly reduced costs by using
standard architectures
later; later could be sooner if you get
behind.
- If it is not possible to develop the final product
quickly,
develop a functional prototype that can be evaluated
quickly,
and perhaps used in a limited way.
- The alternative
machine must be large and scalable to stay
ahead of the best
workstations, whose speed doubles every 18-24
months, and which
will always be easier to program and more appropriate
for smaller
problems.
- As application problem sizes grow in time, the
alternative
architecture must also grow. Increased memory will
provide much
of the gain, so memory size should grow faster than
processor
speed. I/O capacity never grows fast enough.
4.
Keep pace with applications on standard
architectures.
- When basically the
same program has been implemented on two
different systems, the two
versions should accept the same input,
produce the same output, be
modified by the same programmers,
and updated on the same schedule.
Users shouldn't be able to
notice which machine is being used,
except for different costs
and execution times.
- Since the
above requirement is probably not achievable, keep
the differences
below a level acceptable to the users. This requires
consultation
during development, not after.
- Consider software maintenance
costs. The support staff should
have to make large or fundamental
changes no more than once,
and preferably never. So, automate as
much as possible.
- For some programs, better algorithms improve
effectiveness
before better hardware increases speed. Algorithms
and hardware
contribute equally in the long run.
- Software
lives longer than hardware in most organizations,
and data formats
live even longer if they are shared among many
large programs.
Prior investments therefore motivate compatibility
among systems,
and future investments for one system should not
ignore those being
made for another.
5. Keep pace with improvements to the
alternative architecture.
- There
are small improvements (primarily bug fixes and evolutionary
steps)
and large improvements (evolutionary leaps). Application
programs
must be able to follow the system's improvements, though
the leaps
may be harder to follow.
- If there won't be any more
improvements, you've got a problem.
Protect yourself with portable
code and allow for alternatives.
- If you can't suggest
improvements, you haven't been paying
attention.
- Promptly
communicate problems to the computer manufacturer,
who should
provide timely repairs or replacements. The number
of problems
should decrease over time. The nature of the problems
should change
over time, as you test the system's different limits.
- Don't
fret about new machines that are announced for delivery
after your
project is supposed to be finished. A chip in the
hand is worth two
in the press. Do fret about machines that are
actually available
and functional.
- New technology is a moving target, particularly
system software.
There are new ideas and tools from academia and
from other industrial
or government customers, and continuing
developments by the manufacturer.
You can help push the technology
forward, and you must be prepared
to use advances when they become
available.
6. Keep overhead under
control.
- Programmers will
encounter new problems, such as communication
overhead in a
multiprocessor. Allow them time to adjust.
- There will be bugs:
bugs in the hardware, bugs in the operating
system, bugs in the
compiler, bugs in the new program, and especially
newly- discovered
bugs in the old program. It is impossible to
predict how much time
you will spend chasing them, so assume
50%.
- Track down the
overhead in programs and the organization.
In a program, the
culprit is probably the niftiest feature. In
an organization, it's
the meetings and reports, but without them
it's difficult to build
a consensus or explain what you've done.
- Keep a diary. You will
be amazed.
7. Resolve operational system issues
concurrently with application
program
development.
- Enforce major
operational requirements -- such as reliability,
network
connections, remote execution, batch queue management
and system
partitioning -- throughout development, not just at
the end of the
project.
- Provide adequate resources so research, program
development,
system development, and operations do not conflict.
- The world is heterogeneous. No one computer product, no matter
how scalable, will meet all needs. Networks will therefore be
the
biggest headache, because they require compatibility and
cooperation.
- Laboratory use is not the same as field use. Lab
users will
more likely tolerate inadequacy, because they expect
future improvements.
Outside the lab, the future is now - failures
are not tolerated.
- Equally important is the risk-reward
tradeoff: Sometimes
users will tolerate unsupported products that
deliver major benefits,
but they must know the risks in advance.
- Testing for correctness is insufficient: You may not find
all
the bugs, due to an inadequate test set, and cannot use tests
alone
to prove that there are no bugs. Even correctness theorems
are
suspect, as the specifications may be wrong (an improper
model,
poor numerical approximations, unexpected behavior on
some
unspecified point, and so on). But don't skip the tests
and
theorems -- they build confidence.
- User acceptance is crucial.
Frequent minor annoyances, and
unnecessary or unexplained
departures from current practice on
standard architectures, can
obscure a new technology's merits.
- Train the support staff
well, long before operations begin,
so they can improve their
tools.
8. Find a "champion" to help introduce
new technology.
- Rarely is a new
tool so obviously useful that it is adopted
quickly, without
encouragement or advertising. A champion can
supply enough
resources and pressure to make the technology useful,
encourage its
use, verify its benefits, and see that it survives
long enough to
become useful if the benefits are not immediate.
- There are no
guarantees that new is better. The champion
must be willing to
share the risk if the technology fails, and
should not turn the
organization against the technology in a
way which might prevent
its use in a different environment or
cause improvements to go
unrecognized.
9. Keep the financial people
happy.
- There's a risk you'll go
over budget. If the total exceeds
the cost of a large standard
machine and a good research project,
management will balk.
- One reason to buy a microprocessor-based parallel computer
is
to ride the waves of processor and memory improvements. But
once
you purchase it, you get off those waves if you never update
the
processors or increase the memory. Amortization and depreciation
must account for frequent hardware updates. A system may not
be so
cheap if it is upgraded frequently.
- Balancing long-term
benefits and totally new markets against
short-term profits is not
easy. Even if you remind everyone how
much money this project will
earn or save, someone might notice
that they could save even more
by stopping it. Short-term spinoffs
will help defend your position.
Still, there may be no defense
against
shortsightedness.
10. Be prepared to redo
it.
- It may become desirable to
make a lateral shift to a system
with equivalent capability but
much lower cost. However, this
will pose some portability problems.
Plan ahead, write portable
code, and isolate the compute and
communication kernels.
- Chances are you learned so much in the
first implementation
that you'd like to redo it -- maybe to get it
right, maybe to
make it faster or better. Think twice before
charging ahead.
The support staff always rewrites code from the
research staff
anyway.
11. Go back and review Rules 1
and 2.
Written by Don Heller while at Center
for Research
on Parallel Computation, Rice
University
Contact:
Don Heller dheller@cse.psu.edu