PARALLEL RESOURCE SCHEDULING


Perhaps the most challenging task in managing any multi-user computer resource is
dividing up the available CPU, memory, disk, resources among the users in such a way
That provides the maximum benefit to the users. The existing queuing solutions have not
proved as successful on cluster computers as they did on systems in which there is a
single system image. The existing solutions were not designed to handle the concept
that a single job is running on multiple individual compute nodes. Therefore, they
exhibit many inadequacies on parallel systems. For example, simply starting a job is
not handled in a scalable fashion producing start times of up to 10 minutes on today's
high-end systems. Consequently, a parallel queuing system is one of the key needs
identified by the HPC Open Source Working Group. The scheduler must be able to take
into account the heterogeneity among the nodes, in terms of both heterogeneous memory
and processors, and specialized graphics and IO nodes.

Now that the Maui Scheduler has been successfully ported to the Portable Batch System
environment basic parallel scheduling is available. However, this did not resolve any
of the parallel scalability and reliability problems. Also the existing PBS infrastructure
greatly limits what information is available to the scheduler. To resolve these problems we,
along with other members of the Scalable Systems Software Center will produce a new integrated
suite of tools for the management of large-scale parallel systems. Our efforts will focus on
the resource management tools, particularly the queue manager and security components.