Computational Clusters

Computational/High Performance Linux clusters started back in 1994 when Donald Becker and Thomas Sterling built a cluster for NASA. This cluster was made up of 16 DX4 processors connected by 10 Mbit Ethernet, and they named it Beowulf.

Since then, the Beowulf Project has been joined by other software projects trying to provide useful solutions to turning Commercial Off the Shelf (COTS) hardware into clusters capable of supercomputer speed.

These clusters have been used for everything from simple data mining, file serving, database serving, or web serving, to flight simulation, computer graphics rendering, weather modelling, or ripping CDs at truly outstanding speeds.

This page will help direct to you various places out there which will help you on your way to building a computational Linux cluster.

Clustering Software
Clustering Tools and Libraries
Miscellaneous Cluster Related Sites
For Schedulers, Management, Monitoring,
Administration, and Job Management Tools,
see the Software page

Clustering Software

Alinka Raisin: Alinka's Raisin software package can do everything from creation to administration of High Performance Linux clusters. It uses Batch Queuing systems such as PBS (or others such as LSF or NQE upon request), Mosix for process migration, and the Parallel file systm PVFS, and comes with a web based user interface.

The Beowulf Project: This is the project that started it all. This page contains a history of the Beowulf Project, links to Beowulf software and Networking Drivers, as well as links to various Beowulf clusters that are in use in various places around the world.

Black Lab: Black Lab is a tool for building and managing clusters running Yellow Dog Linux (Yellow Dog is a PPC distro).

CLIC: CLIC is Mandrake's offering of a project that attempts to be an HPC Linux distribution for 32 and 64 bit processors. The object is to create a Linux distro specifically for clusters.

Clubmask: Clubmask is a collection of existing Open Source and new software for the installation, configuration, and management of Beowulf style high performance computing clusters. The design and goal of the project is to provide a "Physicist Proof", completely turnkey set of tools.

Cluster Infrastructure: Cluster Infrastructure for Linux (CI) aims at developing a common infrastructure for Linux clustering by extending cluster membership and internode communication subsystems from Compaq's NonStop Clusters for Unixware code base. This project also provides the basis for the SSI Clusters for Linux project.

Cluster Systems Management: Cluster Systems Management for Linux (CSM) enables organizations to build, manage, and expand clusters. A free demonstration of CSM, called the Cluster Starter Kit, is available here.

ClusterIt: ClusterIt is a collection of clustering tools, allowing for heterogeneous cluster makeup, including various architectures and operating systems, various authentication mechanisms, job sequencing, distributed virtual terminals, and more.

Clustermatic: Clustermatic is a collection of technologies being developed at the Cluster Research Lab at Los Alamos National Laboratory. Besides the new software being developed by the group, existing projects such as LinuxBIOS and BProc are integrated into it as well.

Cplant: Computational Plant (a.k.a. Cplant) is a newly released project coming from the folks at the Sandia National Laboratories. The goal is "to provide a commodity-based, large-scale computing resource that meets the level of compute performance needed by Sandia's critical applications."

EnFuzion: EnFuzion, by TurboLinux, supports clusters of up to 1000 machines, a combination of Linux and Windows NT machines, and provides fault tolerant, highly available supercomputing speed. It also has an API, allowing for easy integration of applications.

Ka: Ka is a toolkit designed to install and administer a cluster of boxes. It focus on scalability of parallel system installation, data distribution and process launching. Ka has been tested on clusters up to 225 nodes.

LCFG: LCFG (Local Configuration System) is a system for automatically installing and managing the configuration of large numbers of Unix systems. It is particularly well suited to environments with diverse and rapidly changing configurations.

MSC.Linux: MSC.Linux is a high performance/cluster distribution that is designed for computational environments in engineering and life sciences.

MOSIX: MOSIX is a software package that enchances the Linux kernel with cluster capabilities. The enhanced kernel supports any size cluster of X86/Pentium based boxes. MOSIX allows for the automatic and transparent migration of processes to other nodes in the cluster, while standard Linux process control utilities, such as 'ps' will show all processes as if they are running on the node the process originated from.

openMosix: openMosix is a spin off of the original Mosix. The first version of openMosix is fully compatible with the last version of Mosix, but is going to go in its own direction.

OSCAR: OSCAR (Open Source Cluster Application Resources) is a bundle of software designed to make it easy to build, maintain, and use a modest sized Linux cluster.

Rocks: The Rocks Clustering Toolkit, from the folks at NPACI, is a collection of Open Source tools to help build, manage, and monitor, clusters.

SCE: SCE (Scalable Cluster Environment) is an easy to use set of interoperable Open Source tools that allows the user to quickly install, configure, and use, a Beowulf cluster.

Scali Software Platform: The Scali Software Platform (SSP) delivers a number of tools for ease of installation, administration, maintenance, and operational use of clusters ranging from a handful to hundreds of nodes, that targets all aspects of building, maintaining, and using a cluster. It covers everything from low level drivers to high level administration.

SCore: SCore, by Real World Computing Partnership (RWCP) is not a Beowulf style cluster in the sense that SCore software is designed for the high performance cluster environment without using the TCP/IP stack.

The Scyld Beowulf Cluster Operating System: Scyld is the second generation of The Beowulf clustering software. Scyld Computing Corporation was started by Donald Becker and few other folks from the original Beowulf Project team. Scyld recently announced the first commercial release of its "Next Generation Beowulf Cluster Operating System" (see our news page for a link to the story).

Some of the tools and options that Scyld Beowulf provides:
- Automatic, remote installation of compute nodes
- A single system image is used for the whole cluster
- An optimized version of MPI, based on MPICH
- A special set of kernel modifications, utilities, and libraries, which in essence turn the whole cluster into one big shared PID space. Processes can be started on the master node, and then migrated to the slave odes, while all process control and monitoring can still take place on the master node using the standard Unix process control utilities.
- GUI interfaces for configuring and monitoring clusters
Single System Image Clusters: Single System Image Clusters for Linux (SSI) aims at providing a full, highly available, single system image cluster environment for Linux, with the goals of availability, scalablity, and manageability, built from standard servers.

Warewulf: Warewulf is a distribution of tools that are designed to aid in the implementation and administration of Beowulf style clusters. It has no underlying OS, so it can be installed on top of your favorite base system, and offers the user which environment (LAM vs MPICH, for example) they wish to use.

Clustering Tools and Libraries

ANTS: The ANTS Load Balancing System is a load balancer/queueing package.

Distributed Debugging Tool: DDT is a commercial debugger designed for debugging parallel code while programming in a cluster environment. Evaluation copies are available on the website.

Grid Engine: The Gride Engine Project is an open source project based on Sun's commercia product, "Sun Grid Engine," which can be seen here. Grid Engine is Distributed Resource Management software, used to create compute farms.

LSF: LSF (Load Sharing Facility) is a suite of software available for various Unixes and NT. It performs load sharing and balancing, and job scheduling.

MPI: MPI (Message Passing Interface) is a library specification for message-passing, proposed as a standard by an industry consortium of vendors, implementors, and users. It has many free and commerial implementations, but because MPI is an open standard, while any person or company can tweak MPI to optimize it for his or their own use, the calling structure and API must remain unchanged. All manufacturers of commercial supercomputers provide a version of MPI with their systems.

Here are some useful MPI sites:
- LAM: LAM (Local Area Multicomputer) is an MPI programming environment and development system developed at the Ohio Supercomputer Center and Notre Dame University, now being developed and maintained by a group at Indiana University. It is freely available for download.
- MP_Lite: MP_Lite is a light weight message passing library designed to deliver the maximum performance to applications in a portable and user-friendly manner.
- MPICH: MPICH is a portable implementation of MPI, developed at Argonne National Laboratory. It is freely available, and an extremely vanilla implementation of MPI, which makes it easy for porting to various Unixes. There is also a Windows NT version available.
- MPI FAQ
- MPI Forum: The Message Passing Interface Forum contains the official MPI standards documents, errata, and archives of the MPI Forum, which is an open group who define and maintain the MPI standard.
PADE: "Parallel Applications Development Environment" provides a GUI that runs on the user's development host, which provides all utilities for development and maintenance of programs that run on nodes of the virtual machine running PVM.

PBS: PBS (Portable Batch System) is a batch queueing and load balancing system originally developed for NASA. It is available for a variety of Unix platforms. There is an Open Source version of PBS, called OpenPBS, which is located here.

The Parallel Tools Consortium: Ptools provides a forum for interactions involving tool users, developers, and researchers, promotes the development and dissemination of usable tools, and serves as a liaison with other special-interest groups and standards efforts.

PETSc: PETSc (The Portable, Extensible Toolkit for Scientific Computation) is a suite of data structures and routines for the scalable solution of scientific applications modeled by partial differential equations. It is developed and maintained at Argonne National Laboratory.

PVM: PVM (Parallel Virtual Machine) is a software package developed at Oak Ridge National Laboratories. It allows a heterogeneous collection of Unix and/or NT computers, connected by a network, to be used as a single parallel computer. Because it is not a kernel-level environment, any user with sufficient space and permissions to compile it, and enough accounts on separate machines to connect, can install and run this software.

Here are some other useful PVM sites:
- PVM: A Users' Guide and Tutorial for Network Parallel Computing
- PVM Source Code
- XPVM: XPVM is a graphical console and monitor for PVM.

Miscellaneous Cluster Related Sites

The Beowulf Underground: Contains information about various vendors providing commercial hardware and software solutions, community developed software, announcements and news, and documentation, including the Beowulf HOWTO.

The History of Beowulf: This page gives the history behind the original Beowulf project.

The Legend of Beowulf: This link was taken from the Scyld website.

This site maintained by Joe Greenseid
Direct questions or comments to [email protected]