What fault tolerance techniques does Open MPI plan on supporting?
Open MPI plans on supporting the following fault tolerance techniques: • Coordinated and uncoordinated process checkpoint and restart. Similar to those implemented in LAM/MPI and MPICH-V, respectively. • Message logging techniques. Similar to those implemented in MPICH-V • Data Reliability and network fault tolerance. Similar to those implemented in LA-MPI • User directed, and communicator driven fault tolerance. Similar to those implemented in FT-MPI. The Open MPI team will not limit their fault tolerance techniques to those mentioned above, but intend on extending beyond them in the future. 3. Does Open MPI support checkpoint and restart of parallel jobs (similar to LAM/MPI)? The current stable release of Open MPI does not support the checkpointing and restarting of processes. However, the Open MPI development trunk does contain such support. The Open MPI team is actively working on integrating a variety of checkpoint and restart techniques into Open MPI, including similar functional