What is “fault tolerance”?
The phrase “fault tolerance” means many things to many people. Typical definitions range from user processes dumping vital state to disk periodically to checkpoint/restart of running processes to elaborate recreate-process-state-from-incremental-pieces schemes to … (you get the idea). In the scope of Open MPI, we typically define “fault tolerance” to mean the ability to recover from one or more component failures in a well defined manner with either a transparent or application-directed mechanism. Component failures may exhibit themselves as a corrupted transmission over a faulty network interface or the failure of one or more serial or parallel processes due to a processor or node failure. Open MPI strives to provide the application with a consistent system view while still providing a production quality, high performance implementation. Yes, that’s pretty much as all-inclusive as possible — intentionally so! Remember that in addition to being a production-quality MPI implementatio