Important Notice: Our web hosting provider recently started charging us for additional visits, which was unexpected. In response, we're seeking donations. Depending on the situation, we may explore different monetization options for our Community and Expert Contributors. It's crucial to provide more returns for their expertise and offer more Expert Validated Answers or AI Validated Answers. Learn more about our hosting issue here.

What is “fault tolerance”?

April 26, 2017fault tolerance

0

Posted

What is “fault tolerance”?

2 Answers

0

0 Posted

Fault tolerance is a measurement of a device’s or computer’s ability to handle errors. Something that is not fault tolerant will fail immediately as soon as an error is encountered.

0

Posted

The phrase “fault tolerance” means many things to many people. Typical definitions range from user processes dumping vital state to disk periodically to checkpoint/restart of running processes to elaborate recreate-process-state-from-incremental-pieces schemes to … (you get the idea). In the scope of Open MPI, we typically define “fault tolerance” to mean the ability to recover from one or more component failures in a well defined manner with either a transparent or application-directed mechanism. Component failures may exhibit themselves as a corrupted transmission over a faulty network interface or the failure of one or more serial or parallel processes due to a processor or node failure. Open MPI strives to provide the application with a consistent system view while still providing a production quality, high performance implementation. Yes, that’s pretty much as all-inclusive as possible — intentionally so! Remember that in addition to being a production-quality MPI implementatio

Whats the difference between high availability and fault tolerance?
What is "fault tolerance"?