Why is fault tolerance vital to the success of grid computing?
Comprehensive error handling and automated recovery are essential to robust, scalable systems. Grid computing especially pushes these limits of reliability, since it entails a large number of distributed components, all of which are subject to various types of unpredictable failures, such as computer hardware breakdowns, operating system crashes, loss of communications, and abnormal exits from application software. Historically, it has been too difficult to program distributed systems that can automatically recover from failures in individual components and keep working. For businesses, this has meant that distributed computing by PCs was more of an academic exercise than a commercial reality. Large-scale, distributed computing is now practical, but only if the middleware has been designed to handle a variety of inevitable failures. Without fault tolerance, systems can’t grow large or complex, which is just where grid computing is most needed and offers the greatest benefits.