Is Microrebooting Useful in Clusters?
In a typical Internet cluster, the unit of recovery is a full node, which is small relative to the cluster as a whole. To learn whether microreboots can yield any benefit in such systems, we built a cluster of 8 independent application server nodes. Clusters of 2-4 J2EE servers are typical in enterprise settings, with high-end financial and telecom applications running on 10-24 nodes [15]; a few gigantic services, like eBay’s online auction service, run on pools of clusters totaling 2000 application servers [11]. We distribute incoming load among nodes using a client-side load balancer LB. Under failure-free operation, LB distributes new incoming login requests evenly between the nodes and, for established sessions, LB implements session affinity (i.e., non-login requests are directed to the node on which the session was originally established). We inject a microreboot-recoverable fault from Table 2 in one of the server instances, say Nbad; the failure detectors notice failures and rep