How does Google respond to outages and integrate the “lessons learned” into its operations?
Urs Holzle: In general, teams follow a postmortem process when an outage occurs, and produce action items such as “monitor timeouts to X” or “document failover procedure and train on-call engineers”. Engineers from affected teams are also quite happy to ask for and supplement a post-mortem as needed. Human beings tend to be quite fallible, so if possible we like to write either a specific or a general automated monitoring rule to notice problems. This is true of both software/configuration problems and hardware/datacenter problems. RELATED STORIES: • The Google Data Center FAQ • Google Planning Offshore Data Barges • Google: The World’s Most Efficient Data Centers • Google: Raise Your Data Center Temperature • A Look at Google’s Newest Data Center AKPC_IDS += “8650,”; AKPC_IDS += “8650,”; –> • var addthis_pub = ‘TechJournalist’; • • Slashdot It!