Important Notice: Our web hosting provider recently started charging us for additional visits, which was unexpected. In response, we're seeking donations. Depending on the situation, we may explore different monetization options for our Community and Expert Contributors. It's crucial to provide more returns for their expertise and offer more Expert Validated Answers or AI Validated Answers. Learn more about our hosting issue here.

How does Google respond to outages and integrate the “lessons learned” into its operations?

April 26, 2017google integrate operations outages respond

0

Posted

How does Google respond to outages and integrate the “lessons learned” into its operations?

1 Answer

0

Posted

Urs Holzle: In general, teams follow a postmortem process when an outage occurs, and produce action items such as “monitor timeouts to X” or “document failover procedure and train on-call engineers”. Engineers from affected teams are also quite happy to ask for and supplement a post-mortem as needed. Human beings tend to be quite fallible, so if possible we like to write either a specific or a general automated monitoring rule to notice problems. This is true of both software/configuration problems and hardware/datacenter problems. RELATED STORIES: • The Google Data Center FAQ • Google Planning Offshore Data Barges • Google: The World’s Most Efficient Data Centers • Google: Raise Your Data Center Temperature • A Look at Google’s Newest Data Center AKPC_IDS += “8650,”; AKPC_IDS += “8650,”; –> • var addthis_pub = ‘TechJournalist’; • • Slashdot It!