Service Outage Remedies and Post MortemPosted: May 1, 2013
We’re planning a short downtime tonight, 4/30/2013 at 10pm PT and a longer maintenance window over the weekend (5/4 from 5-8pm PT) to address some of the root causes of our recent downtime.
We’ll update this blog post with a much more detailed post-mortem, but the 6 word summary: an AWS instance went to lunch. That got a process in our monitoring system stuck on a kernel lock that it couldn’t time out.
During the longer maintenance window over the weekend, we’ll be upgrading our monitoring system, and, further, rolling out a much needed upgrade to our app-server architecture to let us handle these failures much more gracefully.