4/24/2014: app.thousandeyes.com outage postmortem

Last updated: Mon Feb 04 23:45:52 GMT 2019

Incident Summary

On 4/24/2014 at 18:10 UTC, services serving app.thousandeyes.com and api.thousandeyes.com stopped responding to user requests.  This outage lasted 26 minutes, until 18:36 UTC, and affected all customers attempting to reach either target URL.  Tests continued to run, and alerts continued to be processed while our presentation tier was unavailable, and no user data was lost during this outage.

Problem Details

Servers servicing our application and API services stopped responding at 18:10 UTC.  Our operations team was notified via system alerts at 18:10 when the outage began, commencing investigation immediately.  The @ThousandEyesOps twitter feed was updated as soon as the outage was confirmed.

Root cause of the outage was determined to be a hardware failure in a shared storage system used for logging on our presentation tier.  The nature of the failure prevented automatic failover to a standby storage node, and required manual intervention by our operations team in order to restore service.

The issue has been resolved, and the affected storage system has been pulled from rotation until the hardware issue is resolved.  We have also taken steps to mitigate the effect of these types of failure on system availability. 

Scheduled tests continued to run and check in with our agent collectors during the outage.  No data was lost as a result of the outage.

Services were fully restored at 18:26 UTC, resulting in 26 minutes of elapsed downtime.  A snapshot detailing the effects of the outage can be found at https://metwt.share.thousandeyes.com.

Screen_Shot_2014-04-24_at_1.48.54_PM.png
Snapshot image showing app.thousandeyes.com outage

At this time we consider the outage to be resolved.  For questions, please contact our Customer Success organization by emailing support@thousandeyes.com.