4/10/2015: network service degradation

Last updated: Thu Aug 22 13:03:33 GMT 2019

Incident Summary

On 4/10/2015 at at 01:36 UTC, services serving app.thousandeyes.com and api.thousandeyes.com began dropping approximately 50% of responses to external connections targeting these services.  This outage lasted 6 minutes, until 01:42 UTC, and affected all customers attempting to reach either target URL.  Tests continued to run, and alerts continued to be processed while our presentation tier was unavailable, and no user data was lost during this outage.
 

Problem Details

One of the load balanced network uplinks handling traffic to our core services began experiencing problems, and approximately 50% of traffic stopped being returned to the requestor.  Our operations team was notified via system alerts at 01:36 when the outage began, commencing investigation immediately.  Root cause of the outage was determined to be a a bad network cable connecting the top of rack switches to intermediate core switches.  The nature of the failure necessitated manual intervention to correct the problem.

Services were restored 6 minutes later at 01:42 UTC, resulting in 6 minutes of degraded service.  A snapshot detailing the effects of the outage can be found at https://wuivtuu.share.thousandeyes.com