Beginning at 13:08 UTC on Friday, January 22, we experienced an outage with the ThousandEyes platform. This affected all communications with ThousandEyes, including access to the front end, as well as agent connectivity back to our collector.
This issue was detected at 13:08 UTC by our internal monitoring processes. Our operations team engaged and determined that the problem was caused by our load balancer cluster on the primary datacenter edge. Due to an internal problem in one node of the load balancing cluster, services could not be failed over to a standby node of the cluster.
At 13:15, we dispatched an engineer to manually resolve the issue, and at 13:22 UTC we moved the site into maintenance mode. The engineer arrived on site at the datacenter at 13:25 UTC and initiated the recovery process, ultimately power cycling the load balancing infrastructure to resolve. Services were fully restored by 13:45 UTC, resulting in a 37 minute outage. At this time, the component which was responsible for the drop in availability of our infrastructure has been removed from production, and we are investigating root cause with the manufacturer.
A share link covering the timeline of the outage can be found here: https://lgucp.share.thousandeyes.com
We apologize for any inconvenience that our customers may have experienced related to this outage. Should you have any questions or comments related to this outage, please contact our support team and we'll be happy to engage.
Thank you for your understanding.
Dave Fraleigh | VP/Customer Success