(Resolved) 2017-12-24 17:30 UTC: Degraded service

Last updated: Thu Dec 28 01:19:06 GMT 2017

12/24/2017 17:30 UTC: Access to the ThousandEyes platform was degraded due to a database issue. Logins to the application and the API were affected, as well as the serving of Share Links and embeddedReports and Dashboard widgets. Additionally, Enterprise Agents were shown as offline in the application and locally on the Enterprise Agent. The issue was caused by performance degradation in a database cluster.

12/24/2017 21:00 UTC: The database cluster has been restored to normal performance levels. Data which was unable to be uploaded during the degraded period was stored locally on Cloud and Enterprise Agents, and is now being uploaded.

Enterprise Agents should now have an Online status.

New configuration for tests is still pending. Any test which has been created or modified during this period may not have received the updates yet. We expect to have this data available shortly.

12/25/2017 22:30 - 02:00 UTC: During the restoration of service, some Cloud and Enterprise Agents received a 'disable' command. Agents that received this command would have ceased running assigned tests.  Tests assigned to the affected Agents would display no data for those Agents, for some period within the time frame shown in this update. This loss of data is permanent. Alerts and Reports will not reflect data for the affected Agents.

User-added image

All affected Agents have been re-enabled. Data generation and collection have been verified to be working as expected. We are actively monitoring all affected systems to check for any remaining effects.

2/25/2017 02:00 UTC: The issue has been resolved.

One member of the database cluster failed to properly synchronize with healthy cluster members after a restart. When the restarting (joining) member attempted to synchronize with a running (donor) cluster member, bug in the clustering software performed a full copy from the donor to the joiner instead of an incremental copy. Because the copy blocked operations for both joiner and donor until completion, and because the full copy took longer than an incremental copy, the remaining active members of the cluster became overloaded while attempting to service requests.

The affected cluster member was removed from service and normal service levels were restored. However, during the restoration process one of our data collectors did not receive a full list of active Agents. Any Agents not on this list were disabled by the collector upon the next Agent check-in. Once it was determined that Agent data was not delayed but instead not present, we performed the necessary steps to re-enable the disabled Agents.

As a short-term solution, we have patched the clustering software. In the long term, we will be scaling up the infrastructure to handle a full database synchronization between multiple cluster members.