(Resolved) 2016-09-29 07:57 PDT Agent collector problems

Last updated: Mon Feb 04 23:48:02 GMT 2019

RCA: Beginning on 2016-09-29 at 07:57 PDT, our agent collectors began experiencing problems, and rejecting agent check-ins.  This problem was escalated immediately to our engineering operations team, and to our developers.  We restored the status of the agent collector infrastructure beginning at 09:20 PDT.  All results were committed over the course of the following hour, and the platform is up to date as of 10:15 PDT.  The problem was due to a problem in the task definition side of the collector, which affected all components of the agent collector.  The resolution of this issue is already in progress, as we split the agent collector processes into separate channels, to allow agents to continue to commit results independently of task definition download.    If you have any questions, comments or concerns related to this issue, please notify us at support@thousandeyes.com.

2016-09-29 10:15 PDT: Agent result backlog has cleared.  All agents are now up to date, and results have been committed.

2016-09-29 10:10 PDT: Agent result backlog is still in the process of clearing.

2016-09-29 09:20 PDT: All agent collectors are back online.  Agents are now checking in: it will take between 30 and 45 minutes to clear agent commit backlogs. As agents start to check in, results will be populated in the platform.

2016-09-29 08:17 PDT: We are experiencing an issue with agent collectors.  Requests made from agents to the agent collectors are failing. Users monitoring agent logs will see the following message:  Test results captured on agents will be delayed until our agents are able to check in.  Our operations team is aware of the issue and is in the process of resolving.

yyyy-MM-dd hh:mm:ss.nnn ERROR [0a5b7849] [te] {} Error calling check_in: Connection timeout.

2016-09-29 07:57 PDT: Agent check-ins are failing.  We have begun diagnosis.