Update timeline
- (Resolved) 2017-08-11 2340 UTC: We've identified the cause of the problem - Routeviews.org was publishing truncated RIB files for the rv/oreg collector, and as a result, prefixes missing from the RIB file were determined to have been withdrawn. We have implemented a series of software fixes which will prevent truncated RIB files from causing BGP reachability issues. In the event that our software detects a truncated RIB file, we will revert to the periodic updates published by routeviews.org in order to determine reachability, updates and path changes to prefix advertisements.
- (Reopened) 2017-08-11 1440 UTC: We have experienced two recurrences of the problem, based on the same route aggregator. At this time we are reopening this issue and are working through resolution with the team at Route Views while we work to implement changes in our processes as described below to mitigate the effect of delayed RIB availability.
- (Resolved) 2017-08-10 18:45 UTC: Initial issue published.
Issue Description
ThousandEyes platform issued several BGP reachability alerts for prefix reachability, and customers may have noticed dips in reachability due to missing routing tables for certain BGP monitors. This issue has occurred at the following times:
- 2017-08-09 0800-1000 UTC
- 2017-08-10 1400-1645 UTC
- 2017-08-11 1200-1600 UTC
Background
As described here, ThousandEyes syndicates data made available by the Route Views project to show prefix reachability from a number of global vantage points (BGP monitors). Data is extracted from full routing tables which are published every two hours, as well as updates published every 15 minutes. BGP view metrics for prefix reachability, updates and path changes are based on information extracted from these routing tables. When a Routing Information Base (RIB) file (either a full table or an update) is not available, reachability of monitored prefixes drops for each monitor contained in that RIB, since the absence of a routing table is interpreted to mean that the prefix is unreachable.
What happened
The rv/oreg endpoint would periodically publish a truncated RIB file in the full routing table dump which occurs every 2 hours. During processing, we would identify certain prefixes as having no routes in the routing tables, therefore would show that those prefixes as having been withdrawn. Alerts (if configured) would trigger based on lack of reachability.
The following BGP monitors have been affected by these issues during the course of the last 3 days:
Monitor name/location | Provider peer | Peer AS number |
---|---|---|
Seattle, WA | Level 3 | 3356 |
New York, NY-1 | AT&T | 7018 |
Stockton, CA | Sprint | 1239 |
Sydney-1 | Telstra | 1221 |
Chicago, IL | AOL | 1668 |
Amsterdam-2 | KPN | 286 |
Calgary, Canada | Telus | 852 |
Palo Alto, CA-2 | Level 3 | 3549 |
Los Angeles, CA | Cenic | 2152 |
Vancouver, Canada | Bell Canada | 6539 |
Frankfurt-2 | GTT | 3257 |
St. Petersburg-1 | Obit | 8492 |
London-9 | Global Crossing | 3549 |
Tokyo-4 | IIJ | 2497 |
San Francisco, CA | Savvis | 3561 |