(Resolved) 2017-08-09 through 2017-08-11: BGP reachability alerts

Last updated: Sat Aug 12 00:19:16 GMT 2017

Update timeline

  • (Resolved) 2017-08-11 2340 UTC: We've identified the cause of the problem - Routeviews.org was publishing truncated RIB files for the rv/oreg collector, and as a result, prefixes missing from the RIB file were determined to have been withdrawn.  We have implemented a series of software fixes which will prevent truncated RIB files from causing BGP reachability issues.  In the event that our software detects a truncated RIB file, we will revert to the periodic updates published by routeviews.org in order to determine reachability, updates and path changes to prefix advertisements.
  • (Reopened) 2017-08-11 1440 UTC: We have experienced two recurrences of the problem, based on the same route aggregator.  At this time we are reopening this issue and are working through resolution with the team at Route Views while we work to implement changes in our processes as described below to mitigate the effect of delayed RIB availability.
  • (Resolved) 2017-08-10 18:45 UTC: Initial issue published.
 

Issue Description

ThousandEyes platform issued several BGP reachability alerts for prefix reachability, and customers may have noticed dips in reachability due to missing routing tables for certain BGP monitors. This issue has occurred at the following times:

  • 2017-08-09 0800-1000 UTC 
  • 2017-08-10 1400-1645 UTC
  • 2017-08-11 1200-1600 UTC

Background

As described here, ThousandEyes syndicates data made available by the Route Views project to show prefix reachability from a number of global vantage points (BGP monitors). Data is extracted from full routing tables which are published every two hours, as well as updates published every 15 minutes. BGP view metrics for prefix reachability, updates and path changes are based on information extracted from these routing tables. When a Routing Information Base (RIB) file (either a full table or an update) is not available, reachability of monitored prefixes drops for each monitor contained in that RIB, since the absence of a routing table is interpreted to mean that the prefix is unreachable.

What happened

The rv/oreg endpoint would periodically publish a truncated RIB file in the full routing table dump which occurs every 2 hours.  During processing, we would identify certain prefixes as having no routes in the routing tables, therefore would show that those prefixes as having been withdrawn.  Alerts (if configured) would trigger based on lack of reachability.

The following BGP monitors have been affected by these issues during the course of the last 3 days:

Monitor name/locationProvider peerPeer AS number
Seattle, WALevel 33356
New York, NY-1AT&T7018
Stockton, CASprint1239
Sydney-1Telstra1221
Chicago, ILAOL1668
Amsterdam-2KPN286
Calgary, CanadaTelus852
Palo Alto, CA-2Level 33549
Los Angeles, CACenic2152
Vancouver, CanadaBell Canada6539
Frankfurt-2GTT3257
St. Petersburg-1Obit8492
London-9Global Crossing3549
Tokyo-4IIJ2497
San Francisco, CASavvis3561

Resolution

We've pushed a software update which will detect truncation in RIB files, and revert to updates (published every 15 minutes) in the event that a truncated file is detected.  At this time, we expect no changes from normal behavior, and expect no adverse timing impact from the perspective of data collection as a result of these changes.