Below is the Root Cause Analysis (RCA) of the network outage experienced on January 16, 2024 as provided by our data center.
Root Cause of Incident
The root cause of the problem was traced back to a specific customer peering arrangement, from which the malformed BGP packets originated. In the context of BGP, a message is considered malformed when any of its attributes are found to be incorrectly structured or invalid. As a consequence of receiving these malformed update messages, the affected routers and switches experienced disruptions in their route processes. This, in turn, caused a number of systems to either dump or stall, impacting the overall network stability.
Incident Resolution
In response to the identified source of the malformed BGP packets, we removed the suspected peers from our network. This step is crucial to prevent further propagation of problematic messages and to isolate the affected nodes. Following the removal of suspected peers, a controlled power cycle was initiated on the affected devices. This measure is intended to ensure a clean state and proper initialization of the devices, resolving any lingering effects of the disruption.
Corrective Actions
Following the incident involving the reception of malformed BGP packets and the resulting disruptions, we are taking proactive measures to enhance the resilience of our network. The corrective action identified is the implementation of BGP error tolerance across the entire network. To achieve this, we will schedule a change control window during which the BGP error tolerance mechanisms will be systematically deployed across relevant network components. This approach is designed to mitigate the impact of malformed BGP packets and enhance the overall stability of our network infrastructure.