DynaFile Application Unavailability

Incident Report for DynaFile

Postmortem

Below is the Root Cause Analysis (RCA) of the network outage experienced on January 16, 2024 as provided by our data center.

Root Cause of Incident

The root cause of the problem was traced back to a specific customer peering arrangement, from which the malformed BGP packets originated. In the context of BGP, a message is considered malformed when any of its attributes are found to be incorrectly structured or invalid. As a consequence of receiving these malformed update messages, the affected routers and switches experienced disruptions in their route processes. This, in turn, caused a number of systems to either dump or stall, impacting the overall network stability.

Incident Resolution

In response to the identified source of the malformed BGP packets, we removed the suspected peers from our network. This step is crucial to prevent further propagation of problematic messages and to isolate the affected nodes. Following the removal of suspected peers, a controlled power cycle was initiated on the affected devices. This measure is intended to ensure a clean state and proper initialization of the devices, resolving any lingering effects of the disruption.

Corrective Actions

Following the incident involving the reception of malformed BGP packets and the resulting disruptions, we are taking proactive measures to enhance the resilience of our network. The corrective action identified is the implementation of BGP error tolerance across the entire network. To achieve this, we will schedule a change control window during which the BGP error tolerance mechanisms will be systematically deployed across relevant network components. This approach is designed to mitigate the impact of malformed BGP packets and enhance the overall stability of our network infrastructure.

Posted Jan 23, 2024 - 16:36 MST

Resolved

This incident has been resolved.
Posted Jan 16, 2024 - 13:47 MST

Update

As of 1:40 PM MST, all systems are now fully online and operational. We will continue to monitor for any abnormalities. We understand that this type of downtime is unacceptable to you and your team and we will do everything in our power to prevent it in the future. You can expect an after incident response from us within 3 business days that will contain a root cause analysis and future mitigation plans.
Posted Jan 16, 2024 - 13:44 MST

Update

We apologize for the delay in the last update. We have been unable to come to a successful resolution as to the the underlying cause of the network connectivity issues at our primary datacenter and are beginning the failover process to our DR site. We will update this ticket at 2:00 PM MST with the expected ETA of when we expect to have services be fully live in that location.
Posted Jan 16, 2024 - 13:26 MST

Update

While partial connectivity was available for about 30 minutes, it was not optimal and the engineers are continuing to trace the root cause, which as of 11:45 AM has again caused full outage. We are continuing to work with them on finding an optimal resolution while also preparing our DR site for failover if deemed necessary. We understand that the accessibility of DynaFile is of crucial importance to you and your team members, and we, along with our service providers, have all hands on deck to resolve this issue as soon as possible. Expect another update within the next 30 minutes.
Posted Jan 16, 2024 - 11:50 MST

Update

Partial network connectivity is restored, but we are still seeing higher than expected packet loss. While some access to DynaFile is able to be achieved, users may experience timeout issues. We are continuing to work with our providers to eliminate this loss.
Posted Jan 16, 2024 - 11:21 MST

Update

Our service provider has informed us that they removed bouncing peers which could be contributing to the issue. Additionally, they continue to isolate the issue by power cycling one core router at a time. They have informed us that this should be a resolvable issue in short order, and therefore we are not implementing our DR failover plan as of yet, however we are continuing to monitor the situation and should the outage proceed for an extended period, we will evaluate the necessity to failover to our secondary datacenter.
Posted Jan 16, 2024 - 10:50 MST

Update

Network connectivity is down again as of 10:18 AM MST. We are continuing to work with our service providers to resolve.
Posted Jan 16, 2024 - 10:22 MST

Update

We have been informed by our data center that there is a routing issue in the Denver market. IP Cores were impacted, causing a reset of the routing process on some routers. Routing and connectivity has returned but we are still seeing some brief interruptions. We are continuing to monitor and investigate this issue.
Posted Jan 16, 2024 - 09:58 MST

Monitoring

All systems appear to be working normally at this time. There was a period of about 6 minutes, from 9:04 AM MST until 9:10 AM MST where network connectivity was lost. We will continue to investigate the root cause, but for now, should you or your users experience any further issues, please contact your DynaFile account manager or email support@dynafile.com.
Posted Jan 16, 2024 - 09:17 MST

Investigating

We are aware that the DynaFile application is currently experiencing issues and some clients may be unable to access their DynaFile site. We are currently investigating the issue and will update this ticket within 30 minutes. Should you have an immediate need or concern, you may contact your DynaFile account manager or email support@dynafile.com, however, this this will contain the most up-to-date information as it pertains to the issue.
Posted Jan 16, 2024 - 09:08 MST
This incident affected: DynaFile Application.