Rogers Communications has filed a 39-page reply to questions from the Canadian telecommunications regulator about the unprecedented outage of its internet and wireless network, again blaming a configuration change that deleted a routing filter, which caused its distribution routers to be overwhelmed.
However, large parts of the version publicly released late Friday by the Canadian Radio-television and Telecommunications Commission (CRTC) were been blanked out by the commission — including the explanation of the root cause — for security or competitive reasons.
Also blanked out are the steps Rogers has taken to prevent a similar outage. “We have developed very specific measures, for the immediate term, short term and medium term, that will be implemented in the coming days, and weeks,” the document says. But the public version leaves out that list.
“Most importantly,” the submission adds, “Rogers is examining its “change, planning and implementation” process to identify improvements to eliminate risk of further service interruptions. These include the following steps:” The list of steps is blanked out in the version released by the CRTC.
Since the July 8 outage, many experts noted that in April, 2021 the wireless side of the Rogers network was out for almost 22 hours, suggesting that the carrier may have serious issues with its infrastructure. In its submission, Rogers says the causes of that outage — a third-party’s product update — were different from the July 8 incident. The submission includes a list of what Rogers has done since the 2021 crash to improve network resiliency.. That list has been blanked out.
As a result the public doesn’t know exactly why the code in the planned update to Rogers’ core IP network caused chaos — was it a simple coding syntax mistake, a failure to follow established devops standards, a failure to follow practices for testing code on an offline platform or….?
Roger does say in the submission that updates to its core IP network are made “very carefully.”
It went through a comprehensive planning process including scoping, budget approval, project approval, kickoff, design document, method of procedure, risk assessment, and testing, finally culminating in the engineering and implementation phases. “The update in question was the sixth phase of a seven-phase process that had begun weeks earlier. The first five phases had proceeded without incident,” Rogers emphasized. “We validated all aspects of this change.”
If so, it isn’t immediately clear why the carrier replaced its CTO last week.
These and other questions may be answered Monday when the House of Commons Industry Committee holds a hearing starting at 11 a.m. Eastern into the outage. The hearing will be televised. Federal officials including the CRTC and Rogers will testify.
The document does include several interesting pieces of colour to Rogers’ account of its response to the July 8 incident. The collapse and disconnection of some equipment was so bad engineers lost access to the carrier’s virtual private network (VPN) system, hindering its ability to begin identifying the trouble and slowing down network restoration.
However, they were able to carry on with work through their cellphones thanks to a seven-year-old emergency preparedness plan. Under the Canadian Telecom Resiliency Working Group, a federal-telco committee that works on best practices, Bell, Rogers and Telus agreed in 2015 to allow certain employees to swap SIM cards on their devices in emergencies. An unnamed number of Rogers staff took advantage of the agreement to use competitors’ networks, which helped Rogers’ restoration efforts.
Rogers offered this account of what happened on July 8:
The implementation of the sixth phase of its maintenance update started at 2:27 a.m. Eastern. At 4:43 a.m. Eastern a specific coding change was introduced in its three Distribution Routers, which triggered the failure of the Rogers IP core network two minutes later.
“The configuration change deleted a routing filter and allowed for all possible routes to the Internet to pass through the routers. As a result, the routers immediately began propagating abnormally high volumes of routes throughout the core network. Certain network routing equipment became flooded, exceeded their capacity levels and were then unable to route traffic, causing the common core network to stop processing traffic. As a result, the Rogers network lost connectivity to the Internet for all incoming and outgoing traffic for both the wireless and wireline networks for our consumer and business customers.”
“Like many large telecommunications services providers (TSPs), Rogers uses a common core network, essentially one IP network infrastructure, that supports all wireless, wireline and enterprise services. The common core is the brain of the network that receives, processes, transmits and connects all Internet, voice, data and TV traffic for our customers.
“Again, similar to other TSPs around the world, Rogers uses a mixed vendor core network consisting of IP routing equipment from multiple tier one manufacturers. This is a common industry practice as different manufacturers have different strengths in routing equipment for Internet gateway, core and distribution routing. Specifically, the two IP routing vendors Rogers uses have their own design and approaches to managing routing traffic and to protect their equipment from being overwhelmed. In the Rogers network, one IP routing manufacturer uses a design that limits the number of routes that are presented by the Distribution Routers to the core routers. The other IP routing vendor relies on controls at its core routers. The impact of these differences in equipment design and protocols are at the heart of the outage that Rogers experienced.”
The result was Rogers network lost connectivity internally and to the Internet for all incoming and outgoing traffic, for both the wireless and wireline networks for consumer and business customers.
The submission lists the number of Rogers’ consumer, business, federal, provincial, territorial and municipal customers (some of whom may have redundant communications services). Those numbers have been blanked out in the public document.
Because wireless devices have become the dominant form of communicating for a vast majority of Canadians, Rogers said its wireless network was the first focus of recovery efforts. Then it worked on the landline service, and, finally on restoring data services, particularly for critical care services and infrastructure.
In a letter to the CRTC accompanying the submission, Ted Woodhead, Rogers’ chief regulatory and government affairs officer, wrote that “the network outage experienced by Rogers was simply not acceptable. We failed in our commitment to be Canada’s most reliable network.”
Discussion about this post