Horizon Network Incident (SIP, Inbound, Horizon)
Incident Report for Windsor Telecom
Postmortem

Report

Please see our postmortem from our network below in regards to the issues experienced on Thursday 11th July . Please note the below report is from our network directly.

Summary

Intermittent Loss of connectivity through the Trafford Data Centre which impacted multiple voice services including access to the Inbound and Horizon Portal. The majority of services were restored by 11:40am but we continued see some residual service impact into the afternoon.

Incident Summary

Following a partial failure of a line card within one of the IP core routers hosted at the Trafford data centre.
Due to the partial failure this caused a network IP traffic loop condition on the core router causing a very high processing load across the device, Traffic routing through this device would have experienced high latency which would have impacted voice and portal services

The traffic looping itself was identified and resolved at 11:40 following some extensive analysis by both our IP and engineering teams. They cleared down the processing load and the majority of services stabilised with the exception of links directly associated with the line card at fault.

The remaining network links associated with the faulty line card were then rerouted over alternate line cards within the core router, this work largely completed by 14:30. At 19:00 an emergency change control was done to replace the impacted line card and by 21:40 all hardware was replaced and all subsequent monitoring and testing was successful.

Pending a full root cause from our vendors on the actual line card failure, we will be reviewing any design improvements we could implement under these exceptional circumstances and how we can effectively plan and test for this from a business continuity perspective.

Posted Jul 17, 2019 - 09:14 BST

Resolved
We have had confirmation that all services are now stable and are behaving as expected. We would like to thank you for your patience. If you have any questions on this incident, please do not hesitate to contact your account manager directly or email cs@windsor-telecom.co.uk
Posted Jul 12, 2019 - 12:17 BST
Update
Further to the last update the work to replace hardware completed successfully as of 22.15pm and all testing and monitoring overnight has been successful. Our network will continue to monitor as we move into the working day to ensure that all services remain stable, however all new call recording is working as expected.
Posted Jul 12, 2019 - 09:14 BST
Update
Further to the last update in regards to call recording, our network engineering teams are continuing to work towards a full resolution. They will be replacing IP hardware in Trafford this evening between 19:00 and midnight.
Posted Jul 11, 2019 - 17:21 BST
Update
Further to the last update, we are seeing the delay in Horizon registrations returning to normal. If you are still having issues with your Horizon Clients, the advice is to re-establish the connection or restart your laptop/computer. We are aware that there are some issues with call recording on Horizon services and our network engineering teams are investigating the full impact of this issue and are currently working towards a full resolution.
Posted Jul 11, 2019 - 15:37 BST
Update
Further to the last update we can confirm that call quality and call routing issues are resolved as of 11.40. There are still some residual issues with portal access, and residual issues with a delay in Horizon registrations. The engineering teams have identified the root cause and are working towards a full resolution.
Posted Jul 11, 2019 - 13:48 BST
Monitoring
Further to the last update, our network engineering team have now completed the restarting of the Core Network Device. The engineering teams are monitoring for improvements and are continuing to work on the root cause of this incident.
Posted Jul 11, 2019 - 12:15 BST
Identified
We are currently investigating alerts and customer reports that there are some portal performance issues and we have had a small number of reports of some intermittent call failures, call quality issues. Our network engineering teams have made changes on call routes to improve call failures. They are currently working towards a full resolution and are in the process of restarting a core network device within one of the data centres at Manchester as part of the recovery process. The next step is for our network to monitor the fault whilst working on a fix.
Posted Jul 11, 2019 - 11:07 BST
This incident affected: Horizon (Portal (Unlimited Horizon), Incoming Calls, Outgoing Calls, Call Recording).