Horizon outage
Incident Report for Windsor Telecom
Postmortem

Report

Please see our postmortem from our network below in regards to the issues experienced on Thursday 16th May . Please note the below report is from our network directly.

Summary

Loss of connectivity impacting a proportion of Horizon handsets/clients, portal access and associated data access services due to an issue with a DNS Server within the network.

Details

The initial issue was caused by a corrupt DNS server which then failed to resolve DNS queries resulting in traffic routing through the server dropping, Horizon registration attempts, Portal queries and DSL DNS queries.

When this was identified the interface to the DNS server was shut down and we monitored for an increase in successful registrations across all platforms. Due to the way that local access devices store DNS updates these are normally refreshed every 60 minutes which would have had a bearing on the overall recovery time.

The DNS servers form part of a group of distributed servers across the network and work in active load balanced configuration across the network, the hierarchy include DNS resolvers which contain cached data for Hostnames and authoritative servers which hold the actual up-to-date information for Hostnames.

In this instance it was one of the eight authoritative servers which failed and thus failed to respond with the correct information resulting in end user/devices failing to connect to the correct destination address. Responses to requests are processed on the first response basis from one of the eight servers and hence impacting only a percentage of devices/end user requests until the corrupted server was removed from operation at 12:44pm.

In addition to the above we did we receive reports of some devices which were unable to register post-incident mainly soft clients and handsets. This was due to the volume of registration attempts being processed by the platform servers and this backlog cleared fully by 8pm. We are reviewing the server configuration to ensure we optimise the volume of registrations we process whilst ensuring the integrity of the platform.

In addition to the above actions, we are taking robust steps to review our change governance process relating to DNS related infrastructure.

Posted May 20, 2019 - 16:11 BST

Resolved
Further to the previous updates, all services are now behaving as expected. Our engineers are continuing to monitor the service closely but no further issues are expected. This notification will be left open until 4 pm and then closed.
Posted May 17, 2019 - 11:53 BST
Update
We continue to monitor the network servers and registration attempts and services remain stable. If you are still experiencing any issues with handsets registering, please reboot your CPE and contact the service desk if the issue is not resolved.
At the moment we are still aware of some residual errors when logging into the Horizon GUI which our engineers are working with priority to resolve.
Posted May 17, 2019 - 10:11 BST
Update
We continue to monitor the network servers and registration attempts and services remain stable. If you are still experiencing any issues with handsets registering, please reboot your CPE and contact the service desk if the issue is not resolved. At the moment we are still aware of some residual errors when logging into the Horizon GUI which our engineers are working with priority to resolve. A further update will be provided by 11am.
Posted May 17, 2019 - 10:09 BST
Update
Further to the last update, we have monitored the network servers and registration attempts and these are now at expected levels with no errors. Our Engineering team will continue monitoring throughout the night and a further update will be provided at 10am tomorrow.
Posted May 16, 2019 - 20:56 BST
Update
Further to the last update we are continuing to see gradual restoration of services and we are continuing to monitor the service. A further update will be published before soon.
Posted May 16, 2019 - 19:04 BST
Update
Further to the last update services have started to recover since 15.40 customers should start to see registration recover gradually over the afternoon. We are monitoring the restoration and a further update will be provided.
Posted May 16, 2019 - 16:20 BST
Update
As a result of the earlier recovery process, we have started to receive reports of some further handset registration issues, which result in the handsets displaying Error 500. In addition, some users may experience issues when logging into or utilising the Horizon soft client. Our network engineering team are currently in discussion with the vendor to establish the quickest ways to fully resolve the ongoing issues.
Posted May 16, 2019 - 15:38 BST
Monitoring
Further to earlier updates, we have seen usual levels of Horizon traffic across the network as of 13:30 pm and customer reports indicate that services have been restored. Monitoring of service will continue throughout the day. Some users may continue to experience issues accessing the Horizon Portal however we expect full portal recovery within the next 30 minutes.
Posted May 16, 2019 - 14:19 BST
Update
We have had some Horizon users regain service following a Broadband Router reboot (power recycle) and a handset reboot and we are advising for all customers affected to reboot their router and Horizon Handset.
Posted May 16, 2019 - 13:21 BST
Update
We are currently experiencing an issue. Customers are reporting that they are unable to access the Gamma Portal, Horizon Gui and we have some Horizon users with no service. Our engineers are engaged and working to resolve the issue. We have had some Horizon users regain service and we are advising for customers to reboot their handsets. We will update you by 14:00 or sooner should information become available
Posted May 16, 2019 - 13:08 BST
Identified
The Horizon network is experiencing some connection issues. This will include handset registration and log in access. Access to the portal is also currently unavailable. We will be updating every 30 minutes.
Posted May 16, 2019 - 12:45 BST
This incident affected: Horizon (Portal (Unlimited Horizon), Incoming Calls, Outgoing Calls, Collaborate, Connect, SIP).