Please see our postmortem from our network below in regards to the issues experienced on Thursday 16th May . Please note the below report is from our network directly.
Loss of connectivity impacting a proportion of Horizon handsets/clients, portal access and associated data access services due to an issue with a DNS Server within the network.
The initial issue was caused by a corrupt DNS server which then failed to resolve DNS queries resulting in traffic routing through the server dropping, Horizon registration attempts, Portal queries and DSL DNS queries.
When this was identified the interface to the DNS server was shut down and we monitored for an increase in successful registrations across all platforms. Due to the way that local access devices store DNS updates these are normally refreshed every 60 minutes which would have had a bearing on the overall recovery time.
The DNS servers form part of a group of distributed servers across the network and work in active load balanced configuration across the network, the hierarchy include DNS resolvers which contain cached data for Hostnames and authoritative servers which hold the actual up-to-date information for Hostnames.
In this instance it was one of the eight authoritative servers which failed and thus failed to respond with the correct information resulting in end user/devices failing to connect to the correct destination address. Responses to requests are processed on the first response basis from one of the eight servers and hence impacting only a percentage of devices/end user requests until the corrupted server was removed from operation at 12:44pm.
In addition to the above we did we receive reports of some devices which were unable to register post-incident mainly soft clients and handsets. This was due to the volume of registration attempts being processed by the platform servers and this backlog cleared fully by 8pm. We are reviewing the server configuration to ensure we optimise the volume of registrations we process whilst ensuring the integrity of the platform.
In addition to the above actions, we are taking robust steps to review our change governance process relating to DNS related infrastructure.