Network issues identified
Incident Report for Infinity
Postmortem

On April 17th 2017 between 15:12 and 15:33 UTC Infinity experienced a network outage at our primary UK data centre.

The affected services were:

• Tracking system unavailable (15 minutes)

• Portals/APIs/Caller Insight/Real-time call dashboard unavailable (15 minutes)

• Hosted PBX unavailable – Some UK customers (2 minutes). Customers on our North America PBX servers were not affected.

• Call tracking calls – Approximately 25% of UK and EU inbound calls would have dropped at the start of the outage, but subsequent calls would have automatically re-routed to our other call servers. Calls handled by our North American and Asia-Pacific servers were not affected.

The issue was caused by a brief power outage in one of our racks. This led to the top-of-rack switch rebooting causing the servers in that rack to become temporarily disconnected from the network. Power and network connectivity was restored within 2 minutes.

At this point the Hosted PBX service was resumed.

The network drop also affected one of our firewall pair that was running in the rack (the other firewall is in a separate rack on a different switch and power distribution unit).

The firewalls monitor each other (over dual network links) and when one of them becomes unavailable the other takes over their services (Tracking System, Portal, APIs etc).

Unfortunately in this particular outage scenario, the brief network drop caused the physical network interface to drop and then re-connect during the failover process. This caused the IP failover software running on the firewalls to enter a split-brain scenario where both were trying to become active. This led to the public service IPs to ‘flap’ from one firewall to the other. Our engineers were automatically alerted and took action to manually disable one of the firewalls allowing the other to resume full service.

Our analysis of the problem revealed that this is a known issue in the software version we use for managing service IPs and have completed a scheduled update to the software running on our firewalls to prevent this particular issue from occurring again.

We apologise for this incident.

Posted Apr 25, 2017 - 10:25 BST

Resolved
Our services have now returned to normal and we will continue to monitor the situation closely.
Posted Apr 17, 2017 - 17:13 BST
Investigating
We are experiencing network difficulties within our production data centre. We believe the situation has stabilised and we are closely monitoring the situation. We apologise for any inconvenience.
Posted Apr 17, 2017 - 16:42 BST
This incident affected: Global Core Services (Infinity API, Infinity Portal, Caller Insight Service, Core Tracking Service) and UK PBX (UK PBX London Host 02).