While I was away recently I got a call from AJ. It was the afternoon of Saturday 2nd August and I was at a festival. My phone had run out of charge, and she'd called several of the people she knew I was with before she reached me.
"I hate to be the bearer of bad news, but the server's down."
I talked her through the social media response while D tried to securely connect to the server via his phone to see if he could find out what was going on. The downtime lasted eight hours in the end. Bad timing - a lot of people want to watch porn on Saturday night.
The problem was not at our end - something went wrong for our hosting provider. I've heard some rather juicy rumours about exactly what caused the initial problem - and they didn't send us a clear explanation, even after things were back up and running, which rather inclines me to wonder if the rumours are correct, and they're too embarrassed to admit it. What they did say in statements was that "the connectivity break was on a third-party network 12km outside of London", and that "the failover connection didn't come up because somebody from the backup connectivity company had mistakenly removed a piece of physical hardware that was required for it to do so". Not our fault guv.
The second unscheduled downtime was only ten days later. Again, it was for eight hours, but luckily it fell on a Tuesday this time - a bit less intrusive than at the weekend. I was halfway through mustering some serious indignance that this would happen twice in as many weeks when D forwarded me this memo from our host:
This is an RFO (Reason For Outage) report for the outage that you experienced on Tuesday 12th August 2014.
Start of Incident: 2014-08-12 09:11:00 BST
Incident Resolution: 2014-08-12 17:25:00 BST
Length of incident: 8 hours 14 minutes
Impact to services : Packet loss to and from some internet locations, intermittent complete IP outages lasting up to 20 minutes
Reason for outage: The events of yesterday were caused by reported underlying upstream provider issues that impacted our client base. Of our upstream providers the following reported network incidents yesterday that affected a great number of Internet users across the globe
This wasn't a small scale problem. It wasn't just us - and it wasn't just our hosting provider, either. This was an internet meltdown on a sufficiently large scale that it was covered by The Register.
"Yesterday, 12 August, 2014, the internet hit an arbitrary limit of more than 512,000 routes. This 512K route limit is something we have known about for some time.
The fix for Cisco devices – and possibly others – is fairly straightforward. Internet service providers and businesses around the world chose not to address this issue in advance, as a result causing major outages around the world.
As part of the outage, punters experienced patchy – or even no – internet connectivity and lost access to all sorts of cloud-based services."
In other words, someone misconfigured something and it tipped the global routing tables over the maximum size they can be, so things randomly dropped off everybody's networks. The outages affected different bits in each place, which was why at first nobody could figure out what was going on.
So we're sorry about the downtime. On both occasions, we extended all active memberships by 24 hours in compensation. But it's safe to say that there's absolutely nothing that we could have done about it. This was so far above our heads it was - literally - up in the cloud.
When I think about the sheer scale of the fuck-up I can't help seeing the funny side. There's no point taking it personally. This wasn't just our site, and it wasn't just our server - it was the whole damn internet.