We interrupt our regular news bulletin about our FLOSS-centric security-focused laptops and phones to bring you this special announcement about a recent temporary outage of our primary domain name.
Now this is a story all about how
our domain got flipped-turned upside down,
and I’d like to take a minute,
just sit right there,
I’ll tell you how we restored the glue records of wp.puri.sm.
Like with all major outages, our story begins in the middle of the night (well, the middle of the night in my timezone). Our monitoring servers alerted our sysadmins that our website was down, and when they started investigating, they discovered that the outage extended to all of our servers because DNS was no longer resolving. We host our own DNS, and it—along with the website and all of our other servers—was up and running fine. The problem was this:
$ whois wp.puri.sm
Domain Name: wp.puri.sm
Registration date: 05/05/2014
Even though our domain wasn’t up for renewal until May, it was marked as “suspended” for some reason unknown to us. This is significant, because when a domain gets suspended, the registry takes it offline by removing its DNS “glue” records at the TLD DNS servers, so instead of records for ns1.wp.puri.sm and friends we had:
$ dig +trace wp.puri.sm
; <<>> DiG 9.11.2-P1-1-Debian <<>> +trace wp.puri.sm
. . .
sm. 3600 IN SOA dns.omniway.sm. hostmaster.telecomitalia.sm. 2018022036 43200 3600 2419200 3600
;; Received 109 bytes from 18.104.22.168#53(dns.omniway.sm) in 187 ms
I truncated the above output, but you can run the same dig command on a console to see what the output should look like. Without our DNS glue records, even though all of our servers were up and running, DNS queries for wp.puri.sm would stop at the .sm name servers and never move on to our name servers. We were dead in the water.
To understand why we were offline even though our servers were up, it helps to understand a bit about how DNS works:
This sounds like a lot of steps, but in practice it’s much faster because your recursive name server (and usually your OS as well) will cache the responses it gets back based on the Time To Live (TTL) value each response includes. Generally the TTL for TLDs is relatively long (172800 seconds or 48 hours in the case of .sm) so recursive name servers don’t have to ask the root name servers for their records too often.
When you register a new domain, part of the registration process is to tell your registrar what DNS servers you wish to use (these days they tend to point to their own name servers by default, both to save you hassle and to upsell you on their DNS and hosting services). That registrar then communicates with the appropriate TLD registry over a secure channel and the registry adds the DNS glue records for that TLD. Any time you want to change those glue records, you need to go back to your registrar as they are the conduit to the TLD registry.
So, to repeat the problem, since our DNS glue records were removed from .sm, any recursive name server looking up a record under wp.puri.sm would stop at the .sm DNS servers. Since we use wp.puri.sm for most of our external and internal hosts our outage didn’t stop at our website, it extended to our mail servers, internal chat, wikis, and everything else that ended in wp.puri.sm. For a brief time it also extended to pureos.net and puri.st domains as well because we had listed ns1-3.wp.puri.sm as their name servers.
So that’s what happened but that leads us to the more important question: why. As I had mentioned, our domain didn’t expire until May so that wasn’t the issue, and we hadn’t received any notices of non-compliance, so it must have been something else. Based on .sm’s rules of suspension, there were only a few possible explanations:
Our long-suffering and amazing sysadmins Theodotos and Stelios contacted our wp.puri.sm registrar, 101domain.com, to find out what was going on.
The registrar has 24×7 support thankfully, so even though it was in the middle of the night US time, we were able to talk to their support team based out of Ireland. Like with much tech support, it took some time to confirm with them that yes, we hadn’t changed this domain in months (if not years) and yes, everything is up on our side. Their immediate response was that it might take 24 to 48 hours to resolve the issue. As you might imagine, we stressed how urgent the issue was: we had an ecommerce site that was down, and therefore we were losing not just reputation but revenue.
They agreed to look into the problem, but with one complication: their “.sm specialist” was based out of their California office and the current team were incapable of contacting the registry! After trying multiple times to get 101domains to do something sooner, our sysadmins resigned themselves to the frustrating experience of waiting 3 hours (while our website, email, and mostly everything else was down) until the registrar’s “.sm specialist” would wake up and arrive to the office. Before the registrar’s customer support ended the chat, they actually tried to upsell us on their TLS certificate services—you know, for when the domain finally came back up. Because obviously we needed to buy more critical services from them.
Around this time, I woke up to this disaster already in progress. I was brought up to speed (challenging, without corporate chat or email) and we waited another hour for the .sm specialist to arrive at the registrar’s California office. Hopefully we could get this resolved and the site could be back up in another hour or so after that.
We contacted the specialist first thing in the morning, and he had no idea why the domain was suspended; he said he would contact the .sm registry but with one complication: the San Marino .sm registry office was now closed so it might take until the next day for them to respond to the email! Because their office was closed, he said all he could do is put the ticket in the queue of the team on the next support shift—that’s right, the same team out of Ireland we originally contacted. Because their office hours mirrored the San Marino office hours, he assured us they would get to it first thing in the morning their time.
Our team at Purism is international and pretty good at taking matters in our own hands, so we decided to not just wait to see if 101domain.com was going to contact them. Our sysadmin Stelios contacted the San Marino registry office directly the moment they opened, and discussed the issue with their friendly staff. Remember that list of reasons why a domain might be suspended? This is the most relevant one:
“non-payment of the registration fee as required by the RA;”
That’s right, our registrar didn’t pay their fees to .sm so the registry suspended all of 101domain.com’s .sm domains, including ours.
Stelios spent the rest of the day going through all of the authentication and paperwork (in Italian!) necessary to restore wp.puri.sm directly through the TLD. Shortly before I woke up in California, they agreed to restore our domain and said they would push to restore our DNS glue records as soon as they could.
At the same time, we were also curious what 101domain.com was going to do, so we also contacted the Ireland support team to check on the progress of our ticket. Their reply after a day of this? That they could not contact the registry themselves and we would have to wait for the .sm specialist in California! At this point, of course, we had already resolved matters without their help and a few hours later our DNS glue records were restored. An hour or so after that, our original support ticket was updated to say the domain was back online—as though it happened all on its own.
It’s hard to underscore just how frustrating an experience this was. Even if you have done everything right and your servers are up and running, your entire infrastructure can still be brought offline by a mistake at the registrar level. While we hope this never happens again, we want to be ready just in case so we are taking some additional steps.
Due to how domain registration works, we can’t completely remove the possibility that our domain could be suspended again in the future. What we can do is try to mitigate the effects with redundancy. To those ends we have registered PurismSPC.com with a different registrar and soon we will bring up a duplicate website and infrastructure under that domain. The idea here is to not just have a second domain with a redundant registrar, but have a domain under a completely different and traditional TLD for redundancy. We will also update our TLS certificates to list both wp.puri.sm and purismspc.com, so you can be assured that we consider both sites valid when you visit with a web browser.
Issues like this remind us why we want to empower you to have direct control over your computers and data. When you turn control over to other parties everything might be “OK” for awhile, but inevitably something will go wrong someday and when it does, you discover just how powerless you are.