On Monday 6th of January, we had a significant outage affecting all Raisely systems. Based on our global DNS monitoring, at the worst stage during the incident about 50% of monitored internet providers were down. We're aware that visitors in the United Kingdom were particularly affected, along with some parts of Europe and North America. Most of North America, Australia, and New Zealand stayed online during the incident
As is often the case when a service goes down, this was due to an unfortunate combination of human error, and insufficient protections against human error.
Raisely exists to empower charities, and we all take uptime very seriously. We're committed to learning from the incident on Monday and ensure a similar incident doesn't happen again. We're very sorry for the impact of this outage.
We’ve identified areas for improvement in our uptime and reliability and have already begun to implement some of the future mitigations listed below.
On 22nd of January, we made what should have been a routine change to our domain registration contact for raisely.com, unfortunately there was a mistake in the contact email that was entered. ICANN requires the email contact of a domain to be verified within 15 days, but due to the error, the verification email was not received.
On 6th of February, as the verification email was not received or confirmed, the domain registrar suspended the raisely.com domain. The suspension included changing the NS record (which indicates the authoritative domain name server for raisely.com) to ns2.registrant-verification.ispapi.net. That name server in turn resolved any lookups of A or AAAA records for raisely.com and any subdomains to their own server.
At 9:43pm AEDT, we were alerted to intermittent failures in our messaging and API, and the issue was escalated to the Head of Engineering.
At 10:00pm AEDT we identified that some hosts were receiving incorrect A records and traced it back to an incorrect NS record, we logged into our name registrar to understand the issue and discovered the verification issue that had caused the problem.
At 10:10pm AEDT we created an email alias to receive the verification email at the incorrect address. As the domain registrar enforced strict limits on how often a verification email could be resent, we took several minutes to test and verify the alias to ensure that the verification email would arrive before resending it. At 10:17pm we resent the verification email and verified the contact details. It took some time before we were able to see that the NS record had been restored.
10:29pm AEDT we were able to see that the NS record had been corrected, and began monitoring for propagation.
We cleared the public DNS caches of Google, Cloudflare, 22.214.171.124 and OpenDNS to speed up the propagation of the correct NS records.
By 11:04pm AEDT we could see that raisely.com was resolving correctly for Oceana, most of the UK and parts of the USA, and by 11:41pm AEDT we could see that all major DNS providers were serving the correct NS and A records.
At 12:50am AEDT we noted that some of our internal services were continuing to cache incorrect DNS records, and we redeployed those services to force a cache clear.
We continued to monitor the service stability until 2:00am AEDT at which point we were confident all systems were functioning normally and DNS had propagated to all top level public DNS servers.
Unfortunately, the issue was compounded by the domain registrar setting the TTL on both the NS, A and AAAA records to 1 hour. Since resolving the IP address for a Raisely host first involves finding the authoritative name server (looking up the NS record) and then asking that server for the A or AAAA record it could take up to 2 hours for a top level DNS server to begin serving the correct records.
Downstream DNS servers may in turn fetch and cache records from those servers, which resulted in some customers continuing to experience the issue for many hours after it was resolved.
Some of our customers were able to work around this by using a VPN, or changing their DNS settings to use public DNS servers (eg 126.96.36.199, 188.8.131.52 or 184.108.40.206).
While we have other contact emails with the domain registrar, the domain registrar unfortunately did not use any of those to advise us of the configuration problem.
As we host our status page on status.raisely.com and our support at firstname.lastname@example.org, it meant customers impacted by the outage were often also unable to contact support or get any details of the outage or its progress towards resolution.
As the TTL on our NS record is set to 24 hours, it meant that many DNS servers maintained a correct copy of the NS records in their cache during the incident which shielded some customers from the issue.
However, this mitigation also had the downside that much of our uptime monitoring also had the correct DNS records cached, and so it took longer for us to be alerted to the issue.
As I said, this issue, like so many, started with a human error. At Raisely we know that human errors will always happen, and so the only way to uphold our responsibility to our customers is to put systems in place to prevent these errors and alert us to them when those preventions fail.
When looking at issue mitigation we aim to look at the broader lessons to be learned. It’s very unlikely that we will make that same mistake again, and so we look for remedies that are not limited to this specific issue.
In this case we are taking steps to limit the impact of an issue with any single domain record, and to ensure that we catch issues with DNS sooner:
We are are moving our domain name registration to a new registrar that provides the following additional safeguards: