DNS resolution issue for raisely.com

Incident Report for Raisely

Postmortem

On Monday 6th of January, we had a significant outage affecting all Raisely systems. Based on our global DNS monitoring, at the worst stage during the incident about 50% of monitored internet providers were down. We're aware that visitors in the United Kingdom were particularly affected, along with some parts of Europe and North America. Most of North America, Australia, and New Zealand stayed online during the incident

As is often the case when a service goes down, this was due to an unfortunate combination of human error, and insufficient protections against human error.

Raisely exists to empower charities, and we all take uptime very seriously. We're committed to learning from the incident on Monday and ensure a similar incident doesn't happen again. We're very sorry for the impact of this outage.

We’ve identified areas for improvement in our uptime and reliability and have already begun to implement some of the future mitigations listed below.

Timeline of the incident

On 22nd of January, we made what should have been a routine change to our domain registration contact for raisely.com, unfortunately there was a mistake in the contact email that was entered. ICANN requires the email contact of a domain to be verified within 15 days, but due to the error, the verification email was not received.

On 6th of February, as the verification email was not received or confirmed, the domain registrar suspended the raisely.com domain. The suspension included changing the NS record (which indicates the authoritative domain name server for raisely.com) to ns2.registrant-verification.ispapi.net. That name server in turn resolved any lookups of A or AAAA records for raisely.com and any subdomains to their own server.

As this record propagated, customers would most commonly have encountered the problem as an SSL certificate error as it appeared to browsers Raisely.com was using an SSL certificate for *.ispapi.net.

At 9:43pm AEDT, we were alerted to intermittent failures in our messaging and API, and the issue was escalated to the Head of Engineering.

At 10:00pm AEDT we identified that some hosts were receiving incorrect A records and traced it back to an incorrect NS record, we logged into our name registrar to understand the issue and discovered the verification issue that had caused the problem.

At 10:10pm AEDT we created an email alias to receive the verification email at the incorrect address. As the domain registrar enforced strict limits on how often a verification email could be resent, we took several minutes to test and verify the alias to ensure that the verification email would arrive before resending it. At 10:17pm we resent the verification email and verified the contact details. It took some time before we were able to see that the NS record had been restored.

10:29pm AEDT we were able to see that the NS record had been corrected, and began monitoring for propagation.

We cleared the public DNS caches of Google, Cloudflare, 1.1.1.1 and OpenDNS to speed up the propagation of the correct NS records.

By 11:04pm AEDT we could see that raisely.com was resolving correctly for Oceana, most of the UK and parts of the USA, and by 11:41pm AEDT we could see that all major DNS providers were serving the correct NS and A records.

At 12:50am AEDT we noted that some of our internal services were continuing to cache incorrect DNS records, and we redeployed those services to force a cache clear.

We continued to monitor the service stability until 2:00am AEDT at which point we were confident all systems were functioning normally and DNS had propagated to all top level public DNS servers.

Compounding factors

Unfortunately, the issue was compounded by the domain registrar setting the TTL on both the NS, A and AAAA records to 1 hour. Since resolving the IP address for a Raisely host first involves finding the authoritative name server (looking up the NS record) and then asking that server for the A or AAAA record it could take up to 2 hours for a top level DNS server to begin serving the correct records.

Downstream DNS servers may in turn fetch and cache records from those servers, which resulted in some customers continuing to experience the issue for many hours after it was resolved.

Some of our customers were able to work around this by using a VPN, or changing their DNS settings to use public DNS servers (eg 1.1.1.1, 4.4.4.4 or 8.8.8.8).

While we have other contact emails with the domain registrar, the domain registrar unfortunately did not use any of those to advise us of the configuration problem.

As we host our status page on status.raisely.com and our support at support@raisely.com, it meant customers impacted by the outage were often also unable to contact support or get any details of the outage or its progress towards resolution.

Mitigations

As the TTL on our NS record is set to 24 hours, it meant that many DNS servers maintained a correct copy of the NS records in their cache during the incident which shielded some customers from the issue.

However, this mitigation also had the downside that much of our uptime monitoring also had the correct DNS records cached, and so it took longer for us to be alerted to the issue.

Future Mitigation

As I said, this issue, like so many, started with a human error. At Raisely we know that human errors will always happen, and so the only way to uphold our responsibility to our customers is to put systems in place to prevent these errors and alert us to them when those preventions fail.

When looking at issue mitigation we aim to look at the broader lessons to be learned. It’s very unlikely that we will make that same mistake again, and so we look for remedies that are not limited to this specific issue.

In this case we are taking steps to limit the impact of an issue with any single domain record, and to ensure that we catch issues with DNS sooner:

We are updating our downtime monitoring to include monitoring of our DNS configuration.
We have purchased and are in the process of spreading critical services across new domains to limit the impact of a future DNS issue and ensure that we are still able to communicate with customers if such an issue were to occur.
We are are moving our domain name registration to a new registrar that provides the following additional safeguards:
- Additional automatic and human checks on contact information updates to catch errors.
- Unverified changes such as the one that caused this issue will be rolled back rather than cause a domain suspension
- More and faster options for contacting their support

Posted Feb 10, 2023 - 12:14 AEDT

Resolved

Raisely's error rates have been at normal levels for a while. This incident is now resolved.

Posted Feb 07, 2023 - 02:00 AEDT

Update

We are continuing to monitor for any further issues.

Posted Feb 07, 2023 - 00:51 AEDT

Update

Raisely is operational for all users now. We're monitoring and working to resolve high API error rates.

Posted Feb 07, 2023 - 00:50 AEDT

Update

We're continuing to monitor the rollout of our fix. Raisely is operating as normal for the vast majority of internet providers.

Posted Feb 06, 2023 - 23:41 AEDT

Update

We have seen the DNS records propagating.
Oceana, most of the UK and parts of the USA are now resolving correctly.

Posted Feb 06, 2023 - 23:04 AEDT

Update

We are checking on the propogation of the DNS fix

Posted Feb 06, 2023 - 22:32 AEDT

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Feb 06, 2023 - 22:29 AEDT

Identified

We've identified an issue with DNS configuration and are working to resolve the issue

Posted Feb 06, 2023 - 22:19 AEDT

Investigating

We are aware of degraded performance of our API and Messaging systems, we are currently investigating

Posted Feb 06, 2023 - 21:43 AEDT

This incident affected: Websites, Marketing Automation, API, and Admin Panel.