On May 15 at 1800 PDT Nagios alerted the start of sporadic DNS resolution failures. This post summarizes the events, the impact and specific steps Mozilla IT is taking to avoid future disruptions of this nature.
This post is intended to be technical in nature. DNSSEC is fairly technical and DNSSEC failures tend to be similarly technical. As we’ve done before, we hope to share the failures we encounter in production so you don’t have to experience the same.
An SOA mismatch between SVN and the nameservers was caused by the DNSSEC signer refusing to sign with an expired ZSK. This was misdiagnosed as a KSK issue, leading to a DNS outage for DNSSEC-verifying resolvers.
In the afternoon of May 15, the nameservers refused to load SOA update 2013051500 for the mozilla.org DNSSEC-signed zone.
Investigation found that the DNSSEC signer was refusing to sign the zone, providing only the error “fatal: cannot find SOA RRSIGs“. In hindsight, this undocumented error indicates that the zone’s ZSK has expired.
Mozilla’s domain registrar publishes DS records for the mozilla.org KSK. When the expired key was found at 16:44, it was misunderstood to be a KSK, rather than a ZSK. A new KSK was generated and its DS record added to Mozilla’s domain registrar.
The new KSK did not resolve the signing errors. Mozilla’s domain registrar was found to rate-limit DS record changes, preventing the new KSK from being reverted. DNS lookups began showing invalid DS records from Mozilla’s domain registrar, but this was later found to be internal DNS only.
After examining the keys (both current and expired) more closely, the expired key was found to be a ZSK, rather than a KSK. Renewing the ZSK fixed the DNSSEC signer. The mozilla.org SOA 2013051500 was signed by both KSKs and the new ZSK, and then published.
Comcast users began reporting DNS resolution issues of mozilla.org, complicating access to various Mozilla properties. DNSSEC validation tools showed unexpected issues with the signed mozilla.org zone.
The DS records were confirmed to be correct externally, so the mozilla.org zone was re-signed without the old KSK, leaving only the new KSK and new ZSK. This resolved the validation issues for reasons unknown, and Comcast users reported DNS working correctly again.
Bugs have been filed to document the KSK/ZSK renewal process, to monitor the expiration times of those keys, and to monitor that the zones validate.
- 872818: mozilla.org SOA mismatch, DNSSEC signer refusing to sign
- 872831: alarm when DNSSEC signing keys are expiring soon
- 872884: document ZSK and KSK renewal/rollover process
- 872832: regenerate mozilla.org DNSSEC ZSK (resolved)
- 872885: regenerate mozilla.org DNSSEC KSK (resolved)
- 872927: monitoring: add full validation of DNSSEC zones
TIMELINE (PST8PDT, UTC -0700)
- 15:32 – SOA mismatch detected between nameservers 2013051402 and svn 2013051500.
- 16:03 – Found DNSSEC signer refusing to sign mozilla.org 2013051500
- 16:44 – Found expired key preventing signing of mozilla.org
- 16:52 – Added new KSK to Mozilla’s domain registrar alongside existing KSK to renew expired key
- 17:06 – Found that expired key was ZSK, not KSK as previously thought.
- 17:27 – Signed mozilla.org with both KSKs and new ZSK
- 17:45 – Mozilla’s domain registrar publishing incorrect hash for new KSK (misleadingly, for internal lookups only)
- 18:00 – Comcast users reporting sporadic DNS resolution failures
- 18:20 – Validation issue found with signed zones
- 18:25 – Signed mozilla.org with new KSK and new ZSK
- 18:30 – Comcast users reporting DNS resolving successfully
- 18:35 – Validation issue confirmed resolved
- ZSK and KSK are “zone signing key” and “key signing key” for mozilla.org. DNSSEC permits multiple KSKs and autoselects the latest ZSK. We sign with a single KSK, outside of 17:30-18:25 above.
- There is no filesystem difference between ZSKs and KSKs. The distinction is the word “zone” or “key” in the comment in the first line of the keyfile.