A few days ago one of the internal certificate authority (CA) expired and caused a wave of SSL errors across the firm. It took quite a while to diagnose and mitigate, with potential issues still ongoing as of writing this. The firm will remain unnamed throughout this post, simply referred to as the firm.
In addition, this happened at a time of very high market activity related to COVID-19, that may or may not have been worsened by the firm being partially out-of-market.
After extensive investigation, the root cause is ultimately due to a human error, for re-using a CSR few years ago.
I was not able to find any literature or recommendations around the re-use of public keys, private keys and CSR when renewing certificates. It is pretty clear after this experience that these are not designed for re-use and should not be re-used. Thus I am writing this postmortem to document it and avoid other people from repeating this mistake in the future.
We will get into details on how the public key infrastructure operate including but not limited to Linux, Windows, Python and Java.
Public Key Infrastructure
You’re expected to understand basics around certificate and certificate authorities. Refer to Wikipedia.
Certificates typically expire every year and need to be renewed in advance. A certificate is signed by a certificate authority.
More importantly. Certificate authorities (CA) expire every 5 to 20 years and need to be renewed as well.
CAs are preloaded on a system. Given a certificate, the associated CA can be determined automatically and used to verify it. A system is pre-configured to trust a bunch of certificate authorities and all certificates emitted by these are transitively trusted. Pretty simple.
The firm is using an internal certificate authority (CA).
- An older CA was created in 2015 and just expired.
- A newer CA was created in 2018 and will expire in 2023.
There are numerous CAs active simultaneously and they roll over.
However the new CA was mis-configured when it was created back in 2018. It was created using the same settings and same private key -probably the same CSR- as the older CA.
This caused both the new and the old to be virtually indistinguishable.
Remember, the certificate authority (CA) can be selected automatically when verifying a certificate? This doesn’t work anymore when there are two conflicting CA setup with the exact same settings.
What happens then depends on the library, the OS, the order of configuration, etc… either or both CA might be used for SSL verification and the verification will fail whenever the expired CA is used.
That’s why all systems that were -properly- updated with the newer CA may have started experiencing erratic SSL connection errors the minute the old CA expired.
To resolve the issue of conflicting certificate authorities. The older conflicting CA must be purged.
- For systems that store the CA standalone in a file, that file should be replaced by the newer CA.
- For systems that store the CA in a bundle (both CA and possibly many more bundled in one large file), the old CA must be removed from that file and the newer CA must be added.
There are different sources of CA. The challenge to purge the old CA is to figure out what is using what from where.
The following goes into how major platforms and frameworks handle certificate verification.
For Linux hosts, the certificate subsystem is managed by the package
ca-certificates and files in
Individual certificates are stored in
/etc/pki/trust/anchors and more can be added. This can be managed by your favorite configuration management system (#ansible).
All CAs are packed into a bundle ready-to-use by applications:
/etc/ssl/certs/ca-certificates.crton Debian derivatives)
There are multiple bundle formats available. Calling
update-ca-trust will regenerate all the bundles, with the anchors.
What happens when there are conflicting CA configured? Say
/etc/pki/trust/anchors/new.pem using the same private keys.
They all get bundled together, in whichever order, then how the bundle are interpreted is up to what will be reading them.
Python urllib notably uses the host bundle on Linux (explained above) and on Windows (will explain later) out-of-the-box.
Specifically the file
After testing, urllib fails on Linux when both CAs are present in the bundle with the expired one appearing first. Noting the order is most likely a side effect of packaging so better not rely upon it.
The expired CA must be removed to get things back to working order.
requests does not interface with the host. Neither on Linux nor on Windows.
It loads a CA bundle from the certifi package. A tiny python package whose only goal is to provide a CA bundle.
Have to patch that file to adjust CA configuration.
When testing with both CA conflicting, it seems the one to appear first is consistently used.
Java embeds it’s own CA bundle within the JVM, have a look into
lib/security. Have to patch the bundle there to adjust default CA.
Otherwise Java can be configured to use the host CA on linux with
Haven’t tested extensively on multiple versions on Java but it also seems to fail if the expired CA is present in the bundle.
Windows has its own certificate management subsystem with dedicated API.
To view configured CA on a Windows host. Run
Trusted Root Certification Authorities and
Intermediate Certification Authorities =>
Additional CAs are usually rolled out through active directory group policies.
When testing with python urllib (that integrates with the windows truststore out-of-the-box), it is able to verify a certificate even if the expired CA is present in the Windows configuration, unlike Linux that would fail.
Your mileage may vary depending on the framework in use (C# ?) and other unknown factors.
Addendum: After many more hours of debugging. Python SSL is capable of loading system certificates on Windows. Low level functions
ssl.enum_certificates("CA"). This is used in some python libraries,
ssl itself and
httplib for example. These will fail to establish SSL connections when Windows computers are misconfigured with both the new and the expired CA. The fix is to remove the expired CA. So that will be 200k+ windows devices to emergency fix.
Public key infrastructure and common cryptographic libraries do not work well when facing conflicting certificates. Do not reuse private keys or CSR.
If it’s too late and it already happened. Refer to mitigation section.
There are probably more edge cases that could be discovered if one were to look deeper around signatures, revocations and other aspects that rely on identifying unique certificates.