dustin.web/content/blog/lets-encrypt-certificates-d...

89 lines
3.8 KiB
Markdown

+++
title = "Let's Encrypt Certificates: DNS Blocked"
date = 2020-09-23T23:40:00-05:00
+++
The *certs* Jenkins job has been failing for a while, ever since I blocked
outbound DNS traffic to the Internet. The problem is `lego` queries DNS for
each domain in the certificate request repeatedly until it sees the
`_acme-challenge` TXT record it created. With DNS traffic blocked, it is never
able to contact the configured DNS servers (was Cloudflare, now Quad9) so it
just waits until its timeout expires.
## Attempt 1: `_acme-challenge` CNAME
At first, I thought the problem was simply that `lego` just needed a DNS
server. I couldn't remember why I configured it to use a third-party server,
so I just disabled that. By default, it uses the same name servers as the
operating system. Unfortunately, I quickly remembered the reason I needed to
use an external DNS server: the internal name servers have different records
for _pyrocufflink.blue_.
I remembered reading about using CNAME records to "redirect" ACME challenges to another domain, so I thought I would try that for _pyrocufflink.blue_:
```
_acme-challenge CNAME 5 _acme-challenge.o-ak4p9kqlmt5uuc.com
```
This _should_ tell Let's Encrypt to look for its TXT record in the
_o-ak4p9kqlmt5uuc.com_ domain instead of the _pyrocufflink.blue_ domain.
Unfortunately, it seems that `lego` does not support this, even with
`LEGO_EXPERIMENTAL_CNAME_SUPPORT=true`, for Namecheap.
In any case, I later discovered that this would not have helped.
## Attempt 2: DNS-over-HTTPS Proxy
Since I couldn't get `lego` to work with the CNAME trick, I decided to try
using a DNS-over-HTTPS (DoH) proxy to tunnel DNS queries to an external name
server. I looked at `dnscrypt-proxy` and `cloudflared`, as these were the only
two implementations of DNS-to-DoH proxies I could find. `cloudflared` is
simple and requires no configuration, but it's a 40 megabyte binary.
`dnscrypt-proxy`, on the other hand is a bit smaller (10 MB), but more
complicated to run. It requires a configuration file and at least one
reference to a list of public resolvers, which it must fetch and load when it
starts up.
I made some modifications to the CI pipeline to support starting and stopping
the DoH proxy, and configured `lego` to send its queries there instead.
Unfortunately, this didn't work, either. It turns out `lego` only uses the
configured name server to find the `NS` records for the domain in question.
Once it gets the names of the authoritative name servers, it sends queries to
them _directly_, NOT through the configured server.
I was able to determine this by watching the network traffic with `tshark` for
both "normal" DNS and DoH-proxied DNS:
```sh
tshark -i any port domain
```
```sh
tshark -i lo -d tcp.port==5053,dns -d udp.port==5053,dns port 5053
```
(port 5053 is where `dnscrypt-proxy` is listening)
I could see `lego` making TXT and NS record requests to `dnscrypt-proxy`, and
then switching to making TXT requred requests to external servers. I am not
sure why it bothers making the initial TXT request, since it does not seem to
care about the result, whether it is correct or not.
## Temporary Solution
I am not sure exactly where to go from here. It seems `lego` is simply
incompatible with strict DNS. I will most likely need to find an alternate
ACME client that:
1. Supports Namecheap API
2. Works without access to the authoritative name servers
3. Is simple enough to install that it can be run from a Jenkins job
Alternatively, I may investigate
[acme-dns](https://github.com/joohoi/acme-dns). I may be able to combine CNAME
records in the target domains pointing to a (sub-)domain hosted by _acme-dns_
to get `lego` to work correctly. I would just have to make sure that the
server is accessible both internally and externally.
In the meantime, I have added firewall rules to allow outbound DNS **to
Namecheap servers only**.