+++ title = "Let's Encrypt Certificates: DNS Blocked" date = 2020-09-23T23:40:00-05:00 +++ The *certs* Jenkins job has been failing for a while, ever since I blocked outbound DNS traffic to the Internet. The problem is `lego` queries DNS for each domain in the certificate request repeatedly until it sees the `_acme-challenge` TXT record it created. With DNS traffic blocked, it is never able to contact the configured DNS servers (was Cloudflare, now Quad9) so it just waits until its timeout expires. ## Attempt 1: `_acme-challenge` CNAME At first, I thought the problem was simply that `lego` just needed a DNS server. I couldn't remember why I configured it to use a third-party server, so I just disabled that. By default, it uses the same name servers as the operating system. Unfortunately, I quickly remembered the reason I needed to use an external DNS server: the internal name servers have different records for _pyrocufflink.blue_. I remembered reading about using CNAME records to "redirect" ACME challenges to another domain, so I thought I would try that for _pyrocufflink.blue_: ``` _acme-challenge CNAME 5 _acme-challenge.o-ak4p9kqlmt5uuc.com ``` This _should_ tell Let's Encrypt to look for its TXT record in the _o-ak4p9kqlmt5uuc.com_ domain instead of the _pyrocufflink.blue_ domain. Unfortunately, it seems that `lego` does not support this, even with `LEGO_EXPERIMENTAL_CNAME_SUPPORT=true`, for Namecheap. In any case, I later discovered that this would not have helped. ## Attempt 2: DNS-over-HTTPS Proxy Since I couldn't get `lego` to work with the CNAME trick, I decided to try using a DNS-over-HTTPS (DoH) proxy to tunnel DNS queries to an external name server. I looked at `dnscrypt-proxy` and `cloudflared`, as these were the only two implementations of DNS-to-DoH proxies I could find. `cloudflared` is simple and requires no configuration, but it's a 40 megabyte binary. `dnscrypt-proxy`, on the other hand is a bit smaller (10 MB), but more complicated to run. It requires a configuration file and at least one reference to a list of public resolvers, which it must fetch and load when it starts up. I made some modifications to the CI pipeline to support starting and stopping the DoH proxy, and configured `lego` to send its queries there instead. Unfortunately, this didn't work, either. It turns out `lego` only uses the configured name server to find the `NS` records for the domain in question. Once it gets the names of the authoritative name servers, it sends queries to them _directly_, NOT through the configured server. I was able to determine this by watching the network traffic with `tshark` for both "normal" DNS and DoH-proxied DNS: ```sh tshark -i any port domain ``` ```sh tshark -i lo -d tcp.port==5053,dns -d udp.port==5053,dns port 5053 ``` (port 5053 is where `dnscrypt-proxy` is listening) I could see `lego` making TXT and NS record requests to `dnscrypt-proxy`, and then switching to making TXT requred requests to external servers. I am not sure why it bothers making the initial TXT request, since it does not seem to care about the result, whether it is correct or not. ## Temporary Solution I am not sure exactly where to go from here. It seems `lego` is simply incompatible with strict DNS. I will most likely need to find an alternate ACME client that: 1. Supports Namecheap API 2. Works without access to the authoritative name servers 3. Is simple enough to install that it can be run from a Jenkins job Alternatively, I may investigate [acme-dns](https://github.com/joohoi/acme-dns). I may be able to combine CNAME records in the target domains pointing to a (sub-)domain hosted by _acme-dns_ to get `lego` to work correctly. I would just have to make sure that the server is accessible both internally and externally. In the meantime, I have added firewall rules to allow outbound DNS **to Namecheap servers only**.