The "Kubernetes" subnet is a /27, not a /28. There are hosts in that
upper section that was masked out, and these were unable to send e-mails
via the relay because they were excluded from the `mynetworks` value.
Specifically to allow the Synology to synchronize its clock, as it only
has an IPv6 address.
We also need to explicitly override `chrony_servers` to an empty list
for the firewall itself, since it syncs with the NTP pool, rather than
its next hop router.
Since we use the proxy when PXE booting to speed up Live OS image and
RPM package downloads, we need to allow machines using it to access the
kickstart files which are now hosted on the PXE server. Virtual
machines on the Kubernetes network (_pyrocufflink.black_ also need
access to those kickstarts, so we need to mark that subnet as trusted.
I've move the Unifi controller back to running on a Fedora Linux
machine. It therefore needs access to Fedora RPM repositories, as well
as the internal "dch" RPM repository, for system packages.
I also created a new custom container image for the Unifi Network
software (the linuxserver.io one sucks), so the server needs access to
the OCI repo on Gitea.
I continually struggle with machines' (physical and virtual, even the
Roku devices!) clocks getting out of sync. I have been putting off
fixing this because I wanted to set up a Windows-compatible NTP server
(i.e. on the domain controllers, with Kerberos signing), but there's
really no reason to wait for that to fix the clocks on all the
non-Windows machines, especially since there are exactly 0 Windows
machines on the network right now.
The *chrony* role and corresponding `chrony.yml` playbook are generic,
configured via the `chrony_pools`, `chrony_servers`, and `chrony_allow`
variables. The values for these variables will configure the firewall
to act as an NTP server, synchronizing with the NTP pool on the
Internet, while all other machines will synchronize with it. This
allows machines on networks without Internet access to keep their clocks
in sync.
Since the canonical location for Anaconda kickstart scripts is now
Gitea, we need to allow hosts to access them from there.
Also allowing access from the _pyrocufflink.red_ network for e.g.
installation testing.
Now that we have the serial terminal server managing `picocom` processes
for each serial port, and those `picocom` processes are configured to
log console output to files, we can configure Promtail to scrape these
log files and send them to Loki.
Gitea and Vaultwarden both have SQLite databases. We'll need to add
some logic to ensure these are in a consistent state before beginning
the backup. Fortunately, neither of them are very busy databases, so
the likelihood of an issue is pretty low. It's definitely more
important to get backups going again sooner, and we can deal with that
later.
Frigate uses the Github API to check for new releases. It then
populates the `update.frigate_server` entity in Home Assistant via MQTT
with the information it retrieved. If it is unable to access the Github
API, the Home Assistant entity will be marked as "unavailable," which
triggers an alert notification from Home Assistant. Thus, we need to
allow Frigate to access Github if we want to use that entity as an
indicator of whether or not Frigate is connected to the MQTT broker.
I don't want to allow access to the Github API to everything on the
Frigate server, just Frigate itself. To do that, I've assigned a unique
username and password for Frigate. Only requests with the proper
`Proxy-Authorization` header will be allowed access. By providing the
credentials only the Frigate container, we can ensure no other process
has access.
I think I did this mostly as an exercise; there's no particular reason
to disallow access to the Github API, since it's mostly read-only and
can't really be used to exfiltrate any data (probably?).
*nvr2.pyrocufflink.blue* originally ran Fedora CoreOS. Since I'm tired
of the tedium and difficulty involved in making configuration changes to
FCOS machines, I am migrating it to Fedora Linux, managed by Ansible.
The Frigate NVR servers, prod & test, need to be able to access Fedora
COPR (for the *gasket-dkms* package) and Github Container Registry (for
Frigate itself).
The UniFi Network server needs to be able access the
_linuxserver.io_/GitHub and Docker Hub OCI image registries for the
Unifi Network and Caddy container images, respectively.
All data have been migrated from the PostgreSQL server in Kubernetes and
the three applications that used it (Firefly-III, Authelia, and Home
Assistant) have been updated to point to the new server.
To avoid comingling the backups from the old server with those from the
new server, we're reconfiguring WAL-G to push and pull from a new S3
prefix.
*db0.pyrocufflink.blue* will be the primary server in the new PostgreSQL
database cluster. We're starting with Fedora 39 so we can have
PostgreSQL 15, to match the version managed by the Postgres Operator in
the Kubernetes cluster right now.
Files in the Nextcloud trash bin do not need to be backed up. They are
often large (i.e. my Signal backups), and presumably, they are not
needed anyway; why would they be in the trash otherwise?
Installing Fedora on a bunch of machines, simultaneously or in rapid
succession, can be painfully slow, as several large files need to be
downloaded. To speed this up, we download those files via the proxy and
cache them on the proxy server.
As a side-effect, the proxy needs to allow access to the Kickstart
"server" (i.e. my workstation, at least for now), since Anaconda will
use the configured proxy for everything it downloads.
*unifi2.pyrocufflink.blue*, which is connected to the management
network, can only access the Internet via the proxy. In order for
Zincati/`rpm-ostree` to automatically update the machine, the proxy
needs to allow access to the FCOS update servers.
*dnf-automatic* is an add-on for `dnf` that performs scheduled,
automatic updates. It works pretty much how I would want it to:
triggered by a systemd timer, sends email reports upon completion, and
only reboots for kernel et al. updates.
In its default configuration, `dnf-automatic.timer` fires every day. I
want machines to update weekly, but I want them to update on different
days (so as to avoid issues if all the machines reboot at once). Thus,
the _dnf-automatic_ role uses a systemd unit extension to change the
schedule. The day-of-the-week is chosen pseudo-randomly based on the
host name of the managed system.
The BIND server on the firewall is configured to write query logs and
RPZ rewrite logs to files under `/var/log/named`. We can scrape these
logs with Promtail and use the messages for analytics on the DNS-based
firewall, etc.
This machine is _not_ a member of the _pyrocufflink.blue_ AD domain, so
it does not inherit the settings from that group. Also, Jenkins does
not manage it, so only my personal keys are authorized.
Running Squid on the firewall makes sense; it's a sort of layer-7
firewall, after all. There's not much storage on that machine, though
so we don't really want to cache anything. In fact, it's only purpose
is to allow very limited web access for certain applications. All
outbound traffic is blocked, with two exceptions:
* Fedora package repositories (for the UniFi controller server)
* Google Fonts (for Invoice Ninja)
Unfortunately, the automatic transfer switch does not seem to work
correctly. When the standby source is a UPS running on battery, it does
*not* switch sources if the primary fails. In other words, when the
power is out and both UPS are running on battery, when the first one
dies, it will NOT switch to the second one. It has no trouble switching
when the second source is mains power, though, which is very strange.
I have tried messing with all the settings including nominal input
voltage, sensitivity, and frequency tolerence, but none seem to have any
effect.
Since it is more important for the machines to shut down safely than it
is to have an extra 10-15 minutes of runtime during an outage, the best
solution for now is to configure the hosts to shut down as soon as the
first UPS battery gets low. This is largely a waste of the second UPS,
but at least it will help prevent data loss.
The automatic transfer switch does not seem to work reliably when both
UPS sources are running on battery. This means all systems lose power
after the first UPS battery dies, even though the second UPS is still
online. To minimize the risk of data loss, at least until I figure out
what's wrong, I want both VM hosts to shut down as soon as the first UPS
signals that its battery is low.
`upsmon` is the component of [NUT] that monitors (local or remote) UPS
devices and reacts to changes in their state. Notably, it is
responsible for powering down the system when there is insufficient
power to the system.
AWS is going to begin charging extra for routable IPv4 addresses soon.
There's really no point in having a relay in the cloud anymore anyway,
since a) all outbound messages are sent via the local relay and b) no
messages are sent to anyone except me.
We're going to run MinIO on the BURP server to provide a backup target
for the [Postgres Operator][0]/[WAL-E][1]. Although the Postgres
Operator also supports backups via [WAL-G][2], which supports more
backup targets like SFTP, the operator does not support restoring from
those targets. As such, the best way to get fully-featured backups for
the Postgres Operator, including environment cloning, etc., is to use
S3. Since I absolutely do not want to store my backups "in the cloud,"
using MinIO seems a decent alternative. Running it on the BURP server
allows the backups to be stored and rotated along with regular system
backups.
[0]: https://github.com/zalando/postgres-operator/
[1]: https://github.com/wal-e/wal-e
[2]: https://github.com/wal-g/wal-g
The BURP storage volume is now backed by a Linux MD RAID array, so we
want to monitor its state. Furthermore, since this machine is a
physical device, we should monitor its thermal characteristics as well.
I moved the metrics Pi from the red network to the blue network. I
started to get uncormfortable with the firewall changes that were
required to host a service on the red network. I think it makes the
most sense to define the red network as egress only.
*mtrcs0.pyrocufflink.red* is a Raspberry Pi CM4 on a Waveshare
CM4-IO-BASE-B carrier board with a NVMe SSD. It runs a custom OS built
using Buildroot, and is not a member of the *pyrocufflink.blue* AD
domain.
*mtrcs0.p.r* hosts Victoria Metrics/`vmagent`, `vmalert`, AlertManager,
and Grafana. I've created a unique group and playbook for it,
*metricspi*, to manage all these applications together.
It seems with each new release of Fedora, some feature or other of
*collectd* gets broken. In Feodra 36, the *interfaces* plugin does not
seem to work reliably, and the *md* plugin logs a *lot* of errors.
While these issues are investigated upstream, we either need to manage
our own policy for collectd or mark the `collectd_t` domain permissive.
I chose the latter because I'm lazy and I don't consider collectd to be
that big of a threat to security.
*nvr1.pyrocufflink.blue* is the new video recording server. It is a
1U rack-mounted physical machine based on the [Jetway
JBC150F596-3160-B][0] barebone system. It replaces
*nvr0.pyrocufflink.blue* in this role.
[0]: https://www.jetwaycomputer.com/JBC150F596.html