There's really no reason why *install-packages.service* needs to
complete before users can log in. Indeed, being able to log in while it
is running may be necessary in order to troubleshoot issues.
The `flash.zsh` script now takes an optional `--image-url` argument,
which can be used to specify a different FCOS base image. This could be
to use a custom image or to simply avoid downloading the same image from
the Internet repeatedly.
I think I have finally decided that I want *collectd* to run in a
container on FCOS machines. It's much easier and quicker to deploy and
configure that way. The only drawback is how filesystems are monitored,
but I think I am okay with `ReportByDevice` now. In fact, I might even
like it better, since container hosts have *tons* of redundant mounts
that add noise to the disk usage charts.
When Fedora CoreOS first boots, Ignition modifies the partition table,
either to add partitions as requested in the config, or just to resize
the root filesystem. In any case, this has the side effect of erasing
the hybrid MBR partition table. If the hybrid MBR table is missing or
incorrect, Raspberry Pi 2 and 3 devices will not be able to boot. We
must therefore rebuild the missing table on first boot after Ignition
has run.
The *apply-config-policy* service does what it says on the tin. It
fetches the *cfg.git* repository and applies the configuration policy
therein for the current host. This is a privileged container with
practically allisolation disabled, to allow the configuration tools to
manage the system.
Installing packages on the host system via `rpm-ostree` is _insanely_
slow, especially on Raspberry Pi devices. The main reason I chose to go
that route for managing the SSH host certificates was to avoid having to
maintain the systemd units in multiple places. I think the trade-off is
worth it, though; bringing up a new Raspberry Pi is significantly
faster, by 15+ minutes, if we do not have to wait for `rpm-ostree` at
all.
Fedora CoreOS can be provisioned on a QEMU virtual machine by providing
the Ignition configuration via `fw_cfg` value. Unfortunately, the
`string` method does not work with JSON values, so we have to use
`file`. The configuration file has to be uploaded via SFTP, rather than
`virsh vol-import`, since the latter would create the file with the
wrong permissions, and QEMU does not automatically adjust the
permissions of files used this way (like it does for disks).
Bind-mount subdirectories of `/etc/nginx` individually so the
non-configuration files (e.g. MIME type database) distributed with the
container image are available.
Fix permissions of `/var/cache/nginx` and put PID file there.
The packages for the Kubelet are now installed by the
*install-packages* service, so they can be processed int he same
transaction as other packages (e.g. collectd).
Units that get installed via `rpm-ostree` on first boot cannot be
enabled by ignition, because they do not exist when it runs `systemctl
preset`. Thus, anything we want to start after its been installed needs
to be explicitly started. To allow this in an extensible fashion, I've
added an `after-install.target` unit and modified the
`install-packages.sh` script to activate this unit once the installation
is complete. The script also re-runs `systemctl preset`, so services
will start automatically on subsequent boots.
The *install-packages.service* unit has to be enabled, and the condition
checking for `/etc/ignition/packages.installed` was inverted.
Sending standard output to the console as well as the journal allows
watching progress.
The default SELinux policy for *collectd* does not allow it all the
necessary access for the way we use it. Notably, it cannot bind to the
HTTP port to export Prometheus metrics, and it is not allowed to use
netlink to read interface statistics. The latter is not a huge deal, as
it can fall back to the legacy procfs interface, but the former is a
nonstarter.
Eventually, I should write an SELinux module with the correct
permissions (and submit the changes upstream), but for now, we'll just
make the `collectd_t` domain permissive.
Unfortunately, running *collectd* in a container is not going to work.
Although containers can be configured to share some of the host's
namespaces, one notable exception is the mount namespace. Naturally,
containers must have their own mount namespace, which prevents them from
seeing filesystems that are actually mounted on the host. For
*collectd*, this effectively makes the `df` plugin useless, which
ultimately prevents us from monitoring disk space.
This reverts commit 4048e5cc0a.
For some reason, the *zincati.service* unit has an `After=` dependency
on *multi-user.target*. This creates a dependency loop between
*local_exporter.service* and *zincati.service* if the former has an
`After=` dependency on the latter an an (implicit) `Before=` dependency
on *multi-user.target*. systemd will resolve this loop by removing one
or the other units from the bootup sequence, so either Zincati or the
local exporter will not start at boot.
We can avoid this dependency loop by removing the `After=` dependency
from *local_exporter.service*. This may cause requests for Zincati
metrics to fail if it happens to come in after the local exporter starts
but before Zincati does, but this is unlikely to actually be an issue.
The *collectd.service* unit may fail for various reasons. Notably, if
the container image is not present, it may fail to start if it is
activated before the network is fully available. Using systemd's
automatic restart mechanism will help ensure *collectd* is running
whenever possible.
Although the official Fedora CoreOS documentation only provides
instructions for running CoreOS on a Raspberry Pi 4, it does actually
work on older boards as well. `coreos-installer` creates a GPT disk
label, which the older devices do not support, but this can be worked
around using a hybrid MBR label.
Unfortunately, after I put all the effort into refactoring this script
and adding support for the older devices, I realized that it was rather
pointless as those boards simply do not have enough memory to be useful
Kubernetes nodes. I was hoping to move the Zigbee and ZWave controllers
to a Raspberry Pi 3, but these processes take way too much memory for
that.
The `common.yaml` Butane configuration file merges in all the other
various Butane configuration files that we want to share amonst all
CoreOS machines. These include the authorized SSH keys list, collectd
deployment, SSH host certificate configuration, etc.
Now that we have an internal SSH certificate authority, instead of
explicitly listing all M×N keys for each user and client machine, we can
list only the CA certificate in the SSH authorized keys file for the
*core* user. This will allow any user who presents a valid, signed SSH
certificate for the *core* principal to log in.
The `ssh-bootstrap` script, which is run by the *ssh-bootstrap.service*
systemd unit, requests SSH host certificates for each of the existing
SSH host keys. The certificates are issued by the *POST /sshkeys/sign*
operation of *dch-webhooks* web service.
The *step-ssh-renew* timer/service runs `step ssh renew`, in a
container, on a weekly basis to renew the SSH host certificate. A host
certificate must already exist, and its private key is used to
authenticate to the CA server.
Since `step ssh renew` can only operate on one certificate/key file at a
time, the `step-ssh-renew@.container` defines a template unit. The
template instance specifies the key type (i.e. `rsa`, `ecdsa`, or
`ed25519`), which in turn defines which certificate and private key file
to use. The timer unit activates a target unit, which depends on the
concrete service units. Note that the target unit must have
`StopWhenUnneeded=yes` so that it can be restarted again the next time
the timer fires.
Installing packages with `rpm-ostree` is somewhat problematic. Notably,
if a new package needs an update of an already-installed package (e.g.
shared library), the new package cannot be installed until a new version
of CoreOS is published with the updated dependency.
In order for collectd to be effective, the container it runs in has to
have most isolation features disabled. Most importantly, the PID, UTS,
and network namespaces need to be shared with the host, so that
*collectd* can "see" the actual values. Additionally, the default
SELinux policy for containerized processes denies practically all of the
instrumentation syscalls *collectd* needs, so it needs to run in the
unconfined `spc_t` domain. Finally, the `/run` directory needs to be
shared with the host, so *collectd* can communicate with various daemons
via UNIX sockets.
Zincati provides Prometheus metrics via a Unix socket. In order for
these to be scraped by `vmagent`, they need to be exposed over HTTP.
The `local_exporter` is designed to do specifically this.
Unfortunately, the Zincati metrics socket is only accessible by the
*zincati* user, so the `local_exporter` also needs to run as that user.
Hopefully, the user ID will remain consistent in future versions of
CoreOS.
Using nginx, we can expose the Frigate web server via HTTPS. Since
Frigate has no built-in authentication, we need to use Authelia via the
nginx proxy auth feature.
Since Fedora CoreOS machines are not managed by Ansible, we need another
way to keep the HTTPS certificate up-to-date. To that end, I've added
the `fetchcert.sh` script, along with a corresponding systemd service
and timer unit, that will fetch the latest certificate from the Secret
resource managed by the Kubernetes API. The script authenticates with
a long-lived bearer token associated with a particular Kubernetes
service account and downloads the current Secret to a local file. If
the certificate in the Secret is different than the one already in
place, the certificate and key files are updated and nginx is reloaded.
The `collectd.yaml` Butane configuration fragment configures the machine
to install *collectd* and its various plugin packages directly on the
host using `rpm-ostree` (via *install-packages.service*).
Some machines may need to install multiple packages for separate use
cases. Requiring each use case to define a systemd unit that runs
`rpm-ostree install` directly would be cumbersome and also quite slow,
as each one would have to run in turn. Instead, now there is a single
*install-packages.service* which installs all of the packages listed in
files in `/etc/ignition/packages.d`. On first boot, all files in that
directory are read and all the packages they list will be installed in a
single `rpm-ostree install` invocation.
When`ProtectSystem` is enabled, systemd sets up a separate mount
namespace for the service. Unfortunately, this appears to interfere
with Podman and prevents it from cleaning up containers on shutdown.
To keep the API key a secret, we're encrypting the environment file in
the repository with GnuPG. The decrypted copy only lives in the work
tree and is never committed. Changes have to be re-encrypted and
committed.
Enabling hardware acceleration using VA-API dramatically reduces
`ffmpeg` CPU usage. For this to work, the Frigate container needs
access to the DRI device node.
Since *frigate.service* runs as root, the directories created by
`StateDirectory` are owned by root. The processes inside the container,
therefore, cannot access them. Thus, we have to use `systemd-tmpfiles`
to create the state directories with the appropriate permissions.
When developing Butane/Ignition files, I frequently forget to update the
parent files after making a change to an included file. This causes a
lot of wasted time re-provisioning, only to discover that my change
did not take effect. To alleviate this, we'll use `make` with some
macro magic to scan the Butane files for their dependencies, and let it
generate whatever Ignition files need updating any time a dependant file
changes.
I've also added a "publish" step to the Makefile, since I also
frequently forget to upload the regenerated Ignition files to the
server, causing the same headaches.
The *frigate* container must run as root, so we use a custom user
namespace to map root in the container to an unprivilged user on the
host.
For some reason, Podman (on CoreOS anyway) fails to stop a container
that uses a separate network namespace. It reports "invalid argument"
when attempting to unmount the `netns` file, which then causes the
container to get "stuck" in `Storage` state. Rebooting the host is
apparently the only way to get the container to start again correctly.
Fortunately, there's no particular reason to use an alternate network
namespace for Frigate, so it can use the host's network and avoid this
problem.
The *gasket-driver* container installs the `gasket` and `apex` kernel
modules, which provide the driver for the Google Coral EdgeTPU AI
accellerator module. The container image must be built ahead of time,
of course, and contains modules built for a specific Fedora kernel
version.
The udev rule has two purposes: to set the permissions on the device
node so that any user on the system can access it, and to "tag" the
device so that systemd will generate a `.device` unit for it. The
latter allows other units (e.g. Frigate) to express a `Requires=` and
`After=` dependency on the device unit, so that they do not start until
the driver is loaded.