Commit Graph

51 Commits (master)

Author SHA1 Message Date
Dustin 3bbe380598 install-packages: Do not prevent login
There's really no reason why *install-packages.service* needs to
complete before users can log in.  Indeed, being able to log in while it
is running may be necessary in order to troubleshoot issues.
2024-01-25 20:49:53 -06:00
Dustin 61973c94cf flash: Add option to override console spec
The `flash.zsh` script now takes an optional `--console` argument, which
can be used to override the `console=` kernel command line argument.
2024-01-25 20:06:24 -06:00
Dustin 57815bdcc5 flash: Add option to specify image URL
The `flash.zsh` script now takes an optional `--image-url` argument,
which can be used to specify a different FCOS base image.  This could be
to use a custom image or to simply avoid downloading the same image from
the Internet repeatedly.
2024-01-25 20:06:24 -06:00
Dustin eb0430392e install-packages: Exit on error
The machine gets into a pretty weird state if `install-packages.sh`
fails but continues running.
2024-01-25 20:06:24 -06:00
Dustin 9e790d055c common: Do not install collectd
I think I have finally decided that I want *collectd* to run in a
container on FCOS machines.  It's much easier and quicker to deploy and
configure that way.  The only drawback is how filesystems are monitored,
but I think I am okay with `ReportByDevice` now.  In fact, I might even
like it better, since container hosts have *tons* of redundant mounts
that add noise to the disk usage charts.
2024-01-25 20:06:24 -06:00
Dustin 17ba7d9d03 serial1: Add config for serial console machine 2024-01-25 20:06:24 -06:00
Dustin 91af50acc2 nut0: Add host Ignition config
*nut0.pyrocufflink.blue* is a Raspberry Pi 3 that will run `upsd` and
`upsmon` to monitor and control the UPS on the server rack.
2024-01-19 21:56:52 -06:00
Dustin 05d5312382 fix-hybrid-mbr: Fix Hybrid GPT/MBR on RPi3
When Fedora CoreOS first boots, Ignition modifies the partition table,
either to add partitions as requested in the config, or just to resize
the root filesystem.  In any case, this has the side effect of erasing
the hybrid MBR partition table.  If the hybrid MBR table is missing or
incorrect, Raspberry Pi 2 and 3 devices will not be able to boot.  We
must therefore rebuild the missing table on first boot after Ignition
has run.
2024-01-17 20:30:34 -06:00
Dustin 196ce46d54 cfg: Add apply-config-policy container unit
The *apply-config-policy* service does what it says on the tin.  It
fetches the *cfg.git* repository and applies the configuration policy
therein for the current host.  This is a privileged container with
practically allisolation disabled, to allow the configuration tools to
manage the system.
2024-01-17 20:30:34 -06:00
Dustin 647cdb8346 ssh-host-certs: Run sshca-cli from a container
Installing packages on the host system via `rpm-ostree` is _insanely_
slow, especially on Raspberry Pi devices.  The main reason I chose to go
that route for managing the SSH host certificates was to avoid having to
maintain the systemd units in multiple places.  I think the trade-off is
worth it, though; bringing up a new Raspberry Pi is significantly
faster, by 15+ minutes, if we do not have to wait for `rpm-ostree` at
all.
2024-01-17 20:30:34 -06:00
Dustin fd7778c01a mkvm: Add script to create FCOS VM
Fedora CoreOS can be provisioned on a QEMU virtual machine by providing
the Ignition configuration via `fw_cfg` value.  Unfortunately, the
`string` method does not work with JSON values, so we have to use
`file`.  The configuration file has to be uploaded via SFTP, rather than
`virsh vol-import`, since the latter would create the file with the
wrong permissions, and QEMU does not automatically adjust the
permissions of files used this way (like it does for disks).
2024-01-06 20:49:31 -06:00
Dustin bdf31d7d1f k8s-amd64-n3: Add new K8s VM node
The three x86_64 Kubernetes nodes are starting to get full.  Adding
another VM will allow pods to be spread thinner.
2024-01-06 20:46:25 -06:00
Dustin 6dfde32a5e Switch from Step CA to SSHCA
SSH host certificates are now issued by SSHCA.  The *sshca-cli-systemd*
package contains the appropriate systemd units for it.
2024-01-06 19:57:48 -06:00
Dustin 78f9284f33 nginx: Fix configuration
Bind-mount subdirectories of `/etc/nginx` individually so the
non-configuration files (e.g. MIME type database) distributed with the
container image are available.

Fix permissions of `/var/cache/nginx` and put PID file there.
2024-01-06 19:50:42 -06:00
Dustin 910c7c56c9 local_exporter: Start after network online
The *local_exporter.service* cannot start on first boot without the
network, as it needs to pull the container image from.
2024-01-06 19:49:41 -06:00
Dustin 7926769528 kubelet: Use install-packages service
The packages for the Kubelet are now installed by the
*install-packages* service, so they can be processed int he same
transaction as other packages (e.g. collectd).
2024-01-06 19:48:31 -06:00
Dustin bdeb44ae36 collectd: Start after install
The *collectd.service* unit is now starged automatically after it is
installed on first boot.
2024-01-06 19:47:07 -06:00
Dustin ac6c31c5d8 packages: Add after-install target unit
Units that get installed via `rpm-ostree` on first boot cannot be
enabled by ignition, because they do not exist when it runs `systemctl
preset`.  Thus, anything we want to start after its been installed needs
to be explicitly started.  To allow this in an extensible fashion, I've
added an `after-install.target` unit and modified the
`install-packages.sh` script to activate this unit once the installation
is complete.  The script also re-runs `systemctl preset`, so services
will start automatically on subsequent boots.
2024-01-06 19:43:08 -06:00
Dustin 9d941a9985 packages: Fix service start on first boot
The *install-packages.service* unit has to be enabled, and the condition
checking for `/etc/ignition/packages.installed` was inverted.
Sending standard output to the console as well as the journal allows
watching progress.
2024-01-06 19:41:07 -06:00
Dustin 1cdd12454f collectd: Set collectd_t domain permissive
The default SELinux policy for *collectd* does not allow it all the
necessary access for the way we use it.  Notably, it cannot bind to the
HTTP port to export Prometheus metrics, and it is not allowed to use
netlink to read interface statistics.  The latter is not a huge deal, as
it can fall back to the legacy procfs interface, but the former is a
nonstarter.

Eventually, I should write an SELinux module with the correct
permissions (and submit the changes upstream), but for now, we'll just
make the `collectd_t` domain permissive.
2023-10-04 21:01:38 -05:00
Dustin fb9684fa93 k8s-aarch6-n1: Add new Kubernetes node
This node provides an ARM64 build environment.
2023-10-03 19:59:14 -05:00
Dustin b5455e519a Revert "collectd: Run collectd in privileged container"
Unfortunately, running *collectd* in a container is not going to work.
Although containers can be configured to share some of the host's
namespaces, one notable exception is the mount namespace.  Naturally,
containers must have their own mount namespace, which prevents them from
seeing filesystems that are actually mounted on the host.  For
*collectd*, this effectively makes the `df` plugin useless, which
ultimately prevents us from monitoring disk space.

This reverts commit 4048e5cc0a.
2023-10-04 20:50:30 -05:00
Dustin 5862ff4cc2 local_exporter: Remove After=zincati dependency
For some reason, the *zincati.service* unit has an `After=` dependency
on *multi-user.target*.  This creates a dependency loop between
*local_exporter.service* and *zincati.service* if the former has an
`After=` dependency on the latter an an (implicit) `Before=` dependency
on *multi-user.target*.  systemd will resolve this loop by removing one
or the other units from the bootup sequence, so either Zincati or the
local exporter will not start at boot.

We can avoid this dependency loop by removing the `After=` dependency
from *local_exporter.service*.  This may cause requests for Zincati
metrics to fail if it happens to come in after the local exporter starts
but before Zincati does, but this is unlikely to actually be an issue.
2023-10-04 20:50:30 -05:00
Dustin dd3be7a24a collectd: Restart service automatically
The *collectd.service* unit may fail for various reasons.  Notably, if
the container image is not present, it may fail to start if it is
activated before the network is fully available.  Using systemd's
automatic restart mechanism will help ensure *collectd* is running
whenever possible.
2023-10-04 20:50:30 -05:00
Dustin 40bde4df26 flash: Clean up/add support for RPi 3
Although the official Fedora CoreOS documentation only provides
instructions for running CoreOS on a Raspberry Pi 4, it does actually
work on older boards as well.  `coreos-installer` creates a GPT disk
label, which the older devices do not support, but this can be worked
around using a hybrid MBR label.

Unfortunately, after I put all the effort into refactoring this script
and adding support for the older devices, I realized that it was rather
pointless as those boards simply do not have enough memory to be useful
Kubernetes nodes.  I was hoping to move the Zigbee and ZWave controllers
to a Raspberry Pi 3, but these processes take way too much memory for
that.
2023-10-04 20:50:30 -05:00
Dustin 364f4fed50 common: Add config shared by all hosts
The `common.yaml` Butane configuration file merges in all the other
various Butane configuration files that we want to share amonst all
CoreOS machines.  These include the authorized SSH keys list, collectd
deployment, SSH host certificate configuration, etc.
2023-10-03 20:07:29 -05:00
Dustin 859deb0664 sshkeys: Trust certificates issued by the CA
Now that we have an internal SSH certificate authority, instead of
explicitly listing all M×N keys for each user and client machine, we can
list only the CA certificate in the SSH authorized keys file for the
*core* user.  This will allow any user who presents a valid, signed SSH
certificate for the *core* principal to log in.
2023-10-03 20:06:37 -05:00
Dustin 88f165363d step-ssh: Automatically issue/renew SSH host certs
The `ssh-bootstrap` script, which is run by the *ssh-bootstrap.service*
systemd unit, requests SSH host certificates for each of the existing
SSH host keys.  The certificates are issued by the *POST /sshkeys/sign*
operation of *dch-webhooks* web service.

The *step-ssh-renew* timer/service runs `step ssh renew`, in a
container, on a weekly basis to renew the SSH host certificate.  A host
certificate must already exist, and its private key is used to
authenticate to the CA server.

Since `step ssh renew` can only operate on one certificate/key file at a
time, the `step-ssh-renew@.container` defines a template unit.  The
template instance specifies the key type (i.e. `rsa`, `ecdsa`, or
`ed25519`), which in turn defines which certificate and private key file
to use.  The timer unit activates a target unit, which depends on the
concrete service units.  Note that the target unit must have
`StopWhenUnneeded=yes` so that it can be restarted again the next time
the timer fires.
2023-10-03 20:06:37 -05:00
Dustin 4048e5cc0a collectd: Run collectd in privileged container
Installing packages with `rpm-ostree` is somewhat problematic.  Notably,
if a new package needs an update of an already-installed package (e.g.
shared library), the new package cannot be installed until a new version
of CoreOS is published with the updated dependency.

In order for collectd to be effective, the container it runs in has to
have most isolation features disabled.  Most importantly, the PID, UTS,
and network namespaces need to be shared with the host, so that
*collectd* can "see" the actual values.  Additionally, the default
SELinux policy for containerized processes denies practically all of the
instrumentation syscalls *collectd* needs, so it needs to run in the
unconfined `spc_t` domain.  Finally, the `/run` directory needs to be
shared with the host, so *collectd* can communicate with various daemons
via UNIX sockets.
2023-10-03 20:03:21 -05:00
Dustin ebdf587de1 local_exporter: Exporter for Zincati metrics
Zincati provides Prometheus metrics via a Unix socket.  In order for
these to be scraped by `vmagent`, they need to be exposed over HTTP.
The `local_exporter` is designed to do specifically this.

Unfortunately, the Zincati metrics socket is only accessible by the
*zincati* user, so the `local_exporter` also needs to run as that user.
Hopefully, the user ID will remain consistent in future versions of
CoreOS.
2023-10-03 15:29:58 -05:00
Dustin 517151f2c8 sshkeys: Add Luma's SSH public key 2023-09-21 22:34:14 -05:00
Dustin cb282f0bce nvr1: Deploy notify-shutdown service 2023-09-21 22:34:14 -05:00
Dustin 11cd8ce8e9 notify-shutdown: Send a message on shutdown
Since Fedora CoreOS machines tend to reboot at seemingly random times
to apply updates, it would be nice to get a notification when they go
down.
2023-09-21 22:34:14 -05:00
Dustin 8828bb3069 nvr1: Deploy nginx
Deploying nginx on the NVR server to proxy for Frigate.
2023-09-21 22:34:14 -05:00
Dustin 9fd3aa0cd3 frigate: Configure nginx reverse proxy
Using nginx, we can expose the Frigate web server via HTTPS.  Since
Frigate has no built-in authentication, we need to use Authelia via the
nginx proxy auth feature.
2023-09-21 22:32:59 -05:00
Dustin d907b47db1 fetchcert: Add script to fetch certs from K8s
Since Fedora CoreOS machines are not managed by Ansible, we need another
way to keep the HTTPS certificate up-to-date.  To that end, I've added
the `fetchcert.sh` script, along with a corresponding systemd service
and timer unit, that will fetch the latest certificate from the Secret
resource managed by the Kubernetes API.  The script authenticates with
a long-lived bearer token associated with a particular Kubernetes
service account and downloads the current Secret to a local file.  If
the certificate in the Secret is different than the one already in
place, the certificate and key files are updated and nginx is reloaded.
2023-09-21 22:30:23 -05:00
Dustin 222f40426a nginx: Deploy nginx in a container 2023-09-21 22:29:51 -05:00
Dustin a32e6676eb nvr1: Install collectd
Also enabling the `md` plugin, which is disabled by default, to monitor
the software RAID array where Frigate recordings are stored.
2023-09-21 22:29:51 -05:00
Dustin d22a65c1bd collectd: Install and configure collectd
The `collectd.yaml` Butane configuration fragment configures the machine
to install *collectd* and its various plugin packages directly on the
host using `rpm-ostree` (via *install-packages.service*).
2023-09-21 22:29:51 -05:00
Dustin 2048713452 packages: Add framework for installing packages
Some machines may need to install multiple packages for separate use
cases.  Requiring each use case to define a systemd unit that runs
`rpm-ostree install` directly would be cumbersome and also quite slow,
as each one would have to run in turn.  Instead, now there is a single
*install-packages.service* which installs all of the packages listed in
files in `/etc/ignition/packages.d`.  On first boot, all files in that
directory are read and all the packages they list will be installed in a
single `rpm-ostree install` invocation.
2023-09-21 22:29:51 -05:00
Dustin 22c085b35d frigate: Disable systemd filesystem isolation
When`ProtectSystem` is enabled, systemd sets up a separate mount
namespace for the service.  Unfortunately, this appears to interfere
with Podman and prevents it from cleaning up containers on shutdown.
2023-09-21 22:29:51 -05:00
Dustin dffa17410f frigate: Enable Frigate+ integration
To keep the API key a secret, we're encrypting the environment file in
the repository with GnuPG.  The decrypted copy only lives in the work
tree and is never committed. Changes have to be re-encrypted and
committed.
2023-09-21 22:29:51 -05:00
Dustin b80bee461a frigate: Pass DRI device for hardware acceleration
Enabling hardware acceleration using VA-API dramatically reduces
`ffmpeg` CPU usage.  For this to work, the Frigate container needs
access to the DRI device node.
2023-09-19 10:46:52 -05:00
Dustin ddd137a2e9 frigate: Manage state dir with tmpfiles.d
Since *frigate.service* runs as root, the directories created by
`StateDirectory` are owned by root.  The processes inside the container,
therefore, cannot access them.  Thus, we have to use `systemd-tmpfiles`
to create the state directories with the appropriate permissions.
2023-09-19 10:44:34 -05:00
Dustin 2a0b23c9a8 meta: Add Makefile
When developing Butane/Ignition files, I frequently forget to update the
parent files after making a change to an included file.  This causes a
lot of wasted time re-provisioning, only to discover that my change
did not take effect.  To alleviate this, we'll use `make` with some
macro magic to scan the Butane files for their dependencies, and let it
generate whatever Ignition files need updating any time a dependant file
changes.

I've also added a "publish" step to the Makefile, since I also
frequently forget to upload the regenerated Ignition files to the
server, causing the same headaches.
2023-09-16 08:15:08 -05:00
Dustin 2efce551ba zram: Configure swap-on-zram
CoreOS does not enable swap-on-zram by default.
2023-09-16 08:15:08 -05:00
Dustin 1a60688cc1 nvr1: Deploy Frigate on the nvr1.p.b 2023-09-16 08:13:03 -05:00
Dustin 533cdc2c09 frigate: Run Frigate in a container
The *frigate* container must run as root, so we use a custom user
namespace to map root in the container to an unprivilged user on the
host.

For some reason, Podman (on CoreOS anyway) fails to stop a container
that uses a separate network namespace.  It reports "invalid argument"
when attempting to unmount the `netns` file, which then causes the
container to get "stuck" in `Storage` state.  Rebooting the host is
apparently the only way to get the container to start again correctly.
Fortunately, there's no particular reason to use an alternate network
namespace for Frigate, so it can use the host's network and avoid this
problem.
2023-09-16 08:06:07 -05:00
Dustin 1d71f874cf gasket-driver: Install Coral EdgeTPU driver
The *gasket-driver* container installs the `gasket` and `apex` kernel
modules, which provide the driver for the Google Coral EdgeTPU AI
accellerator module.  The container image must be built ahead of time,
of course, and contains modules built for a specific Fedora kernel
version.

The udev rule has two purposes: to set the permissions on the device
node so that any user on the system can access it, and to "tag" the
device so that systemd will generate a `.device` unit for it.  The
latter allows other units (e.g. Frigate) to express a `Requires=` and
`After=` dependency on the device unit, so that they do not start until
the driver is loaded.
2023-09-16 07:58:48 -05:00
Dustin afadd7dcf5 Add flash.sh
This simple script helps automate the process of flashing Fedora CoreOS
onto a SD card for a Raspberry Pi.
2023-08-04 15:01:18 -05:00