Commit Graph

49 Commits (011058aec3d3400b682ba90fcfea8457c58e36cd)

Author SHA1 Message Date
Dustin 011058aec3 loki: Use fetchcert to manage server certificate
Before going into production with Grafana Loki, I want to set it up to
use TLS.  To that end, I have configured _cert-manager_ to issue it a
certificate, signed by _DCH CA_.  In order to use said certificate,
we need to configure `fetchcert` to run on the Loki server.
2024-02-18 11:35:13 -06:00
Dustin 29afcae52e fetchcert: Deploy tool to get cert from k8s Secret
The `fetchcert` tool is a short shell script that fetches an X.509
certificate and corresponding private key from a Kubernetes Secret,
using the Kubernetes API.  I originally wrote it for the Frigate server
so it could fetch the _pyrocufflink.blue_ wildcard certificate, which is
managed by _cert-manager_.  Since then, I have adapted it to be more
generic, so it will be useful to fetch the _loki.pyrocufflink.blue_
certificate for Grafana Loki.

Although the script is rather simple, it does have several required
configuration parameters.  It needs to know the URL of the Kubernetes
API server and have the certificate for the CA that signs the server
certificate, as well as an authorization token.  It also needs to know
the namespace and name of the Secret from which it will fetch the
certificate and private key.  Finally,  needs to know the paths to the
files where the fetched data will be written.

Generally, after certificates are updated, some action needs to be
performed in order to make use of them.  This typically involves
restarting or reloading a daemon.  Since the `fetchcert` tool runs in
a container, it can't directly perform those actions, so it simply
indicates via a special exit code that the certificate has been updated
and some further action may be needed.  The
`/etc/fetchcert/postupdate.sh` script is executed by _systemd_ after
`fetchcert` finishes.  If the `EXIT_STATUS` environment variable (which
is set by _systemd_ to the return code of the main service process)
matches the expected code, the configured post-update actions will be
executed.
2024-02-18 10:48:01 -06:00
Dustin b51428c363 Merge branch 'loki' 2024-02-17 16:49:35 -06:00
Dustin 2a84d810e0 reload-udev-rules: Add delay before copying files
Since *systemd* starts the *reload-udev-rules.service* unit as soon as
any file in the `/run/containers/udev-rules` directory changes, the `cp`
command may start before all of the files have been copied out of the
container.  If this happens, some of the rules will not get copied to
the final path, and thus will not be processed by *udev*.

Togive the container a chance to finish copying all of the files before
we process them, we need a bit of a delay.  Obviously, this is not a
perfect solution, as it could potentially take longer than 250ms to copy
the files in some cases, but hopefully those cases are rare enough to
not worry about.
2024-02-15 10:08:52 -06:00
Dustin ffe450cd30 loki: Run Grafana Loki in a container
Deploying Loki is pretty straightforward.  It just needs a container
unit file and a basic YAML configuration file.
2024-02-13 19:54:48 -06:00
Dustin 45285b9c47 host: Add loki0.p.b
*loki0.pyrocufflink.blue* will host [Grafana Loki][0], a log aggregation
system.

[0]: https://grafana.com/oss/loki/
2024-02-13 16:55:05 -06:00
Dustin 1738e4a1f1 host: Add k8s-aarch64-n{0,1} 2024-02-03 11:16:52 -06:00
Dustin 786145e914 env/prod: Collect common tempates in module
In order to simplify the process of adding new template render
instructions to all hosts, I've created a list of templates in the
`env/prod` module.  This way, I only have to add templates there, and
all hosts that "inherit" from it will automatically get them.
2024-02-03 11:16:52 -06:00
Dustin b7f5d4a910 app/ssh: Configure sshd trusted user CA keys
Configuring the system-wide trusted user CA key list for *sshd(8)*.
2024-02-03 11:16:52 -06:00
Dustin afd65ea9b8 host/nvr1: Fix cue package name 2024-02-03 11:13:42 -06:00
Dustin 073f7a6845 host: Add k8s-amd64-n3
*k8s-amd64-n3.pyrocufflink.blue* is a Kubernetes worker node.
2024-02-03 11:12:55 -06:00
Dustin f886a1bd8a sudo: Configure pam_ssh_agent_auth
I do not like how Fedora CoreOS configures `sudo` to allow the *core*
user to run privileged processes without authentication.  Rather than
assign the user a password, which would then have to be stored
somewhere, we'll install *pam_ssh_agent_auth* and configure `sudo` to
use it for authentication.  This way, only users with the private key
corresponding to one of the configured public keys can run `sudo`.

Naturally, *pam_ssh_agent_auth* has to be installed on the host system.
We achieve this by executing `rpm-ostree` via `nsenter` to escape the
container.  Once it is installed, we configure the PAM stack for
`sudo` to use it and populate the authorized keys database.  We also
need to configure `sudo` to keep the `SSH_AUTH_SOCK` environment
variable, so *pam_ssh_agent_auth* knows where to look for the private
keys.  Finally, we disable the default NOPASSWD rule for `sudo`, if
and only if the new configuration was installed.
2024-01-29 09:10:42 -06:00
Dustin d6751af326 prod/nut: Require both UPS to be online
Unfortunately, the automatic transfer switch does not seem to work
correctly.  When the standby source is a UPS running on battery, it does
*not* switch sources if the primary fails.  In other words, when the
power is out and both UPS are running on battery, when the first one
dies, it will NOT switch to the second one.  It has no trouble switching
when the second source is mains power, though, which is very strange.

I have tried messing with all the settings including nominal input
voltage, sensitivity, and frequency tolerence, but none seem to have any
effect.

Since it is more important for the machines to shut down safely than it
is to have an extra 10-15 minutes of runtime during an outage, the best
solution for now is to configure the hosts to shut down as soon as the
first UPS battery gets low.  This is largely a waste of the second UPS,
but at least it will help prevent data loss.
2024-01-25 21:12:33 -06:00
Dustin bd18d3a734 host: Add serial1.p.b
*serial1.pyrocufflink.blue* is a replacement for *serial0.p.b*.  It runs
Fedora CoreOS and just has `picocom` and `tmux`.
2024-01-25 20:17:00 -06:00
Dustin e25aa15eb1 prod/nut: Add user for dustin
The `upsrw` command, which is used to set individual UPS configuration
parameters like low battery level, etc., needs a username and password
to authenticate to `upsd`.
2024-01-21 15:10:56 -06:00
Dustin 0450617ae6 prod/nut: Add upsmon user for burp1 2024-01-19 20:57:47 -06:00
Dustin 48145c3573 nut: Enable Podman auto-update for containers
Setting `AutoUpdate=registry` will tell Podman to automatically fetch
an updated container image from its corresponding registry and restart
the container.  The `podman-auto-update.timer` systemd unit needs to be
active for this to happen on a schedule.
2024-01-19 20:10:11 -06:00
Dustin 668b79aaac prod/nut: Add upsmon passwords for gw1, vmhost{0,1} 2024-01-19 19:56:20 -06:00
Dustin f4938c57e1 prod/nut: Reset password for nvr1
The original password worked, but caused a warning in the `upsd` log:

> Ignoring duplicate password for nvr1
2024-01-19 19:32:08 -06:00
Dustin bb3705939e nut: Fix upsmon reload hook
`upsmon.conf` is used by *nut-monitor* (`upsmon`) rather than
*nut-server* (`upsd`).
2024-01-19 18:01:42 -06:00
Dustin 36fd137897 nut: Infer role from server name, set commands
Since the "primary" `upsmon` is always (for our purposes) running on the
same host as `upsd`, there's no reason to specify both values.

All systems need a shutdown command; one is not set by default.

The primary system is the only one that should send notifications.
2024-01-19 17:57:20 -06:00
Dustin caccffcb65 nut: split out template for sysusers.d config
Hosts that run `upsmon` but not `upsd` still need the *nut* user.
2024-01-19 17:21:23 -06:00
Dustin ad42c2d883 nvr1: Add instructions to configure upsmon
*nvr1.pyrocufflink.blue* will run `upsmon` so it can shut itself down
safely when the power goes out.
2024-01-19 16:57:47 -06:00
Dustin a919a9f94b nut/monitor: Fix tmpfs mount syntax
`dest` is not a valid option for the `--mount` argument to `podman`.  To
specify where the target path, only `target`, `destination`, and `dst`
are valid.
2024-01-19 16:42:56 -06:00
Dustin fb74f0e81c nut: Configure upsmon
`upsmon` is the component of NUT that tracks the status of UPSs and
reacts to their changing by sending notifications and/or shutting down
the system.  It is a networked application that can run on any system;
it can run on a different system than `upsd`, and indeed can run on
multiple systems simultaneously.

Each system that runs `upsmon` will need a username and password for
each UPS it will monitor.  Using the CUE [function pattern][0], I've
made it pretty simple to declare the necessary values under
`nut.monitor`.

[0]: https://cuetorials.com/patterns/functions/
2024-01-19 08:52:14 -06:00
Dustin 227ce8cfcf collectd: Bind-mount journal log socket
*collectd* logs to syslog, so its output is lost when it's running in a
container.  We can capture messages from it by mounting the journald
syslog socket into the container.
2024-01-18 20:35:22 -06:00
Dustin f1a55e3d5c collectd: Fix / bind mount directive 2024-01-18 20:27:25 -06:00
Dustin ec4b640170 reload-udev-rules: Ensure rules.d directory exists
The `/run/udev/rules.d` directory may not always exist, especially at
boot.  We need to ensure that it does before we try to copy rules
exported by containers into it, or the unit will fail.
2024-01-18 20:01:06 -06:00
Dustin 714df85183 collectd: Bind mount / into container
Even with *collectd* configured to report filesystem usage by device, it
still only reports filesystems that are mounted (in its namespace).
Thus, in order for it to report filesystems like `/boot`, these need to
be mounted in the container.
2024-01-18 19:58:11 -06:00
Dustin d3338a125b nut0: Configure collectd 2024-01-17 17:35:21 -06:00
Dustin 51aaccc861 collectd: Deploy collectd in a container
I keep going back-and-forth on whether or not collectd should run in a
container on Fedora CoreOS machines.  On the one hand, running it
directly on the host allows it to monitor filesystem usage by mount
point, which is consistent with how non-FCOS machines are monitored.
On the other hand, installing packages on FCOS with `rpm-ostree` is a
nightmare.  It's _incredibly_ slow.  There's also occasionally issues
installing packages if the base layer has not been updated in a while
and the new packages require an existing package to be updated.

For the NUT server specifically, I have changed my mind again: the
*collectd-nut* package depends on *nut-client*, which in turn depends on
Python.  I definitely want to avoid installing Python on the host, but I
do not want to lose the ability to monitor the UPSs via collectd.  Using
a container, I can strip out the unnecessary bits of *nut-client* and
avoid installing Python at all.  I think that's worth having to monitor
filesystem usage by device instead of by mount point.
2024-01-17 17:35:21 -06:00
Dustin 0bcbcbd199 base/schema: Fix instructions schema
Without the `...` prefix, CUE interprets a type enclosed in square
brackets as a list of exactly one of that type.  The ellipsis changes it
to mean a list of any number of that type.
2024-01-17 17:35:21 -06:00
Dustin 86f6943f5b Remove Containerfile
I don't want Jenkins to build a new runtime container every time I make
a change to the configuration policy.  As such, I've moved the container
image definition and corresponding CI pipeline script to their own
repository.
2024-01-17 17:35:21 -06:00
Dustin 41e9fa85d2 Restructure CUE packages
A bunch of stuff that wasn't schema definitions ended up in the `schema`
package.  Rather than split values up in a bunch of top-level packages,
I think it would be better to have a package-per-app model.
2024-01-17 17:35:18 -06:00
Dustin 52642d37d9 nut: Configure collectd NUT plugin
infra/cfg/pipeline/head This commit looks good Details
2024-01-17 07:18:37 -06:00
Dustin 44926c944f app/nut: Inherit container udev rules units
infra/cfg/pipeline/head This commit looks good Details
I missed getting the path and service unit file templates when rewriting
from KCL into CUE.
2024-01-15 17:34:45 -06:00
Dustin 37d65984c7 host/nut0: Switch to prod configuration
infra/cfg/pipeline/head This commit looks good Details
2024-01-15 16:15:47 -06:00
Dustin 47278c01e5 nut: Set container_use_devices SELinux tunable
By default, the Fedora SELinux policy does not allow containers to
access device nodes.  This setting is independent of CGroup device
rules.
2024-01-15 12:55:10 -06:00
Dustin 11f9957c11 Switch from KCL to CUE
Although KCL is unquestionably a more powerful language, and maps more
closely to my mental model of how host/environment/application
configuration is defined, the fact that it doesn't work on ARM (issue
982]) makes it a non-starter.  It's also quite slow (owing to how it
compiles a program to evaluate the code) and cumbersome to distribute.
Fortunately, `tmpl` doesn't care how the values it uses were computed,
so we freely change configuration languages, so long as whatever we use
generates JSON/YAML.

CUE is probably a lot more popular than KCL, and is quite a bit simpler.
It's more restrictive (values cannot be overridden once defined), but
still expressive enough for what I am trying to do (so far).
2024-01-15 11:40:58 -06:00
Dustin 8f31b0302c container: Install kcl, tmpl from binaries
`tmpl` takes a long time to compile on a Raspberry Pi, so I've created a
CI pipeline to build it separately.

`kcl` seems to have a [bug][0] that causes it to include the x86_64
builds of `kclvm_cli` and `libkclvm_cli_cdylib.so` on aarch64.  This
naturally doesn't work, so we need to fetch the correct builds
ourselves.

[0]: https://github.com/kcl-lang/cli/issues/31
2024-01-14 19:42:36 -06:00
Dustin f0ee31e3b1 Add Jenkinsfile 2024-01-14 19:24:55 -06:00
Dustin be1042cda7 nut: Do not run as privileged container
The only privilege NUT needs is access to the USB device nodes.  Using a
device CGroup rule to allow this is significantly better than disabling
all restrictions.  Especially since I discovered that `--privileged`
implies `--security-opt label=disable`, effectively disabling SELinux
confinement of the container.
2024-01-14 19:24:55 -06:00
Dustin 74508faf27 nut: Apply udev rules on the host
NUT needs some udev rules in order to set the proper permissions on USB
etc. devices so it can run as an otherwise unprivileged user.  Since
udev rules can only be processed on the host, these rules need to be
copied out of the container and evaluated before the NUT server starts.
To enable this, the *nut-server* container image copies the rules it
contains to `/etc/udev/rules.d` if that directory is a mount point.  By
bind mounting a directory on the host at that path, we can get a copy of
the rules files outside the container.  Then, using a systemd path unit,
we can tell the udev daemon to reload and reevaluate its rules.

SELinux prevents processes in containers from writing to
`/etc/udev/rules.d` directly, so we have to use an intermediate location
and then copy the rules files to their final destination.
2024-01-14 19:24:55 -06:00
Dustin 0e046d062e nut: Reload systemd after updating container unit
Need to run `systemctl daemon-reload` after creating or modifying the
`nut-server.container` unit file, so that the corresponding service unit
will be generated.
2024-01-14 19:24:55 -06:00
Dustin e2f9cc7a3a container: Symlink /etc/{passwd,group} to /host
When `tmpl` runs `systemd-sysusers` after generating the `sysusers.d`
file for NUT, the `/etc/passwd` and `/etc/group` files on the host are
created anew and replaced, which "breaks" the bind mount.  Since new
files are put in their place, the container and the host no longer see
the same files.  We can work around this by using a symbolic link for
each file, pointing to the respective file in the `/host` directory
(which is the host's `/` directory bind mounted into the container's
namespace).  Since the symlinks follow the file by name rather than
inode, the container's view is always in sync with the host's.
2024-01-14 19:24:55 -06:00
Dustin 79de375b30 container: Fix kcl runtime
As it turns out, KCL literally *compiles* a program from the KCL
sources.  The program it creates needs to link with its runtime library,
`libkclvm_cli_cdylib.so`.  The `kcl` command extracts this library,
along with a helper utility `kclvm_cli`, which performs the actual
compilation and linking.  In a container, `/root/go` is probably mounted
read-only, so we need to extract these files ahead of time and put them
in another location, so the `kcl` command does not have to do it each
time it runs.
2024-01-14 19:24:55 -06:00
Dustin d44e7df8cf nut: Pass explicit path to systemd-sysusers
When `tmpl` substitutes the path of the generated file for `%s` in hook
commands, it uses the full path including the `destdir` prefix.  Since
we're running `tmpl` inside a container, but `systemd-sysusers` outside
it (via `nsenter -t 1`), that path is not correct.  Thus, we need to
explicitly pass the path as `systemd-sysusers` will see it.
2024-01-14 19:24:55 -06:00
Dustin 1d4d29c294 Add Containerfile 2024-01-14 19:24:55 -06:00
Dustin 778c6d440d Initial commit 2024-01-14 19:24:55 -06:00