infra/cfg - cfg - Gitea: Git with a cup of tea

Commit Graph

Author	SHA1	Message	Date
Dustin	011058aec3	loki: Use fetchcert to manage server certificate Before going into production with Grafana Loki, I want to set it up to use TLS. To that end, I have configured _cert-manager_ to issue it a certificate, signed by _DCH CA_. In order to use said certificate, we need to configure `fetchcert` to run on the Loki server.	2024-02-18 11:35:13 -06:00
Dustin	29afcae52e	fetchcert: Deploy tool to get cert from k8s Secret The `fetchcert` tool is a short shell script that fetches an X.509 certificate and corresponding private key from a Kubernetes Secret, using the Kubernetes API. I originally wrote it for the Frigate server so it could fetch the _pyrocufflink.blue_ wildcard certificate, which is managed by _cert-manager_. Since then, I have adapted it to be more generic, so it will be useful to fetch the _loki.pyrocufflink.blue_ certificate for Grafana Loki. Although the script is rather simple, it does have several required configuration parameters. It needs to know the URL of the Kubernetes API server and have the certificate for the CA that signs the server certificate, as well as an authorization token. It also needs to know the namespace and name of the Secret from which it will fetch the certificate and private key. Finally, needs to know the paths to the files where the fetched data will be written. Generally, after certificates are updated, some action needs to be performed in order to make use of them. This typically involves restarting or reloading a daemon. Since the `fetchcert` tool runs in a container, it can't directly perform those actions, so it simply indicates via a special exit code that the certificate has been updated and some further action may be needed. The `/etc/fetchcert/postupdate.sh` script is executed by _systemd_ after `fetchcert` finishes. If the `EXIT_STATUS` environment variable (which is set by _systemd_ to the return code of the main service process) matches the expected code, the configured post-update actions will be executed.	2024-02-18 10:48:01 -06:00
Dustin	b51428c363	Merge branch 'loki'	2024-02-17 16:49:35 -06:00
Dustin	2a84d810e0	reload-udev-rules: Add delay before copying files Since systemd starts the reload-udev-rules.service unit as soon as any file in the `/run/containers/udev-rules` directory changes, the `cp` command may start before all of the files have been copied out of the container. If this happens, some of the rules will not get copied to the final path, and thus will not be processed by udev. Togive the container a chance to finish copying all of the files before we process them, we need a bit of a delay. Obviously, this is not a perfect solution, as it could potentially take longer than 250ms to copy the files in some cases, but hopefully those cases are rare enough to not worry about.	2024-02-15 10:08:52 -06:00
Dustin	ffe450cd30	loki: Run Grafana Loki in a container Deploying Loki is pretty straightforward. It just needs a container unit file and a basic YAML configuration file.	2024-02-13 19:54:48 -06:00
Dustin	45285b9c47	host: Add loki0.p.b loki0.pyrocufflink.blue will host [Grafana Loki][0], a log aggregation system. [0]: https://grafana.com/oss/loki/	2024-02-13 16:55:05 -06:00
Dustin	1738e4a1f1	host: Add k8s-aarch64-n{0,1}	2024-02-03 11:16:52 -06:00
Dustin	786145e914	env/prod: Collect common tempates in module In order to simplify the process of adding new template render instructions to all hosts, I've created a list of templates in the `env/prod` module. This way, I only have to add templates there, and all hosts that "inherit" from it will automatically get them.	2024-02-03 11:16:52 -06:00
Dustin	b7f5d4a910	app/ssh: Configure sshd trusted user CA keys Configuring the system-wide trusted user CA key list for sshd(8).	2024-02-03 11:16:52 -06:00
Dustin	afd65ea9b8	host/nvr1: Fix cue package name	2024-02-03 11:13:42 -06:00
Dustin	073f7a6845	host: Add k8s-amd64-n3 k8s-amd64-n3.pyrocufflink.blue is a Kubernetes worker node.	2024-02-03 11:12:55 -06:00
Dustin	f886a1bd8a	sudo: Configure pam_ssh_agent_auth I do not like how Fedora CoreOS configures `sudo` to allow the core user to run privileged processes without authentication. Rather than assign the user a password, which would then have to be stored somewhere, we'll install pam_ssh_agent_auth and configure `sudo` to use it for authentication. This way, only users with the private key corresponding to one of the configured public keys can run `sudo`. Naturally, pam_ssh_agent_auth has to be installed on the host system. We achieve this by executing `rpm-ostree` via `nsenter` to escape the container. Once it is installed, we configure the PAM stack for `sudo` to use it and populate the authorized keys database. We also need to configure `sudo` to keep the `SSH_AUTH_SOCK` environment variable, so pam_ssh_agent_auth knows where to look for the private keys. Finally, we disable the default NOPASSWD rule for `sudo`, if and only if the new configuration was installed.	2024-01-29 09:10:42 -06:00
Dustin	d6751af326	prod/nut: Require both UPS to be online Unfortunately, the automatic transfer switch does not seem to work correctly. When the standby source is a UPS running on battery, it does not switch sources if the primary fails. In other words, when the power is out and both UPS are running on battery, when the first one dies, it will NOT switch to the second one. It has no trouble switching when the second source is mains power, though, which is very strange. I have tried messing with all the settings including nominal input voltage, sensitivity, and frequency tolerence, but none seem to have any effect. Since it is more important for the machines to shut down safely than it is to have an extra 10-15 minutes of runtime during an outage, the best solution for now is to configure the hosts to shut down as soon as the first UPS battery gets low. This is largely a waste of the second UPS, but at least it will help prevent data loss.	2024-01-25 21:12:33 -06:00
Dustin	bd18d3a734	host: Add serial1.p.b serial1.pyrocufflink.blue is a replacement for serial0.p.b. It runs Fedora CoreOS and just has `picocom` and `tmux`.	2024-01-25 20:17:00 -06:00
Dustin	e25aa15eb1	prod/nut: Add user for dustin The `upsrw` command, which is used to set individual UPS configuration parameters like low battery level, etc., needs a username and password to authenticate to `upsd`.	2024-01-21 15:10:56 -06:00
Dustin	0450617ae6	prod/nut: Add upsmon user for burp1	2024-01-19 20:57:47 -06:00
Dustin	48145c3573	nut: Enable Podman auto-update for containers Setting `AutoUpdate=registry` will tell Podman to automatically fetch an updated container image from its corresponding registry and restart the container. The `podman-auto-update.timer` systemd unit needs to be active for this to happen on a schedule.	2024-01-19 20:10:11 -06:00
Dustin	668b79aaac	prod/nut: Add upsmon passwords for gw1, vmhost{0,1}	2024-01-19 19:56:20 -06:00
Dustin	f4938c57e1	prod/nut: Reset password for nvr1 The original password worked, but caused a warning in the `upsd` log: > Ignoring duplicate password for nvr1	2024-01-19 19:32:08 -06:00
Dustin	bb3705939e	nut: Fix upsmon reload hook `upsmon.conf` is used by nut-monitor (`upsmon`) rather than nut-server (`upsd`).	2024-01-19 18:01:42 -06:00
Dustin	36fd137897	nut: Infer role from server name, set commands Since the "primary" `upsmon` is always (for our purposes) running on the same host as `upsd`, there's no reason to specify both values. All systems need a shutdown command; one is not set by default. The primary system is the only one that should send notifications.	2024-01-19 17:57:20 -06:00
Dustin	caccffcb65	nut: split out template for sysusers.d config Hosts that run `upsmon` but not `upsd` still need the nut user.	2024-01-19 17:21:23 -06:00
Dustin	ad42c2d883	nvr1: Add instructions to configure upsmon nvr1.pyrocufflink.blue will run `upsmon` so it can shut itself down safely when the power goes out.	2024-01-19 16:57:47 -06:00
Dustin	a919a9f94b	nut/monitor: Fix tmpfs mount syntax `dest` is not a valid option for the `--mount` argument to `podman`. To specify where the target path, only `target`, `destination`, and `dst` are valid.	2024-01-19 16:42:56 -06:00
Dustin	fb74f0e81c	nut: Configure upsmon `upsmon` is the component of NUT that tracks the status of UPSs and reacts to their changing by sending notifications and/or shutting down the system. It is a networked application that can run on any system; it can run on a different system than `upsd`, and indeed can run on multiple systems simultaneously. Each system that runs `upsmon` will need a username and password for each UPS it will monitor. Using the CUE [function pattern][0], I've made it pretty simple to declare the necessary values under `nut.monitor`. [0]: https://cuetorials.com/patterns/functions/	2024-01-19 08:52:14 -06:00
Dustin	227ce8cfcf	collectd: Bind-mount journal log socket collectd logs to syslog, so its output is lost when it's running in a container. We can capture messages from it by mounting the journald syslog socket into the container.	2024-01-18 20:35:22 -06:00
Dustin	f1a55e3d5c	collectd: Fix / bind mount directive	2024-01-18 20:27:25 -06:00
Dustin	ec4b640170	reload-udev-rules: Ensure rules.d directory exists The `/run/udev/rules.d` directory may not always exist, especially at boot. We need to ensure that it does before we try to copy rules exported by containers into it, or the unit will fail.	2024-01-18 20:01:06 -06:00
Dustin	714df85183	collectd: Bind mount / into container Even with collectd configured to report filesystem usage by device, it still only reports filesystems that are mounted (in its namespace). Thus, in order for it to report filesystems like `/boot`, these need to be mounted in the container.	2024-01-18 19:58:11 -06:00
Dustin	d3338a125b	nut0: Configure collectd	2024-01-17 17:35:21 -06:00
Dustin	51aaccc861	collectd: Deploy collectd in a container I keep going back-and-forth on whether or not collectd should run in a container on Fedora CoreOS machines. On the one hand, running it directly on the host allows it to monitor filesystem usage by mount point, which is consistent with how non-FCOS machines are monitored. On the other hand, installing packages on FCOS with `rpm-ostree` is a nightmare. It's _incredibly_ slow. There's also occasionally issues installing packages if the base layer has not been updated in a while and the new packages require an existing package to be updated. For the NUT server specifically, I have changed my mind again: the collectd-nut package depends on nut-client, which in turn depends on Python. I definitely want to avoid installing Python on the host, but I do not want to lose the ability to monitor the UPSs via collectd. Using a container, I can strip out the unnecessary bits of nut-client and avoid installing Python at all. I think that's worth having to monitor filesystem usage by device instead of by mount point.	2024-01-17 17:35:21 -06:00
Dustin	0bcbcbd199	base/schema: Fix instructions schema Without the `...` prefix, CUE interprets a type enclosed in square brackets as a list of exactly one of that type. The ellipsis changes it to mean a list of any number of that type.	2024-01-17 17:35:21 -06:00
Dustin	86f6943f5b	Remove Containerfile I don't want Jenkins to build a new runtime container every time I make a change to the configuration policy. As such, I've moved the container image definition and corresponding CI pipeline script to their own repository.	2024-01-17 17:35:21 -06:00
Dustin	41e9fa85d2	Restructure CUE packages A bunch of stuff that wasn't schema definitions ended up in the `schema` package. Rather than split values up in a bunch of top-level packages, I think it would be better to have a package-per-app model.	2024-01-17 17:35:18 -06:00
Dustin	52642d37d9	nut: Configure collectd NUT plugin infra/cfg/pipeline/head This commit looks good Details	2024-01-17 07:18:37 -06:00
Dustin	44926c944f	app/nut: Inherit container udev rules units infra/cfg/pipeline/head This commit looks good Details I missed getting the path and service unit file templates when rewriting from KCL into CUE.	2024-01-15 17:34:45 -06:00
Dustin	37d65984c7	host/nut0: Switch to prod configuration infra/cfg/pipeline/head This commit looks good Details	2024-01-15 16:15:47 -06:00
Dustin	47278c01e5	nut: Set container_use_devices SELinux tunable By default, the Fedora SELinux policy does not allow containers to access device nodes. This setting is independent of CGroup device rules.	2024-01-15 12:55:10 -06:00
Dustin	11f9957c11	Switch from KCL to CUE Although KCL is unquestionably a more powerful language, and maps more closely to my mental model of how host/environment/application configuration is defined, the fact that it doesn't work on ARM (issue 982]) makes it a non-starter. It's also quite slow (owing to how it compiles a program to evaluate the code) and cumbersome to distribute. Fortunately, `tmpl` doesn't care how the values it uses were computed, so we freely change configuration languages, so long as whatever we use generates JSON/YAML. CUE is probably a lot more popular than KCL, and is quite a bit simpler. It's more restrictive (values cannot be overridden once defined), but still expressive enough for what I am trying to do (so far).	2024-01-15 11:40:58 -06:00
Dustin	8f31b0302c	container: Install kcl, tmpl from binaries `tmpl` takes a long time to compile on a Raspberry Pi, so I've created a CI pipeline to build it separately. `kcl` seems to have a [bug][0] that causes it to include the x86_64 builds of `kclvm_cli` and `libkclvm_cli_cdylib.so` on aarch64. This naturally doesn't work, so we need to fetch the correct builds ourselves. [0]: https://github.com/kcl-lang/cli/issues/31	2024-01-14 19:42:36 -06:00
Dustin	f0ee31e3b1	Add Jenkinsfile	2024-01-14 19:24:55 -06:00
Dustin	be1042cda7	nut: Do not run as privileged container The only privilege NUT needs is access to the USB device nodes. Using a device CGroup rule to allow this is significantly better than disabling all restrictions. Especially since I discovered that `--privileged` implies `--security-opt label=disable`, effectively disabling SELinux confinement of the container.	2024-01-14 19:24:55 -06:00
Dustin	74508faf27	nut: Apply udev rules on the host NUT needs some udev rules in order to set the proper permissions on USB etc. devices so it can run as an otherwise unprivileged user. Since udev rules can only be processed on the host, these rules need to be copied out of the container and evaluated before the NUT server starts. To enable this, the nut-server container image copies the rules it contains to `/etc/udev/rules.d` if that directory is a mount point. By bind mounting a directory on the host at that path, we can get a copy of the rules files outside the container. Then, using a systemd path unit, we can tell the udev daemon to reload and reevaluate its rules. SELinux prevents processes in containers from writing to `/etc/udev/rules.d` directly, so we have to use an intermediate location and then copy the rules files to their final destination.	2024-01-14 19:24:55 -06:00
Dustin	0e046d062e	nut: Reload systemd after updating container unit Need to run `systemctl daemon-reload` after creating or modifying the `nut-server.container` unit file, so that the corresponding service unit will be generated.	2024-01-14 19:24:55 -06:00
Dustin	e2f9cc7a3a	container: Symlink /etc/{passwd,group} to /host When `tmpl` runs `systemd-sysusers` after generating the `sysusers.d` file for NUT, the `/etc/passwd` and `/etc/group` files on the host are created anew and replaced, which "breaks" the bind mount. Since new files are put in their place, the container and the host no longer see the same files. We can work around this by using a symbolic link for each file, pointing to the respective file in the `/host` directory (which is the host's `/` directory bind mounted into the container's namespace). Since the symlinks follow the file by name rather than inode, the container's view is always in sync with the host's.	2024-01-14 19:24:55 -06:00
Dustin	79de375b30	container: Fix kcl runtime As it turns out, KCL literally compiles a program from the KCL sources. The program it creates needs to link with its runtime library, `libkclvm_cli_cdylib.so`. The `kcl` command extracts this library, along with a helper utility `kclvm_cli`, which performs the actual compilation and linking. In a container, `/root/go` is probably mounted read-only, so we need to extract these files ahead of time and put them in another location, so the `kcl` command does not have to do it each time it runs.	2024-01-14 19:24:55 -06:00
Dustin	d44e7df8cf	nut: Pass explicit path to systemd-sysusers When `tmpl` substitutes the path of the generated file for `%s` in hook commands, it uses the full path including the `destdir` prefix. Since we're running `tmpl` inside a container, but `systemd-sysusers` outside it (via `nsenter -t 1`), that path is not correct. Thus, we need to explicitly pass the path as `systemd-sysusers` will see it.	2024-01-14 19:24:55 -06:00
Dustin	1d4d29c294	Add Containerfile	2024-01-14 19:24:55 -06:00
Dustin	778c6d440d	Initial commit	2024-01-14 19:24:55 -06:00

49 Commits (011058aec3d3400b682ba90fcfea8457c58e36cd) All Branches Search

49 Commits (011058aec3d3400b682ba90fcfea8457c58e36cd)

All Branches