Commit Graph

73 Commits (beb243d69a8f61d38606df501a3c7a6e4415dec4)

Author SHA1 Message Date
Dustin beb243d69a loki: Do not chcon/chown state dir at startup
_systemd_ automatically recursively changes the ownership of the paths
listed in `StateDirectory` when the unit is activated.  This can take a
very long time, as the Loki storage directory contains hundreds of
thousands  of files.  Since we also have `podman` change the ownership,
that *doubles* the time taken.  Similarly, with `podman` also configured
to change the SELinux label of the files in that path, even more time is
wasted at startup.

To avoid all these time wasters, we need to avoid having _systemd_
manage the state directory and create it with the proper ownership and
SELinux label manually.  Here, we're only manipulating the metadata of
the top-level directory; anything within the directory is untouched.
This ensures that the directory is always there and has the correct
permissions, but does not spend any time changing anything that doesn't
need changed.
2024-04-25 09:47:21 -05:00
Dustin 837cec36f1 prod: frigate: Fix typo in go2rtc config 2024-04-07 14:42:02 -05:00
Dustin b933f37270 prod: promtail: Update CA certificate
*Loki* now has a certificate signed by DCH CA R2, so the Promtail
configuration needs updated to trust that root certificate.
2024-04-07 14:39:46 -05:00
Dustin 9c8e580c59 prod: caddy: Update ACME CA certificate
*step-ca* now uses DCH CA R2, so the Caddy ACME configuration needs
updated to trust that root certificate.
2024-04-07 14:23:51 -05:00
Dustin 1db158c150 nvr2: Deploy collectd 2024-04-07 11:18:42 -05:00
Dustin 97ba882cb2 prod: frigate: Set LIBVA driver name
Frigate defaults to using the *intel* VA-API driver, but *nvr2.p.b* has
an AMD GPU.
2024-04-05 22:26:41 -05:00
Dustin 41251a52cd wip: app/frigate: Deploy Caddy
Running Caddy in front of Frigate to provide HTTPS and authentication.
2024-04-05 22:26:36 -05:00
Dustin ee66e9ea18 caddy: Separate out from loki app
This will make it more clear when sharing Caddy resources with other
applications (e.g. Frigate).
2024-04-05 22:05:21 -05:00
Dustin b5fea000fa prod: Add upsmon password for nvr2 2024-04-05 22:05:21 -05:00
Dustin d432c673e9 host: Add nvr2.p.b
*nvr2.pyrocufflink.blue* runs Frigate video recording software.
2024-04-05 22:05:21 -05:00
Dustin aeddab46ff env/prod: Add values for Frigate
Imported as-is from *nvr1.pyrocufflink.blue*.
2024-04-05 22:05:21 -05:00
Dustin cd64b3bccb app/frigate: Add schema, templates for Frigate
[Frigate] is an open source network video recording software with
advanced motion detection using machine learning object detection.  It
uses `ffmpeg` to stream video from one or more RTSP-capable IP video
cameras and passes the images through an object detection process.  To
improve the performance of the machine learning model, it supports using
a Coral EdgeTPU device, which requires special drivers: `gasket` and
`apex`.

Frigate is configured via a (rather compex) YAML document, some of the
schema of which is modeled in `schema.cue` (the parts I need, anyway).

[Frigate]: https://frigate.video/
2024-04-05 20:27:00 -05:00
Dustin c4dcb5a8de loki: Enable auto-restart
Sometimes Loki fails to start or otherwise isn't running.  To minimize
loss of log data, we need it to restart automatically when possible.
2024-03-28 10:11:38 -05:00
Dustin ba5ba257c1 loki: Increase start timeout
It can sometimes take a very long time for Loki to start, for reasons
that are not entirely clear...
2024-03-28 10:09:01 -05:00
Dustin d989994f25 serterm: Deploy serial terminal server
The serial terminal server ("serterm") is a collection of scripts that
automate launching multiple `picocom` processes, one per USB-serial
adapter connected to the system.  Each `picocom` process has its own
window in a `tmux` session, which is accessible via SSH on a dedicated
port (20022).  Clients connecting to that SSH server will be
automatically attached to the `tmux` session, allowing them to access
the serial terminal server quickly and easily.  The SSH server only
allows public-key authentication, so the authorized keys have to be
pre-configured.

In addition to automatically launching `picocom` windows for each serial
port when the terminal server starts, ports that are added (hot-plugged)
while the server is running will have windows created for them
automatically, by way of a udev rule.

Each `picocom` process is configured to log communications with its
respective serial port.  This may be useful, for example, to find
diagnostic messages that may not be captured by the `tmux` scrollback
buffer.
2024-03-21 21:24:12 -05:00
Dustin 9779ac795d Merge branch 'promtail' 2024-02-21 07:48:42 -06:00
Dustin 01d8f7043b loki: Require X-Grafana-User HTTP header
I discovered today that if anonymous Grafana users have Viewer
permission, they can use the Datasource API to make arbitrary queries
to any backend, even if they cannot access the Explore page directly.
This is documented ([issue #48313][0]) as expected behavior.

I don't really mind giving anonymous access to the Victoria Metrics
datasource, but I definitely don't want anonymous users to be able to
make Loki queries and view log data.  Since Grafana Datasource
Permissions is limited to Grafana Enterprise and not available in
the open source version of Grafana, the official recommendation from
upstream is to use a separate Organization for the Loki datasource.
Unfortunately, this would preclude having dashboards that have graphs
from both data sources.  Although I don't have any of those right now, I
like the idea and may build some eventually.

Fortunately, I discovered the `send_user_header` Grafana configuration
option.  With this enabled, Grafana will send an `X-Grafana-User` header
with the username of the user on whose behalf it is making a request to
the backend.  If the user is not logged in, it does not send the header.
Thus, we can detect the presence of this header on the backend and
refuse to serve query requests if it is missing.

[0]: https://github.com/grafana/grafana/issues/48313
2024-02-21 07:47:51 -06:00
Dustin cdd6a62b5d promtail: Update loki port
With Loki behind a reverse proxy now, clients access it using the
default HTTPS port (443).
2024-02-21 07:47:51 -06:00
Dustin 878ff7acb5 loki: Deploy Caddy in front of Loki
Grafana Loki explicitly eschews built-in authentication.  In fact, its
[documentation][0] states:

> Operators are expected to run an authenticating reverse proxy in front
> of your services.

While I don't really want to require authentication for agents sending
logs, I definitely want to restrict querying and viewing logs to trusted
users.

There are _many_ reverse proxy servers available, and normally I would
choose _nginx_.  In this case, though, I decided to try Caddy, mostly
because of its built-in ACME support.  I wasn't really happy with how
the `fetchcert` system turned out, particularly using the Kubernetes API
token for authentication.  Since the token will eventually expire, it
will require manual intervention to renew, thus mostly defeating the
purpose of having an auto-renewing certificate.  So instead of using
_cert-manager_ to issue the certificate and store it in Kubernetes, and
then having `fetchcert` download it via the Kubernetes API, I set up
_step-ca_ to handle issuing the certificate directly to the server. When
Caddy starts up, it contacts _step-ca_ via ACME and handles the
challenge verification automatically.  Further, it will automatically
renew the certificate as necessary, again using ACME.

I didn't spend a lot of time optimizing the Caddy configuration, so
there's some duplication there (i.e. the multiple `reverse_proxy`
statements), but the configuration works as desired.  Clients may
provide a certificate, which will be verified against the trusted issuer
CA.  If the certificate is valid, the client may access any Loki
resource.  Clients that do not provide a certificate can only access the
ingestion path, as well as the "ready" and "metrics" resources.

[0]: https://grafana.com/docs/loki/latest/operations/authentication/
2024-02-21 07:47:51 -06:00
Dustin 5e10f2c1e7 promtail: Increase start timeout
The Promtail container image is pretty big, so it takes quite some time
to pull on a slow machine like a Raspberry Pi.  Let's increase the
startup timeout so the service is less likely to fail while the image is
still being pulled.
2024-02-20 07:27:11 -06:00
Dustin ae948489e3 Deploy Promtail to all non-Kubernetes nodes
All the stand-alone FCOS hosts now have Promtail running, forwarding
_systemd_ journal messages to Grafana Loki.  The Kubernetes nodes will
have Promtail deployed as a Kubernetes pod.

I would really like to come up with a way to define variables for groups
of hosts, so that I do not have to add `promtail: prod.#promtail` to
every host's values file individually...
2024-02-18 12:59:14 -06:00
Dustin 45c35c065a promtail: Deploy Loki Promtail Agent
[Promtail][0] is the log collection agent for Grafana Loki.  It reads
logs from various locations, including local files and the _systemd_
journal and sends them to Loki via HTTP.

Loki configuration is a highly-structured YAML document.  Thus, instead
of using Tera template syntax for loops, conditionals, etc., we can use
the full power of CUE to construct the configuration.  Using the
`Marshal` function from the built-in `encoding/yaml` package, we
serialize the final configuration structure as a string and write it
verbatim to the configuration file.

I have modeled most of the Promtail configuration schema in the
`du5t1n.me/cfg/app/promtail/schema` package.  Having the schema modeled
will ensure the generated configuration is valid during development
(i.e. `cue export` will fail if it is not), which will save time pushing
changes to machines and having Loki complain.

The `#promtail` "function" in `du5t1n.me/cfg/env/prod` makes it easy to
build our desired configuration.  It accepts an optional `#scrape`
field, which can be used to provide specific log scraping definitions.
If it is unspecified, the default configuration is to scrape the systemd
journal.  Hosts with additional needs can supply their own list,
probably including the `promtail.scrape.journal` object in it to get the
default journal scrape job.

[0]: https://grafana.com/docs/loki/latest/send-data/promtail/
2024-02-18 11:35:13 -06:00
Dustin 4608f19724 loki: Add ExecReload to systemd service unit
According to the [Grafana Loki documentation][0], sending SIGHUP to the
Loki process will instruct it to reload its configuration.  This is
necessary in order for it to re-read its server certificate after it has
been renewed.

[0]: https://grafana.com/docs/loki/latest/configure/#reload-at-runtime
2024-02-18 11:35:13 -06:00
Dustin 011058aec3 loki: Use fetchcert to manage server certificate
Before going into production with Grafana Loki, I want to set it up to
use TLS.  To that end, I have configured _cert-manager_ to issue it a
certificate, signed by _DCH CA_.  In order to use said certificate,
we need to configure `fetchcert` to run on the Loki server.
2024-02-18 11:35:13 -06:00
Dustin 29afcae52e fetchcert: Deploy tool to get cert from k8s Secret
The `fetchcert` tool is a short shell script that fetches an X.509
certificate and corresponding private key from a Kubernetes Secret,
using the Kubernetes API.  I originally wrote it for the Frigate server
so it could fetch the _pyrocufflink.blue_ wildcard certificate, which is
managed by _cert-manager_.  Since then, I have adapted it to be more
generic, so it will be useful to fetch the _loki.pyrocufflink.blue_
certificate for Grafana Loki.

Although the script is rather simple, it does have several required
configuration parameters.  It needs to know the URL of the Kubernetes
API server and have the certificate for the CA that signs the server
certificate, as well as an authorization token.  It also needs to know
the namespace and name of the Secret from which it will fetch the
certificate and private key.  Finally,  needs to know the paths to the
files where the fetched data will be written.

Generally, after certificates are updated, some action needs to be
performed in order to make use of them.  This typically involves
restarting or reloading a daemon.  Since the `fetchcert` tool runs in
a container, it can't directly perform those actions, so it simply
indicates via a special exit code that the certificate has been updated
and some further action may be needed.  The
`/etc/fetchcert/postupdate.sh` script is executed by _systemd_ after
`fetchcert` finishes.  If the `EXIT_STATUS` environment variable (which
is set by _systemd_ to the return code of the main service process)
matches the expected code, the configured post-update actions will be
executed.
2024-02-18 10:48:01 -06:00
Dustin f793249ed3 collectd: df: Ignore autofs mount points
When _collectd_ calls *statvfs(3)* on paths like
`/host/proc/sys/fs/binfmt_misc` which are configured for auto-mounting,
_systemd_ logs hundreds of messages like these:

```
systemd[1]: proc-sys-fs-binfmt_misc.automount: Got automount request for /proc/sys/fs/binfmt_misc, triggered by 1303 (reader#3)
systemd[1]: proc-sys-fs-binfmt_misc.automount: Automount point already active?
```

Eventually, _collectd_ logs an error:

```
collectd[1132]: statvfs(/host/proc/sys/fs/binfmt_misc) failed: Too many levels of symbolic links
```

This happens on every scrape interval.

To avoid this, we can configure _collectd_ to skip calling *statvfs(3)*
on _autofs_ mount points.  Even if it did work correctly, we wouldn't
really want _collectd_ triggering automounts; that would pretty much
defeat the purpose of them.
2024-02-17 21:36:21 -06:00
Dustin b51428c363 Merge branch 'loki' 2024-02-17 16:49:35 -06:00
Dustin 2a84d810e0 reload-udev-rules: Add delay before copying files
Since *systemd* starts the *reload-udev-rules.service* unit as soon as
any file in the `/run/containers/udev-rules` directory changes, the `cp`
command may start before all of the files have been copied out of the
container.  If this happens, some of the rules will not get copied to
the final path, and thus will not be processed by *udev*.

Togive the container a chance to finish copying all of the files before
we process them, we need a bit of a delay.  Obviously, this is not a
perfect solution, as it could potentially take longer than 250ms to copy
the files in some cases, but hopefully those cases are rare enough to
not worry about.
2024-02-15 10:08:52 -06:00
Dustin ffe450cd30 loki: Run Grafana Loki in a container
Deploying Loki is pretty straightforward.  It just needs a container
unit file and a basic YAML configuration file.
2024-02-13 19:54:48 -06:00
Dustin 45285b9c47 host: Add loki0.p.b
*loki0.pyrocufflink.blue* will host [Grafana Loki][0], a log aggregation
system.

[0]: https://grafana.com/oss/loki/
2024-02-13 16:55:05 -06:00
Dustin 1738e4a1f1 host: Add k8s-aarch64-n{0,1} 2024-02-03 11:16:52 -06:00
Dustin 786145e914 env/prod: Collect common tempates in module
In order to simplify the process of adding new template render
instructions to all hosts, I've created a list of templates in the
`env/prod` module.  This way, I only have to add templates there, and
all hosts that "inherit" from it will automatically get them.
2024-02-03 11:16:52 -06:00
Dustin b7f5d4a910 app/ssh: Configure sshd trusted user CA keys
Configuring the system-wide trusted user CA key list for *sshd(8)*.
2024-02-03 11:16:52 -06:00
Dustin afd65ea9b8 host/nvr1: Fix cue package name 2024-02-03 11:13:42 -06:00
Dustin 073f7a6845 host: Add k8s-amd64-n3
*k8s-amd64-n3.pyrocufflink.blue* is a Kubernetes worker node.
2024-02-03 11:12:55 -06:00
Dustin f886a1bd8a sudo: Configure pam_ssh_agent_auth
I do not like how Fedora CoreOS configures `sudo` to allow the *core*
user to run privileged processes without authentication.  Rather than
assign the user a password, which would then have to be stored
somewhere, we'll install *pam_ssh_agent_auth* and configure `sudo` to
use it for authentication.  This way, only users with the private key
corresponding to one of the configured public keys can run `sudo`.

Naturally, *pam_ssh_agent_auth* has to be installed on the host system.
We achieve this by executing `rpm-ostree` via `nsenter` to escape the
container.  Once it is installed, we configure the PAM stack for
`sudo` to use it and populate the authorized keys database.  We also
need to configure `sudo` to keep the `SSH_AUTH_SOCK` environment
variable, so *pam_ssh_agent_auth* knows where to look for the private
keys.  Finally, we disable the default NOPASSWD rule for `sudo`, if
and only if the new configuration was installed.
2024-01-29 09:10:42 -06:00
Dustin d6751af326 prod/nut: Require both UPS to be online
Unfortunately, the automatic transfer switch does not seem to work
correctly.  When the standby source is a UPS running on battery, it does
*not* switch sources if the primary fails.  In other words, when the
power is out and both UPS are running on battery, when the first one
dies, it will NOT switch to the second one.  It has no trouble switching
when the second source is mains power, though, which is very strange.

I have tried messing with all the settings including nominal input
voltage, sensitivity, and frequency tolerence, but none seem to have any
effect.

Since it is more important for the machines to shut down safely than it
is to have an extra 10-15 minutes of runtime during an outage, the best
solution for now is to configure the hosts to shut down as soon as the
first UPS battery gets low.  This is largely a waste of the second UPS,
but at least it will help prevent data loss.
2024-01-25 21:12:33 -06:00
Dustin bd18d3a734 host: Add serial1.p.b
*serial1.pyrocufflink.blue* is a replacement for *serial0.p.b*.  It runs
Fedora CoreOS and just has `picocom` and `tmux`.
2024-01-25 20:17:00 -06:00
Dustin e25aa15eb1 prod/nut: Add user for dustin
The `upsrw` command, which is used to set individual UPS configuration
parameters like low battery level, etc., needs a username and password
to authenticate to `upsd`.
2024-01-21 15:10:56 -06:00
Dustin 0450617ae6 prod/nut: Add upsmon user for burp1 2024-01-19 20:57:47 -06:00
Dustin 48145c3573 nut: Enable Podman auto-update for containers
Setting `AutoUpdate=registry` will tell Podman to automatically fetch
an updated container image from its corresponding registry and restart
the container.  The `podman-auto-update.timer` systemd unit needs to be
active for this to happen on a schedule.
2024-01-19 20:10:11 -06:00
Dustin 668b79aaac prod/nut: Add upsmon passwords for gw1, vmhost{0,1} 2024-01-19 19:56:20 -06:00
Dustin f4938c57e1 prod/nut: Reset password for nvr1
The original password worked, but caused a warning in the `upsd` log:

> Ignoring duplicate password for nvr1
2024-01-19 19:32:08 -06:00
Dustin bb3705939e nut: Fix upsmon reload hook
`upsmon.conf` is used by *nut-monitor* (`upsmon`) rather than
*nut-server* (`upsd`).
2024-01-19 18:01:42 -06:00
Dustin 36fd137897 nut: Infer role from server name, set commands
Since the "primary" `upsmon` is always (for our purposes) running on the
same host as `upsd`, there's no reason to specify both values.

All systems need a shutdown command; one is not set by default.

The primary system is the only one that should send notifications.
2024-01-19 17:57:20 -06:00
Dustin caccffcb65 nut: split out template for sysusers.d config
Hosts that run `upsmon` but not `upsd` still need the *nut* user.
2024-01-19 17:21:23 -06:00
Dustin ad42c2d883 nvr1: Add instructions to configure upsmon
*nvr1.pyrocufflink.blue* will run `upsmon` so it can shut itself down
safely when the power goes out.
2024-01-19 16:57:47 -06:00
Dustin a919a9f94b nut/monitor: Fix tmpfs mount syntax
`dest` is not a valid option for the `--mount` argument to `podman`.  To
specify where the target path, only `target`, `destination`, and `dst`
are valid.
2024-01-19 16:42:56 -06:00
Dustin fb74f0e81c nut: Configure upsmon
`upsmon` is the component of NUT that tracks the status of UPSs and
reacts to their changing by sending notifications and/or shutting down
the system.  It is a networked application that can run on any system;
it can run on a different system than `upsd`, and indeed can run on
multiple systems simultaneously.

Each system that runs `upsmon` will need a username and password for
each UPS it will monitor.  Using the CUE [function pattern][0], I've
made it pretty simple to declare the necessary values under
`nut.monitor`.

[0]: https://cuetorials.com/patterns/functions/
2024-01-19 08:52:14 -06:00
Dustin 227ce8cfcf collectd: Bind-mount journal log socket
*collectd* logs to syslog, so its output is lost when it's running in a
container.  We can capture messages from it by mounting the journald
syslog socket into the container.
2024-01-18 20:35:22 -06:00