https://20125.home/ is the URL the Status Android application loads in
its main WebView. This site is powered by a server that generates a
custom page showing the status of our self-hosted applications, based on
alerts retrieved from the AlertManager API.
Android WebView does not allow cleartext HTTP connections. It does,
however, allow connecting an HTTPS server and ignoring the certificate
it presents, which is effectively the same thing. Thus, we generate a
self-signed certificate for the Ingress for this site.
Fedora CoreOS fills `/boot` beyond the 75% alert threshold under normal
circumstances on aarch64 machines. This is not a problem, because it
cleans up old files on its own, so we do not need to alert on it.
Unfortunately, the _DiskUsage_ alert is already quite complex, and
adding in exclusions for these devices would make it even worse.
To simplify the logic, we can use a recording rule to precomupte the
used/free space ratio. By using `sum(...) without (type)` instead of
`sum(...) on (df, instance)`, we keep the other labels, which we can
then use to identify the metrics coming from machines we don't care to
monitor.
Instead of having different thresholds for different volumes
encoded in the same expression, we can use multiple alerts to alert on
"low" vs "very low" thresholds. Since this will of course cause
duplicate alerts for most volumes, we can use AlertManager inhibition
rules to disable the "low" alert once the metric crosses the "very low"
threshold.
*loki1.pyrocufflink.blue* is a regular Fedora machine, a member of the
AD domain, and managed by Ansible. Thus, it does not need to be
explicitly listed as a scrape target.
For scraping metrics from Loki itself, I've changed the job to use
DNS-SD because it seems like `vmagent` does _not_ re-resolve host names
from static configuration.
The `flower_events_total` metric is a counter, so its value only ever
increases (discounting restarts of the server process). As such,
nonzero values do not necessarily indicate a _current_ problem, but
rather that there was one at some point in the past. To identify
current issues, we need to use the `increase` function, and then apply
the `max_over_time` function so that the alert doesn't immediately reset
itself.
The Gotenberg container image uses UID 1001 for the _gotenberg_ user.
Using any other UID number, even when the home directory is set and
owned by that UID, results in random issues, especially when using
LibreOffice conversions.
The Paperless-ngx ecosystem consists of several services. Defining the
resources for each service in separate manifest files will make
maintenance a little bit easier.
Longhorn uses a special Secret resource to configure the backup target.
This secret includes the credentials and CA certificate for accessing
the MinIO S3 service.
Longhorn must be configured to use this Secret by setting the
`backup-target-credential-secret` setting to
`minio-backups-credentials`.
I was doing this to monitor Jenkins's certificate, but since that's
managed by _cert-manager_, there's really practically no risk of it
expiring without warning anymore. Since Jenkins is already being
scraped directly, having this extra check just gernerates extra
notifications when there is an issue without adding any real value.
Using domain names in the "blackbox" probe makes it difficult to tell
the difference between a complete Internet outage and DNS issues. I
switched to using these names when I changed how the firewall routed
traffic to the public DNS servers, since those were the IP addresses
I was using to determine if the Internet was "up." I think it makes
sense, though, to just ping the upstream gateway for that check. If
EverFast changes their routing or numbering, we'll just have to update
our checks to match.
The alerts for Z-Wave device batteries in particular are pretty
annoying, as they tend to "flap" for some reason. I like having the
alerts show up on Alertmanager/Grafana dashboards, but I don't
necessarily need notifications about them. Fortunately, we can create a
special "none" receiver and route notifications there, which does
exactly what we want here.
Using Kustomize, we can define the configuration file separately from
the Kubernetes resources, and use `configMapGenerators` to generate the
ConfigMap for it. Additionally, this will make it possible to update
_ntfy_ using `updatebot`.
Tabitha wants to be able to accept Apple Pay payemnts via stripe, but
this requires an additional "domain verification" step. Apple needs to
make an HTTP request to the domain owned by the vendor, which in the
case of Invoice Ninja, must be the "app URL." Unfortunately, there
does not appear to be a way to tell Apple/Stripe/IN to use the client
portal domain or any other domain besides the app URL. Therefore, we
need to expose Invoice Ninja to the Internet under the public
_pyrocufflink.net_ domain, rather than the internal _pyrocufflink.blue_.
Let's run `updatebot` on Saturday morning, so I can apply the changes
over the weekend if I have time. If I don't, there's no harm in having
the PRs open for a few days until I can get to it during the week.