As with AlertManager, the point of having multiple replicas of `vmagent`
is so that one is always running, even if the other fails. Thus, we
want to start the pods in parallel so that if the first one does not
come up, the second one at least has a chance.
If something prevents the first AlertManager instance from starting, we
don't want to wait forever for it before starting the second. That
pretty much defeats the purpose of having two instances. Fortunately,
we can configure Kubernetes to bring up both instances simultaneously by
setting the pod management policyo to `Parallel`.
We also don't need a 4 GB volume for AlertManager; even 500 MB is
way too big for the tiny amount of data it stores, but that's about the
smallest size a filesystem can be.
_bw0.pyrocufflink.blue_ has been decommissioned since some time, so it
doesn't get backed up any more. We want to keep its previous backups
around, though, in case we ever need to restore something. This
triggers the "no recent backups" alert, since the last snapshot is over
a week old. Let's ignore that hostname when generating this alert.
The `vmagent` needs a place to spool data it has not yet sent to
Victoria Metrics, but it doesn't really need to be persistent. As long
as all of the `vmagent` nodes _and_ all of the `vminsert` nodes do not
go down simultaneously, there shouldn't be any data loss. If they are
all down at the same time, there's probably something else going on and
lost metrics are the least concerning problem.
Music Assistant doesn't expose any metrics natively. Since we really
only care about whether or not it's accessible, scraping it with the
blackbox exporter is fine.
_Firefly III_ and _phpipam_ don't export any Prometheus metrics, so we
have to scrape them via the Blackbox Exporter.
Paperless-ngx only exposes metrics via Flower, but since it runs in the
same container as the main application, we can assume that if the former
is unavailable, the latter is as well.
Docker Hub has blocked ("rate limited") my IP address. Moving as much
as I can to use images from other sources. Hopefully they'll unblock me
soon and I can deploy a caching proxy.
Scraping metrics from the Kubernetes API server has started taking 20+
seconds recondly. Until I figure out the underlying cause, I'm
increasing the scrape timeout so that the _vmagent_ doesn't give up and
report the API server as "down."
*unifi3.pyrocufflink.blue* has been replaced by
*unifi-nuptials.host.pyrocufflink.black*. The former was the last
Fedora CoreOS machine in use, so the entire Zincati scrape job is no
longer needed.
The `pg_stat_archiver_failed_count` metric is a counter, so once a WAL
archival has failed, it will increase and never return to `0`. To
ensure the alert is resolved once the WAL archival process recovers, we
need to use the `increase` function to turn it into a gauge. Finally,
we aggregate that gauge with `max_over_time` to keep the alert from
flapping if the WAL archive occurs less frequently than the scrape
interval.
At some point this week, the front porch camera stopped sending video.
I'm not sure exactly what happened to it, but Frigate kept logging
"Unable to read frames from ffmpeg process." I power-cycled the camera,
which resolved the issue.
Unfortunately, no alerts were generated about this situation. Home
Assistant did not consider the camera entity unavailable, presumably
because Frigate was still reporting stats about it. Thus, I missed
several important notifications. To avoid this in the future, I have
enabled the "Camera FPS" sensors for all of the cameras in Home
Assistant, and added this alert to trigger when the reported framerate
is 0.
I really also need to get alerts for log events configured, as that
would also indicated there was an issue.
We don't need a notification about paperless not scheduling email tasks
every time there is a gap in the metric. This can happen in some
innocuous situations like when the pod restarts or if there is a brief
disruption of service. Using the `absent_over_time` function with a
range vector, we can have the alert fire only if there have been no
email tasks scheduled within the last 12 hours.
It turns out this alert is not very useful, and indeed quite annoying.
Many servers can go for days or even weeks with no changes, which is
completely normal.
Nextcloud uses a _client-side_ (Javascript) redirect to navigate the
browser to its `index.php`. The page it serves with this redirect is
static and will often load successfully, even if there is a problem with
the application. This causes the Blackbox exporter to record the site
as "up," even when it it definitely is not. To avoid this, we can
scrape the `index.php` page explicitly, ensuring that the application is
loaded.
Just like I did with the RAID-1 array in the old BURP server, I will
keep one member active and one in the fireproof safe, swapping them each
month. We can use the same metrics queries to alert on when the swap
should happen that we used with the BURP server.
The ephemeral Jenkins worker nodes that run in AWS don't have colletcd,
promtail, or Zincati. We don't needto get three alerts every time a
worker starts up to handle am ARM build job, so we drop these discovered
targets for these scrape jobs.
Paperless-ngx uses a Celery task to process uploaded files, converting
them to PDF, running OCR, etc. This task can be marked as "failed" for
various reasons, most of which are more about the document itself than
the health of the application. The GUI displays the results of failed
tasks when they occur. It doesn't really make sense to have an alert
about this scenario, especially since there's nothing to do to directly
clear the alert anyway.
Fedora CoreOS fills `/boot` beyond the 75% alert threshold under normal
circumstances on aarch64 machines. This is not a problem, because it
cleans up old files on its own, so we do not need to alert on it.
Unfortunately, the _DiskUsage_ alert is already quite complex, and
adding in exclusions for these devices would make it even worse.
To simplify the logic, we can use a recording rule to precomupte the
used/free space ratio. By using `sum(...) without (type)` instead of
`sum(...) on (df, instance)`, we keep the other labels, which we can
then use to identify the metrics coming from machines we don't care to
monitor.
Instead of having different thresholds for different volumes
encoded in the same expression, we can use multiple alerts to alert on
"low" vs "very low" thresholds. Since this will of course cause
duplicate alerts for most volumes, we can use AlertManager inhibition
rules to disable the "low" alert once the metric crosses the "very low"
threshold.
*loki1.pyrocufflink.blue* is a regular Fedora machine, a member of the
AD domain, and managed by Ansible. Thus, it does not need to be
explicitly listed as a scrape target.
For scraping metrics from Loki itself, I've changed the job to use
DNS-SD because it seems like `vmagent` does _not_ re-resolve host names
from static configuration.
The `flower_events_total` metric is a counter, so its value only ever
increases (discounting restarts of the server process). As such,
nonzero values do not necessarily indicate a _current_ problem, but
rather that there was one at some point in the past. To identify
current issues, we need to use the `increase` function, and then apply
the `max_over_time` function so that the alert doesn't immediately reset
itself.
I was doing this to monitor Jenkins's certificate, but since that's
managed by _cert-manager_, there's really practically no risk of it
expiring without warning anymore. Since Jenkins is already being
scraped directly, having this extra check just gernerates extra
notifications when there is an issue without adding any real value.
Using domain names in the "blackbox" probe makes it difficult to tell
the difference between a complete Internet outage and DNS issues. I
switched to using these names when I changed how the firewall routed
traffic to the public DNS servers, since those were the IP addresses
I was using to determine if the Internet was "up." I think it makes
sense, though, to just ping the upstream gateway for that check. If
EverFast changes their routing or numbering, we'll just have to update
our checks to match.
The alerts for Z-Wave device batteries in particular are pretty
annoying, as they tend to "flap" for some reason. I like having the
alerts show up on Alertmanager/Grafana dashboards, but I don't
necessarily need notifications about them. Fortunately, we can create a
special "none" receiver and route notifications there, which does
exactly what we want here.
The VM hosts are now managed by the "main" Ansible inventory and thus
appear in the host list ConfigMap. As such, they do not need to be
listed explicitly in the static targets list.
Some machines have the same volume mounted multiple times (e.g.
container hosts, BURP). Alerts will fire for all of these
simultaneously when the filesystem usage passes the threshold. To avoid
getting spammed with a bunch of messages about the same filesystem,
we'll group alerts from the same machine.
I'm not using Matrix for anything anymore, and it seems to have gone
offline. I haven't fully decommissioned it yet, but the Blackbox scrape
is failing, so I'll just disable that bit for now.