*loki1.pyrocufflink.blue* is a regular Fedora machine, a member of the
AD domain, and managed by Ansible. Thus, it does not need to be
explicitly listed as a scrape target.
For scraping metrics from Loki itself, I've changed the job to use
DNS-SD because it seems like `vmagent` does _not_ re-resolve host names
from static configuration.
The `flower_events_total` metric is a counter, so its value only ever
increases (discounting restarts of the server process). As such,
nonzero values do not necessarily indicate a _current_ problem, but
rather that there was one at some point in the past. To identify
current issues, we need to use the `increase` function, and then apply
the `max_over_time` function so that the alert doesn't immediately reset
itself.
I was doing this to monitor Jenkins's certificate, but since that's
managed by _cert-manager_, there's really practically no risk of it
expiring without warning anymore. Since Jenkins is already being
scraped directly, having this extra check just gernerates extra
notifications when there is an issue without adding any real value.
Using domain names in the "blackbox" probe makes it difficult to tell
the difference between a complete Internet outage and DNS issues. I
switched to using these names when I changed how the firewall routed
traffic to the public DNS servers, since those were the IP addresses
I was using to determine if the Internet was "up." I think it makes
sense, though, to just ping the upstream gateway for that check. If
EverFast changes their routing or numbering, we'll just have to update
our checks to match.
The alerts for Z-Wave device batteries in particular are pretty
annoying, as they tend to "flap" for some reason. I like having the
alerts show up on Alertmanager/Grafana dashboards, but I don't
necessarily need notifications about them. Fortunately, we can create a
special "none" receiver and route notifications there, which does
exactly what we want here.
The VM hosts are now managed by the "main" Ansible inventory and thus
appear in the host list ConfigMap. As such, they do not need to be
listed explicitly in the static targets list.
Some machines have the same volume mounted multiple times (e.g.
container hosts, BURP). Alerts will fire for all of these
simultaneously when the filesystem usage passes the threshold. To avoid
getting spammed with a bunch of messages about the same filesystem,
we'll group alerts from the same machine.
I'm not using Matrix for anything anymore, and it seems to have gone
offline. I haven't fully decommissioned it yet, but the Blackbox scrape
is failing, so I'll just disable that bit for now.
This machine never worked correctly; the USB-RS232 adapters would stop
working randomly (and of course it would be whenever I needed to
actually use them). I thought it was something wrong with the server
itself (a Raspberry Pi 3), but the same thing happened when I tried
using a Pi 4.
The new backup server has a plethora of on-board RS-232 ports, so I'm
going to use it as the serial console server, too.
I've rebuilt the Unifi Network controller machine (again);
*unifi3.pyrocufflink.blue* has replaced *unifi2.p.b*. The
`unifi_exporter` no longer works with the latest version of Unifi
Network, so it's not deployed on the new machine.
After the incident this week with the CPU overheating on _vmhost1_, I
want to make sure I know as soon as possible when anything is starting
to get too hot.
When Frigate is down, multiple alerts are generated for each camera, as
Home Assistant creates camera entities for each tracked object. This is
extremely annoying, not to mention unnecessary. To address this, we'll
configure AlertManager to send a single notification for alerts in the
group.
The [postgres exporter][0] exposes metrics about the operation and
performance of a PostgreSQL server. It's currently deployed on
_db0.pyrocufflink.blue_, the primary server of the main PostgreSQL
cluster.
[0]: https://github.com/prometheus-community/postgres_exporter
All the Kubernetes nodes (except *k8s-ctrl0*) are now running Fedora
CoreOS. We can therefore use the Kubernetes API to discover scrape
targets for the Zincati job.
One of the reasons for moving to 4 `vmstorage` replicas was to ensure
that the load was spread evenly between the physical VM host machines.
To ensure that is the case as much as possible, we need to keep one
pod per Kubernetes node.
Longhorn does not work well for very large volumes. It takes ages to
synchronize/rebuild them when migrating between nodes, which happens
all too frequently. This consumes a lot of resources, which impacts
the operation of the rest of the cluster, and can cause a cascading
failure in some circumstances.
Now that the cluster is set up to be able to mount storage directly from
the Synology, it makes sense to move the Victoria Metrics data there as
well. Similar to how I did this with Jenkins, I created
PersistentVolume resources that map to iSCSI volumes, and patched the
PersistentVolumeClaims (or rather the template for them defined by the
StatefulSet) to use these. Each `vmstorage` pod then gets an iSCSI
LUN, bypassing both Longhorn and QEMU to write directly to the NAS.
The migration process was relatively straightforwrad. I started by
scaling down the `vminsert` Deployment so the `vmagent` pods would
queue the metrics they had collected while the storage layer was down.
Next, I created a [native][0] export of all the time series in the
database. Then, I deleted the `vmstorage` StatefulSet and its
associated PVCs. Finally, I applied the updated configuration,
including the new PVs and patched PVCs, and brought the `vminsert`
pods back online. Once everything was up and running, I re-imported
the exported data.
[0]: https://docs.victoriametrics.com/Single-server-VictoriaMetrics.html#how-to-export-data-in-native-format
By default, Kubernetes waits for each pod in a StatefulSet to become
"ready" before starting the next one. If there is a problem starting
that pod, e.g. data corruption, then the others will never start. This
sort of defeats the purpose of having multiple replicas. Fortunately,
we can configure the pod management policy to start all the pods at
once, regardless of the status of any individual pod. This way, if
there is a problem with the first pod, the others will still come up
and serve whatever data they have.
We don't need to explicitly specify every single host individually.
Domain controllers, for example, are registered in DNS with SRV records.
Kubernetes nodes, of course, can be discovered using the Kubernetes API.
Both of these classes of nodes change frequently, so discovering them
dynamically is convenient.
Since I added an IPv6 ULA prefix to the "main" VLAN (to allow
communicating with the Synology directly), the domain controllers now
have AAAA records. This causes the `sambadc` screpe job to fail because
Blackbox Exporter prefers IPv6 by default, but Kubernetes pods do not
have IPv6 addreses.
Just having the alert name and group name in the ntfy notification is
not enough to really indicate what the problem is, as some alerts can
generate notifications for many reasons. In the email notifications
AlertManager sends by default, the values (but not the keys) of all
labels are included in the subject, so we will reproduce that here.
I don't like having alerts sent by e-mail. Since I don't get e-mail
notifications on my watch, I often do not see alerts for quite some
time. They are also much harder to read in an e-mail client (Fastmail
web an K-9 Mail both display them poorly). I would much rather have
them delivered via _ntfy_, just like all the rest of the ephemeral
notifications I receive.
Fortunately, it is easy enough to integrate Alertmanager and _ntfy_
using the webhook notifier in Alertmanager. Since _ntfy_ does not
natively support the Alertmanager webhook API, though, a bridge is
necessary to translate from one data format to the other. There are a
few options for this bridge, but I chose
[alexbakker/alertmanager-ntfy][0] because it looked the most complete
while also having the simplest configuration format. Sadly, it does not
expose any Prometheus metrics itself, and since it's deployed in the
_victoria-metrics_ namespace, it needs to be explicitly excluded from
the VMAgent scrape configuration.
[0]: https://github.com/alexbakker/alertmanager-ntfy
Patroni, a component of the *postgres poerator*, exports metrics about
the PostgreSQL database servers it manages. Notably, it provides
information about the current transaction log location for each server.
This allows us to monitor and alert on the health of database replicas.
The *promtail* job scrapes metrics from all the hosts running Promtail.
The static targets are Fedora CoreOS nodes that are not part of the
Kubernetes cluster.
The relabeling rules ensure that both the static targets and the
targets discovered via the Kubernetes Node API use the FQDN of the host
as the value of the *instance* label.
Grafana Loki is hosted on a VM named *loki0.pyrocufflink.blue*. It runs
Fedora CoreOS, so in addition to scraping Loki itself, we need to scrape
_collectd_ and _Zincati_ as well.
I did not realize the batteries on the garage door tilt sensors had
died. Adding alerts for various sensor batteries should help keep me
better informed.