*collectd* is now running on *k8s-aarch64-n0.pyrocufflink.blue*,
exposing system metrics. As it is not a member of the AD domain, it has
to be explicitly listed in the `scrape_collectd_extra_targets` variable.
*nvr1.pyrocufflink.blue* has been migrated to Fedora CoreOS. As such,
it is no longer managed by Ansible; its configuration is done via
Butane/Ignition. It is no longer a member of the Active Directory
domain, but it does still run *collectd* and export Prometheus metrics.
When the RAID array is being resynchronized after the archived disk has
been reconnected, md changes the disk status from "missing" to "spare."
Once the synchronization is complete, it changes from "spare" to
"active." We only want to trigger the "disk needs archived" alert once
the synchronization process is complete; otherwise, both the "disks need
swapped" and "disk needs archived" alerts would be active at the same
time, which makes no sense. By adjusting the query for the "disk needs
archived" alert to consider disks in both "missing" and "spare" status,
we can delay firing that alert until the proper time.
Kubernetes exports a *lot* of metrics in Prometheus format. I am not
sure what all is there, yet, but apparently several thousand time series
were added.
To allow anonymous access to the metrics, I added this RoleBinding:
```yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: prometheus
rules:
- apiGroups:
- ""
resources:
- nodes/metrics
verbs:
- get
- nonResourceURLs:
- /metrics
verbs:
- get
```
MinIO exposes metrics in Prometheus exposition format. By default, it
requires an authentication token to access the metrics, but I was unable
to get this to work. Fortunately, it can be configured to allow
anonymous access to the metrics, which is fine, in my opinion.
This alert will fire once the MD RAID resynchronization process has
completed and both disks in the array are online. It will clear when
one disk is disconnected and moved to the safe.
When BURP fails to even *start* a backup, it does not trigger a
notification at all. As a result, I may not notice for a few days when
backups are not happening. That was the case this week, when clients'
backups were failing immediately, because of a file permissions issue on
the server. To hopefully avoid missing backups for too long in the
future, I've added two new alerts:
* The *no recent backups* alert fires if there have not been *any* BURP
backups recently. This may also fire, for example, if the BURP
exporter is not working, or if there is something wrong with the BURP
data volume.
* The *missed client backup* alert fires if an active BURP client (i.e.
one that has had at least one backup in the past 90 days) has not been
backed up in the last 24 hours.
Using a 30-day window for the `tlast_change_over_time` function
effectively "caps out" the value at 30 days. Thus, the alert reminding
me to swap the BURP backup volume will never fire, since the value will
never be greater than the 30-day threshold. Using a wider window
resolves that issue (though the query will still produce inaccurate
results beyond the window).
The `tlast_change_over_time` function needs an interval wide enough to
consider the range of time we are intrested in. In this case, we want
to see if the BURP volume has been swapped in the last thirty days, so
the interval needs to be `30d`.
This alert counts how long its been since the number of "active" disks
in the RAID array on the BURP server has changed. The assumption is
that the number will typically be `1`, but it will be `2` when the
second disk synchronized before the swap occurs.
1. Grafana 8 changed the format of the query string parameters for the
Explore page.
2. vmalert no longer needs the http.pathPrefix argument when behind a
reverse proxy, rather it uses the request path like the other
Victoria Metrics components.
The way I am handling swapping out the BURP disk now is by using the
Linux MD RAID driver to manage a RAID 1 mirror array. The array
normally operates with one disk missing, as it is in the fireproof safe.
When it is time to swap the disks, I reattach the offline disk, let the
array resync, then disconnect and store the other disk.
This works considerably better than the previous method, as it does not
require BURP or the NFS server to be offline during the synchronization.
I changed the naming convention for domain controller machines. They
are no longer "numbered," since the plan is to rotate through them
quickly. For each release of Fedora, we'll create two new domain
controllers, replacing the existing ones. Their names are now randomly
generated and contain letters and numbers, so the Blackbox Exporter
check for DNS records needs to account for this.
The `-external.url` and `-external.alert.source` command line arguments
and their corresponding environment variables can be used to configure
the "Source" links associated with alerts created by `vmalert`.
The firewall hardware is too slow to run the *prometheus_speedtest*
program. It always showed *way* lower speeds than were actually
available. I've moved the service to the Kubernetes cluster and it
works a lot better there.
*mtrcs0.pyrocufflink.red* is a Raspberry Pi CM4 on a Waveshare
CM4-IO-BASE-B carrier board with a NVMe SSD. It runs a custom OS built
using Buildroot, and is not a member of the *pyrocufflink.blue* AD
domain.
*mtrcs0.p.r* hosts Victoria Metrics/`vmagent`, `vmalert`, AlertManager,
and Grafana. I've created a unique group and playbook for it,
*metricspi*, to manage all these applications together.