One of the reasons for moving to 4 `vmstorage` replicas was to ensure
that the load was spread evenly between the physical VM host machines.
To ensure that is the case as much as possible, we need to keep one
pod per Kubernetes node.
Longhorn does not work well for very large volumes. It takes ages to
synchronize/rebuild them when migrating between nodes, which happens
all too frequently. This consumes a lot of resources, which impacts
the operation of the rest of the cluster, and can cause a cascading
failure in some circumstances.
Now that the cluster is set up to be able to mount storage directly from
the Synology, it makes sense to move the Victoria Metrics data there as
well. Similar to how I did this with Jenkins, I created
PersistentVolume resources that map to iSCSI volumes, and patched the
PersistentVolumeClaims (or rather the template for them defined by the
StatefulSet) to use these. Each `vmstorage` pod then gets an iSCSI
LUN, bypassing both Longhorn and QEMU to write directly to the NAS.
The migration process was relatively straightforwrad. I started by
scaling down the `vminsert` Deployment so the `vmagent` pods would
queue the metrics they had collected while the storage layer was down.
Next, I created a [native][0] export of all the time series in the
database. Then, I deleted the `vmstorage` StatefulSet and its
associated PVCs. Finally, I applied the updated configuration,
including the new PVs and patched PVCs, and brought the `vminsert`
pods back online. Once everything was up and running, I re-imported
the exported data.
[0]: https://docs.victoriametrics.com/Single-server-VictoriaMetrics.html#how-to-export-data-in-native-format
By default, Kubernetes waits for each pod in a StatefulSet to become
"ready" before starting the next one. If there is a problem starting
that pod, e.g. data corruption, then the others will never start. This
sort of defeats the purpose of having multiple replicas. Fortunately,
we can configure the pod management policy to start all the pods at
once, regardless of the status of any individual pod. This way, if
there is a problem with the first pod, the others will still come up
and serve whatever data they have.
We don't need to explicitly specify every single host individually.
Domain controllers, for example, are registered in DNS with SRV records.
Kubernetes nodes, of course, can be discovered using the Kubernetes API.
Both of these classes of nodes change frequently, so discovering them
dynamically is convenient.
Since I added an IPv6 ULA prefix to the "main" VLAN (to allow
communicating with the Synology directly), the domain controllers now
have AAAA records. This causes the `sambadc` screpe job to fail because
Blackbox Exporter prefers IPv6 by default, but Kubernetes pods do not
have IPv6 addreses.
Just having the alert name and group name in the ntfy notification is
not enough to really indicate what the problem is, as some alerts can
generate notifications for many reasons. In the email notifications
AlertManager sends by default, the values (but not the keys) of all
labels are included in the subject, so we will reproduce that here.
I don't like having alerts sent by e-mail. Since I don't get e-mail
notifications on my watch, I often do not see alerts for quite some
time. They are also much harder to read in an e-mail client (Fastmail
web an K-9 Mail both display them poorly). I would much rather have
them delivered via _ntfy_, just like all the rest of the ephemeral
notifications I receive.
Fortunately, it is easy enough to integrate Alertmanager and _ntfy_
using the webhook notifier in Alertmanager. Since _ntfy_ does not
natively support the Alertmanager webhook API, though, a bridge is
necessary to translate from one data format to the other. There are a
few options for this bridge, but I chose
[alexbakker/alertmanager-ntfy][0] because it looked the most complete
while also having the simplest configuration format. Sadly, it does not
expose any Prometheus metrics itself, and since it's deployed in the
_victoria-metrics_ namespace, it needs to be explicitly excluded from
the VMAgent scrape configuration.
[0]: https://github.com/alexbakker/alertmanager-ntfy
Patroni, a component of the *postgres poerator*, exports metrics about
the PostgreSQL database servers it manages. Notably, it provides
information about the current transaction log location for each server.
This allows us to monitor and alert on the health of database replicas.
The *promtail* job scrapes metrics from all the hosts running Promtail.
The static targets are Fedora CoreOS nodes that are not part of the
Kubernetes cluster.
The relabeling rules ensure that both the static targets and the
targets discovered via the Kubernetes Node API use the FQDN of the host
as the value of the *instance* label.
Grafana Loki is hosted on a VM named *loki0.pyrocufflink.blue*. It runs
Fedora CoreOS, so in addition to scraping Loki itself, we need to scrape
_collectd_ and _Zincati_ as well.
I did not realize the batteries on the garage door tilt sensors had
died. Adding alerts for various sensor batteries should help keep me
better informed.
Graylog is down because Elasticsearch corrupted itself again, and this
time, I'm just not going to bother fixing it. I practically never use
it anymore anyway, and I want to migrate to Grafana Loki, so now seems
like a good time to just get rid of it.
*vmalert* has been generating alerts and triggering notifications, but
not writing any `ALERTS`/`ALERTS_FOR_STATE` metrics. It turns out this
is because I had not correctly configured the remote read/write
URLs.
If Frigate is running but not connected to the MQTT broker, the
`sensor.frigate_status` entity will be available, but the
`update.frigate_server` entity will not.
Kubernetes will not start additional Pods in a StatefulSet until the
existing ones are Ready. This means that if there is a problem bringing
up, e.g. `vmstorage-0`, it will never start `vmstorage-1` or
`vmstorage-2`. Since this pretty much defeats the purpose of having a
multi-node `vmstorage` cluster, we have to remove the readiness probe,
so the Pods will be Ready as soon as they start. If there is a problem
with one of them, it will matter less, as the others can still run.
The *virt* plugin for *collectd* sets `instance` to the name of the
libvirt domain the metric refers to. This makes it so there is no label
identifying which host the VM is running on. Thus, if we want to
classify metrics by VM host, we need to add that label explicitly.
Since the `__address__` label is not available during metric relabeling,
we need to store it in a temporary label, which gets dropped at the end
of the relabeling phase. We copy the value of that label into a new
label, but only for metrics that match the desired metric name.
I found the official container image for Prometheus Blackbox exporter.
It is hosted on Quay, which is why I didn't see it on Docker Hub when I
looked initially.
Scraping the public DNS servers doesn't work anymore since the firewall
routes traffic through Mullvad. Pinging public cloud providers should
give a pretty decent indication of Internet connectivity. It will also
serve as a benchmark for the local DNS performance, since the names will
have to be resolved.
By default, the `instance` label for discovered metrics targets is set
to the scrape address. For Kubernetes pods, that is the IP address and
port of the pod, which naturally changes every time the pod is recreated
or moved. This will cause a high churn rate for Longhorn manager pods.
To avoid this, we set the `instance` label to the name of the node the
pod is running on, which will not change because the Longhorn manager
pods are managed by a DaemonSet.
Each Longhorn manager pod exports metrics about the node on which it is
running. Thus, we have to scrape every pod to get the metrics about the
whole ecosystem.
The original RBAC configuration allowed `vmagent` only to list the pods
in the `victoria-metrics` namespace. In order to allow it to monitor
other applications' pods, it needs to be assigned permission to list
pods in all namespaces.
Since *mtrcs0.pyrocufflink.blue* (the Metrics Pi) seems to be dying,
I decided to move monitoring and alerting into Kubernetes.
I was originally planning to have a single, dedicated virtual machine
for Victoria Metrics and Grafana, similar to how the Metrics Pi was set
up, but running Fedora CoreOS instead of a custom Buildroot-based OS.
While I was working on the Ignition configuration for the VM, it
occurred to me that monitoring would be interrupted frequently, since
FCOS updates weekly and all updates require a reboot. I would rather
not have that many gaps in the data. Ultimately I decided that
deploying a cluster with Kubernetes would probably be more robust and
reliable, as updates can be performed without any downtime at all.
I chose not to use the Victoria Metrics Operator, but rather handle
the resource definitions myself. Victoria Metrics components are not
particularly difficult to deploy, so the overhead of running the
operator and using its custom resources would not be worth the minor
convenience it provides.