Music Assistant doesn't expose any metrics natively. Since we really
only care about whether or not it's accessible, scraping it with the
blackbox exporter is fine.
_Firefly III_ and _phpipam_ don't export any Prometheus metrics, so we
have to scrape them via the Blackbox Exporter.
Paperless-ngx only exposes metrics via Flower, but since it runs in the
same container as the main application, we can assume that if the former
is unavailable, the latter is as well.
Scraping metrics from the Kubernetes API server has started taking 20+
seconds recondly. Until I figure out the underlying cause, I'm
increasing the scrape timeout so that the _vmagent_ doesn't give up and
report the API server as "down."
*unifi3.pyrocufflink.blue* has been replaced by
*unifi-nuptials.host.pyrocufflink.black*. The former was the last
Fedora CoreOS machine in use, so the entire Zincati scrape job is no
longer needed.
Nextcloud uses a _client-side_ (Javascript) redirect to navigate the
browser to its `index.php`. The page it serves with this redirect is
static and will often load successfully, even if there is a problem with
the application. This causes the Blackbox exporter to record the site
as "up," even when it it definitely is not. To avoid this, we can
scrape the `index.php` page explicitly, ensuring that the application is
loaded.
The ephemeral Jenkins worker nodes that run in AWS don't have colletcd,
promtail, or Zincati. We don't needto get three alerts every time a
worker starts up to handle am ARM build job, so we drop these discovered
targets for these scrape jobs.
*loki1.pyrocufflink.blue* is a regular Fedora machine, a member of the
AD domain, and managed by Ansible. Thus, it does not need to be
explicitly listed as a scrape target.
For scraping metrics from Loki itself, I've changed the job to use
DNS-SD because it seems like `vmagent` does _not_ re-resolve host names
from static configuration.
I was doing this to monitor Jenkins's certificate, but since that's
managed by _cert-manager_, there's really practically no risk of it
expiring without warning anymore. Since Jenkins is already being
scraped directly, having this extra check just gernerates extra
notifications when there is an issue without adding any real value.
Using domain names in the "blackbox" probe makes it difficult to tell
the difference between a complete Internet outage and DNS issues. I
switched to using these names when I changed how the firewall routed
traffic to the public DNS servers, since those were the IP addresses
I was using to determine if the Internet was "up." I think it makes
sense, though, to just ping the upstream gateway for that check. If
EverFast changes their routing or numbering, we'll just have to update
our checks to match.
The VM hosts are now managed by the "main" Ansible inventory and thus
appear in the host list ConfigMap. As such, they do not need to be
listed explicitly in the static targets list.
I'm not using Matrix for anything anymore, and it seems to have gone
offline. I haven't fully decommissioned it yet, but the Blackbox scrape
is failing, so I'll just disable that bit for now.
This machine never worked correctly; the USB-RS232 adapters would stop
working randomly (and of course it would be whenever I needed to
actually use them). I thought it was something wrong with the server
itself (a Raspberry Pi 3), but the same thing happened when I tried
using a Pi 4.
The new backup server has a plethora of on-board RS-232 ports, so I'm
going to use it as the serial console server, too.
I've rebuilt the Unifi Network controller machine (again);
*unifi3.pyrocufflink.blue* has replaced *unifi2.p.b*. The
`unifi_exporter` no longer works with the latest version of Unifi
Network, so it's not deployed on the new machine.
The [postgres exporter][0] exposes metrics about the operation and
performance of a PostgreSQL server. It's currently deployed on
_db0.pyrocufflink.blue_, the primary server of the main PostgreSQL
cluster.
[0]: https://github.com/prometheus-community/postgres_exporter
All the Kubernetes nodes (except *k8s-ctrl0*) are now running Fedora
CoreOS. We can therefore use the Kubernetes API to discover scrape
targets for the Zincati job.
We don't need to explicitly specify every single host individually.
Domain controllers, for example, are registered in DNS with SRV records.
Kubernetes nodes, of course, can be discovered using the Kubernetes API.
Both of these classes of nodes change frequently, so discovering them
dynamically is convenient.
I don't like having alerts sent by e-mail. Since I don't get e-mail
notifications on my watch, I often do not see alerts for quite some
time. They are also much harder to read in an e-mail client (Fastmail
web an K-9 Mail both display them poorly). I would much rather have
them delivered via _ntfy_, just like all the rest of the ephemeral
notifications I receive.
Fortunately, it is easy enough to integrate Alertmanager and _ntfy_
using the webhook notifier in Alertmanager. Since _ntfy_ does not
natively support the Alertmanager webhook API, though, a bridge is
necessary to translate from one data format to the other. There are a
few options for this bridge, but I chose
[alexbakker/alertmanager-ntfy][0] because it looked the most complete
while also having the simplest configuration format. Sadly, it does not
expose any Prometheus metrics itself, and since it's deployed in the
_victoria-metrics_ namespace, it needs to be explicitly excluded from
the VMAgent scrape configuration.
[0]: https://github.com/alexbakker/alertmanager-ntfy
Patroni, a component of the *postgres poerator*, exports metrics about
the PostgreSQL database servers it manages. Notably, it provides
information about the current transaction log location for each server.
This allows us to monitor and alert on the health of database replicas.
The *promtail* job scrapes metrics from all the hosts running Promtail.
The static targets are Fedora CoreOS nodes that are not part of the
Kubernetes cluster.
The relabeling rules ensure that both the static targets and the
targets discovered via the Kubernetes Node API use the FQDN of the host
as the value of the *instance* label.
Grafana Loki is hosted on a VM named *loki0.pyrocufflink.blue*. It runs
Fedora CoreOS, so in addition to scraping Loki itself, we need to scrape
_collectd_ and _Zincati_ as well.
Graylog is down because Elasticsearch corrupted itself again, and this
time, I'm just not going to bother fixing it. I practically never use
it anymore anyway, and I want to migrate to Grafana Loki, so now seems
like a good time to just get rid of it.
The *virt* plugin for *collectd* sets `instance` to the name of the
libvirt domain the metric refers to. This makes it so there is no label
identifying which host the VM is running on. Thus, if we want to
classify metrics by VM host, we need to add that label explicitly.
Since the `__address__` label is not available during metric relabeling,
we need to store it in a temporary label, which gets dropped at the end
of the relabeling phase. We copy the value of that label into a new
label, but only for metrics that match the desired metric name.