Commit Graph

57 Commits

Author SHA1 Message Date
1fc1c5594e v-m: Scrape PiKVM metrics
PiKVM exports some rudimentary metrics, but requires authentication to
scrape them.  At the very least, this will provide alerting in case the
PiKVM systems go offline.
2025-12-01 12:19:15 -06:00
82c37a8dff v-m/scrape: Remove Promtail job 2025-11-09 10:21:49 -06:00
92cf0edc4b v-m/scrape: Scrape Music Assistant via Blackbox
Music Assistant doesn't expose any metrics natively.  Since we really
only care about whether or not it's accessible, scraping it with the
blackbox exporter is fine.
2025-09-07 08:27:19 -05:00
38ee60e099 v-m: Add alerts for Firefly, Paperless, phpipam
_Firefly III_ and _phpipam_ don't export any Prometheus metrics, so we
have to scrape them via the Blackbox Exporter.

Paperless-ngx only exposes metrics via Flower, but since it runs in the
same container as the main application, we can assume that if the former
is unavailable, the latter is as well.
2025-07-27 17:39:28 -05:00
093e909475 v-m/scrape: Scrape Victoria Logs 2025-07-06 15:20:16 -05:00
cc83a5115a v-m/scrape: Scrape MinIO metrics 2025-07-02 10:29:53 -05:00
fdb4bdb23d Merge branch 'unifi' 2025-06-21 14:00:38 -05:00
75edfb74cb v-m/scrape: Increase timeout for k8s job
Scraping metrics from the Kubernetes API server has started taking 20+
seconds recondly.  Until I figure out the underlying cause, I'm
increasing the scrape timeout so that the _vmagent_ doesn't give up and
report the API server as "down."
2025-06-21 13:55:23 -05:00
52094da8fd v-m/scrape: Remove unifi3, Zincati
*unifi3.pyrocufflink.blue* has been replaced by
*unifi-nuptials.host.pyrocufflink.black*.  The former was the last
Fedora CoreOS machine in use, so the entire Zincati scrape job is no
longer needed.
2025-03-29 08:10:50 -05:00
6da330f2be v-m/scrape: Remove k8s SD config for Zincati
There are no more Kubernetes nodes running Fedora CoreOS.
2025-02-01 18:16:10 -06:00
11a0f84db7 v-m/scrape: Remove websites job
Websites are being scraped by the `vmagent` on the OVH VPS.
2025-02-01 18:16:10 -06:00
6e15b11f73 Merge branch 'fix-nextcloud-alert' 2024-12-21 11:58:41 -06:00
e0c633c21e v-m: scrape: Fix Nextcloud URL
Nextcloud uses a _client-side_ (Javascript) redirect to navigate the
browser to its `index.php`.  The page it serves with this redirect is
static and will often load successfully, even if there is a problem with
the application.  This causes the Blackbox exporter to record the site
as "up," even when it it definitely is not.  To avoid this, we can
scrape the `index.php` page explicitly, ensuring that the application is
loaded.
2024-11-17 18:43:00 +00:00
0209f921c3 v-m: Remove nut0 from scrape targets
_nut0.pyrocufflink.blue_ is decommissioned.
2024-11-12 08:02:00 -06:00
2380468658 v-m/scrape: Collect Jellyfin metrics 2024-11-04 20:38:25 -06:00
db7c07ee55 v-m/scrape: Ignore cloud Kubernetes nodes
The ephemeral Jenkins worker nodes that run in AWS don't have colletcd,
promtail, or Zincati.  We don't needto get three alerts every time a
worker starts up to handle am ARM build job, so we drop these discovered
targets for these scrape jobs.
2024-11-04 20:35:17 -06:00
6cf11f9f61 v-m: Scrape HAProxy 2024-11-01 18:14:37 -05:00
7a768cbb76 v-m: Update jobs for new Loki server
*loki1.pyrocufflink.blue* is a regular Fedora machine, a member of the
AD domain, and managed by Ansible.  Thus, it does not need to be
explicitly listed as a scrape target.

For scraping metrics from Loki itself, I've changed the job to use
DNS-SD because it seems like `vmagent` does _not_ re-resolve host names
from static configuration.
2024-11-01 18:07:34 -05:00
d12e66f58a v-m: Scrape Frigate exporter 2024-11-01 17:47:51 -05:00
ea89e0cde4 v-m/scrape: Remove synapse job
The Synapse server is now completely decommissioned.
2024-10-17 06:50:27 -05:00
ffa47b9fba v-m: Scrape ntfy
_ntfy_ has supported Prometheus metrics for a while now, so let's
collect them.
2024-09-22 12:13:01 -05:00
9ec6b651c1 v-m: Scrape wal-g via statsd_exporter
The database server now runs _statsd_exporter_, which receives metrics
from WAL-G whenever it saves WAL segments or creates backups.
2024-09-22 12:11:59 -05:00
c83ceee994 v-m: Quit scraping Jenkins with blackbox_exporter
I was doing this to monitor Jenkins's certificate, but since that's
managed by _cert-manager_, there's really practically no risk of it
expiring without warning anymore.  Since Jenkins is already being
scraped directly, having this extra check just gernerates extra
notifications when there is an issue without adding any real value.
2024-09-22 12:10:03 -05:00
3f39747557 v-m: Redo Internet/DNS connectivity checks (again)
Using domain names in the "blackbox" probe makes it difficult to tell
the difference between a complete Internet outage and DNS issues.  I
switched to using these names when I changed how the firewall routed
traffic to the public DNS servers, since those were the IP addresses
I was using to determine if the Internet was "up."  I think it makes
sense, though, to just ping the upstream gateway for that check.  If
EverFast changes their routing or numbering, we'll just have to update
our checks to match.
2024-09-22 12:06:03 -05:00
f182479d34 v-m: Remove BURP metrics, alerts
BURP is officially decommissioned, replaced by Restic.
2024-09-05 20:16:01 -05:00
78afee9abc v-m/scrape: Remove static VM hosts from collectd
The VM hosts are now managed by the "main" Ansible inventory and thus
appear in the host list ConfigMap.  As such, they do not need to be
listed explicitly in the static targets list.
2024-08-23 09:28:05 -05:00
02001f61db v-m/scrape: webistes: Stop scraping Matrix
I'm not using Matrix for anything anymore, and it seems to have gone
offline.  I haven't fully decommissioned it yet, but the Blackbox scrape
is failing, so I'll just disable that bit for now.
2024-08-17 10:57:22 -05:00
c7e4baa466 v-m: scrape: Remove nvr2.p.b Zincati scrape target
I've redeployed *nvr2.pyrocufflink.blue* as Fedora Linux, so it does not
run Zincati anymore.
2024-08-17 10:56:06 -05:00
1a631bf366 v-m: scrape: Remove serial1.p.b
This machine never worked correctly; the USB-RS232 adapters would stop
working randomly (and of course it would be whenever I needed to
actually use them).  I thought it was something wrong with the server
itself (a Raspberry Pi 3), but the same thing happened when I tried
using a Pi 4.

The new backup server has a plethora of on-board RS-232 ports, so I'm
going to use it as the serial console server, too.
2024-08-17 10:54:21 -05:00
6f7f09de85 v-m: scrape: Update Unifi server target
I've rebuilt the Unifi Network controller machine (again);
*unifi3.pyrocufflink.blue* has replaced *unifi2.p.b*.  The
`unifi_exporter` no longer works with the latest version of Unifi
Network, so it's not deployed on the new machine.
2024-08-17 10:52:51 -05:00
78cd26c827 v-m: Scrape metrics from RabbitMQ 2024-07-26 20:59:00 -05:00
248a9a5ae9 v-m: Scrape PostgreSQL exporter
The [postgres exporter][0] exposes metrics about the operation and
performance of a PostgreSQL server.  It's currently deployed on
_db0.pyrocufflink.blue_, the primary server of the main PostgreSQL
cluster.

[0]: https://github.com/prometheus-community/postgres_exporter
2024-07-02 18:16:05 -05:00
65e53ad16d v-m: Scrape Zinciti metrics from K8s nodes
All the Kubernetes nodes (except *k8s-ctrl0*) are now running Fedora
CoreOS.  We can therefore use the Kubernetes API to discover scrape
targets for the Zincati job.
2024-07-02 18:16:05 -05:00
14be633843 v-m: Scrape Restic exporter 2024-06-26 18:29:49 -05:00
1c4b32925e v-m: Use dynamic discovery for some collectd nodes
We don't need to explicitly specify every single host individually.
Domain controllers, for example, are registered in DNS with SRV records.
Kubernetes nodes, of course, can be discovered using the Kubernetes API.
Both of these classes of nodes change frequently, so discovering them
dynamically is convenient.
2024-06-26 18:29:49 -05:00
48f20eac07 v-m: Scrape metrics from fleetlock 2024-05-31 15:18:55 -05:00
8939c1d02c v-m/scrape: Scrape unifi2.p.b
*unifi2.pyrocufflink.blue* is a Fedora CoreOS host, so it runs
*collectd*, *Promtail*, and *Zincati*.
2024-05-26 11:48:59 -05:00
3b74c3d508 v-m: Scrape metrics from Paperless-ngx Flower 2024-05-22 15:51:07 -05:00
d74e26d527 victoria-metrics: Send alerts via ntfy
I don't like having alerts sent by e-mail.  Since I don't get e-mail
notifications on my watch, I often do not see alerts for quite some
time.  They are also much harder to read in an e-mail client (Fastmail
web an K-9 Mail both display them poorly).  I would much rather have
them delivered via _ntfy_, just like all the rest of the ephemeral
notifications I receive.

Fortunately, it is easy enough to integrate Alertmanager and _ntfy_
using the webhook notifier in Alertmanager.  Since _ntfy_ does not
natively support the Alertmanager webhook API, though, a bridge is
necessary to translate from one data format to the other.  There are a
few options for this bridge, but I chose
[alexbakker/alertmanager-ntfy][0] because it looked the most complete
while also having the simplest configuration format.  Sadly, it does not
expose any Prometheus metrics itself, and since it's deployed in the
_victoria-metrics_ namespace, it needs to be explicitly excluded from
the VMAgent scrape configuration.

[0]: https://github.com/alexbakker/alertmanager-ntfy
2024-05-10 10:32:52 -05:00
1581a620ef v-m/scrape: Scrape nvr2.p.b
*nvr2.pyrocufflink.blue* has replaced *nvr1.pyrocufflink.blue* as the
Frigate/recording server.
2024-04-10 21:25:26 -05:00
de72776e73 v-m: Scrape metrics from Authelia
Authelia exposes Prometheus metrics from a different server socket,
which is not enabled by default.
2024-02-27 06:41:52 -06:00
e0b2b3f5ae v-m: Scrape metrics from Patroni
Patroni, a component of the *postgres poerator*, exports metrics about
the PostgreSQL database servers it manages.  Notably, it provides
information about the current transaction log location for each server.
This allows us to monitor and alert on the health of database replicas.
2024-02-24 08:33:52 -06:00
83eeb46c93 v-m: Scrape Argo CD
*Argo CD* exposes metrics about itself and the applications it manages.
Notibly, this can be useful for monitoring application health.
2024-02-22 07:10:01 -06:00
465f121e61 v-m: Scrape Promtail
The *promtail* job scrapes metrics from all the hosts running Promtail.
The static targets are Fedora CoreOS nodes that are not part of the
Kubernetes cluster.

The relabeling rules ensure that both the static targets and the
targets discovered via the Kubernetes Node API use the FQDN of the host
as the value of the *instance* label.
2024-02-22 07:10:01 -06:00
5e4ab1d988 v-m: Update Loki scrape target
Now that Loki uses Caddy as a reverse proxy, we need to update the
scrape target to point to the correct port (443).
2024-02-22 07:10:01 -06:00
4c238a69aa v-m: Scrape Grafana Loki
Grafana Loki is hosted on a VM named *loki0.pyrocufflink.blue*.  It runs
Fedora CoreOS, so in addition to scraping Loki itself, we need to scrape
_collectd_ and _Zincati_ as well.
2024-02-21 09:16:26 -06:00
1f28a623ae v-m: Do not scrape/alert on Graylog
Graylog is down because Elasticsearch corrupted itself again, and this
time, I'm just not going to bother fixing it.  I practically never use
it anymore anyway, and I want to migrate to Grafana Loki, so now seems
like a good time to just get rid of it.
2024-02-01 21:45:43 -06:00
834d0f804f v-m: Scrape Grafana
Grafana exports Prometheus metrics about its own performance.
2024-02-01 09:02:01 -06:00
8ae8bad112 v-m: Scrape serial1.p.b 2024-01-25 20:42:07 -06:00
ad37948fe2 v-m: Scrape all metrics components
We are now getting metrics from *vmstorage*, *vminsert*, *vmselect*,
*vmalert*, *alertmanaer*, and *blackbox-exporter*, in addition to
*vmagent*.
2024-01-23 11:51:50 -06:00