Commit Graph

30 Commits (e3d0b5e91857dc758f70dd232dae7339151d4847)

Author SHA1 Message Date
Dustin 1b9543b88f metricspi: alerts: Increase Frigate disk threshold
We want the Frigate recording volume to be basically full at all times,
to ensure we are keeping as much recording as possible.
2023-10-15 09:52:12 -05:00
Dustin 2f554dda72 metricspi: Scrape k8s-aarch64-n1
I've added a new Kubernetes worker node,
*k8s-aarch64-n1.pyrocufflink.blue*.  This machine is a Raspberry Pi CM4
mounted on a Waveshare CM4-IO-Base A and clipped onto the DIN rail.
It's got 8 GB of RAM and 32 GB of eMMC storage.  I intend to use it to
build container images locally, instead of bringing up cloud instances.
2023-10-05 14:32:19 -05:00
Dustin a74113d95f metricspi: Scrape Zincati metrics from CoreOS hosts
Zincati is the automatic update manager on Fedora CoreOS.  It exposes
Prometheus metrics for host/update statistics, which are useful to track
the progress of automatic updates and identify update issues.

Zinciti actually exposes its metrics via a Unix socket on the
filesystem.  Another process, [local_exporter], is required to expose
the metrics from this socket via HTTP so Prometheus can scrape them.

[local_exporter]: https://github.com/lucab/local_exporter
2023-10-03 10:29:12 -05:00
Dustin d7f778b01c metricspi: Scrape metrics from k8s-aarch64-n0
*collectd* is now running on *k8s-aarch64-n0.pyrocufflink.blue*,
exposing system metrics.  As it is not a member of the AD domain, it has
to be explicitly listed in the `scrape_collectd_extra_targets` variable.
2023-10-03 10:29:11 -05:00
Dustin 50f4b565f8 hosts: Remove nvr1.p.b as managed system
*nvr1.pyrocufflink.blue* has been migrated to Fedora CoreOS.  As such,
it is no longer managed by Ansible; its configuration is done via
Butane/Ignition.  It is no longer a member of the Active Directory
domain, but it does still run *collectd* and export Prometheus metrics.
2023-09-27 20:24:47 -05:00
Dustin 0a68d84121 metricspi: Scrape hatchlearningcenter.org
To monitor site availability and certificate expiration.
2023-06-21 14:31:33 -05:00
Dustin 4e608e379f metricspi/alerts: Correct BURP archive alert query
When the RAID array is being resynchronized after the archived disk has
been reconnected, md changes the disk status from "missing" to "spare."
Once the synchronization is complete, it changes from "spare" to
"active."  We only want to trigger the "disk needs archived" alert once
the synchronization process is complete; otherwise, both the "disks need
swapped" and "disk needs archived" alerts would be active at the same
time, which makes no sense.  By adjusting the query for the "disk needs
archived" alert to consider disks in both "missing" and "spare" status,
we can delay firing that alert until the proper time.
2023-06-20 11:58:35 -05:00
Dustin 347cda74fd metrics: Scrape metrics from Kubernetes API server
Kubernetes exports a *lot* of metrics in Prometheus format.  I am not
sure what all is there, yet, but apparently several thousand time series
were added.

To allow anonymous access to the metrics, I added this RoleBinding:

```yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: prometheus
rules:
- apiGroups:
  - ""
  resources:
  - nodes/metrics
  verbs:
  - get
- nonResourceURLs:
  - /metrics
  verbs:
  - get
```
2023-05-22 21:21:08 -05:00
Dustin c0bb387b18 metricspi: Scrape metrics from MinIO backup storage
MinIO exposes metrics in Prometheus exposition format.  By default, it
requires an authentication token to access the metrics, but I was unable
to get this to work.  Fortunately, it can be configured to allow
anonymous access to the metrics, which is fine, in my opinion.
2023-05-22 21:19:25 -05:00
Dustin 2c002aa7c5 alerts: Add alert to archive BURP disk
This alert will fire once the MD RAID resynchronization process has
completed and both disks in the array are online.  It will clear when
one disk is disconnected and moved to the safe.
2023-05-16 08:33:13 -05:00
Dustin 877dcc3879 alerts: Add alerts for missed client backups
When BURP fails to even *start* a backup, it does not trigger a
notification at all.  As a result, I may not notice for a few days when
backups are not happening.  That was the case this week, when clients'
backups were failing immediately, because of a file permissions issue on
the server.  To hopefully avoid missing backups for too long in the
future, I've added two new alerts:

* The *no recent backups* alert fires if there have not been *any* BURP
  backups recently.  This may also fire, for example, if the BURP
  exporter is not working, or if there is something wrong with the BURP
  data volume.
* The *missed client backup* alert fires if an active BURP client (i.e.
  one that has had at least one backup in the past 90 days) has not been
  backed up in the last 24 hours.
2023-05-14 11:48:36 -05:00
Dustin a2bcd5ccbb alerts: Adjust BURP RAID disk swap alert
Using a 30-day window for the `tlast_change_over_time` function
effectively "caps out" the value at 30 days.  Thus, the alert reminding
me to swap the BURP backup volume will never fire, since the value will
never be greater than the 30-day threshold.  Using a wider window
resolves that issue (though the query will still produce inaccurate
results beyond the window).
2023-05-14 11:38:00 -05:00
Dustin 9722fed1b8 metricspi: Scrape dustinandtabitha.com 2023-05-09 21:30:11 -05:00
Dustin f6f286ac24 alerts: Correct BURP volume swap alert
The `tlast_change_over_time` function needs an interval wide enough to
consider the range of time we are intrested in.  In this case, we want
to see if the BURP volume has been swapped in the last thirty days, so
the interval needs to be `30d`.
2023-05-03 11:06:34 -05:00
Dustin a4cc9d0c46 metricspi: Scrape tabitha.biz 2023-04-23 20:03:43 -05:00
Dustin 1da4c17a8c alerts: Add alerts for HTTPS certificates
These alerts will generate notifications when websites' HTTPS
certificates are not properly renewed automatically and become in danger
of expiring.
2023-04-12 13:55:31 -05:00
Dustin bf4133652c metrics: Scrape Jenkins with blackbox exporter
This is mostly to monitor the HTTPS certificate expiration.
2023-04-12 13:55:31 -05:00
Dustin dc2a05dc8f alerts: Add alert for BURP RAID array swap
This alert counts how long its been since the number of "active" disks
in the RAID array on the BURP server has changed.  The assumption is
that the number will typically be `1`, but it will be `2` when the
second disk synchronized before the swap occurs.
2023-04-11 22:25:36 -05:00
Dustin 2394bf7436 metricspi: Fix vmalert links
1. Grafana 8 changed the format of the query string parameters for the
   Explore page.
2. vmalert no longer needs the http.pathPrefix argument when behind a
   reverse proxy, rather it uses the request path like the other
   Victoria Metrics components.
2023-04-11 21:46:43 -05:00
Dustin 6c562c9821 alerts: Ignore missing mdraid disk for BURP
The way I am handling swapping out the BURP disk now is by using the
Linux MD RAID driver to manage a RAID 1 mirror array.  The array
normally operates with one disk missing, as it is in the fireproof safe.
When it is time to swap the disks, I reattach the offline disk, let the
array resync, then disconnect and store the other disk.

This works considerably better than the previous method, as it does not
require BURP or the NFS server to be offline during the synchronization.
2023-04-11 20:08:07 -05:00
Dustin a59f24a8b5 metricspi: Stop scraping speedtest
Running the speed test periodically was just wasting bandwidth.  It
failed frequently, and generally did not provide useful information.
2023-04-02 11:05:16 -05:00
Dustin 632e1dd906 metricspi: Update LDAP configuration
All domain controllers now use the Let's Encrypt wildcard certificate
for the *pyrocufflink.blue* domain.  Further, *dc2.p.b* is
decommissioned.
2023-01-09 12:23:54 -06:00
Dustin 637289036a blackbox: Update pyrocufflink DNS check
I changed the naming convention for domain controller machines.  They
are no longer "numbered," since the plan is to rotate through them
quickly.  For each release of Fedora, we'll create two new domain
controllers, replacing the existing ones.  Their names are now randomly
generated and contain letters and numbers, so the Blackbox Exporter
check for DNS records needs to account for this.
2022-12-19 09:04:37 -06:00
Dustin 77c6408187 metricspi: Remove sensors scrape job
Sensor data are retrieved via Home Assistant.
2022-12-18 19:16:10 -06:00
Dustin bc60451949 metricspi: Update DNS server address
DNS is now handled by the border firewall.
2022-08-20 18:19:13 -05:00
Dustin dbc18022f2 metricspi: Increase scrape_timeout for speedtest
Running the Internet speed test can often take longer than a minute.
2022-08-12 14:54:49 -05:00
Dustin ce3e88932d vmalert: Allow configuring http.pathPrefix
*vmalert* requires explicit configuration when it is behind a reverse
proxy.
2022-08-12 13:10:36 -05:00
Dustin fe87edea21 r/vmalert: Allow configuring external source URLs
The `-external.url` and `-external.alert.source` command line arguments
and their corresponding environment variables can be used to configure
the "Source" links associated with alerts created by `vmalert`.
2022-08-12 12:58:53 -05:00
Dustin c57500a9f4 metricspi: Update speedtest scrape target
The firewall hardware is too slow to run the *prometheus_speedtest*
program.  It always showed *way* lower speeds than were actually
available.  I've moved the service to the Kubernetes cluster and it
works a lot better there.
2022-08-12 12:55:52 -05:00
Dustin 4ddbc9f256 hosts: Add mtrcs0.p.r
*mtrcs0.pyrocufflink.red* is a Raspberry Pi CM4 on a Waveshare
CM4-IO-BASE-B carrier board with a NVMe SSD.  It runs a custom OS built
using Buildroot, and is not a member of the *pyrocufflink.blue* AD
domain.

*mtrcs0.p.r* hosts Victoria Metrics/`vmagent`, `vmalert`, AlertManager,
and Grafana.  I've created a unique group and playbook for it,
*metricspi*, to manage all these applications together.
2022-08-11 21:40:19 -05:00