1
0
Fork 0
Commit Graph

82 Commits (adb103238b386fec319d6e794375b1cc65dcc43b)

Author SHA1 Message Date
Dustin dc835ddc9d v-m/alerts: Fix PostgreSQL WAL archive failed alert
The `pg_stat_archiver_failed_count` metric is a counter, so once a WAL
archival has failed, it will increase and never return to `0`.  To
ensure the alert is resolved once the WAL archival process recovers, we
need to use the `increase` function to turn it into a gauge.  Finally,
we aggregate that gauge with `max_over_time` to keep the alert from
flapping if the WAL archive occurs less frequently than the scrape
interval.
2025-02-05 10:42:35 -06:00
Dustin 6da330f2be v-m/scrape: Remove k8s SD config for Zincati
There are no more Kubernetes nodes running Fedora CoreOS.
2025-02-01 18:16:10 -06:00
Dustin 11a0f84db7 v-m/scrape: Remove websites job
Websites are being scraped by the `vmagent` on the OVH VPS.
2025-02-01 18:16:10 -06:00
Dustin a87b53e3ac v-m: Add alert for Frigate camera no video
At some point this week, the front porch camera stopped sending video.
I'm not sure exactly what happened to it, but Frigate kept logging
"Unable to read frames from ffmpeg process."  I power-cycled the camera,
which resolved the issue.

Unfortunately, no alerts were generated about this situation.  Home
Assistant did not consider the camera entity unavailable, presumably
because Frigate was still reporting stats about it.  Thus, I missed
several important notifications.  To avoid this in the future, I have
enabled the "Camera FPS" sensors for all of the cameras in Home
Assistant, and added this alert to trigger when the reported framerate
is 0.

I really also need to get alerts for log events configured, as that
would also indicated there was an issue.
2025-02-01 18:16:10 -06:00
Dustin b9d69ec0a3 v-m/alerts: Ignore missing backups from Toad, Luma
Toad and Luma can go offline for several days at a time if I don't use
them.  I don't need an alert telling me this.
2024-12-21 12:23:19 -06:00
Dustin a03d63841d v-m/alerts: Fire paperless email alert after 12h
We don't need a notification about paperless not scheduling email tasks
every time there is a gap in the metric.  This can happen in some
innocuous situations like when the pod restarts or if there is a brief
disruption of service.  Using the `absent_over_time` function with a
range vector, we can have the alert fire only if there have been no
email tasks scheduled within the last 12 hours.
2024-12-21 12:17:45 -06:00
Dustin d04c18cfcd v-m/alerts: Remove 'no file changes' alert
It turns out this alert is not very useful, and indeed quite annoying.
Many servers can go for days or even weeks with no changes, which is
completely normal.
2024-12-21 12:14:11 -06:00
Dustin 6e15b11f73 Merge branch 'fix-nextcloud-alert' 2024-12-21 11:58:41 -06:00
Dustin e0c633c21e v-m: scrape: Fix Nextcloud URL
Nextcloud uses a _client-side_ (Javascript) redirect to navigate the
browser to its `index.php`.  The page it serves with this redirect is
static and will often load successfully, even if there is a problem with
the application.  This causes the Blackbox exporter to record the site
as "up," even when it it definitely is not.  To avoid this, we can
scrape the `index.php` page explicitly, ensuring that the application is
loaded.
2024-11-17 18:43:00 +00:00
Dustin 0209f921c3 v-m: Remove nut0 from scrape targets
_nut0.pyrocufflink.blue_ is decommissioned.
2024-11-12 08:02:00 -06:00
Dustin 9f287d0f71 v-m/alerts: Add alerts for backup RAID array
Just like I did with the RAID-1 array in the old BURP server, I will
keep one member active and one in the fireproof safe, swapping them each
month.  We can use the same metrics queries to alert on when the swap
should happen that we used with the BURP server.
2024-11-04 20:46:03 -06:00
Dustin 2380468658 v-m/scrape: Collect Jellyfin metrics 2024-11-04 20:38:25 -06:00
Dustin db7c07ee55 v-m/scrape: Ignore cloud Kubernetes nodes
The ephemeral Jenkins worker nodes that run in AWS don't have colletcd,
promtail, or Zincati.  We don't needto get three alerts every time a
worker starts up to handle am ARM build job, so we drop these discovered
targets for these scrape jobs.
2024-11-04 20:35:17 -06:00
Dustin d76a1360c8 v-m/alerts: Ignore Paperless consume_file task
Paperless-ngx uses a Celery task to process uploaded files, converting
them to PDF, running OCR, etc.  This task can be marked as "failed" for
various reasons, most of which are more about the document itself than
the health of the application.  The GUI displays the results of failed
tasks when they occur.  It doesn't really make sense to have an alert
about this scenario, especially since there's nothing to do to directly
clear the alert anyway.
2024-11-04 20:28:11 -06:00
Dustin 8ecee4133f v-m/alerts: Rework free disk space alert
Fedora CoreOS fills `/boot` beyond the 75% alert threshold under normal
circumstances on aarch64 machines.  This is not a problem, because it
cleans up old files on its own, so we do not need to alert on it.
Unfortunately, the _DiskUsage_ alert is already quite complex, and
adding in exclusions for these devices would make it even worse.

To simplify the logic, we can use a recording rule to precomupte the
used/free space ratio.  By using `sum(...) without (type)` instead of
`sum(...) on (df, instance)`, we keep the other labels, which we can
then use to identify the metrics coming from machines we don't care to
monitor.

Instead of having different thresholds for different volumes
encoded in the same expression, we can use multiple alerts to alert on
"low" vs "very low" thresholds.  Since this will of course cause
duplicate alerts for most volumes, we can use AlertManager inhibition
rules to disable the "low" alert once the metric crosses the "very low"
threshold.
2024-11-02 09:38:02 -05:00
Dustin 4cef41688f v-m/alerts: Add Zigbee+ZWave network alerts 2024-11-01 18:14:56 -05:00
Dustin 6cf11f9f61 v-m: Scrape HAProxy 2024-11-01 18:14:37 -05:00
Dustin 7a768cbb76 v-m: Update jobs for new Loki server
*loki1.pyrocufflink.blue* is a regular Fedora machine, a member of the
AD domain, and managed by Ansible.  Thus, it does not need to be
explicitly listed as a scrape target.

For scraping metrics from Loki itself, I've changed the job to use
DNS-SD because it seems like `vmagent` does _not_ re-resolve host names
from static configuration.
2024-11-01 18:07:34 -05:00
Dustin 0101040634 v-m/alerts: Add Paperless-ngx email task alert
This alert should fire if the background task to fetch e-mail and import
them into Paperless-ngx has not run for a while.
2024-11-01 18:04:06 -05:00
Dustin 3f9601dc94 v-m/alerts: Improve Paperless-ngx Celery task alert
The `flower_events_total` metric is a counter, so its value only ever
increases (discounting restarts of the server process).  As such,
nonzero values do not necessarily indicate a _current_ problem, but
rather that there was one at some point in the past.  To identify
current issues, we need to use the `increase` function, and then apply
the `max_over_time` function so that the alert doesn't immediately reset
itself.
2024-11-01 18:00:50 -05:00
Dustin d12e66f58a v-m: Scrape Frigate exporter 2024-11-01 17:47:51 -05:00
Dustin e19e8f50ab v-m/alerts: Add alerts for Paperless-ngx 2024-10-17 07:18:23 -05:00
Dustin 78651eb5f8 v-m/alerts: Add alerts for PostgreSQL WAL archiver 2024-10-17 07:18:09 -05:00
Dustin ee3e078b20 v-m/alerts: Add alerts for Restic backups 2024-10-17 06:58:48 -05:00
Dustin ea89e0cde4 v-m/scrape: Remove synapse job
The Synapse server is now completely decommissioned.
2024-10-17 06:50:27 -05:00
Dustin ffa47b9fba v-m: Scrape ntfy
_ntfy_ has supported Prometheus metrics for a while now, so let's
collect them.
2024-09-22 12:13:01 -05:00
Dustin 9ec6b651c1 v-m: Scrape wal-g via statsd_exporter
The database server now runs _statsd_exporter_, which receives metrics
from WAL-G whenever it saves WAL segments or creates backups.
2024-09-22 12:11:59 -05:00
Dustin c83ceee994 v-m: Quit scraping Jenkins with blackbox_exporter
I was doing this to monitor Jenkins's certificate, but since that's
managed by _cert-manager_, there's really practically no risk of it
expiring without warning anymore.  Since Jenkins is already being
scraped directly, having this extra check just gernerates extra
notifications when there is an issue without adding any real value.
2024-09-22 12:10:03 -05:00
Dustin 3f39747557 v-m: Redo Internet/DNS connectivity checks (again)
Using domain names in the "blackbox" probe makes it difficult to tell
the difference between a complete Internet outage and DNS issues.  I
switched to using these names when I changed how the firewall routed
traffic to the public DNS servers, since those were the IP addresses
I was using to determine if the Internet was "up."  I think it makes
sense, though, to just ping the upstream gateway for that check.  If
EverFast changes their routing or numbering, we'll just have to update
our checks to match.
2024-09-22 12:06:03 -05:00
Dustin 8f354a4460 v-m/alertmanager: Suppress battery low alerts
The alerts for Z-Wave device batteries in particular are pretty
annoying, as they tend to "flap" for some reason.  I like having the
alerts show up on Alertmanager/Grafana dashboards, but I don't
necessarily need notifications about them.  Fortunately, we can create a
special "none" receiver and route notifications there, which does
exactly what we want here.
2024-09-22 12:01:02 -05:00
Dustin f182479d34 v-m: Remove BURP metrics, alerts
BURP is officially decommissioned, replaced by Restic.
2024-09-05 20:16:01 -05:00
Dustin 78afee9abc v-m/scrape: Remove static VM hosts from collectd
The VM hosts are now managed by the "main" Ansible inventory and thus
appear in the host list ConfigMap.  As such, they do not need to be
listed explicitly in the static targets list.
2024-08-23 09:28:05 -05:00
Dustin 7dffb5195a v-m: alertmanager: Group disk usage alerts
Some machines have the same volume mounted multiple times (e.g.
container hosts, BURP).  Alerts will fire for all of these
simultaneously when the filesystem usage passes the threshold.  To avoid
getting spammed with a bunch of messages about the same filesystem,
we'll group alerts from the same machine.
2024-08-17 10:59:05 -05:00
Dustin 02001f61db v-m/scrape: webistes: Stop scraping Matrix
I'm not using Matrix for anything anymore, and it seems to have gone
offline.  I haven't fully decommissioned it yet, but the Blackbox scrape
is failing, so I'll just disable that bit for now.
2024-08-17 10:57:22 -05:00
Dustin c7e4baa466 v-m: scrape: Remove nvr2.p.b Zincati scrape target
I've redeployed *nvr2.pyrocufflink.blue* as Fedora Linux, so it does not
run Zincati anymore.
2024-08-17 10:56:06 -05:00
Dustin 1a631bf366 v-m: scrape: Remove serial1.p.b
This machine never worked correctly; the USB-RS232 adapters would stop
working randomly (and of course it would be whenever I needed to
actually use them).  I thought it was something wrong with the server
itself (a Raspberry Pi 3), but the same thing happened when I tried
using a Pi 4.

The new backup server has a plethora of on-board RS-232 ports, so I'm
going to use it as the serial console server, too.
2024-08-17 10:54:21 -05:00
Dustin 6f7f09de85 v-m: scrape: Update Unifi server target
I've rebuilt the Unifi Network controller machine (again);
*unifi3.pyrocufflink.blue* has replaced *unifi2.p.b*.  The
`unifi_exporter` no longer works with the latest version of Unifi
Network, so it's not deployed on the new machine.
2024-08-17 10:52:51 -05:00
Dustin 809676f691 v-m: alerts: Add Longhorn alerts 2024-08-17 10:51:13 -05:00
Dustin 78cd26c827 v-m: Scrape metrics from RabbitMQ 2024-07-26 20:59:00 -05:00
Dustin 8cb292a4b2 v-m: alerts: Add alert for temperatures
After the incident this week with the CPU overheating on _vmhost1_, I
want to make sure I know as soon as possible when anything is starting
to get too hot.
2024-07-11 22:07:27 -05:00
Dustin 8113e5a47f v-m: Fix syntax in AlertManager config
The `group_by` field takes a list of label names, rather than a single
string.
2024-07-06 07:13:27 -05:00
Dustin 952ab9f264 v-m: alertmanager: Group camera notifications
When Frigate is down, multiple alerts are generated for each camera, as
Home Assistant creates camera entities for each tracked object.  This is
extremely annoying, not to mention unnecessary.  To address this, we'll
configure AlertManager to send a single notification for alerts in the
group.
2024-07-05 07:30:30 -05:00
Dustin 9b26753e73 v-m: alerts: Add durations to spammy alerts
Let's avoid sending alerts immediately when something is unavailable,
because the issue might be transient and will resolve itself shortly.
2024-07-05 07:23:38 -05:00
Dustin 248a9a5ae9 v-m: Scrape PostgreSQL exporter
The [postgres exporter][0] exposes metrics about the operation and
performance of a PostgreSQL server.  It's currently deployed on
_db0.pyrocufflink.blue_, the primary server of the main PostgreSQL
cluster.

[0]: https://github.com/prometheus-community/postgres_exporter
2024-07-02 18:16:05 -05:00
Dustin a8ef4c7a80 v-m: Add component labels to configmaps
Adding a `component` label to each ConfigMap will make it possible to
target them specifically, e.g. with `kubectl apply -l`.
2024-07-02 18:16:05 -05:00
Dustin 65e53ad16d v-m: Scrape Zinciti metrics from K8s nodes
All the Kubernetes nodes (except *k8s-ctrl0*) are now running Fedora
CoreOS.  We can therefore use the Kubernetes API to discover scrape
targets for the Zincati job.
2024-07-02 18:16:05 -05:00
Dustin 2d7fec1cdf v-m: vmstorage: Add pod anti-affinity
One of the reasons for moving to 4 `vmstorage` replicas was to ensure
that the load was spread evenly between the physical VM host machines.
To ensure that is the case as much as possible, we need to keep one
pod per Kubernetes node.
2024-06-26 18:29:49 -05:00
Dustin f7f408ca8c v-m: Redo vmstorage persistent volumes
Longhorn does not work well for very large volumes.  It takes ages to
synchronize/rebuild them when migrating between nodes, which happens
all too frequently.  This consumes a lot of resources, which impacts
the operation of the rest of the cluster, and can cause a cascading
failure in some circumstances.

Now that the cluster is set up to be able to mount storage directly from
the Synology, it makes sense to move the Victoria Metrics data there as
well.  Similar to how I did this with Jenkins, I created
PersistentVolume resources that map to iSCSI volumes, and patched the
PersistentVolumeClaims (or rather the template for them defined by the
StatefulSet) to use these.  Each `vmstorage` pod then gets an iSCSI
LUN, bypassing both Longhorn and QEMU to write directly to the NAS.

The migration process was relatively straightforwrad.  I started by
scaling down the `vminsert` Deployment so the `vmagent` pods would
queue the metrics they had collected while the storage layer was down.
Next, I created a [native][0] export of all the time series in the
database.  Then, I deleted the `vmstorage` StatefulSet and its
associated PVCs.  Finally, I applied the updated configuration,
including the new PVs and patched PVCs, and brought the `vminsert`
pods back online.  Once everything was up and running, I re-imported
the exported data.

[0]: https://docs.victoriametrics.com/Single-server-VictoriaMetrics.html#how-to-export-data-in-native-format
2024-06-26 18:29:49 -05:00
Dustin ab458df415 v-m/vmstorage: Start pods in parallel
By default, Kubernetes waits for each pod in a StatefulSet to become
"ready" before starting the next one.  If there is a problem starting
that pod, e.g. data corruption, then the others will never start.  This
sort of defeats the purpose of having multiple replicas.  Fortunately,
we can configure the pod management policy to start all the pods at
once, regardless of the status of any individual pod.  This way, if
there is a problem with the first pod, the others will still come up
and serve whatever data they have.
2024-06-26 18:29:49 -05:00
Dustin 14be633843 v-m: Scrape Restic exporter 2024-06-26 18:29:49 -05:00