Commit Graph

400 Commits

Author SHA1 Message Date
9f287d0f71 v-m/alerts: Add alerts for backup RAID array
Just like I did with the RAID-1 array in the old BURP server, I will
keep one member active and one in the fireproof safe, swapping them each
month.  We can use the same metrics queries to alert on when the swap
should happen that we used with the BURP server.
2024-11-04 20:46:03 -06:00
2380468658 v-m/scrape: Collect Jellyfin metrics 2024-11-04 20:38:25 -06:00
db7c07ee55 v-m/scrape: Ignore cloud Kubernetes nodes
The ephemeral Jenkins worker nodes that run in AWS don't have colletcd,
promtail, or Zincati.  We don't needto get three alerts every time a
worker starts up to handle am ARM build job, so we drop these discovered
targets for these scrape jobs.
2024-11-04 20:35:17 -06:00
d76a1360c8 v-m/alerts: Ignore Paperless consume_file task
Paperless-ngx uses a Celery task to process uploaded files, converting
them to PDF, running OCR, etc.  This task can be marked as "failed" for
various reasons, most of which are more about the document itself than
the health of the application.  The GUI displays the results of failed
tasks when they occur.  It doesn't really make sense to have an alert
about this scenario, especially since there's nothing to do to directly
clear the alert anyway.
2024-11-04 20:28:11 -06:00
8ecee4133f v-m/alerts: Rework free disk space alert
Fedora CoreOS fills `/boot` beyond the 75% alert threshold under normal
circumstances on aarch64 machines.  This is not a problem, because it
cleans up old files on its own, so we do not need to alert on it.
Unfortunately, the _DiskUsage_ alert is already quite complex, and
adding in exclusions for these devices would make it even worse.

To simplify the logic, we can use a recording rule to precomupte the
used/free space ratio.  By using `sum(...) without (type)` instead of
`sum(...) on (df, instance)`, we keep the other labels, which we can
then use to identify the metrics coming from machines we don't care to
monitor.

Instead of having different thresholds for different volumes
encoded in the same expression, we can use multiple alerts to alert on
"low" vs "very low" thresholds.  Since this will of course cause
duplicate alerts for most volumes, we can use AlertManager inhibition
rules to disable the "low" alert once the metric crosses the "very low"
threshold.
2024-11-02 09:38:02 -05:00
4cef41688f v-m/alerts: Add Zigbee+ZWave network alerts 2024-11-01 18:14:56 -05:00
6cf11f9f61 v-m: Scrape HAProxy 2024-11-01 18:14:37 -05:00
7a768cbb76 v-m: Update jobs for new Loki server
*loki1.pyrocufflink.blue* is a regular Fedora machine, a member of the
AD domain, and managed by Ansible.  Thus, it does not need to be
explicitly listed as a scrape target.

For scraping metrics from Loki itself, I've changed the job to use
DNS-SD because it seems like `vmagent` does _not_ re-resolve host names
from static configuration.
2024-11-01 18:07:34 -05:00
0101040634 v-m/alerts: Add Paperless-ngx email task alert
This alert should fire if the background task to fetch e-mail and import
them into Paperless-ngx has not run for a while.
2024-11-01 18:04:06 -05:00
3f9601dc94 v-m/alerts: Improve Paperless-ngx Celery task alert
The `flower_events_total` metric is a counter, so its value only ever
increases (discounting restarts of the server process).  As such,
nonzero values do not necessarily indicate a _current_ problem, but
rather that there was one at some point in the past.  To identify
current issues, we need to use the `increase` function, and then apply
the `max_over_time` function so that the alert doesn't immediately reset
itself.
2024-11-01 18:00:50 -05:00
d12e66f58a v-m: Scrape Frigate exporter 2024-11-01 17:47:51 -05:00
045eea89a9 Merge remote-tracking branch 'refs/remotes/origin/master' 2024-10-19 09:49:59 -05:00
8ff45a8c01 paperless-ngx/gotenberg: Run as correct user
The Gotenberg container image uses UID 1001 for the _gotenberg_ user.
Using any other UID number, even when the home directory is set and
owned by that UID, results in random issues, especially when using
LibreOffice conversions.
2024-10-19 09:46:15 -05:00
d3e00680c0 Merge pull request 'home-assistant: Update to 2024.10.3' (#29) from updatebot/home-assistant into master
Reviewed-on: #29
2024-10-19 13:13:12 +00:00
bot
c5daf23f71 mosquitto: Update to 2.0.20 2024-10-19 11:32:16 +00:00
bot
6e2c8d1a25 zwavejs2mqtt: Update to 9.24.0 2024-10-19 11:32:16 +00:00
bot
0e3f719e32 whisper: Update to 2.2.0 2024-10-19 11:32:16 +00:00
bot
94e10207d2 home-assistant: Update to 2024.10.3 2024-10-19 11:32:15 +00:00
99c8f7694c paperless-ngx: Split resources into separate files
The Paperless-ngx ecosystem consists of several services.  Defining the
resources for each service in separate manifest files will make
maintenance a little bit easier.
2024-10-17 07:27:33 -05:00
e19e8f50ab v-m/alerts: Add alerts for Paperless-ngx 2024-10-17 07:18:23 -05:00
78651eb5f8 v-m/alerts: Add alerts for PostgreSQL WAL archiver 2024-10-17 07:18:09 -05:00
ee3e078b20 v-m/alerts: Add alerts for Restic backups 2024-10-17 06:58:48 -05:00
ea89e0cde4 v-m/scrape: Remove synapse job
The Synapse server is now completely decommissioned.
2024-10-17 06:50:27 -05:00
e581957f9d Merge remote-tracking branch 'refs/remotes/origin/master' 2024-10-15 07:59:42 -05:00
b01300f8cc Merge pull request 'zwavejs2mqtt: Update to 9.20.0' (#26) from updatebot/home-assistant into master
Reviewed-on: #26
2024-10-15 12:43:28 +00:00
bot
55ae979a1d mosquitto: Update to 2.0.19 2024-10-15 12:42:36 +00:00
bot
1de05f2ccc zwavejs2mqtt: Update to 9.23.0 2024-10-15 12:42:36 +00:00
bot
58f7f9e2cc zigbee2mqtt: Update to 1.40.2 2024-10-15 12:42:35 +00:00
bot
390eacf209 home-assistant: Update to 2024.10.2 2024-10-15 12:42:35 +00:00
145fa6286e storage: Add Longhorn backup target secret
Longhorn uses a special Secret resource to configure the backup target.
This secret includes the credentials and CA certificate for accessing
the MinIO S3 service.

Longhorn must be configured to use this Secret by setting the
`backup-target-credential-secret` setting to
`minio-backups-credentials`.
2024-10-13 14:03:49 -05:00
1b4bb234c8 Merge pull request 'gotenberg: Update to 8.10.0' (#25) from updatebot/paperless-ngx into master
Reviewed-on: #25
2024-10-12 20:44:58 +00:00
7e2512c261 Merge pull request 'authelia: Update to 4.38.12' (#28) from updatebot/authelia into master
Reviewed-on: #28
2024-10-12 20:44:41 +00:00
bot
281ec623c4 authelia: Update to 4.38.16 2024-10-12 11:33:03 +00:00
bot
51fe6f39af gotenberg: Update to 8.12.0 2024-10-12 11:33:00 +00:00
2ccbcd494c firefly-iii: Update to 6.1.21
Notably, this version fixes the ~4s delay when creating/editing
transactions.
2024-10-02 09:08:58 -05:00
e9bfc63a74 Merge remote-tracking branch 'refs/remotes/origin/master' 2024-10-02 09:08:31 -05:00
32171cc76e Merge pull request 'firefly-iii: Update to 6.1.20' (#27) from updatebot/firefly-iii into master
Reviewed-on: #27
2024-09-29 21:09:41 +00:00
bot
71f091fa05 firefly-iii: Update to 6.1.20 2024-09-28 11:32:18 +00:00
df50decba1 argocd: apps/authelia: Enable auto-sync
This way, merging PRs from *updatebot* will automatically trigger
updating Paperless-ngx et al.
2024-09-24 07:16:45 -05:00
0022171616 argocd: apps/ntfy: Enable auto-sync
This way, merging PRs from *updatebot* will automatically trigger
updating Paperless-ngx et al.
2024-09-24 07:16:34 -05:00
a149bc8761 updatebot: Manage Authelia 2024-09-24 07:15:41 -05:00
76588c3e20 updatebot: Manage Mosquitto 2024-09-24 07:08:56 -05:00
bdc24e1778 updatebot: Manage ntfy 2024-09-24 07:05:37 -05:00
982cd88255 Merge remote-tracking branch 'refs/remotes/origin/master' 2024-09-22 12:13:58 -05:00
ffa47b9fba v-m: Scrape ntfy
_ntfy_ has supported Prometheus metrics for a while now, so let's
collect them.
2024-09-22 12:13:01 -05:00
9ec6b651c1 v-m: Scrape wal-g via statsd_exporter
The database server now runs _statsd_exporter_, which receives metrics
from WAL-G whenever it saves WAL segments or creates backups.
2024-09-22 12:11:59 -05:00
c83ceee994 v-m: Quit scraping Jenkins with blackbox_exporter
I was doing this to monitor Jenkins's certificate, but since that's
managed by _cert-manager_, there's really practically no risk of it
expiring without warning anymore.  Since Jenkins is already being
scraped directly, having this extra check just gernerates extra
notifications when there is an issue without adding any real value.
2024-09-22 12:10:03 -05:00
3f39747557 v-m: Redo Internet/DNS connectivity checks (again)
Using domain names in the "blackbox" probe makes it difficult to tell
the difference between a complete Internet outage and DNS issues.  I
switched to using these names when I changed how the firewall routed
traffic to the public DNS servers, since those were the IP addresses
I was using to determine if the Internet was "up."  I think it makes
sense, though, to just ping the upstream gateway for that check.  If
EverFast changes their routing or numbering, we'll just have to update
our checks to match.
2024-09-22 12:06:03 -05:00
8f354a4460 v-m/alertmanager: Suppress battery low alerts
The alerts for Z-Wave device batteries in particular are pretty
annoying, as they tend to "flap" for some reason.  I like having the
alerts show up on Alertmanager/Grafana dashboards, but I don't
necessarily need notifications about them.  Fortunately, we can create a
special "none" receiver and route notifications there, which does
exactly what we want here.
2024-09-22 12:01:02 -05:00
1c6286a977 ntfy: Migrate to Kustomize
Using Kustomize, we can define the configuration file separately from
the Kubernetes resources, and use `configMapGenerators` to generate the
ConfigMap for it.  Additionally, this will make it possible to update
_ntfy_ using `updatebot`.
2024-09-22 12:00:28 -05:00