kubernetes

infra

Author	SHA1	Message	Date
Dustin	7158ff89df	v-m/alerts: Ignore Restic alert for Purple Pi The Purple Pi is no more. We want to keep it's backups around, though, but we don't need alerts about them.	2025-09-12 07:25:21 -05:00
Dustin	4c1992b3c9	v-m/vmagent: Start in parallel As with AlertManager, the point of having multiple replicas of `vmagent` is so that one is always running, even if the other fails. Thus, we want to start the pods in parallel so that if the first one does not come up, the second one at least has a chance.	2025-09-07 10:49:22 -05:00
Dustin	25d34efb4c	v-m/alertmanager: Bring up replicas in parallel If something prevents the first AlertManager instance from starting, we don't want to wait forever for it before starting the second. That pretty much defeats the purpose of having two instances. Fortunately, we can configure Kubernetes to bring up both instances simultaneously by setting the pod management policyo to `Parallel`.	2025-09-07 10:42:50 -05:00
Dustin	e605e3d1ea	v-m/alertmanager: Migrate PVC to Synology We also don't need a 4 GB volume for AlertManager; even 500 MB is way too big for the tiny amount of data it stores, but that's about the smallest size a filesystem can be.	2025-09-07 10:42:13 -05:00
Dustin	87331b24b0	v-m/alerts: Ignore Restic alert for bw0 _bw0.pyrocufflink.blue_ has been decommissioned since some time, so it doesn't get backed up any more. We want to keep its previous backups around, though, in case we ever need to restore something. This triggers the "no recent backups" alert, since the last snapshot is over a week old. Let's ignore that hostname when generating this alert.	2025-09-07 08:27:19 -05:00
Dustin	7ad8fff7c6	v-m/vmagent: Use ephemeral storage The `vmagent` needs a place to spool data it has not yet sent to Victoria Metrics, but it doesn't really need to be persistent. As long as all of the `vmagent` nodes _and_ all of the `vminsert` nodes do not go down simultaneously, there shouldn't be any data loss. If they are all down at the same time, there's probably something else going on and lost metrics are the least concerning problem.	2025-09-07 08:27:19 -05:00
Dustin	92cf0edc4b	v-m/scrape: Scrape Music Assistant via Blackbox Music Assistant doesn't expose any metrics natively. Since we really only care about whether or not it's accessible, scraping it with the blackbox exporter is fine.	2025-09-07 08:27:19 -05:00
Dustin	1ec974fa2d	v-m/alerts: Add alert for Internet down	2025-08-03 11:29:41 -05:00
Dustin	38ee60e099	v-m: Add alerts for Firefly, Paperless, phpipam _Firefly III_ and _phpipam_ don't export any Prometheus metrics, so we have to scrape them via the Blackbox Exporter. Paperless-ngx only exposes metrics via Flower, but since it runs in the same container as the main application, we can assume that if the former is unavailable, the latter is as well.	2025-07-27 17:39:28 -05:00
Dustin	d48dabca5b	Merge remote-tracking branch 'refs/remotes/origin/master'	2025-07-21 12:02:44 -05:00
Dustin	8491d2ded7	v-m: Switch to quay.io for container images Docker Hub has blocked ("rate limited") my IP address. Moving as much as I can to use images from other sources. Hopefully they'll unblock me soon and I can deploy a caching proxy.	2025-07-07 08:43:20 -05:00
Dustin	093e909475	v-m/scrape: Scrape Victoria Logs	2025-07-06 15:20:16 -05:00
Dustin	cc83a5115a	v-m/scrape: Scrape MinIO metrics	2025-07-02 10:29:53 -05:00
Dustin	fdb4bdb23d	Merge branch 'unifi'	2025-06-21 14:00:38 -05:00
Dustin	75edfb74cb	v-m/scrape: Increase timeout for k8s job Scraping metrics from the Kubernetes API server has started taking 20+ seconds recondly. Until I figure out the underlying cause, I'm increasing the scrape timeout so that the _vmagent_ doesn't give up and report the API server as "down."	2025-06-21 13:55:23 -05:00
Dustin	52094da8fd	v-m/scrape: Remove unifi3, Zincati unifi3.pyrocufflink.blue has been replaced by unifi-nuptials.host.pyrocufflink.black. The former was the last Fedora CoreOS machine in use, so the entire Zincati scrape job is no longer needed.	2025-03-29 08:10:50 -05:00
Dustin	dc835ddc9d	v-m/alerts: Fix PostgreSQL WAL archive failed alert The `pg_stat_archiver_failed_count` metric is a counter, so once a WAL archival has failed, it will increase and never return to `0`. To ensure the alert is resolved once the WAL archival process recovers, we need to use the `increase` function to turn it into a gauge. Finally, we aggregate that gauge with `max_over_time` to keep the alert from flapping if the WAL archive occurs less frequently than the scrape interval.	2025-02-05 10:42:35 -06:00
Dustin	6da330f2be	v-m/scrape: Remove k8s SD config for Zincati There are no more Kubernetes nodes running Fedora CoreOS.	2025-02-01 18:16:10 -06:00
Dustin	11a0f84db7	v-m/scrape: Remove websites job Websites are being scraped by the `vmagent` on the OVH VPS.	2025-02-01 18:16:10 -06:00
Dustin	a87b53e3ac	v-m: Add alert for Frigate camera no video At some point this week, the front porch camera stopped sending video. I'm not sure exactly what happened to it, but Frigate kept logging "Unable to read frames from ffmpeg process." I power-cycled the camera, which resolved the issue. Unfortunately, no alerts were generated about this situation. Home Assistant did not consider the camera entity unavailable, presumably because Frigate was still reporting stats about it. Thus, I missed several important notifications. To avoid this in the future, I have enabled the "Camera FPS" sensors for all of the cameras in Home Assistant, and added this alert to trigger when the reported framerate is 0. I really also need to get alerts for log events configured, as that would also indicated there was an issue.	2025-02-01 18:16:10 -06:00
Dustin	b9d69ec0a3	v-m/alerts: Ignore missing backups from Toad, Luma Toad and Luma can go offline for several days at a time if I don't use them. I don't need an alert telling me this.	2024-12-21 12:23:19 -06:00
Dustin	a03d63841d	v-m/alerts: Fire paperless email alert after 12h We don't need a notification about paperless not scheduling email tasks every time there is a gap in the metric. This can happen in some innocuous situations like when the pod restarts or if there is a brief disruption of service. Using the `absent_over_time` function with a range vector, we can have the alert fire only if there have been no email tasks scheduled within the last 12 hours.	2024-12-21 12:17:45 -06:00
Dustin	d04c18cfcd	v-m/alerts: Remove 'no file changes' alert It turns out this alert is not very useful, and indeed quite annoying. Many servers can go for days or even weeks with no changes, which is completely normal.	2024-12-21 12:14:11 -06:00
Dustin	6e15b11f73	Merge branch 'fix-nextcloud-alert'	2024-12-21 11:58:41 -06:00
Dustin	e0c633c21e	v-m: scrape: Fix Nextcloud URL Nextcloud uses a _client-side_ (Javascript) redirect to navigate the browser to its `index.php`. The page it serves with this redirect is static and will often load successfully, even if there is a problem with the application. This causes the Blackbox exporter to record the site as "up," even when it it definitely is not. To avoid this, we can scrape the `index.php` page explicitly, ensuring that the application is loaded.	2024-11-17 18:43:00 +00:00
Dustin	0209f921c3	v-m: Remove nut0 from scrape targets _nut0.pyrocufflink.blue_ is decommissioned.	2024-11-12 08:02:00 -06:00
Dustin	9f287d0f71	v-m/alerts: Add alerts for backup RAID array Just like I did with the RAID-1 array in the old BURP server, I will keep one member active and one in the fireproof safe, swapping them each month. We can use the same metrics queries to alert on when the swap should happen that we used with the BURP server.	2024-11-04 20:46:03 -06:00
Dustin	2380468658	v-m/scrape: Collect Jellyfin metrics	2024-11-04 20:38:25 -06:00
Dustin	db7c07ee55	v-m/scrape: Ignore cloud Kubernetes nodes The ephemeral Jenkins worker nodes that run in AWS don't have colletcd, promtail, or Zincati. We don't needto get three alerts every time a worker starts up to handle am ARM build job, so we drop these discovered targets for these scrape jobs.	2024-11-04 20:35:17 -06:00
Dustin	d76a1360c8	v-m/alerts: Ignore Paperless consume_file task Paperless-ngx uses a Celery task to process uploaded files, converting them to PDF, running OCR, etc. This task can be marked as "failed" for various reasons, most of which are more about the document itself than the health of the application. The GUI displays the results of failed tasks when they occur. It doesn't really make sense to have an alert about this scenario, especially since there's nothing to do to directly clear the alert anyway.	2024-11-04 20:28:11 -06:00
Dustin	8ecee4133f	v-m/alerts: Rework free disk space alert Fedora CoreOS fills `/boot` beyond the 75% alert threshold under normal circumstances on aarch64 machines. This is not a problem, because it cleans up old files on its own, so we do not need to alert on it. Unfortunately, the _DiskUsage_ alert is already quite complex, and adding in exclusions for these devices would make it even worse. To simplify the logic, we can use a recording rule to precomupte the used/free space ratio. By using `sum(...) without (type)` instead of `sum(...) on (df, instance)`, we keep the other labels, which we can then use to identify the metrics coming from machines we don't care to monitor. Instead of having different thresholds for different volumes encoded in the same expression, we can use multiple alerts to alert on "low" vs "very low" thresholds. Since this will of course cause duplicate alerts for most volumes, we can use AlertManager inhibition rules to disable the "low" alert once the metric crosses the "very low" threshold.	2024-11-02 09:38:02 -05:00
Dustin	4cef41688f	v-m/alerts: Add Zigbee+ZWave network alerts	2024-11-01 18:14:56 -05:00
Dustin	6cf11f9f61	v-m: Scrape HAProxy	2024-11-01 18:14:37 -05:00
Dustin	7a768cbb76	v-m: Update jobs for new Loki server loki1.pyrocufflink.blue is a regular Fedora machine, a member of the AD domain, and managed by Ansible. Thus, it does not need to be explicitly listed as a scrape target. For scraping metrics from Loki itself, I've changed the job to use DNS-SD because it seems like `vmagent` does _not_ re-resolve host names from static configuration.	2024-11-01 18:07:34 -05:00
Dustin	0101040634	v-m/alerts: Add Paperless-ngx email task alert This alert should fire if the background task to fetch e-mail and import them into Paperless-ngx has not run for a while.	2024-11-01 18:04:06 -05:00
Dustin	3f9601dc94	v-m/alerts: Improve Paperless-ngx Celery task alert The `flower_events_total` metric is a counter, so its value only ever increases (discounting restarts of the server process). As such, nonzero values do not necessarily indicate a _current_ problem, but rather that there was one at some point in the past. To identify current issues, we need to use the `increase` function, and then apply the `max_over_time` function so that the alert doesn't immediately reset itself.	2024-11-01 18:00:50 -05:00
Dustin	d12e66f58a	v-m: Scrape Frigate exporter	2024-11-01 17:47:51 -05:00
Dustin	e19e8f50ab	v-m/alerts: Add alerts for Paperless-ngx	2024-10-17 07:18:23 -05:00
Dustin	78651eb5f8	v-m/alerts: Add alerts for PostgreSQL WAL archiver	2024-10-17 07:18:09 -05:00
Dustin	ee3e078b20	v-m/alerts: Add alerts for Restic backups	2024-10-17 06:58:48 -05:00
Dustin	ea89e0cde4	v-m/scrape: Remove synapse job The Synapse server is now completely decommissioned.	2024-10-17 06:50:27 -05:00
Dustin	ffa47b9fba	v-m: Scrape ntfy _ntfy_ has supported Prometheus metrics for a while now, so let's collect them.	2024-09-22 12:13:01 -05:00
Dustin	9ec6b651c1	v-m: Scrape wal-g via statsd_exporter The database server now runs _statsd_exporter_, which receives metrics from WAL-G whenever it saves WAL segments or creates backups.	2024-09-22 12:11:59 -05:00
Dustin	c83ceee994	v-m: Quit scraping Jenkins with blackbox_exporter I was doing this to monitor Jenkins's certificate, but since that's managed by _cert-manager_, there's really practically no risk of it expiring without warning anymore. Since Jenkins is already being scraped directly, having this extra check just gernerates extra notifications when there is an issue without adding any real value.	2024-09-22 12:10:03 -05:00
Dustin	3f39747557	v-m: Redo Internet/DNS connectivity checks (again) Using domain names in the "blackbox" probe makes it difficult to tell the difference between a complete Internet outage and DNS issues. I switched to using these names when I changed how the firewall routed traffic to the public DNS servers, since those were the IP addresses I was using to determine if the Internet was "up." I think it makes sense, though, to just ping the upstream gateway for that check. If EverFast changes their routing or numbering, we'll just have to update our checks to match.	2024-09-22 12:06:03 -05:00
Dustin	8f354a4460	v-m/alertmanager: Suppress battery low alerts The alerts for Z-Wave device batteries in particular are pretty annoying, as they tend to "flap" for some reason. I like having the alerts show up on Alertmanager/Grafana dashboards, but I don't necessarily need notifications about them. Fortunately, we can create a special "none" receiver and route notifications there, which does exactly what we want here.	2024-09-22 12:01:02 -05:00
Dustin	f182479d34	v-m: Remove BURP metrics, alerts BURP is officially decommissioned, replaced by Restic.	2024-09-05 20:16:01 -05:00
Dustin	78afee9abc	v-m/scrape: Remove static VM hosts from collectd The VM hosts are now managed by the "main" Ansible inventory and thus appear in the host list ConfigMap. As such, they do not need to be listed explicitly in the static targets list.	2024-08-23 09:28:05 -05:00
Dustin	7dffb5195a	v-m: alertmanager: Group disk usage alerts Some machines have the same volume mounted multiple times (e.g. container hosts, BURP). Alerts will fire for all of these simultaneously when the filesystem usage passes the threshold. To avoid getting spammed with a bunch of messages about the same filesystem, we'll group alerts from the same machine.	2024-08-17 10:59:05 -05:00
Dustin	02001f61db	v-m/scrape: webistes: Stop scraping Matrix I'm not using Matrix for anything anymore, and it seems to have gone offline. I haven't fully decommissioned it yet, but the Blackbox scrape is failing, so I'll just disable that bit for now.	2024-08-17 10:57:22 -05:00

1 2

98 Commits (master)