kubernetes

Author	SHA1	Message	Date
Dustin C. Hatch	db7c07ee55	v-m/scrape: Ignore cloud Kubernetes nodes The ephemeral Jenkins worker nodes that run in AWS don't have colletcd, promtail, or Zincati. We don't needto get three alerts every time a worker starts up to handle am ARM build job, so we drop these discovered targets for these scrape jobs.	2024-11-04 20:35:17 -06:00
Dustin C. Hatch	d76a1360c8	v-m/alerts: Ignore Paperless consume_file task Paperless-ngx uses a Celery task to process uploaded files, converting them to PDF, running OCR, etc. This task can be marked as "failed" for various reasons, most of which are more about the document itself than the health of the application. The GUI displays the results of failed tasks when they occur. It doesn't really make sense to have an alert about this scenario, especially since there's nothing to do to directly clear the alert anyway.	2024-11-04 20:28:11 -06:00
Dustin C. Hatch	8ecee4133f	v-m/alerts: Rework free disk space alert Fedora CoreOS fills `/boot` beyond the 75% alert threshold under normal circumstances on aarch64 machines. This is not a problem, because it cleans up old files on its own, so we do not need to alert on it. Unfortunately, the _DiskUsage_ alert is already quite complex, and adding in exclusions for these devices would make it even worse. To simplify the logic, we can use a recording rule to precomupte the used/free space ratio. By using `sum(...) without (type)` instead of `sum(...) on (df, instance)`, we keep the other labels, which we can then use to identify the metrics coming from machines we don't care to monitor. Instead of having different thresholds for different volumes encoded in the same expression, we can use multiple alerts to alert on "low" vs "very low" thresholds. Since this will of course cause duplicate alerts for most volumes, we can use AlertManager inhibition rules to disable the "low" alert once the metric crosses the "very low" threshold.	2024-11-02 09:38:02 -05:00
Dustin C. Hatch	4cef41688f	v-m/alerts: Add Zigbee+ZWave network alerts	2024-11-01 18:14:56 -05:00
Dustin C. Hatch	6cf11f9f61	v-m: Scrape HAProxy	2024-11-01 18:14:37 -05:00
Dustin C. Hatch	7a768cbb76	v-m: Update jobs for new Loki server loki1.pyrocufflink.blue is a regular Fedora machine, a member of the AD domain, and managed by Ansible. Thus, it does not need to be explicitly listed as a scrape target. For scraping metrics from Loki itself, I've changed the job to use DNS-SD because it seems like `vmagent` does _not_ re-resolve host names from static configuration.	2024-11-01 18:07:34 -05:00
Dustin C. Hatch	0101040634	v-m/alerts: Add Paperless-ngx email task alert This alert should fire if the background task to fetch e-mail and import them into Paperless-ngx has not run for a while.	2024-11-01 18:04:06 -05:00
Dustin C. Hatch	3f9601dc94	v-m/alerts: Improve Paperless-ngx Celery task alert The `flower_events_total` metric is a counter, so its value only ever increases (discounting restarts of the server process). As such, nonzero values do not necessarily indicate a _current_ problem, but rather that there was one at some point in the past. To identify current issues, we need to use the `increase` function, and then apply the `max_over_time` function so that the alert doesn't immediately reset itself.	2024-11-01 18:00:50 -05:00
Dustin C. Hatch	d12e66f58a	v-m: Scrape Frigate exporter	2024-11-01 17:47:51 -05:00
Dustin C. Hatch	e19e8f50ab	v-m/alerts: Add alerts for Paperless-ngx	2024-10-17 07:18:23 -05:00
Dustin C. Hatch	78651eb5f8	v-m/alerts: Add alerts for PostgreSQL WAL archiver	2024-10-17 07:18:09 -05:00
Dustin C. Hatch	ee3e078b20	v-m/alerts: Add alerts for Restic backups	2024-10-17 06:58:48 -05:00
Dustin C. Hatch	ea89e0cde4	v-m/scrape: Remove synapse job The Synapse server is now completely decommissioned.	2024-10-17 06:50:27 -05:00
Dustin C. Hatch	ffa47b9fba	v-m: Scrape ntfy _ntfy_ has supported Prometheus metrics for a while now, so let's collect them.	2024-09-22 12:13:01 -05:00
Dustin C. Hatch	9ec6b651c1	v-m: Scrape wal-g via statsd_exporter The database server now runs _statsd_exporter_, which receives metrics from WAL-G whenever it saves WAL segments or creates backups.	2024-09-22 12:11:59 -05:00
Dustin C. Hatch	c83ceee994	v-m: Quit scraping Jenkins with blackbox_exporter I was doing this to monitor Jenkins's certificate, but since that's managed by _cert-manager_, there's really practically no risk of it expiring without warning anymore. Since Jenkins is already being scraped directly, having this extra check just gernerates extra notifications when there is an issue without adding any real value.	2024-09-22 12:10:03 -05:00
Dustin C. Hatch	3f39747557	v-m: Redo Internet/DNS connectivity checks (again) Using domain names in the "blackbox" probe makes it difficult to tell the difference between a complete Internet outage and DNS issues. I switched to using these names when I changed how the firewall routed traffic to the public DNS servers, since those were the IP addresses I was using to determine if the Internet was "up." I think it makes sense, though, to just ping the upstream gateway for that check. If EverFast changes their routing or numbering, we'll just have to update our checks to match.	2024-09-22 12:06:03 -05:00
Dustin C. Hatch	8f354a4460	v-m/alertmanager: Suppress battery low alerts The alerts for Z-Wave device batteries in particular are pretty annoying, as they tend to "flap" for some reason. I like having the alerts show up on Alertmanager/Grafana dashboards, but I don't necessarily need notifications about them. Fortunately, we can create a special "none" receiver and route notifications there, which does exactly what we want here.	2024-09-22 12:01:02 -05:00
Dustin C. Hatch	f182479d34	v-m: Remove BURP metrics, alerts BURP is officially decommissioned, replaced by Restic.	2024-09-05 20:16:01 -05:00
Dustin C. Hatch	78afee9abc	v-m/scrape: Remove static VM hosts from collectd The VM hosts are now managed by the "main" Ansible inventory and thus appear in the host list ConfigMap. As such, they do not need to be listed explicitly in the static targets list.	2024-08-23 09:28:05 -05:00
Dustin C. Hatch	7dffb5195a	v-m: alertmanager: Group disk usage alerts Some machines have the same volume mounted multiple times (e.g. container hosts, BURP). Alerts will fire for all of these simultaneously when the filesystem usage passes the threshold. To avoid getting spammed with a bunch of messages about the same filesystem, we'll group alerts from the same machine.	2024-08-17 10:59:05 -05:00
Dustin C. Hatch	02001f61db	v-m/scrape: webistes: Stop scraping Matrix I'm not using Matrix for anything anymore, and it seems to have gone offline. I haven't fully decommissioned it yet, but the Blackbox scrape is failing, so I'll just disable that bit for now.	2024-08-17 10:57:22 -05:00
Dustin C. Hatch	c7e4baa466	v-m: scrape: Remove nvr2.p.b Zincati scrape target I've redeployed nvr2.pyrocufflink.blue as Fedora Linux, so it does not run Zincati anymore.	2024-08-17 10:56:06 -05:00
Dustin C. Hatch	1a631bf366	v-m: scrape: Remove serial1.p.b This machine never worked correctly; the USB-RS232 adapters would stop working randomly (and of course it would be whenever I needed to actually use them). I thought it was something wrong with the server itself (a Raspberry Pi 3), but the same thing happened when I tried using a Pi 4. The new backup server has a plethora of on-board RS-232 ports, so I'm going to use it as the serial console server, too.	2024-08-17 10:54:21 -05:00
Dustin C. Hatch	6f7f09de85	v-m: scrape: Update Unifi server target I've rebuilt the Unifi Network controller machine (again); unifi3.pyrocufflink.blue has replaced unifi2.p.b. The `unifi_exporter` no longer works with the latest version of Unifi Network, so it's not deployed on the new machine.	2024-08-17 10:52:51 -05:00
Dustin C. Hatch	809676f691	v-m: alerts: Add Longhorn alerts	2024-08-17 10:51:13 -05:00
Dustin C. Hatch	78cd26c827	v-m: Scrape metrics from RabbitMQ	2024-07-26 20:59:00 -05:00
Dustin C. Hatch	8cb292a4b2	v-m: alerts: Add alert for temperatures After the incident this week with the CPU overheating on _vmhost1_, I want to make sure I know as soon as possible when anything is starting to get too hot.	2024-07-11 22:07:27 -05:00
Dustin C. Hatch	8113e5a47f	v-m: Fix syntax in AlertManager config The `group_by` field takes a list of label names, rather than a single string.	2024-07-06 07:13:27 -05:00
Dustin C. Hatch	952ab9f264	v-m: alertmanager: Group camera notifications When Frigate is down, multiple alerts are generated for each camera, as Home Assistant creates camera entities for each tracked object. This is extremely annoying, not to mention unnecessary. To address this, we'll configure AlertManager to send a single notification for alerts in the group.	2024-07-05 07:30:30 -05:00
Dustin C. Hatch	9b26753e73	v-m: alerts: Add durations to spammy alerts Let's avoid sending alerts immediately when something is unavailable, because the issue might be transient and will resolve itself shortly.	2024-07-05 07:23:38 -05:00
Dustin C. Hatch	248a9a5ae9	v-m: Scrape PostgreSQL exporter The [postgres exporter][0] exposes metrics about the operation and performance of a PostgreSQL server. It's currently deployed on _db0.pyrocufflink.blue_, the primary server of the main PostgreSQL cluster. [0]: https://github.com/prometheus-community/postgres_exporter	2024-07-02 18:16:05 -05:00
Dustin C. Hatch	a8ef4c7a80	v-m: Add component labels to configmaps Adding a `component` label to each ConfigMap will make it possible to target them specifically, e.g. with `kubectl apply -l`.	2024-07-02 18:16:05 -05:00
Dustin C. Hatch	65e53ad16d	v-m: Scrape Zinciti metrics from K8s nodes All the Kubernetes nodes (except k8s-ctrl0) are now running Fedora CoreOS. We can therefore use the Kubernetes API to discover scrape targets for the Zincati job.	2024-07-02 18:16:05 -05:00
Dustin C. Hatch	2d7fec1cdf	v-m: vmstorage: Add pod anti-affinity One of the reasons for moving to 4 `vmstorage` replicas was to ensure that the load was spread evenly between the physical VM host machines. To ensure that is the case as much as possible, we need to keep one pod per Kubernetes node.	2024-06-26 18:29:49 -05:00
Dustin C. Hatch	f7f408ca8c	v-m: Redo vmstorage persistent volumes Longhorn does not work well for very large volumes. It takes ages to synchronize/rebuild them when migrating between nodes, which happens all too frequently. This consumes a lot of resources, which impacts the operation of the rest of the cluster, and can cause a cascading failure in some circumstances. Now that the cluster is set up to be able to mount storage directly from the Synology, it makes sense to move the Victoria Metrics data there as well. Similar to how I did this with Jenkins, I created PersistentVolume resources that map to iSCSI volumes, and patched the PersistentVolumeClaims (or rather the template for them defined by the StatefulSet) to use these. Each `vmstorage` pod then gets an iSCSI LUN, bypassing both Longhorn and QEMU to write directly to the NAS. The migration process was relatively straightforwrad. I started by scaling down the `vminsert` Deployment so the `vmagent` pods would queue the metrics they had collected while the storage layer was down. Next, I created a [native][0] export of all the time series in the database. Then, I deleted the `vmstorage` StatefulSet and its associated PVCs. Finally, I applied the updated configuration, including the new PVs and patched PVCs, and brought the `vminsert` pods back online. Once everything was up and running, I re-imported the exported data. [0]: https://docs.victoriametrics.com/Single-server-VictoriaMetrics.html#how-to-export-data-in-native-format	2024-06-26 18:29:49 -05:00
Dustin C. Hatch	ab458df415	v-m/vmstorage: Start pods in parallel By default, Kubernetes waits for each pod in a StatefulSet to become "ready" before starting the next one. If there is a problem starting that pod, e.g. data corruption, then the others will never start. This sort of defeats the purpose of having multiple replicas. Fortunately, we can configure the pod management policy to start all the pods at once, regardless of the status of any individual pod. This way, if there is a problem with the first pod, the others will still come up and serve whatever data they have.	2024-06-26 18:29:49 -05:00
Dustin C. Hatch	14be633843	v-m: Scrape Restic exporter	2024-06-26 18:29:49 -05:00
Dustin C. Hatch	1c4b32925e	v-m: Use dynamic discovery for some collectd nodes We don't need to explicitly specify every single host individually. Domain controllers, for example, are registered in DNS with SRV records. Kubernetes nodes, of course, can be discovered using the Kubernetes API. Both of these classes of nodes change frequently, so discovering them dynamically is convenient.	2024-06-26 18:29:49 -05:00
Dustin C. Hatch	b8015c0bed	v-m: blackbox: Force TCP probe to IPv4 Since I added an IPv6 ULA prefix to the "main" VLAN (to allow communicating with the Synology directly), the domain controllers now have AAAA records. This causes the `sambadc` screpe job to fail because Blackbox Exporter prefers IPv6 by default, but Kubernetes pods do not have IPv6 addreses.	2024-06-26 18:29:49 -05:00
Dustin C. Hatch	48f20eac07	v-m: Scrape metrics from fleetlock	2024-05-31 15:18:55 -05:00
Dustin C. Hatch	8939c1d02c	v-m/scrape: Scrape unifi2.p.b unifi2.pyrocufflink.blue is a Fedora CoreOS host, so it runs collectd, Promtail, and Zincati.	2024-05-26 11:48:59 -05:00
Dustin C. Hatch	3b74c3d508	v-m: Scrape metrics from Paperless-ngx Flower	2024-05-22 15:51:07 -05:00
Dustin C. Hatch	d5bfdaca25	v-m/alertmanager-ntfy: Add labels to notifications Just having the alert name and group name in the ntfy notification is not enough to really indicate what the problem is, as some alerts can generate notifications for many reasons. In the email notifications AlertManager sends by default, the values (but not the keys) of all labels are included in the subject, so we will reproduce that here.	2024-05-22 15:20:27 -05:00
Dustin C. Hatch	d74e26d527	victoria-metrics: Send alerts via ntfy I don't like having alerts sent by e-mail. Since I don't get e-mail notifications on my watch, I often do not see alerts for quite some time. They are also much harder to read in an e-mail client (Fastmail web an K-9 Mail both display them poorly). I would much rather have them delivered via _ntfy_, just like all the rest of the ephemeral notifications I receive. Fortunately, it is easy enough to integrate Alertmanager and _ntfy_ using the webhook notifier in Alertmanager. Since _ntfy_ does not natively support the Alertmanager webhook API, though, a bridge is necessary to translate from one data format to the other. There are a few options for this bridge, but I chose [alexbakker/alertmanager-ntfy][0] because it looked the most complete while also having the simplest configuration format. Sadly, it does not expose any Prometheus metrics itself, and since it's deployed in the _victoria-metrics_ namespace, it needs to be explicitly excluded from the VMAgent scrape configuration. [0]: https://github.com/alexbakker/alertmanager-ntfy	2024-05-10 10:32:52 -05:00
Dustin C. Hatch	ebea31fe55	v-m: alerts: Add alert for camera offline	2024-04-23 09:42:04 -05:00
Dustin C. Hatch	1581a620ef	v-m/scrape: Scrape nvr2.p.b nvr2.pyrocufflink.blue has replaced nvr1.pyrocufflink.blue as the Frigate/recording server.	2024-04-10 21:25:26 -05:00
Dustin C. Hatch	de72776e73	v-m: Scrape metrics from Authelia Authelia exposes Prometheus metrics from a different server socket, which is not enabled by default.	2024-02-27 06:41:52 -06:00
Dustin C. Hatch	e0b2b3f5ae	v-m: Scrape metrics from Patroni Patroni, a component of the postgres poerator, exports metrics about the PostgreSQL database servers it manages. Notably, it provides information about the current transaction log location for each server. This allows us to monitor and alert on the health of database replicas.	2024-02-24 08:33:52 -06:00
Dustin C. Hatch	83eeb46c93	v-m: Scrape Argo CD Argo CD exposes metrics about itself and the applications it manages. Notibly, this can be useful for monitoring application health.	2024-02-22 07:10:01 -06:00

1 2

70 Commits