kubernetes

Author	SHA1	Message	Date
Dustin C. Hatch	ebea31fe55	v-m: alerts: Add alert for camera offline	2024-04-23 09:42:04 -05:00
Dustin C. Hatch	1581a620ef	v-m/scrape: Scrape nvr2.p.b nvr2.pyrocufflink.blue has replaced nvr1.pyrocufflink.blue as the Frigate/recording server.	2024-04-10 21:25:26 -05:00
Dustin C. Hatch	de72776e73	v-m: Scrape metrics from Authelia Authelia exposes Prometheus metrics from a different server socket, which is not enabled by default.	2024-02-27 06:41:52 -06:00
Dustin C. Hatch	e0b2b3f5ae	v-m: Scrape metrics from Patroni Patroni, a component of the postgres poerator, exports metrics about the PostgreSQL database servers it manages. Notably, it provides information about the current transaction log location for each server. This allows us to monitor and alert on the health of database replicas.	2024-02-24 08:33:52 -06:00
Dustin C. Hatch	83eeb46c93	v-m: Scrape Argo CD Argo CD exposes metrics about itself and the applications it manages. Notibly, this can be useful for monitoring application health.	2024-02-22 07:10:01 -06:00
Dustin C. Hatch	465f121e61	v-m: Scrape Promtail The promtail job scrapes metrics from all the hosts running Promtail. The static targets are Fedora CoreOS nodes that are not part of the Kubernetes cluster. The relabeling rules ensure that both the static targets and the targets discovered via the Kubernetes Node API use the FQDN of the host as the value of the instance label.	2024-02-22 07:10:01 -06:00
Dustin C. Hatch	5e4ab1d988	v-m: Update Loki scrape target Now that Loki uses Caddy as a reverse proxy, we need to update the scrape target to point to the correct port (443).	2024-02-22 07:10:01 -06:00
Dustin C. Hatch	4c238a69aa	v-m: Scrape Grafana Loki Grafana Loki is hosted on a VM named loki0.pyrocufflink.blue. It runs Fedora CoreOS, so in addition to scraping Loki itself, we need to scrape _collectd_ and _Zincati_ as well.	2024-02-21 09:16:26 -06:00
Dustin C. Hatch	2acefd9a72	v-m: Add alert for sensor battery levels I did not realize the batteries on the garage door tilt sensors had died. Adding alerts for various sensor batteries should help keep me better informed.	2024-02-16 20:56:38 -06:00
Dustin C. Hatch	1f28a623ae	v-m: Do not scrape/alert on Graylog Graylog is down because Elasticsearch corrupted itself again, and this time, I'm just not going to bother fixing it. I practically never use it anymore anyway, and I want to migrate to Grafana Loki, so now seems like a good time to just get rid of it.	2024-02-01 21:45:43 -06:00
Dustin C. Hatch	834d0f804f	v-m: Scrape Grafana Grafana exports Prometheus metrics about its own performance.	2024-02-01 09:02:01 -06:00
Dustin C. Hatch	8ae8bad112	v-m: Scrape serial1.p.b	2024-01-25 20:42:07 -06:00
Dustin C. Hatch	ad37948fe2	v-m: Scrape all metrics components We are now getting metrics from vmstorage, vminsert, vmselect, vmalert, alertmanaer, and blackbox-exporter, in addition to vmagent.	2024-01-23 11:51:50 -06:00
Dustin C. Hatch	bcb588407d	v-m: Correct vmalert remote read/write URLs vmalert has been generating alerts and triggering notifications, but not writing any `ALERTS`/`ALERTS_FOR_STATE` metrics. It turns out this is because I had not correctly configured the remote read/write URLs.	2024-01-23 10:45:40 -06:00
Dustin C. Hatch	119a8a74ae	v-m: alerts: Enhance Frigate unavailable alert If Frigate is running but not connected to the MQTT broker, the `sensor.frigate_status` entity will be available, but the `update.frigate_server` entity will not.	2024-01-22 18:27:30 -06:00
Dustin C. Hatch	54e7a25f93	v-m: vmstorage: Remove startup/ready probes Kubernetes will not start additional Pods in a StatefulSet until the existing ones are Ready. This means that if there is a problem bringing up, e.g. `vmstorage-0`, it will never start `vmstorage-1` or `vmstorage-2`. Since this pretty much defeats the purpose of having a multi-node `vmstorage` cluster, we have to remove the readiness probe, so the Pods will be Ready as soon as they start. If there is a problem with one of them, it will matter less, as the others can still run.	2024-01-22 16:43:46 -06:00
Dustin C. Hatch	ca02dfec62	v-m: Add host labels to collectd-virt metrics The virt plugin for collectd sets `instance` to the name of the libvirt domain the metric refers to. This makes it so there is no label identifying which host the VM is running on. Thus, if we want to classify metrics by VM host, we need to add that label explicitly. Since the `__address__` label is not available during metric relabeling, we need to store it in a temporary label, which gets dropped at the end of the relabeling phase. We copy the value of that label into a new label, but only for metrics that match the desired metric name.	2024-01-22 11:12:19 -06:00
Dustin C. Hatch	51775ede81	v-m/vmagent: Scrape nut0 nut0.pyrocufflink.blue is the new UPS monitor server. It runs Fedora CoreOS, with NUT in a container.	2024-01-15 18:46:46 -06:00
Dustin C. Hatch	90b293d5c8	v-m/vmagent: Scrape k8s-amd64-n3	2024-01-15 18:45:52 -06:00
Dustin C. Hatch	278be05121	v-m/blackbox: Switch to upstream container image I found the official container image for Prometheus Blackbox exporter. It is hosted on Quay, which is why I didn't see it on Docker Hub when I looked initially.	2024-01-15 18:45:25 -06:00
Dustin C. Hatch	539e25d9bd	v-m/vmagent: Scrape public clouds to test Internet Scraping the public DNS servers doesn't work anymore since the firewall routes traffic through Mullvad. Pinging public cloud providers should give a pretty decent indication of Internet connectivity. It will also serve as a benchmark for the local DNS performance, since the names will have to be resolved.	2024-01-15 18:44:46 -06:00
Dustin C. Hatch	98cdcdfe30	v-m/scrape: Stable instance label for Longhorn By default, the `instance` label for discovered metrics targets is set to the scrape address. For Kubernetes pods, that is the IP address and port of the pod, which naturally changes every time the pod is recreated or moved. This will cause a high churn rate for Longhorn manager pods. To avoid this, we set the `instance` label to the name of the node the pod is running on, which will not change because the Longhorn manager pods are managed by a DaemonSet.	2024-01-04 09:16:20 -06:00
Dustin C. Hatch	bac7de72f2	v-m: Scrape Longhorn manager metrics Each Longhorn manager pod exports metrics about the node on which it is running. Thus, we have to scrape every pod to get the metrics about the whole ecosystem.	2024-01-02 11:27:31 -06:00
Dustin C. Hatch	225fd8469c	v-m/vmagent: Allow listing all pods in cluster The original RBAC configuration allowed `vmagent` only to list the pods in the `victoria-metrics` namespace. In order to allow it to monitor other applications' pods, it needs to be assigned permission to list pods in all namespaces.	2024-01-02 11:25:54 -06:00
Dustin C. Hatch	8f088fb6ae	v-m: Deploy (clustered) Victoria Metrics Since mtrcs0.pyrocufflink.blue (the Metrics Pi) seems to be dying, I decided to move monitoring and alerting into Kubernetes. I was originally planning to have a single, dedicated virtual machine for Victoria Metrics and Grafana, similar to how the Metrics Pi was set up, but running Fedora CoreOS instead of a custom Buildroot-based OS. While I was working on the Ignition configuration for the VM, it occurred to me that monitoring would be interrupted frequently, since FCOS updates weekly and all updates require a reboot. I would rather not have that many gaps in the data. Ultimately I decided that deploying a cluster with Kubernetes would probably be more robust and reliable, as updates can be performed without any downtime at all. I chose not to use the Victoria Metrics Operator, but rather handle the resource definitions myself. Victoria Metrics components are not particularly difficult to deploy, so the overhead of running the operator and using its custom resources would not be worth the minor convenience it provides.	2024-01-01 17:48:10 -06:00

25 Commits