kubernetes

infra

Author	SHA1	Message	Date
Dustin	9b26753e73	v-m: alerts: Add durations to spammy alerts Let's avoid sending alerts immediately when something is unavailable, because the issue might be transient and will resolve itself shortly.	2024-07-05 07:23:38 -05:00
Dustin	fa80b15a71	jenkins: Remove Argo CD sync hook Since Jenkins no longer uses a Longhorn volume, this sync hook is not useful.	2024-07-04 06:53:58 -05:00
Dustin	248a9a5ae9	v-m: Scrape PostgreSQL exporter The [postgres exporter][0] exposes metrics about the operation and performance of a PostgreSQL server. It's currently deployed on _db0.pyrocufflink.blue_, the primary server of the main PostgreSQL cluster. [0]: https://github.com/prometheus-community/postgres_exporter	2024-07-02 18:16:05 -05:00
Dustin	215b2c6975	home-assistant: Use external PostgreSQL server Home Assistant uses PostgreSQL for recording the history of entity states. Since we had been using the in-cluster database server for this, the data were migrated to the new external PostgreSQL server automatically when the backup from the former was restored on the latter. It follows, then, that we can point Home Assistant to the new server as well. Home Assistant uses SQLAlchemy, which in turn uses _libpq_ via _psycopg_, as a client for PostgreSQL. It doesn't expose any configuration parameters beyond the "database URL" directly, but we can use the standard environment variables to specify the certificate and private key for authentication. In fact, the empty `postgresql://` URL is sufficient, and indicates that _all_ of the connection parameters should be taken from environment variables. This makes specifying the parameters for both the `wait-for-db` init container and the main container take the exact same environment variables, so we can use YAML anchors to share their definitions.	2024-07-02 18:16:05 -05:00
Dustin	a269f8a1ae	firefly-iii: Connect to external PostgreSQL Since the new database server outside the Kubernetes cluster, created for Authelia, was seeded from a backup of the in-cluster server, it already contained the data from Firefly-III as well. Thus, we can switch Firefly-III to using it, too. The documentation for Firefly-III does not mention anything about how to configure it to use certificate-based authentication for PostgreSQL, as is required by the new server. Fortunately, it ultimately uses _libpq_, so the standard `PG...` environment variables work fine. We just need a certificate issued by the _postgresql-ca_ ClusterIssuer and the _DCH Root CA_ certificate mounted in the Firefly-III container.	2024-07-02 18:16:05 -05:00
Dustin	92497004be	authelia: Point to external PostgreSQL server If there is an issue with the in-cluster database server, accessing the Kubernetes API becomes impossible by normal means. This is because the Kubernetes API uses Authelia for authentication and authorization, and Authelia relies on the in-cluster database server. To solve this chicken-and-egg scenario, I've set up a dedicated PostgreSQL database server on a virtual machine, totally external to the Kubernetes cluster. With this commit, I have changed the Authelia configuration to point at this new database server. The contents of the new database server were restored from a backup from the in-cluster server, so of Authelia's state was migrated automatically. Thus, updating the configuration is all that is necessary to switch to using it. The new server uses certificate-based authentication. In order for Authelia to access it, it needs a certificate issued by the _postgresql-ca_ ClusterIssuer, managed by _cert-manager_. Although the environment variables for pointing to the certificate and private key are not listed explicitly in the Authelia documentation, their names can be inferred from the configuration document schema and work as expected.	2024-07-02 18:16:05 -05:00
Dustin	a8ef4c7a80	v-m: Add component labels to configmaps Adding a `component` label to each ConfigMap will make it possible to target them specifically, e.g. with `kubectl apply -l`.	2024-07-02 18:16:05 -05:00
Dustin	65e53ad16d	v-m: Scrape Zinciti metrics from K8s nodes All the Kubernetes nodes (except k8s-ctrl0) are now running Fedora CoreOS. We can therefore use the Kubernetes API to discover scrape targets for the Zincati job.	2024-07-02 18:16:05 -05:00
Dustin	31345bee7b	home-assistant: Add Pool Time WebDAV calendar I've created a _Pool Time_ calendar in Nextcloud that we can use to mark when people are expected to be in the pool. Using this, we can configure the "someone is in the pool" alert not to fire during times when we know people will be in the pool. This will make it much less annoying on HLC pool days.	2024-07-02 18:16:05 -05:00
Dustin	2d7fec1cdf	v-m: vmstorage: Add pod anti-affinity One of the reasons for moving to 4 `vmstorage` replicas was to ensure that the load was spread evenly between the physical VM host machines. To ensure that is the case as much as possible, we need to keep one pod per Kubernetes node.	2024-06-26 18:29:49 -05:00
Dustin	f7f408ca8c	v-m: Redo vmstorage persistent volumes Longhorn does not work well for very large volumes. It takes ages to synchronize/rebuild them when migrating between nodes, which happens all too frequently. This consumes a lot of resources, which impacts the operation of the rest of the cluster, and can cause a cascading failure in some circumstances. Now that the cluster is set up to be able to mount storage directly from the Synology, it makes sense to move the Victoria Metrics data there as well. Similar to how I did this with Jenkins, I created PersistentVolume resources that map to iSCSI volumes, and patched the PersistentVolumeClaims (or rather the template for them defined by the StatefulSet) to use these. Each `vmstorage` pod then gets an iSCSI LUN, bypassing both Longhorn and QEMU to write directly to the NAS. The migration process was relatively straightforwrad. I started by scaling down the `vminsert` Deployment so the `vmagent` pods would queue the metrics they had collected while the storage layer was down. Next, I created a [native][0] export of all the time series in the database. Then, I deleted the `vmstorage` StatefulSet and its associated PVCs. Finally, I applied the updated configuration, including the new PVs and patched PVCs, and brought the `vminsert` pods back online. Once everything was up and running, I re-imported the exported data. [0]: https://docs.victoriametrics.com/Single-server-VictoriaMetrics.html#how-to-export-data-in-native-format	2024-06-26 18:29:49 -05:00
Dustin	0f24341e5c	collectd: Add DaemonSet for collectd Since all the nodes in the cluster run Fedora CoreOS now, we can deploy collectd as a container, managed by a DaemonSet. Note that while _collectd_ has to run as _root_ in order to collect a lot of metrics, it should not run with all privileges. It does need to run as a "super-privileged container" (`spc_t` SELinux domain), but it does _not_ need most kernel capabilities.	2024-06-26 18:29:49 -05:00
Dustin	ab458df415	v-m/vmstorage: Start pods in parallel By default, Kubernetes waits for each pod in a StatefulSet to become "ready" before starting the next one. If there is a problem starting that pod, e.g. data corruption, then the others will never start. This sort of defeats the purpose of having multiple replicas. Fortunately, we can configure the pod management policy to start all the pods at once, regardless of the status of any individual pod. This way, if there is a problem with the first pod, the others will still come up and serve whatever data they have.	2024-06-26 18:29:49 -05:00
Dustin	14be633843	v-m: Scrape Restic exporter	2024-06-26 18:29:49 -05:00
Dustin	5079599423	restic-exporter: Deploy Restic Prometheus exporter The [restic-exporter][0] exposes metrics about Restic snapshots as Prometheus metrics. This allows us to get similar data as we have for BURP backups. Chiefly important among the metrics are last backup time and size, which we can use to determine if backups are working correctly. [0]: https://github.com/ngosang/restic-exporter	2024-06-26 18:29:49 -05:00
Dustin	ebcf9e3d42	authelia: Scale up to 2 replicas Since Authelia is stateless, we can run a second instance to improve availability.	2024-06-26 18:29:49 -05:00
Dustin	21e8ad2afd	home-assistant: Add commands to control photoframe The digital photo frame in the kitchen is powered by a server service, which exposes a minimal HTTP API. Using this API, we can e.g. advance or backtrack the displayed photo. Exposing `rest_command` services for these operations allows us to add buttons to dashboards to control the frame.	2024-06-26 18:29:49 -05:00
Dustin	1c4b32925e	v-m: Use dynamic discovery for some collectd nodes We don't need to explicitly specify every single host individually. Domain controllers, for example, are registered in DNS with SRV records. Kubernetes nodes, of course, can be discovered using the Kubernetes API. Both of these classes of nodes change frequently, so discovering them dynamically is convenient.	2024-06-26 18:29:49 -05:00
Dustin	98651cf9d9	jenkins: Force iSCSI volume on specific nodes Instead of routing iSCSI traffic from the Kubernetes network, through the firewall, to the storage network, nodes now have a second network adapter connected to directly to the storage network. The nodes with such an adapter are labelled `network.du5t1n.me/storage`, so we can pin the Jenkins PersistentVolume to them via a node affinity rule.	2024-06-26 18:29:49 -05:00
Dustin	a2225e583e	paperless-ngx: Use volume claim template for redis Using a volume claim template to define the persistent volume claim for the Redis pod has two advantages: first, it enables using clustered Redis, if we decide that becomes necessary, and second, it makes deleteing and recreating the volume easier in the case of data corruption. Simply scale down the StatefulSet to 0, delete the PVC, and scale the StatefulSet back up.	2024-06-26 18:29:49 -05:00
Dustin	02c88700f7	firefly-iii: Use volume claim template for redis Using a volume claim template to define the persistent volume claim for the Redis pod has two advantages: first, it enables using clustered Redis, if we decide that becomes necessary, and second, it makes deleteing and recreating the volume easier in the case of data corruption. Simply scale down the StatefulSet to 0, delete the PVC, and scale the StatefulSet back up.	2024-06-26 18:29:49 -05:00
Dustin	2ce1821667	step-ca: Allow longer validity for ACME certificates By default, step-ca issues certificates that are valid for only one day. This means that clients need to have multiple renew attempts scheduled throughout the day, otherwise, missing one could mean having their certificates expire. This is unnecessary, and not even possible in all cases, so let's make the default validity period longer and avoid the issue.	2024-06-26 18:29:49 -05:00
Dustin	858bad55ca	grafana: Trust dch-root-ca for LDAP connections The LDAP servers now use certificates signed by _DCH CA R2_, so the _DCH Root CA R2_ CA needs to be trusted in order to communicate with them.	2024-06-26 18:29:49 -05:00
Dustin	e71156bcec	authelia: Mount dch-root-ca The LDAP servers now use certificates signed by _DCH CA R2_, so the _DCH Root CA R2_ CA needs to be trusted in order to communicate with them.	2024-06-26 18:29:49 -05:00
Dustin	b8015c0bed	v-m: blackbox: Force TCP probe to IPv4 Since I added an IPv6 ULA prefix to the "main" VLAN (to allow communicating with the Synology directly), the domain controllers now have AAAA records. This causes the `sambadc` screpe job to fail because Blackbox Exporter prefers IPv6 by default, but Kubernetes pods do not have IPv6 addreses.	2024-06-26 18:29:49 -05:00
Dustin	7f3287297b	jenkins: Migrate to iSCSI persistent volume Managing the Jenkins volume with Longhorn has become increasingly problematic. Because of its large size, whenever Longhorn needs to rebuild/replicate it (which happens often for no apparent reason), it can take several hours. While the synchronization is happening, the entire cluster suffers from degraded performance. Instead of using Longhorn, I've decided to try storing the data directly on the Synology NAS and expose it to Kubernetes via iSCSI. The Synology offers many of the same features as Longhorn, including snapshots/rollbacks and backups. Using the NAS allows the volume to be available to any Kubernetes node, without keeping multiple copies of the data. In order to expose the iSCSI service on the NAS to the Kubernetes nodes, I had to make the storage VLAN routable. I kept it as IPv6-only, though, as an extra precaution against unauthorized access. The firewall only allows nodes on the Kubernetes network to access the NAS via iSCSI. I originally tried proxying the iSCSI connection via the VM hosts, however, this failed because of how iSCSI target discovery works. The provided "target host" is really only used to identify available LUNs; follow-up communication is done with the IP address returned by the discovery process. Since the NAS would return its IP address, which differed from the proxy address, the connection would fail. Thus, I resorted to reconfiguring the storage network and connecting directly to the NAS. To migrate the contents of the volume, I temporarily created a PVC with a different name and bound it to the iSCSI PersistentVolume. Using a pod with both the original PVC and the new PVC mounted, I used `rsync` to copy the data. Once the copy completed, I deleted the Pod and both PVCs, then created a new PVC with the original name (i.e. `jenkins`), bound to the iSCSI PV. While doing this, Longhorn, for some reason, kept re-creating the PVC whenever I would delete it, no matter how I requested the deletion. Deleting the PV, the PVC, or the Volume, using either the Kubernetes API or the Longhorn UI, they would all get recreated almost immediately. Fortunately, there was actually enough of a delay after deleting it before Longhorn would recreate it that I was able to create the new PVC manually. Once I did that, Longhorn seemed to give up.	2024-06-23 09:53:15 -05:00
Dustin	c3c9c0c555	kitchen: Run as non-root user The kitchen server service does not need to run as root or have any access to writable storage.	2024-06-06 11:03:42 -05:00
Dustin	b4d6dfeb07	kitchen: Re-enable graceful shutdown timeout Version 0.5.1 fixes the issue with `uvicorn` hanging on shutdown because of the WebSocket message queue.	2024-06-06 10:09:37 -05:00
Dustin	7b8b11111e	kitchen: Updates for v0.5 Kitchen v0.5 a few changes that affect the deployment: * The Bored Board is now backed by MQTT * The pool temperature is now displayed in the weather pane * The container image is now based on Fedora and includes its own time zone database and root CA bundle * The websocket server prevents the process from stopping correctly unless the graceful shutdown feature of `uvicorn` is disabled	2024-06-05 22:04:55 -05:00
Dustin	48f20eac07	v-m: Scrape metrics from fleetlock	2024-05-31 15:18:55 -05:00
Dustin	fc66058251	fleetlock: Deploy Zincati fleet lock manager [fleetlock] is an implementation of the Zincati FleetLock reboot coordination protocol. It only works for machines that are Kubernetes nodes, but it does enable safe rolling updates for those machines. Specifically, when a node acquires a lock (backed by a Kubernetes Lease), it cordons that node and evicts pods from it. After the node has rebooted into the new version of Fedora CoreOS, it uncordons the node and releases the lock. [fleetlock]: https://github.com/poseidon/fleetlock	2024-05-31 15:18:01 -05:00
Dustin	365334cea7	xactfetch: Provide Vaultwarden password for sync Vaultwarden has started prompting for the master password occasionally when syncing the vault. Thus, we need to make sure it is available in the _sync_ container, by mounting the secret and providing the `PINENTRY_PASSWORD_FILE` environment variable.	2024-05-29 09:36:30 -05:00
Dustin	8939c1d02c	v-m/scrape: Scrape unifi2.p.b unifi2.pyrocufflink.blue is a Fedora CoreOS host, so it runs collectd, Promtail, and Zincati.	2024-05-26 11:48:59 -05:00
Dustin	61bfd8ff1a	keyserv: Add age keys for unifi2 This key encrypts the password for unifi_exporter to connect to Unifi Network.	2024-05-26 11:48:12 -05:00
Dustin	3b74c3d508	v-m: Scrape metrics from Paperless-ngx Flower	2024-05-22 15:51:07 -05:00
Dustin	f83783fd58	paperless-ngx: Enable Flower Flower is the monitoring agent for Celery. It has a web UI, but more importantly, it exposes Celery performance metrics in Prometheus format.	2024-05-22 15:50:32 -05:00
Dustin	d5bfdaca25	v-m/alertmanager-ntfy: Add labels to notifications Just having the alert name and group name in the ntfy notification is not enough to really indicate what the problem is, as some alerts can generate notifications for many reasons. In the email notifications AlertManager sends by default, the values (but not the keys) of all labels are included in the subject, so we will reproduce that here.	2024-05-22 15:20:27 -05:00
Dustin	aedd4df9f6	sshca: Add machine ID for Toad	2024-05-22 15:20:09 -05:00
Dustin	d74e26d527	victoria-metrics: Send alerts via ntfy I don't like having alerts sent by e-mail. Since I don't get e-mail notifications on my watch, I often do not see alerts for quite some time. They are also much harder to read in an e-mail client (Fastmail web an K-9 Mail both display them poorly). I would much rather have them delivered via _ntfy_, just like all the rest of the ephemeral notifications I receive. Fortunately, it is easy enough to integrate Alertmanager and _ntfy_ using the webhook notifier in Alertmanager. Since _ntfy_ does not natively support the Alertmanager webhook API, though, a bridge is necessary to translate from one data format to the other. There are a few options for this bridge, but I chose [alexbakker/alertmanager-ntfy][0] because it looked the most complete while also having the simplest configuration format. Sadly, it does not expose any Prometheus metrics itself, and since it's deployed in the _victoria-metrics_ namespace, it needs to be explicitly excluded from the VMAgent scrape configuration. [0]: https://github.com/alexbakker/alertmanager-ntfy	2024-05-10 10:32:52 -05:00
Dustin	a4591950ba	home-assistant: Add time-to-go timer to watch view This way I can start the "time to go" timer from my watch as soon as Brandon says he's leaving work.	2024-05-10 09:24:34 -05:00
Dustin	ab916640cb	home-assistant: Re-enable 17track sensor	2024-05-10 09:24:02 -05:00
Dustin	7618bdcae6	firefly-iii: Replace importer access token The access token the Firefly III Importer service uses to communicate with Firefly III expired and needs replaced.	2024-05-10 09:23:04 -05:00
Dustin	ebea31fe55	v-m: alerts: Add alert for camera offline	2024-04-23 09:42:04 -05:00
Dustin	c2417b7960	authelia: Fix Jenkins OIDC client Authelia 4.38 introduced a change that broke logging in to Jenkins with OIDC. This setting is required to fix it.	2024-04-10 21:26:00 -05:00
Dustin	1581a620ef	v-m/scrape: Scrape nvr2.p.b nvr2.pyrocufflink.blue has replaced nvr1.pyrocufflink.blue as the Frigate/recording server.	2024-04-10 21:25:26 -05:00
Dustin	c2b595d3e2	keyserv: Add age key for nvr2/NUT monitor	2024-04-06 10:06:30 -05:00
Dustin	31b0b081a3	keyserv: Add key for Frigate/nvr2	2024-04-05 14:12:08 -05:00
Dustin	3ba83373f3	step-ca: Re-deploy (again) with DCH CA R2 Although most libraries support ED25519 signatures for X.509 certificates, Firefox does not. This means that any certificate signed by DCH CA R3 cannot be verified by the browser and thus will always present a certificate error. I want to migrate internal services that do not need certificates that are trusted by default (i.e. they are only accessed programatically or only I use them in the browser) back to using an internal CA instead of the public pyrocufflink.net wildcard certificate. For applications like Frigate and UniFi Network, these need to be signed by a CA that the browser will trust, so the ED25519 certificate is inappropriate. Thus, I've decided to migrate back to DCH CA R2, which uses an EdDSA signature, and can therefore be trusted by Firefox, etc.	2024-04-05 13:03:34 -05:00
Dustin	5c34fdb1c6	sshca: Add Machine UUID for nvr2.p.b	2024-04-05 12:26:51 -05:00
Dustin	680709e670	authelia: Add auth rule for HLC forms submit The hlcforms application handles form submissions for the Hatch Learning Center website. It has various features for Tabitha that are only accessible internally, but the form submission handler itself of course needs to be accessible anonymously.	2024-03-25 08:43:55 -05:00

... 5 6 7 8 9 ...

569 Commits (7d7199ee104b5657b4546d18ff6b1fae95c4f61f) All Branches Search

569 Commits (7d7199ee104b5657b4546d18ff6b1fae95c4f61f)

All Branches