kubernetes

Author	SHA1	Message	Date
Dustin C. Hatch	b8015c0bed	v-m: blackbox: Force TCP probe to IPv4 Since I added an IPv6 ULA prefix to the "main" VLAN (to allow communicating with the Synology directly), the domain controllers now have AAAA records. This causes the `sambadc` screpe job to fail because Blackbox Exporter prefers IPv6 by default, but Kubernetes pods do not have IPv6 addreses.	2024-06-26 18:29:49 -05:00
Dustin C. Hatch	7f3287297b	jenkins: Migrate to iSCSI persistent volume Managing the Jenkins volume with Longhorn has become increasingly problematic. Because of its large size, whenever Longhorn needs to rebuild/replicate it (which happens often for no apparent reason), it can take several hours. While the synchronization is happening, the entire cluster suffers from degraded performance. Instead of using Longhorn, I've decided to try storing the data directly on the Synology NAS and expose it to Kubernetes via iSCSI. The Synology offers many of the same features as Longhorn, including snapshots/rollbacks and backups. Using the NAS allows the volume to be available to any Kubernetes node, without keeping multiple copies of the data. In order to expose the iSCSI service on the NAS to the Kubernetes nodes, I had to make the storage VLAN routable. I kept it as IPv6-only, though, as an extra precaution against unauthorized access. The firewall only allows nodes on the Kubernetes network to access the NAS via iSCSI. I originally tried proxying the iSCSI connection via the VM hosts, however, this failed because of how iSCSI target discovery works. The provided "target host" is really only used to identify available LUNs; follow-up communication is done with the IP address returned by the discovery process. Since the NAS would return its IP address, which differed from the proxy address, the connection would fail. Thus, I resorted to reconfiguring the storage network and connecting directly to the NAS. To migrate the contents of the volume, I temporarily created a PVC with a different name and bound it to the iSCSI PersistentVolume. Using a pod with both the original PVC and the new PVC mounted, I used `rsync` to copy the data. Once the copy completed, I deleted the Pod and both PVCs, then created a new PVC with the original name (i.e. `jenkins`), bound to the iSCSI PV. While doing this, Longhorn, for some reason, kept re-creating the PVC whenever I would delete it, no matter how I requested the deletion. Deleting the PV, the PVC, or the Volume, using either the Kubernetes API or the Longhorn UI, they would all get recreated almost immediately. Fortunately, there was actually enough of a delay after deleting it before Longhorn would recreate it that I was able to create the new PVC manually. Once I did that, Longhorn seemed to give up.	2024-06-23 09:53:15 -05:00
Dustin C. Hatch	c3c9c0c555	kitchen: Run as non-root user The kitchen server service does not need to run as root or have any access to writable storage.	2024-06-06 11:03:42 -05:00
Dustin C. Hatch	b4d6dfeb07	kitchen: Re-enable graceful shutdown timeout Version 0.5.1 fixes the issue with `uvicorn` hanging on shutdown because of the WebSocket message queue.	2024-06-06 10:09:37 -05:00
Dustin C. Hatch	7b8b11111e	kitchen: Updates for v0.5 Kitchen v0.5 a few changes that affect the deployment: * The Bored Board is now backed by MQTT * The pool temperature is now displayed in the weather pane * The container image is now based on Fedora and includes its own time zone database and root CA bundle * The websocket server prevents the process from stopping correctly unless the graceful shutdown feature of `uvicorn` is disabled	2024-06-05 22:04:55 -05:00
Dustin C. Hatch	48f20eac07	v-m: Scrape metrics from fleetlock	2024-05-31 15:18:55 -05:00
Dustin C. Hatch	fc66058251	fleetlock: Deploy Zincati fleet lock manager [fleetlock] is an implementation of the Zincati FleetLock reboot coordination protocol. It only works for machines that are Kubernetes nodes, but it does enable safe rolling updates for those machines. Specifically, when a node acquires a lock (backed by a Kubernetes Lease), it cordons that node and evicts pods from it. After the node has rebooted into the new version of Fedora CoreOS, it uncordons the node and releases the lock. [fleetlock]: https://github.com/poseidon/fleetlock	2024-05-31 15:18:01 -05:00
Dustin C. Hatch	365334cea7	xactfetch: Provide Vaultwarden password for sync Vaultwarden has started prompting for the master password occasionally when syncing the vault. Thus, we need to make sure it is available in the _sync_ container, by mounting the secret and providing the `PINENTRY_PASSWORD_FILE` environment variable.	2024-05-29 09:36:30 -05:00
Dustin C. Hatch	8939c1d02c	v-m/scrape: Scrape unifi2.p.b unifi2.pyrocufflink.blue is a Fedora CoreOS host, so it runs collectd, Promtail, and Zincati.	2024-05-26 11:48:59 -05:00
Dustin C. Hatch	61bfd8ff1a	keyserv: Add age keys for unifi2 This key encrypts the password for unifi_exporter to connect to Unifi Network.	2024-05-26 11:48:12 -05:00
Dustin C. Hatch	3b74c3d508	v-m: Scrape metrics from Paperless-ngx Flower	2024-05-22 15:51:07 -05:00
Dustin C. Hatch	f83783fd58	paperless-ngx: Enable Flower Flower is the monitoring agent for Celery. It has a web UI, but more importantly, it exposes Celery performance metrics in Prometheus format.	2024-05-22 15:50:32 -05:00
Dustin C. Hatch	d5bfdaca25	v-m/alertmanager-ntfy: Add labels to notifications Just having the alert name and group name in the ntfy notification is not enough to really indicate what the problem is, as some alerts can generate notifications for many reasons. In the email notifications AlertManager sends by default, the values (but not the keys) of all labels are included in the subject, so we will reproduce that here.	2024-05-22 15:20:27 -05:00
Dustin C. Hatch	aedd4df9f6	sshca: Add machine ID for Toad	2024-05-22 15:20:09 -05:00
Dustin C. Hatch	d74e26d527	victoria-metrics: Send alerts via ntfy I don't like having alerts sent by e-mail. Since I don't get e-mail notifications on my watch, I often do not see alerts for quite some time. They are also much harder to read in an e-mail client (Fastmail web an K-9 Mail both display them poorly). I would much rather have them delivered via _ntfy_, just like all the rest of the ephemeral notifications I receive. Fortunately, it is easy enough to integrate Alertmanager and _ntfy_ using the webhook notifier in Alertmanager. Since _ntfy_ does not natively support the Alertmanager webhook API, though, a bridge is necessary to translate from one data format to the other. There are a few options for this bridge, but I chose [alexbakker/alertmanager-ntfy][0] because it looked the most complete while also having the simplest configuration format. Sadly, it does not expose any Prometheus metrics itself, and since it's deployed in the _victoria-metrics_ namespace, it needs to be explicitly excluded from the VMAgent scrape configuration. [0]: https://github.com/alexbakker/alertmanager-ntfy	2024-05-10 10:32:52 -05:00
Dustin C. Hatch	a4591950ba	home-assistant: Add time-to-go timer to watch view This way I can start the "time to go" timer from my watch as soon as Brandon says he's leaving work.	2024-05-10 09:24:34 -05:00
Dustin C. Hatch	ab916640cb	home-assistant: Re-enable 17track sensor	2024-05-10 09:24:02 -05:00
Dustin C. Hatch	7618bdcae6	firefly-iii: Replace importer access token The access token the Firefly III Importer service uses to communicate with Firefly III expired and needs replaced.	2024-05-10 09:23:04 -05:00
Dustin C. Hatch	ebea31fe55	v-m: alerts: Add alert for camera offline	2024-04-23 09:42:04 -05:00
Dustin C. Hatch	c2417b7960	authelia: Fix Jenkins OIDC client Authelia 4.38 introduced a change that broke logging in to Jenkins with OIDC. This setting is required to fix it.	2024-04-10 21:26:00 -05:00
Dustin C. Hatch	1581a620ef	v-m/scrape: Scrape nvr2.p.b nvr2.pyrocufflink.blue has replaced nvr1.pyrocufflink.blue as the Frigate/recording server.	2024-04-10 21:25:26 -05:00
Dustin C. Hatch	c2b595d3e2	keyserv: Add age key for nvr2/NUT monitor	2024-04-06 10:06:30 -05:00
Dustin C. Hatch	31b0b081a3	keyserv: Add key for Frigate/nvr2	2024-04-05 14:12:08 -05:00
Dustin C. Hatch	3ba83373f3	step-ca: Re-deploy (again) with DCH CA R2 Although most libraries support ED25519 signatures for X.509 certificates, Firefox does not. This means that any certificate signed by DCH CA R3 cannot be verified by the browser and thus will always present a certificate error. I want to migrate internal services that do not need certificates that are trusted by default (i.e. they are only accessed programatically or only I use them in the browser) back to using an internal CA instead of the public pyrocufflink.net wildcard certificate. For applications like Frigate and UniFi Network, these need to be signed by a CA that the browser will trust, so the ED25519 certificate is inappropriate. Thus, I've decided to migrate back to DCH CA R2, which uses an EdDSA signature, and can therefore be trusted by Firefox, etc.	2024-04-05 13:03:34 -05:00
Dustin C. Hatch	5c34fdb1c6	sshca: Add Machine UUID for nvr2.p.b	2024-04-05 12:26:51 -05:00
Dustin C. Hatch	680709e670	authelia: Add auth rule for HLC forms submit The hlcforms application handles form submissions for the Hatch Learning Center website. It has various features for Tabitha that are only accessible internally, but the form submission handler itself of course needs to be accessible anonymously.	2024-03-25 08:43:55 -05:00
Dustin C. Hatch	c7223ff4fd	authelia: Enable dark theme A recent version of Authelia added a dark theme. Setting the `theme` option to `auto` enables it when the user agent has the "prefers dark mode" hint enabled.	2024-02-27 06:51:14 -06:00
Dustin C. Hatch	de72776e73	v-m: Scrape metrics from Authelia Authelia exposes Prometheus metrics from a different server socket, which is not enabled by default.	2024-02-27 06:41:52 -06:00
Dustin C. Hatch	e0b2b3f5ae	v-m: Scrape metrics from Patroni Patroni, a component of the postgres poerator, exports metrics about the PostgreSQL database servers it manages. Notably, it provides information about the current transaction log location for each server. This allows us to monitor and alert on the health of database replicas.	2024-02-24 08:33:52 -06:00
Dustin C. Hatch	2442835edd	autoscaler: Add SealedSecret for AWS key	2024-02-22 09:59:16 -06:00
Dustin C. Hatch	83eeb46c93	v-m: Scrape Argo CD Argo CD exposes metrics about itself and the applications it manages. Notibly, this can be useful for monitoring application health.	2024-02-22 07:10:01 -06:00
Dustin C. Hatch	465f121e61	v-m: Scrape Promtail The promtail job scrapes metrics from all the hosts running Promtail. The static targets are Fedora CoreOS nodes that are not part of the Kubernetes cluster. The relabeling rules ensure that both the static targets and the targets discovered via the Kubernetes Node API use the FQDN of the host as the value of the instance label.	2024-02-22 07:10:01 -06:00
Dustin C. Hatch	815eefdcf9	promtail: Deploy as DaemonSet Running Promtail in a pod controlled by a DaemonSet allows it to access the Kubernetes API via a ServiceAccount token. Since it needs the API in order to discover the Pods running on the current node in order to find their log files, this makes the authentication process a lot simpler.	2024-02-22 07:10:01 -06:00
Dustin C. Hatch	5e4ab1d988	v-m: Update Loki scrape target Now that Loki uses Caddy as a reverse proxy, we need to update the scrape target to point to the correct port (443).	2024-02-22 07:10:01 -06:00
Dustin C. Hatch	f468977d91	grafana: Enable send_user_header option I discovered today that if anonymous Grafana users have Viewer permission, they can use the Datasource API to make arbitrary queries to any backend, even if they cannot access the Explore page directly. This is documented ([issue #48313][0]) as expected behavior. I don't really mind giving anonymous access to the Victoria Metrics datasource, but I definitely don't want anonymous users to be able to make Loki queries and view log data. Since Grafana Datasource Permissions is limited to Grafana Enterprise and not available in the open source version of Grafana, the official recommendation from upstream is to use a separate Organization for the Loki datasource. Unfortunately, this would preclude having dashboards that have graphs from both data sources. Although I don't have any of those right now, I like the idea and may build some eventually. Fortunately, I discovered the `send_user_header` Grafana configuration option. With this enabled, Grafana will send an `X-Grafana-User` header with the username of the user on whose behalf it is making a request to the backend. If the user is not logged in, it does not send the header. Thus, we can detect the presence of this header on the backend and refuse to serve query requests if it is missing. [0]: https://github.com/grafana/grafana/issues/48313	2024-02-22 07:10:01 -06:00
Dustin C. Hatch	35ff500812	grafana: Configure Loki datastore Usually, Grafana datastores are configured using its web GUI. When setting up a datastore that requires TLS client authentication, the client certificate and private key have to be pasted into the form. For certificates that renew frequently, this method would require a frequent manual effort. Fortunately, Grafana supports defining datastores via its "provisioning" mechanism, reading the configuration from YAML files on the filesystem.	2024-02-22 07:10:01 -06:00
Dustin C. Hatch	d4efb735bf	loki-ca: Add cert-manager issuer for Loki CA The Loki CA is used to issue client certificates for Grafana Loki. This _cert-manager_ ClusterIssuer will allow applications running in Kubernetes (e.g. Grafana) to request a Certificate that they can use to access the Loki HTTP API.	2024-02-22 07:10:01 -06:00
Dustin C. Hatch	d08cc6fb0f	step-ca: Redeploy with DCH CA R3 I never ended up using _Step CA_ for anything, since I was initially focused on the SSH CA feature and I was unhappy with how it worked (which led me to write _SSHCA_). I didn't think about it much until I was working on deploying Grafana Loki. For that project, I wanted to use a certificate signed by a private CA instead of the wildcard certificate for _pyrocufflink.blue_. So, I created DCH CA R3 for that purpose. Then, for some reason, I used the exact same procedure to fetch the certificate from Kubernetes as I had set up for the _pyrocufflink.blue_ wildcard certificate, as used by Frigate. This of course defeated the purpose, since I could have just as easily used the wildcard certificate in that case. When I discovered that Grafana Loki expects to be deployed behind a reverse proxy in order to implement access control, I took the opportunity to reevaluate the certificate issuance process. Since a reverse proxy is required to implement the access control I want (anyone can push logs but only authenticated users can query them), it made sense to choose one with native support for requesting certificates via ACME. This would eliminate the need for `fetchcert` and the corresponding Kubernetes API token. Thus, I ended up deciding to redeploy _Step CA_ with the new _DCH CA R3_ for this purpose.	2024-02-22 07:10:01 -06:00
Dustin C. Hatch	4c238a69aa	v-m: Scrape Grafana Loki Grafana Loki is hosted on a VM named loki0.pyrocufflink.blue. It runs Fedora CoreOS, so in addition to scraping Loki itself, we need to scrape _collectd_ and _Zincati_ as well.	2024-02-21 09:16:26 -06:00
Dustin C. Hatch	1777262c15	dch-root-ca: Update to DCH Root CA R3 Since I shut down _step-ca_, nothing uses _DCH Root CA R2_ anymore. I've created a new CA using ED25519 key pairs, named _DCH Root CA R3_.	2024-02-21 09:16:26 -06:00
Dustin C. Hatch	1d2b5260bb	keyserv: Add age key for loki0 This key is used to encrypt the Kubernetes access token for `fetchcert`, which downloads the certificate for Grafana Loki HTTPS.	2024-02-21 09:16:26 -06:00
Dustin C. Hatch	96928a2611	kitchen: Fix weather metrics API URI Apparently, I never bothered to check that the Kitchen HUD server was actually fetching data from Victoria Metrics when I updated it before; I only verified that the Unauthorized errors in the `vmselect` log went away. They did, but only because now the Kitchen server was failing to contact `vmselect` at all.	2024-02-21 08:01:35 -06:00
Dustin C. Hatch	2acefd9a72	v-m: Add alert for sensor battery levels I did not realize the batteries on the garage door tilt sensors had died. Adding alerts for various sensor batteries should help keep me better informed.	2024-02-16 20:56:38 -06:00
Dustin C. Hatch	9784b90743	cert-manager: Remove unused secrets These secrets were used by previous issuers/solvers and are no longer needed.	2024-02-16 20:56:08 -06:00
Dustin C. Hatch	0ad63e0613	authelia: Allow anonymous access to AlertManager Sometimes, I want to be able to look at active alerts without logging in. This rule allows read-only access to the AlertManager UI and API. Unfortunately, the user experience when attempting to create a new Silence using the UI without first logging in is suboptimal, but I think that's worth the trade-off.	2024-02-16 20:41:47 -06:00
Dustin C. Hatch	2f6c358860	invoice-ninja: Update PVC for restored backup The Longhorn volume for the invoice-ninja PVC got into a strange state following an unexpected shutdown this morning. One of its replicas seemed to have disappeared, and it also thought that the size had changed. As such, it got stuck in "expanding" state, but it was not actually being expanded. This issue is described in detail in the Longhorn documentation: [Troubleshooting: Unexpected expansion leads to degradation or attach failure][0]. Unfortunately, there is no way to recover a volume from that state, and it must be deleted and recreated from backup. This changes some of the properties of the PVC, so they need to be updated in the manifest. [0]: https://longhorn.io/kb/troubleshooting-unexpected-expansion-leads-to-degradation-or-attach-failure/	2024-02-15 09:45:57 -06:00
Dustin C. Hatch	80df160ceb	device-plugins: Allow FUSE plugin on Jenkins nodes Jenkins jobs that build container images need access to `/dev/fuse`. Thus, we have to allow Pods managed by the fuse-device-plugin DaemonSet to be scheduled on nodes that are tainted for use exclusively by Jenkins jobs.	2024-02-13 07:56:35 -06:00
Dustin C. Hatch	33fa951c68	Merge remote-tracking branch 'refs/remotes/origin/master'	2024-02-03 09:52:39 -06:00
Dustin C. Hatch	a395d176bc	sshca: Set group principals for Server Admins Members of the Server Admins group need to be able to log in to machines using their respective privileged accounts for e.g. provisioning or emergencies.	2024-02-02 21:02:40 -06:00
Dustin C. Hatch	1f28a623ae	v-m: Do not scrape/alert on Graylog Graylog is down because Elasticsearch corrupted itself again, and this time, I'm just not going to bother fixing it. I practically never use it anymore anyway, and I want to migrate to Grafana Loki, so now seems like a good time to just get rid of it.	2024-02-01 21:45:43 -06:00

1 2 3 4 5 ...

295 Commits