1
0
Fork 0
Commit Graph

295 Commits (1a631bf366c3678039f41ee1101684b688ad8f20)

Author SHA1 Message Date
Dustin b8015c0bed v-m: blackbox: Force TCP probe to IPv4
Since I added an IPv6 ULA prefix to the "main" VLAN (to allow
communicating with the Synology directly), the domain controllers now
have AAAA records.  This causes the `sambadc` screpe job to fail because
Blackbox Exporter prefers IPv6 by default, but Kubernetes pods do not
have IPv6 addreses.
2024-06-26 18:29:49 -05:00
Dustin 7f3287297b jenkins: Migrate to iSCSI persistent volume
Managing the Jenkins volume with Longhorn has become increasingly
problematic.  Because of its large size, whenever Longhorn needs to
rebuild/replicate it (which happens often for no apparent reason), it
can take several hours.  While the synchronization is happening, the
entire cluster suffers from degraded performance.

Instead of using Longhorn, I've decided to try storing the data directly
on the Synology NAS and expose it to Kubernetes via iSCSI.  The Synology
offers many of the same features as Longhorn, including
snapshots/rollbacks and backups.  Using the NAS allows the volume to be
available to any Kubernetes node, without keeping multiple copies of
the data.

In order to expose the iSCSI service on the NAS to the Kubernetes nodes,
I had to make the storage VLAN routable.  I kept it as IPv6-only,
though, as an extra precaution against unauthorized access.  The
firewall only allows nodes on the Kubernetes network to access the NAS
via iSCSI.

I originally tried proxying the iSCSI connection via the VM hosts,
however, this failed because of how iSCSI target discovery works.  The
provided "target host" is really only used to identify available LUNs;
follow-up communication is done with the IP address returned by the
discovery process.  Since the NAS would return its IP address, which
differed from the proxy address, the connection would fail.  Thus, I
resorted to reconfiguring the storage network and connecting directly
to the NAS.

To migrate the contents of the volume, I temporarily created a PVC with
a different name and bound it to the iSCSI PersistentVolume.  Using a
pod with both the original PVC and the new PVC mounted, I used `rsync`
to copy the data.  Once the copy completed, I deleted the Pod and both
PVCs, then created a new PVC with the original name (i.e. `jenkins`),
bound to the iSCSI PV.  While doing this, Longhorn, for some reason,
kept re-creating the PVC whenever I would delete it, no matter how I
requested the deletion.  Deleting the PV, the PVC, or the Volume, using
either the Kubernetes API or the Longhorn UI, they would all get
recreated almost immediately.  Fortunately, there was actually enough of
a delay after deleting it before Longhorn would recreate it that I was
able to create the new PVC manually.  Once I did that, Longhorn seemed
to give up.
2024-06-23 09:53:15 -05:00
Dustin c3c9c0c555 kitchen: Run as non-root user
The *kitchen* server service does not need to run as root or have any
access to writable storage.
2024-06-06 11:03:42 -05:00
Dustin b4d6dfeb07 kitchen: Re-enable graceful shutdown timeout
Version 0.5.1 fixes the issue with `uvicorn` hanging on shutdown because
of the WebSocket message queue.
2024-06-06 10:09:37 -05:00
Dustin 7b8b11111e kitchen: Updates for v0.5
Kitchen v0.5 a few changes that affect the deployment:

* The Bored Board is now backed by MQTT
* The pool temperature is now displayed in the weather pane
* The container image is now based on Fedora and includes its own time
  zone database and root CA bundle
* The websocket server prevents the process from stopping correctly
  unless the graceful shutdown feature of `uvicorn` is disabled
2024-06-05 22:04:55 -05:00
Dustin 48f20eac07 v-m: Scrape metrics from fleetlock 2024-05-31 15:18:55 -05:00
Dustin fc66058251 fleetlock: Deploy Zincati fleet lock manager
[fleetlock] is an implementation of the Zincati FleetLock reboot
coordination protocol.  It only works for machines that are Kubernetes
nodes, but it does enable safe rolling updates for those machines.
Specifically, when a node acquires a lock (backed by a Kubernetes
Lease), it cordons that node and evicts pods from it.  After the node
has rebooted into the new version of Fedora CoreOS, it uncordons the
node and releases the lock.

[fleetlock]: https://github.com/poseidon/fleetlock
2024-05-31 15:18:01 -05:00
Dustin 365334cea7 xactfetch: Provide Vaultwarden password for sync
Vaultwarden has started prompting for the master password occasionally
when syncing the vault.  Thus, we need to make sure it is available in
the _sync_ container, by mounting the secret and providing the
`PINENTRY_PASSWORD_FILE` environment variable.
2024-05-29 09:36:30 -05:00
Dustin 8939c1d02c v-m/scrape: Scrape unifi2.p.b
*unifi2.pyrocufflink.blue* is a Fedora CoreOS host, so it runs
*collectd*, *Promtail*, and *Zincati*.
2024-05-26 11:48:59 -05:00
Dustin 61bfd8ff1a keyserv: Add age keys for unifi2
This key encrypts the password for *unifi_exporter* to connect to Unifi
Network.
2024-05-26 11:48:12 -05:00
Dustin 3b74c3d508 v-m: Scrape metrics from Paperless-ngx Flower 2024-05-22 15:51:07 -05:00
Dustin f83783fd58 paperless-ngx: Enable Flower
Flower is the monitoring agent for Celery.  It has a web UI, but more
importantly, it exposes Celery performance metrics in Prometheus format.
2024-05-22 15:50:32 -05:00
Dustin d5bfdaca25 v-m/alertmanager-ntfy: Add labels to notifications
Just having the alert name and group name in the ntfy notification is
not enough to really indicate what the problem is, as some alerts can
generate notifications for many reasons.  In the email notifications
AlertManager sends by default, the values (but not the keys) of all
labels are included in the subject, so we will reproduce that here.
2024-05-22 15:20:27 -05:00
Dustin aedd4df9f6 sshca: Add machine ID for Toad 2024-05-22 15:20:09 -05:00
Dustin d74e26d527 victoria-metrics: Send alerts via ntfy
I don't like having alerts sent by e-mail.  Since I don't get e-mail
notifications on my watch, I often do not see alerts for quite some
time.  They are also much harder to read in an e-mail client (Fastmail
web an K-9 Mail both display them poorly).  I would much rather have
them delivered via _ntfy_, just like all the rest of the ephemeral
notifications I receive.

Fortunately, it is easy enough to integrate Alertmanager and _ntfy_
using the webhook notifier in Alertmanager.  Since _ntfy_ does not
natively support the Alertmanager webhook API, though, a bridge is
necessary to translate from one data format to the other.  There are a
few options for this bridge, but I chose
[alexbakker/alertmanager-ntfy][0] because it looked the most complete
while also having the simplest configuration format.  Sadly, it does not
expose any Prometheus metrics itself, and since it's deployed in the
_victoria-metrics_ namespace, it needs to be explicitly excluded from
the VMAgent scrape configuration.

[0]: https://github.com/alexbakker/alertmanager-ntfy
2024-05-10 10:32:52 -05:00
Dustin a4591950ba home-assistant: Add time-to-go timer to watch view
This way I can start the "time to go" timer from my watch as soon as
Brandon says he's leaving work.
2024-05-10 09:24:34 -05:00
Dustin ab916640cb home-assistant: Re-enable 17track sensor 2024-05-10 09:24:02 -05:00
Dustin 7618bdcae6 firefly-iii: Replace importer access token
The access token the Firefly III Importer service uses to communicate
with Firefly III expired and needs replaced.
2024-05-10 09:23:04 -05:00
Dustin ebea31fe55 v-m: alerts: Add alert for camera offline 2024-04-23 09:42:04 -05:00
Dustin c2417b7960 authelia: Fix Jenkins OIDC client
Authelia 4.38 introduced a change that broke logging in to Jenkins with
OIDC.  This setting is required to fix it.
2024-04-10 21:26:00 -05:00
Dustin 1581a620ef v-m/scrape: Scrape nvr2.p.b
*nvr2.pyrocufflink.blue* has replaced *nvr1.pyrocufflink.blue* as the
Frigate/recording server.
2024-04-10 21:25:26 -05:00
Dustin c2b595d3e2 keyserv: Add age key for nvr2/NUT monitor 2024-04-06 10:06:30 -05:00
Dustin 31b0b081a3 keyserv: Add key for Frigate/nvr2 2024-04-05 14:12:08 -05:00
Dustin 3ba83373f3 step-ca: Re-deploy (again) with DCH CA R2
Although most libraries support ED25519 signatures for X.509
certificates, Firefox does not.  This means that any certificate signed
by DCH CA R3 cannot be verified by the browser and thus will always
present a certificate error.

I want to migrate internal services that do not need certificates
that are trusted by default (i.e. they are only accessed programatically
or only I use them in the browser) back to using an internal CA instead
of the public *pyrocufflink.net* wildcard certificate.  For applications
like Frigate and UniFi Network, these need to be signed by a CA that
the browser will trust, so the ED25519 certificate is inappropriate.
Thus, I've decided to migrate back to DCH CA R2, which uses an EdDSA
signature, and can therefore be trusted by Firefox, etc.
2024-04-05 13:03:34 -05:00
Dustin 5c34fdb1c6 sshca: Add Machine UUID for nvr2.p.b 2024-04-05 12:26:51 -05:00
Dustin 680709e670 authelia: Add auth rule for HLC forms submit
The *hlcforms* application handles form submissions for the Hatch
Learning Center website.  It has various features for Tabitha that are
only accessible internally, but the form submission handler itself of
course needs to be accessible anonymously.
2024-03-25 08:43:55 -05:00
Dustin c7223ff4fd authelia: Enable dark theme
A recent version of *Authelia* added a dark theme.  Setting the `theme`
option to `auto` enables it when the user agent has the "prefers dark
mode" hint enabled.
2024-02-27 06:51:14 -06:00
Dustin de72776e73 v-m: Scrape metrics from Authelia
Authelia exposes Prometheus metrics from a different server socket,
which is not enabled by default.
2024-02-27 06:41:52 -06:00
Dustin e0b2b3f5ae v-m: Scrape metrics from Patroni
Patroni, a component of the *postgres poerator*, exports metrics about
the PostgreSQL database servers it manages.  Notably, it provides
information about the current transaction log location for each server.
This allows us to monitor and alert on the health of database replicas.
2024-02-24 08:33:52 -06:00
Dustin 2442835edd autoscaler: Add SealedSecret for AWS key 2024-02-22 09:59:16 -06:00
Dustin 83eeb46c93 v-m: Scrape Argo CD
*Argo CD* exposes metrics about itself and the applications it manages.
Notibly, this can be useful for monitoring application health.
2024-02-22 07:10:01 -06:00
Dustin 465f121e61 v-m: Scrape Promtail
The *promtail* job scrapes metrics from all the hosts running Promtail.
The static targets are Fedora CoreOS nodes that are not part of the
Kubernetes cluster.

The relabeling rules ensure that both the static targets and the
targets discovered via the Kubernetes Node API use the FQDN of the host
as the value of the *instance* label.
2024-02-22 07:10:01 -06:00
Dustin 815eefdcf9 promtail: Deploy as DaemonSet
Running Promtail in a pod controlled by a DaemonSet allows it to access
the Kubernetes API via a ServiceAccount token.  Since it needs the API
in order to discover the Pods running on the current node in order to
find their log files, this makes the authentication process a lot
simpler.
2024-02-22 07:10:01 -06:00
Dustin 5e4ab1d988 v-m: Update Loki scrape target
Now that Loki uses Caddy as a reverse proxy, we need to update the
scrape target to point to the correct port (443).
2024-02-22 07:10:01 -06:00
Dustin f468977d91 grafana: Enable send_user_header option
I discovered today that if anonymous Grafana users have Viewer
permission, they can use the Datasource API to make arbitrary queries
to any backend, even if they cannot access the Explore page directly.
This is documented ([issue #48313][0]) as expected behavior.

I don't really mind giving anonymous access to the Victoria Metrics
datasource, but I definitely don't want anonymous users to be able to
make Loki queries and view log data.  Since Grafana Datasource
Permissions is limited to Grafana Enterprise and not available in
the open source version of Grafana, the official recommendation from
upstream is to use a separate Organization for the Loki datasource.
Unfortunately, this would preclude having dashboards that have graphs
from both data sources.  Although I don't have any of those right now, I
like the idea and may build some eventually.

Fortunately, I discovered the `send_user_header` Grafana configuration
option.  With this enabled, Grafana will send an `X-Grafana-User` header
with the username of the user on whose behalf it is making a request to
the backend.  If the user is not logged in, it does not send the header.
Thus, we can detect the presence of this header on the backend and
refuse to serve query requests if it is missing.

[0]: https://github.com/grafana/grafana/issues/48313
2024-02-22 07:10:01 -06:00
Dustin 35ff500812 grafana: Configure Loki datastore
Usually, Grafana datastores are configured using its web GUI.  When
setting up a datastore that requires TLS client authentication, the
client certificate and private key have to be pasted into the form.
For certificates that renew frequently, this method would require a
frequent manual effort.  Fortunately, Grafana supports defining
datastores via its "provisioning" mechanism, reading the configuration
from YAML files on the filesystem.
2024-02-22 07:10:01 -06:00
Dustin d4efb735bf loki-ca: Add cert-manager issuer for Loki CA
The Loki CA is used to issue client certificates for Grafana Loki.  This
_cert-manager_ ClusterIssuer will allow applications running in
Kubernetes (e.g. Grafana) to request a Certificate that they can use to
access the Loki HTTP API.
2024-02-22 07:10:01 -06:00
Dustin d08cc6fb0f step-ca: Redeploy with DCH CA R3
I never ended up using _Step CA_ for anything, since I was initially
focused on the SSH CA feature and I was unhappy with how it worked
(which led me to write _SSHCA_).  I didn't think about it much until I
was working on deploying Grafana Loki.  For that project, I wanted to
use a certificate signed by a private CA instead of the wildcard
certificate for _pyrocufflink.blue_.  So, I created *DCH CA R3* for that
purpose.  Then, for some reason, I used the exact same procedure to
fetch the certificate from Kubernetes as I had set up for the
_pyrocufflink.blue_ wildcard certificate, as used by Frigate.  This of
course defeated the purpose, since I could have just as easily used
the wildcard certificate in that case.

When I discovered that Grafana Loki expects to be deployed behind a
reverse proxy in order to implement access control, I took the
opportunity to reevaluate the certificate issuance process.  Since a
reverse proxy is required to implement the access control I want (anyone
can push logs but only authenticated users can query them), it made
sense to choose one with native support for requesting certificates via
ACME.  This would eliminate the need for `fetchcert` and the
corresponding Kubernetes API token.  Thus, I ended up deciding to
redeploy _Step CA_ with the new _DCH CA R3_ for this purpose.
2024-02-22 07:10:01 -06:00
Dustin 4c238a69aa v-m: Scrape Grafana Loki
Grafana Loki is hosted on a VM named *loki0.pyrocufflink.blue*.  It runs
Fedora CoreOS, so in addition to scraping Loki itself, we need to scrape
_collectd_ and _Zincati_ as well.
2024-02-21 09:16:26 -06:00
Dustin 1777262c15 dch-root-ca: Update to DCH Root CA R3
Since I shut down _step-ca_, nothing uses _DCH Root CA R2_ anymore.
I've created a new CA using ED25519 key pairs, named _DCH Root CA R3_.
2024-02-21 09:16:26 -06:00
Dustin 1d2b5260bb keyserv: Add age key for loki0
This key is used to encrypt the Kubernetes access token for `fetchcert`,
which downloads the certificate for Grafana Loki HTTPS.
2024-02-21 09:16:26 -06:00
Dustin 96928a2611 kitchen: Fix weather metrics API URI
Apparently, I never bothered to check that the Kitchen HUD server was
actually fetching data from Victoria Metrics when I updated it before; I
only verified that the Unauthorized errors in the `vmselect` log
went away.  They did, but only because now the Kitchen server was
failing to contact `vmselect` at all.
2024-02-21 08:01:35 -06:00
Dustin 2acefd9a72 v-m: Add alert for sensor battery levels
I did not realize the batteries on the garage door tilt sensors had
died.  Adding alerts for various sensor batteries should help keep me
better informed.
2024-02-16 20:56:38 -06:00
Dustin 9784b90743 cert-manager: Remove unused secrets
These secrets were used by previous issuers/solvers and are no longer
needed.
2024-02-16 20:56:08 -06:00
Dustin 0ad63e0613 authelia: Allow anonymous access to AlertManager
Sometimes, I want to be able to look at active alerts without logging
in.  This rule allows read-only access to the AlertManager UI and API.
Unfortunately, the user experience when attempting to create a new
Silence using the UI without first logging in is suboptimal, but I think
that's worth the trade-off.
2024-02-16 20:41:47 -06:00
Dustin 2f6c358860 invoice-ninja: Update PVC for restored backup
The Longhorn volume for the *invoice-ninja* PVC got into a strange state
following an unexpected shutdown this morning.  One of its replicas
seemed to have disappeared, and it also thought that the size had
changed.  As such, it got stuck in "expanding" state, but it was not
actually being expanded.  This issue is described in detail in the
Longhorn documentation: [Troubleshooting: Unexpected expansion leads to
degradation or attach failure][0].  Unfortunately, there is no way to
recover a volume from that state, and it must be deleted and recreated
from backup.  This changes some of the properties of the PVC, so they
need to be updated in the manifest.

[0]: https://longhorn.io/kb/troubleshooting-unexpected-expansion-leads-to-degradation-or-attach-failure/
2024-02-15 09:45:57 -06:00
Dustin 80df160ceb device-plugins: Allow FUSE plugin on Jenkins nodes
Jenkins jobs that build container images need access to `/dev/fuse`.
Thus, we have to allow Pods managed by the *fuse-device-plugin*
DaemonSet to be scheduled on nodes that are tainted for use exclusively
by Jenkins jobs.
2024-02-13 07:56:35 -06:00
Dustin 33fa951c68 Merge remote-tracking branch 'refs/remotes/origin/master' 2024-02-03 09:52:39 -06:00
Dustin a395d176bc sshca: Set group principals for Server Admins
Members of the *Server Admins* group need to be able to log in to
machines using their respective privileged accounts for e.g.
provisioning or emergencies.
2024-02-02 21:02:40 -06:00
Dustin 1f28a623ae v-m: Do not scrape/alert on Graylog
Graylog is down because Elasticsearch corrupted itself again, and this
time, I'm just not going to bother fixing it.  I practically never use
it anymore anyway, and I want to migrate to Grafana Loki, so now seems
like a good time to just get rid of it.
2024-02-01 21:45:43 -06:00