1
0
Fork 0
Commit Graph

549 Commits (f72639bec6a0a71ebdd78067d4f2716e1a86f074)

Author SHA1 Message Date
Dustin 94b7168b1e home-assistant: Add restart MQTTMarionette script
There's obviously a bug or something in `mqttmarionette` because it
occasionally gets "stuck" in a state where it is running but does
not reconnect to the MQTT broker.  In such situations, it has to be
restarted (and even then it doesn't shut down correctly but has to
be killed with SIGKILL, usually).  I have been doing this manually, but
with this shell script and a corresponding "shell command" integration
in Home Assistant, it can be done automatically.  This is similar to
how Home Assistant restarts Mopidy on the living room stereo when it
gets into the same kind of state.
2024-08-23 09:24:46 -05:00
Dustin 7dffb5195a v-m: alertmanager: Group disk usage alerts
Some machines have the same volume mounted multiple times (e.g.
container hosts, BURP).  Alerts will fire for all of these
simultaneously when the filesystem usage passes the threshold.  To avoid
getting spammed with a bunch of messages about the same filesystem,
we'll group alerts from the same machine.
2024-08-17 10:59:05 -05:00
Dustin 02001f61db v-m/scrape: webistes: Stop scraping Matrix
I'm not using Matrix for anything anymore, and it seems to have gone
offline.  I haven't fully decommissioned it yet, but the Blackbox scrape
is failing, so I'll just disable that bit for now.
2024-08-17 10:57:22 -05:00
Dustin c7e4baa466 v-m: scrape: Remove nvr2.p.b Zincati scrape target
I've redeployed *nvr2.pyrocufflink.blue* as Fedora Linux, so it does not
run Zincati anymore.
2024-08-17 10:56:06 -05:00
Dustin 1a631bf366 v-m: scrape: Remove serial1.p.b
This machine never worked correctly; the USB-RS232 adapters would stop
working randomly (and of course it would be whenever I needed to
actually use them).  I thought it was something wrong with the server
itself (a Raspberry Pi 3), but the same thing happened when I tried
using a Pi 4.

The new backup server has a plethora of on-board RS-232 ports, so I'm
going to use it as the serial console server, too.
2024-08-17 10:54:21 -05:00
Dustin 6f7f09de85 v-m: scrape: Update Unifi server target
I've rebuilt the Unifi Network controller machine (again);
*unifi3.pyrocufflink.blue* has replaced *unifi2.p.b*.  The
`unifi_exporter` no longer works with the latest version of Unifi
Network, so it's not deployed on the new machine.
2024-08-17 10:52:51 -05:00
Dustin 809676f691 v-m: alerts: Add Longhorn alerts 2024-08-17 10:51:13 -05:00
Dustin 9977bb3de4 Merge remote-tracking branch 'refs/remotes/origin/master' 2024-08-06 08:03:42 -05:00
Dustin dcd3f898c7 xactmon: Deploy Invoice Ninja importer for HLC
Bank notifications sent to Tabitha's mailbox are now processed by
`xactmon` and imported into Invoice Ninja as expenses for Hatch Learning
Center.
2024-08-03 13:39:17 -05:00
Dustin 5b34547730 h-a: Config Zigbee2MQTT w/ env vars
Zigbee2MQTT commits the cardinal sin of storing state in its
configuration file.  This means the file has to be writable and thus
stored in persistent storage rather than in a ConfigMap.  As a
consequence, making changes to the configuration when the application is
not running is rather difficult.  Case in point: when I added the
internal alias for _mqtt.pyrocufflink.blue_ pointing to the in-cluster
service, Zigbee2MQTT became unable to connect to the broker because it
was using the node port instead of the internal port.  Since it could
not connect to the broker, it refused to start, and thus the container
would not stay running long enough to fix the configuration to point
to the correct port.

Fortunately, Zigbee2MQTT also allows configuring settings via
environment variables, which can be managed with a ConfigMap.  Luckily,
the values read from environment variables override those from the
configuration file, so pointing to the correct broker port with the
environment variable was sufficient to allow the application to start.
2024-08-01 09:27:52 -05:00
Dustin b366532c88 cert-manager, step-ca: Bypass cluster DNS
Having name overrides for in-cluster services breaks ACME challenges,
because the server tries to connect to the Service instead of the
Ingress.  To fix this, we need to configure both _cert-manager_ and
_step-ca_ to *only* resolve names using the network-wide DNS server.
2024-07-29 20:58:18 -05:00
Dustin a785fcec73 sshca: Allow Jenkins jobs to restart the Deployment
The Jenkins job for the SSHCA Server restarts the Deployment after
building a new container image.
2024-07-27 13:10:20 -05:00
Dustin a26857819a step-ca: Add Ingress resource
It turns out, `step ca renew` _can_ renew certificates without mTLS; it
has a `--mtls=false` command-line argument that configures it to use
a JWT signed by the certificate, instead of using the certificate at
the transport layer.  This allows clients to renew their certificates
without needing another authentication mechanism, even with the
TLS-terminating proxy.
2024-07-27 13:07:26 -05:00
Dustin 079c3871b9 invoice-ninja: Fix document upload feature
Invoice Ninja allows attaching documents to invoices, payments,
expenses, etc.  Tabitha wants to use this feature to attach receipts for
her expenses, but the photos her phone takes of them are too large for
the default nginx client body limit.  We can raise this limit on the
ingress, but we also need to raise it on the "inner" nginx.
2024-07-27 13:04:02 -05:00
Dustin e74a6b3142 invoice-ninja: Run in a mutable container
The Invoice Ninja container is not designed to be immutable at all; it
makes a bunch of changes to its own contents when it starts up.
Notably, it copies the contents of the `public` and `storage`
directories from the container image to the persistent volume _and then
deletes the source_.  Additionally, being a Laravel application, it
needs write access to its own code for caching, etc.  Previously, the
`init.sh` script copied the entire `app` directory to a temporary
directory, and then the runtime container mounted that volume over the
top of the original location.  This allowed the root filesystem of the
container to be read-only, while the `app` directory was still mutable.
Unfortunately, this makes the startup process incredibly slow, as it
takes a couple of minutes to copy the whole application.  It's also
pretty pointless, because the application runs as an unprivileged
process, so it wouldn't have write access to the rest of the filesystem
anyway.  As such, I've decided to remove the `readOnlyRootFilesytem`
restriction, and allow the container to run as upstream intends, albeit
begrudgingly.
2024-07-27 12:57:02 -05:00
Dustin 78cd26c827 v-m: Scrape metrics from RabbitMQ 2024-07-26 20:59:00 -05:00
Dustin e56a38c034 cert-manager: Add dch-ca issuer
In-cluster services can now get certificates signed by the DCH CA via
`step-ca`.  This issuer uses ACME with the HTTP-01 challenge, so it
can only issue certificates for names in the _pyrocufflink.blue_ zone
that point to the ingress controllers.
2024-07-26 20:59:00 -05:00
Dustin 54187176ba ingress: Proxy AMQP
Passing port 5671 through the ingress-nginx proxy to the `rabbitmq`
service will allow clients outside the cluster to connect to it.

While we're at it, we'll move the definition of the `tcp-services`
ConfigMap to its own file to make it easier to maintain.
2024-07-26 20:59:00 -05:00
Dustin 1a1d8ff27d rabbitmq: Deploy RabbitMQ Server
RabbitMQ is an AMQP message broker.  It will be used by `xactmon` to
pass messages between the components.

Although RabbitMQ can be deployed in a high-availability cluster, we
don't really need that level of robustness for `xactmon`, so we will
just run a single instance.  Deploying a single-host RabbitMQ server
is pretty straightforward.

We're using mTLS authentication; clients need to have a certificate
issued by the *RabbitMQ CA* in order to connect to the message broker.
The `rabbitmq-ca` _cert-manager_ ClusterIssuer issues these certificates
for in-cluster services like `xactmon`.
2024-07-26 20:59:00 -05:00
Dustin a04a2b5334 xactmon: Deploy xactmon
`xactmon` is a new tool I developed to parse transaction notifications
from banks and automatically import them into my personal finance
tracker.  It is designed in a modular fashion, composed of three main
components:

* Receiver
* Processor
* Importer

Components communicate with one another using an AMQP exchange.
Hypothetically, there could be multipel implementations of the receiver
and importer components.  Right now, there is only a JMAP receiver,
which fetches email messages (from Fastmail), and a Firefly III
importer.  The processor is a singleton, handling notifications from the
receiver, parsing them into a normalized format, and passing them on to
the importer.  It uses a set of rules to decide how to parse the
messages, and supports using either a regular expression with named
capture groups or an Awk script to extract the relevant information.
2024-07-26 20:53:19 -05:00
Dustin ccc46288c2 Merge remote-tracking branch 'refs/remotes/origin/master' 2024-07-22 08:12:11 -05:00
Dustin f4d41c0ec7 invoice-ninja: Add Ingress for HLC client portal
Tabitha wants to use the Invoice Ninja Client Portal and Stripe
integration for customer payments.
2024-07-14 15:41:14 -05:00
Dustin 989556d458 cert-manager: Update to v1.14.5 2024-07-14 15:14:44 -05:00
Dustin 74fa9264df xactfetch: Configure secretsocket
The `xactfetch` script now uses a helper tool, `secretsocket` to
handle looking up secrets.  This tool supports various secret source
types, including files, environment variables, and external commands.
Separating this functionality out of the main script makes it a lot
more flexible and pluggable.  It's main purpose, though, was actually
to allow `xactfetch` to run in a container while communicating with
`rbw` outside that container, specifically for development puposes.

The `secretsocket` tool reads its configuration from a TOML document.
This document defines the secrets the tool handles, and how to look
them up.

Note that the `xactfetch` container image no longer defines the
`XDG_CONFIG_HOME` environment variable, as it uses Chromium instead of
Firefox now, and the former does not work with a read-only config
directory.  As such, we have to mount the `rbw` configuration in the
default location.
2024-07-11 22:49:07 -05:00
Dustin 71ca910ef7 home-assistant: Add Tabitha's HLC calendar 2024-07-11 22:15:56 -05:00
Dustin ee00412bf6 xactfetch: Use separate CronJobs per bank
Usually, `xactfetch` will only fail for one bank or the other.  Rarely
do we want to redownload the data from both banks just because one
failed.  The latest version of `xactfetch` supports specifying a bank
name as a CLI argument, so now we can define separate jobs for each
bank.  Then, when one Job fails, only that one will be retried later.

It's kind of a bummer that it's so repetitive to define two CronJobs
that differ by only a single command-line argument.  I suppose that's
a good argument for using one of the preprocessor tools like Jsonnet
or KCL.
2024-07-11 22:09:27 -05:00
Dustin c741d04d54 xactfetch: Skip wait for manual runs
When the `xactfetch` CronJob is triggered manually, it will now skip
the `sleep` step.  Presumably, whoever triggered it wants the script
to run _right now_, probably to diagnose a problem.
2024-07-11 22:07:54 -05:00
Dustin 8cb292a4b2 v-m: alerts: Add alert for temperatures
After the incident this week with the CPU overheating on _vmhost1_, I
want to make sure I know as soon as possible when anything is starting
to get too hot.
2024-07-11 22:07:27 -05:00
Dustin 8113e5a47f v-m: Fix syntax in AlertManager config
The `group_by` field takes a list of label names, rather than a single
string.
2024-07-06 07:13:27 -05:00
Dustin 952ab9f264 v-m: alertmanager: Group camera notifications
When Frigate is down, multiple alerts are generated for each camera, as
Home Assistant creates camera entities for each tracked object.  This is
extremely annoying, not to mention unnecessary.  To address this, we'll
configure AlertManager to send a single notification for alerts in the
group.
2024-07-05 07:30:30 -05:00
Dustin 9b26753e73 v-m: alerts: Add durations to spammy alerts
Let's avoid sending alerts immediately when something is unavailable,
because the issue might be transient and will resolve itself shortly.
2024-07-05 07:23:38 -05:00
Dustin fa80b15a71 jenkins: Remove Argo CD sync hook
Since Jenkins no longer uses a Longhorn volume, this sync hook is not
useful.
2024-07-04 06:53:58 -05:00
Dustin 248a9a5ae9 v-m: Scrape PostgreSQL exporter
The [postgres exporter][0] exposes metrics about the operation and
performance of a PostgreSQL server.  It's currently deployed on
_db0.pyrocufflink.blue_, the primary server of the main PostgreSQL
cluster.

[0]: https://github.com/prometheus-community/postgres_exporter
2024-07-02 18:16:05 -05:00
Dustin 215b2c6975 home-assistant: Use external PostgreSQL server
Home Assistant uses PostgreSQL for recording the history of entity
states.  Since we had been using the in-cluster database server for
this, the data were migrated to the new external PostgreSQL server
automatically when the backup from the former was restored on the
latter.  It follows, then, that we can point Home Assistant to the
new server as well.

Home Assistant uses SQLAlchemy, which in turn uses _libpq_ via
_psycopg_, as a client for PostgreSQL.  It doesn't expose any
configuration parameters beyond the "database URL" directly, but we
can use the standard environment variables to specify the certificate
and private key for authentication.  In fact, the empty `postgresql://`
URL is sufficient, and indicates that _all_ of the connection parameters
should be taken from environment variables.  This makes specifying the
parameters for both the `wait-for-db` init container and the main
container take the exact same environment variables, so we can use
YAML anchors to share their definitions.
2024-07-02 18:16:05 -05:00
Dustin a269f8a1ae firefly-iii: Connect to external PostgreSQL
Since the new database server outside the Kubernetes cluster, created
for Authelia, was seeded from a backup of the in-cluster server, it
already contained the data from Firefly-III as well.  Thus, we can
switch Firefly-III to using it, too.

The documentation for Firefly-III does not mention anything about how
to configure it to use certificate-based authentication for PostgreSQL,
as is required by the new server.  Fortunately, it ultimately uses
_libpq_, so the standard `PG...` environment variables work fine.  We
just need a certificate issued by the _postgresql-ca_ ClusterIssuer and
the _DCH Root CA_ certificate mounted in the Firefly-III container.
2024-07-02 18:16:05 -05:00
Dustin 92497004be authelia: Point to external PostgreSQL server
If there is an issue with the in-cluster database server, accessing the
Kubernetes API becomes impossible by normal means.  This is because the
Kubernetes API uses Authelia for authentication and authorization, and
Authelia relies on the in-cluster database server.  To solve this
chicken-and-egg scenario, I've set up a dedicated PostgreSQL database
server on a virtual machine, totally external to the Kubernetes cluster.

With this commit, I have changed the Authelia configuration to point at
this new database server.  The contents of the new database server were
restored from a backup from the in-cluster server, so of Authelia's
state was migrated automatically.  Thus, updating the configuration is
all that is necessary to switch to using it.

The new server uses certificate-based authentication.  In order for
Authelia to access it, it needs a certificate issued by the
_postgresql-ca_ ClusterIssuer, managed by _cert-manager_.  Although the
environment variables for pointing to the certificate and private key
are not listed explicitly in the Authelia documentation, their names
can be inferred from the configuration document schema and work as
expected.
2024-07-02 18:16:05 -05:00
Dustin a8ef4c7a80 v-m: Add component labels to configmaps
Adding a `component` label to each ConfigMap will make it possible to
target them specifically, e.g. with `kubectl apply -l`.
2024-07-02 18:16:05 -05:00
Dustin 65e53ad16d v-m: Scrape Zinciti metrics from K8s nodes
All the Kubernetes nodes (except *k8s-ctrl0*) are now running Fedora
CoreOS.  We can therefore use the Kubernetes API to discover scrape
targets for the Zincati job.
2024-07-02 18:16:05 -05:00
Dustin 31345bee7b home-assistant: Add Pool Time WebDAV calendar
I've created a _Pool Time_ calendar in Nextcloud that we can use to
mark when people are expected to be in the pool.  Using this, we can
configure the "someone is in the pool" alert not to fire during times
when we know people will be in the pool.  This will make it much less
annoying on HLC pool days.
2024-07-02 18:16:05 -05:00
Dustin 2d7fec1cdf v-m: vmstorage: Add pod anti-affinity
One of the reasons for moving to 4 `vmstorage` replicas was to ensure
that the load was spread evenly between the physical VM host machines.
To ensure that is the case as much as possible, we need to keep one
pod per Kubernetes node.
2024-06-26 18:29:49 -05:00
Dustin f7f408ca8c v-m: Redo vmstorage persistent volumes
Longhorn does not work well for very large volumes.  It takes ages to
synchronize/rebuild them when migrating between nodes, which happens
all too frequently.  This consumes a lot of resources, which impacts
the operation of the rest of the cluster, and can cause a cascading
failure in some circumstances.

Now that the cluster is set up to be able to mount storage directly from
the Synology, it makes sense to move the Victoria Metrics data there as
well.  Similar to how I did this with Jenkins, I created
PersistentVolume resources that map to iSCSI volumes, and patched the
PersistentVolumeClaims (or rather the template for them defined by the
StatefulSet) to use these.  Each `vmstorage` pod then gets an iSCSI
LUN, bypassing both Longhorn and QEMU to write directly to the NAS.

The migration process was relatively straightforwrad.  I started by
scaling down the `vminsert` Deployment so the `vmagent` pods would
queue the metrics they had collected while the storage layer was down.
Next, I created a [native][0] export of all the time series in the
database.  Then, I deleted the `vmstorage` StatefulSet and its
associated PVCs.  Finally, I applied the updated configuration,
including the new PVs and patched PVCs, and brought the `vminsert`
pods back online.  Once everything was up and running, I re-imported
the exported data.

[0]: https://docs.victoriametrics.com/Single-server-VictoriaMetrics.html#how-to-export-data-in-native-format
2024-06-26 18:29:49 -05:00
Dustin 0f24341e5c collectd: Add DaemonSet for collectd
Since all the nodes in the cluster run Fedora CoreOS now, we can
deploy collectd as a container, managed by a DaemonSet.

Note that while _collectd_ has to run as _root_ in order to collect
a lot of metrics, it should not run with all privileges.  It does need
to run as a "super-privileged container" (`spc_t` SELinux domain), but
it does _not_ need most kernel capabilities.
2024-06-26 18:29:49 -05:00
Dustin ab458df415 v-m/vmstorage: Start pods in parallel
By default, Kubernetes waits for each pod in a StatefulSet to become
"ready" before starting the next one.  If there is a problem starting
that pod, e.g. data corruption, then the others will never start.  This
sort of defeats the purpose of having multiple replicas.  Fortunately,
we can configure the pod management policy to start all the pods at
once, regardless of the status of any individual pod.  This way, if
there is a problem with the first pod, the others will still come up
and serve whatever data they have.
2024-06-26 18:29:49 -05:00
Dustin 14be633843 v-m: Scrape Restic exporter 2024-06-26 18:29:49 -05:00
Dustin 5079599423 restic-exporter: Deploy Restic Prometheus exporter
The [restic-exporter][0] exposes metrics about Restic snapshots as
Prometheus metrics.  This allows us to get similar data as we have for
BURP backups.  Chiefly important among the metrics are last backup time
and size, which we can use to determine if backups are working
correctly.

[0]: https://github.com/ngosang/restic-exporter
2024-06-26 18:29:49 -05:00
Dustin ebcf9e3d42 authelia: Scale up to 2 replicas
Since Authelia is stateless, we can run a second instance to improve
availability.
2024-06-26 18:29:49 -05:00
Dustin 21e8ad2afd home-assistant: Add commands to control photoframe
The digital photo frame in the kitchen is powered by a server service,
which exposes a minimal HTTP API.  Using this API, we can e.g. advance
or backtrack the displayed photo.  Exposing `rest_command` services
for these operations allows us to add buttons to dashboards to control
the frame.
2024-06-26 18:29:49 -05:00
Dustin 1c4b32925e v-m: Use dynamic discovery for some collectd nodes
We don't need to explicitly specify every single host individually.
Domain controllers, for example, are registered in DNS with SRV records.
Kubernetes nodes, of course, can be discovered using the Kubernetes API.
Both of these classes of nodes change frequently, so discovering them
dynamically is convenient.
2024-06-26 18:29:49 -05:00
Dustin 98651cf9d9 jenkins: Force iSCSI volume on specific nodes
Instead of routing iSCSI traffic from the Kubernetes network, through
the firewall, to the storage network, nodes now have a second network
adapter connected to directly to the storage network.  The nodes with
such an adapter are labelled `network.du5t1n.me/storage`, so we can pin
the Jenkins PersistentVolume to them via a node affinity rule.
2024-06-26 18:29:49 -05:00
Dustin a2225e583e paperless-ngx: Use volume claim template for redis
Using a volume claim template to define the persistent volume claim for
the Redis pod has two advantages: first, it enables using clustered
Redis, if we decide that becomes necessary, and second, it makes
deleteing and recreating the volume easier in the case of data
corruption.  Simply scale down the StatefulSet to 0, delete the PVC, and
scale the StatefulSet back up.
2024-06-26 18:29:49 -05:00