1
0
Fork 0
Commit Graph

582 Commits (9d18173b3e31529e9cf910eddded83bd50f5eef4)

Author SHA1 Message Date
Dustin fefbaa9991 ingress: Use Deployment+Service with externalIPs
Now that we have `keepalived` managing the "virtual" IP address for the
ingress controller, we can change _ingress-nginx_ to run as a Deployment
rather than a DaemonSet.  It no longer needs to use the host network
namespace, as `kube-proxy` will route all traffic sent to the configured
external IP address to the controller pods.  Using the _Local_ external
traffic policy disables NAT, so incoming traffic is seen by the
nginx unmodified.
2024-11-22 22:35:37 -06:00
Dustin e7ea2b0659 keepalived: Initial commit
Running `keepalived` as a DaemonSet will allow managing floating
"virtual" IP addresses for Kubernetes services with configured external
IP addresses.  The main services we want to expose outside the cluster
are _ingress-nginx_, Mosquitto, and RabbitMQ.  The `keepalived` cluster
will negotiate using the VRRF protocol to determine which node should
have each external address.  Using the process tracking feature of
`keepalived`, we can steer traffic directly to the node where the target
service is running.
2024-11-22 22:26:48 -06:00
Dustin 5c78bb89b5 Merge remote-tracking branch 'refs/remotes/origin/master' 2024-11-22 19:38:00 -06:00
Dustin 0a6086eb2a longhorn: Run on dedicated nodes
I've created new worker nodes that are dedicated to running Longhorn
replicas.  These nodes are tainted with the
`node-role.kubernetes.io/longhorn` taint, so no regular pods will be
scheduled there by default.  Longhorn pods thus needs to be configured
to tolerate that taint, and to be scheduled on nodes with the
similarly-named label.
2024-11-21 22:59:14 -06:00
Dustin d6c83565ec rabbitmq: Update to 4.0
RabbitMQ Server 3.13 is out of support now.
2024-11-21 22:59:14 -06:00
Dustin 121e6e7111 rabbitmq: Switch to using volume claim templates
This will make it easier to "blow away" the RabbitMQ data volume on the
occasions when it gets into a weird state.  Simply scale the StatefulSet
down to 0 replicas, delete the PVC, then scale back up.  Kubernetes will
handle creating a new PVC automatically.
2024-11-21 22:59:14 -06:00
Dustin 3d5dd52eb9 ingress: Use upstream resources w/ patches
This will make it easier to upgrade, since we keep track of _exactly_
what we changed from the upstream resources with Kustomize patches.
2024-11-21 19:42:35 -06:00
Dustin 3b3d4c38ed dynk8s: Move Wireguard config to SealedSecret 2024-11-21 19:41:55 -06:00
Dustin da81a336e1 dynk8s-provisioner: Migrate to Kustomize 2024-11-19 10:43:42 -06:00
Dustin e0c633c21e v-m: scrape: Fix Nextcloud URL
Nextcloud uses a _client-side_ (Javascript) redirect to navigate the
browser to its `index.php`.  The page it serves with this redirect is
static and will often load successfully, even if there is a problem with
the application.  This causes the Blackbox exporter to record the site
as "up," even when it it definitely is not.  To avoid this, we can
scrape the `index.php` page explicitly, ensuring that the application is
loaded.
2024-11-17 18:43:00 +00:00
Dustin 14492d827a Merge pull request 'home-assistant: Update to 2024.11.2' (#34) from updatebot/home-assistant into master
Reviewed-on: #34
2024-11-16 18:04:43 +00:00
Dustin 444686cb1e Merge pull request 'paperless-ngx: Update to 2.13.0' (#31) from updatebot/paperless-ngx into master
Reviewed-on: #31
2024-11-16 17:55:04 +00:00
Dustin ceea84d7f9 Merge pull request 'firefly-iii: Update to 6.1.22' (#33) from updatebot/firefly-iii into master
Reviewed-on: #33
2024-11-16 17:45:08 +00:00
bot 4d2cc40b5e tika: Update to 3.0.0.0 2024-11-16 12:32:14 +00:00
bot c31db5fde2 gotenberg: Update to 8.13.0 2024-11-16 12:32:14 +00:00
bot 74ce0e1b0a paperless-ngx: Update to 2.13.5 2024-11-16 12:32:14 +00:00
bot f0b16fd53c firefly-iii: Update to 6.1.22 2024-11-16 12:32:12 +00:00
bot acd9a0fa92 zwavejs2mqtt: Update to 9.27.2 2024-11-16 12:32:08 +00:00
bot 115b4ade39 home-assistant: Update to 2024.11.2 2024-11-16 12:32:08 +00:00
Dustin c1927eecfc Merge pull request 'home-assistant: Update to 2024.10.4' (#30) from updatebot/home-assistant into master
Reviewed-on: #30
2024-11-12 15:56:50 +00:00
Dustin 04ef1faf75 Merge pull request 'authelia: Update to 4.38.17' (#32) from updatebot/authelia into master
Reviewed-on: #32
2024-11-12 15:14:50 +00:00
Dustin 0209f921c3 v-m: Remove nut0 from scrape targets
_nut0.pyrocufflink.blue_ is decommissioned.
2024-11-12 08:02:00 -06:00
Dustin 62b19e942b sshca: Add machine ID for nut1.p.b 2024-11-10 11:19:53 -06:00
bot b956e9ac05 authelia: Update to 4.38.17 2024-11-09 12:32:16 +00:00
bot f7eb3b49e7 zwavejs2mqtt: Update to 9.26.0 2024-11-09 12:32:08 +00:00
bot 0db830a670 zigbee2mqtt: Update to 1.41.0 2024-11-09 12:32:08 +00:00
bot 6d137af6dc home-assistant: Update to 2024.11.1 2024-11-09 12:32:08 +00:00
Dustin 3d40424cf7 fleetlock: Use patched server from Github PR
The _fleetlock_ server drains all pods from a node before allocating the
reboot lock to that node.  Unfortunately, it doesn't actually wait for
those pods to be completely evicted.  If some pods take too long to shut
down, they may get stuck in `Terminating` state once the machine starts
rebooting.  This makes it so those pods cannot be replaced on another
node with the original one is offline, which pretty much defeats the
purpose of using Fleetlock in the first place.

It seems upstream has abandoned this project, as there is an open [Pull
Request][0] to fix this issue that has so far been ignored.
Fortunately, building a new container image containing the patch is easy
enough, so we can run our own patched build.

[0]: https://github.com/poseidon/fleetlock/pull/271
2024-11-05 07:05:55 -06:00
Dustin ac62a77c96 Merge branch '20125' 2024-11-05 07:05:19 -06:00
Dustin e1d9833e83 cert-manager: Add cert for apps.du5t1n.xyz 2024-11-05 07:04:27 -06:00
Dustin 4ad5518f18 cert-manager: Migrate config to configMapGenerator 2024-11-05 07:04:09 -06:00
Dustin 9f287d0f71 v-m/alerts: Add alerts for backup RAID array
Just like I did with the RAID-1 array in the old BURP server, I will
keep one member active and one in the fireproof safe, swapping them each
month.  We can use the same metrics queries to alert on when the swap
should happen that we used with the BURP server.
2024-11-04 20:46:03 -06:00
Dustin 2380468658 v-m/scrape: Collect Jellyfin metrics 2024-11-04 20:38:25 -06:00
Dustin db7c07ee55 v-m/scrape: Ignore cloud Kubernetes nodes
The ephemeral Jenkins worker nodes that run in AWS don't have colletcd,
promtail, or Zincati.  We don't needto get three alerts every time a
worker starts up to handle am ARM build job, so we drop these discovered
targets for these scrape jobs.
2024-11-04 20:35:17 -06:00
Dustin d76a1360c8 v-m/alerts: Ignore Paperless consume_file task
Paperless-ngx uses a Celery task to process uploaded files, converting
them to PDF, running OCR, etc.  This task can be marked as "failed" for
various reasons, most of which are more about the document itself than
the health of the application.  The GUI displays the results of failed
tasks when they occur.  It doesn't really make sense to have an alert
about this scenario, especially since there's nothing to do to directly
clear the alert anyway.
2024-11-04 20:28:11 -06:00
Dustin 71b52e4c6f 20125: Deploy Status server
https://20125.home/ is the URL the Status Android application loads in
its main WebView.  This site is powered by a server that generates a
custom page showing the status of our self-hosted applications, based on
alerts retrieved from the AlertManager API.

Android WebView does not allow cleartext HTTP connections.  It does,
however, allow connecting an HTTPS server and ignoring the certificate
it presents, which is effectively the same thing.  Thus, we generate a
self-signed certificate for the Ingress for this site.
2024-11-02 19:51:53 -05:00
Dustin 8ecee4133f v-m/alerts: Rework free disk space alert
Fedora CoreOS fills `/boot` beyond the 75% alert threshold under normal
circumstances on aarch64 machines.  This is not a problem, because it
cleans up old files on its own, so we do not need to alert on it.
Unfortunately, the _DiskUsage_ alert is already quite complex, and
adding in exclusions for these devices would make it even worse.

To simplify the logic, we can use a recording rule to precomupte the
used/free space ratio.  By using `sum(...) without (type)` instead of
`sum(...) on (df, instance)`, we keep the other labels, which we can
then use to identify the metrics coming from machines we don't care to
monitor.

Instead of having different thresholds for different volumes
encoded in the same expression, we can use multiple alerts to alert on
"low" vs "very low" thresholds.  Since this will of course cause
duplicate alerts for most volumes, we can use AlertManager inhibition
rules to disable the "low" alert once the metric crosses the "very low"
threshold.
2024-11-02 09:38:02 -05:00
Dustin 4cef41688f v-m/alerts: Add Zigbee+ZWave network alerts 2024-11-01 18:14:56 -05:00
Dustin 6cf11f9f61 v-m: Scrape HAProxy 2024-11-01 18:14:37 -05:00
Dustin 7a768cbb76 v-m: Update jobs for new Loki server
*loki1.pyrocufflink.blue* is a regular Fedora machine, a member of the
AD domain, and managed by Ansible.  Thus, it does not need to be
explicitly listed as a scrape target.

For scraping metrics from Loki itself, I've changed the job to use
DNS-SD because it seems like `vmagent` does _not_ re-resolve host names
from static configuration.
2024-11-01 18:07:34 -05:00
Dustin 0101040634 v-m/alerts: Add Paperless-ngx email task alert
This alert should fire if the background task to fetch e-mail and import
them into Paperless-ngx has not run for a while.
2024-11-01 18:04:06 -05:00
Dustin 3f9601dc94 v-m/alerts: Improve Paperless-ngx Celery task alert
The `flower_events_total` metric is a counter, so its value only ever
increases (discounting restarts of the server process).  As such,
nonzero values do not necessarily indicate a _current_ problem, but
rather that there was one at some point in the past.  To identify
current issues, we need to use the `increase` function, and then apply
the `max_over_time` function so that the alert doesn't immediately reset
itself.
2024-11-01 18:00:50 -05:00
Dustin d12e66f58a v-m: Scrape Frigate exporter 2024-11-01 17:47:51 -05:00
Dustin 045eea89a9 Merge remote-tracking branch 'refs/remotes/origin/master' 2024-10-19 09:49:59 -05:00
Dustin 8ff45a8c01 paperless-ngx/gotenberg: Run as correct user
The Gotenberg container image uses UID 1001 for the _gotenberg_ user.
Using any other UID number, even when the home directory is set and
owned by that UID, results in random issues, especially when using
LibreOffice conversions.
2024-10-19 09:46:15 -05:00
giteadmin d3e00680c0 Merge pull request 'home-assistant: Update to 2024.10.3' (#29) from updatebot/home-assistant into master
Reviewed-on: #29
2024-10-19 13:13:12 +00:00
bot c5daf23f71 mosquitto: Update to 2.0.20 2024-10-19 11:32:16 +00:00
bot 6e2c8d1a25 zwavejs2mqtt: Update to 9.24.0 2024-10-19 11:32:16 +00:00
bot 0e3f719e32 whisper: Update to 2.2.0 2024-10-19 11:32:16 +00:00
bot 94e10207d2 home-assistant: Update to 2024.10.3 2024-10-19 11:32:15 +00:00