Commit Graph

965 Commits (221d3a2be93eea43b6e5f71d810f86f06a0806f0)

Author SHA1 Message Date
Dustin 431b7dfacc facts: Do not collect facts in first play
The first play in the `facts.yml` playbook contains a single task: clear
the existing fact cache.  It makes *no* sense to gather facts for this
play.
2023-10-27 17:40:50 -05:00
Dustin 7b23f6a4ac r/winbind: Disable offline login by default
The `winbind offline login` setting seems to cause issues when one of
the domain controllers is offline.  Rather than try the other DC,
winbind seems to just "give up" and return NT_STATUS_NO_SUCH_USER for
all authentication requests until the offline cache is flushed.  There's
not really any reason to use this setting on servers anyway, since they
are always connected to the LAN, as opposed to laptops that may
occasionally disconnect.  Let's disable this option in the hopes that it
makes logins more resilient to DC downtime.  After all, there's not much
point in having multiple DCs if they all have to be available in order
to log in.
2023-10-27 17:37:49 -05:00
Dustin 686817571e smtp-relay: Switch to Fastmail
AWS is going to begin charging extra for routable IPv4 addresses soon.
There's really no point in having a relay in the cloud anymore anyway,
since a) all outbound messages are sent via the local relay and b) no
messages are sent to anyone except me.
2023-10-24 17:27:21 -05:00
Dustin d2eb61cce1 r/sudo: Tag install tasks
Tasks that install packages need to be tagged as `install` so they can
be skipped by Jenkins daily runs.
2023-10-21 22:16:28 -05:00
Dustin 7c6ed667be r/system-auth: Tag install tasks
Tasks that install packages need to be tagged as `install` so they can
be skipped by Jenkins daily runs.
2023-10-21 22:16:28 -05:00
Dustin 6a6765ac06 r/system-auth: Remove uninstall authconfig task
The *authconfig* package has been gone from Fedora since ages.  There's
no reason to have this no-op step any more, especially since it has the
side-effect of making a network request to refresh the dnf cache.
2023-10-21 13:11:25 -05:00
Dustin 1b9543b88f metricspi: alerts: Increase Frigate disk threshold
We want the Frigate recording volume to be basically full at all times,
to ensure we are keeping as much recording as possible.
2023-10-15 09:52:12 -05:00
Dustin 2f554dda72 metricspi: Scrape k8s-aarch64-n1
I've added a new Kubernetes worker node,
*k8s-aarch64-n1.pyrocufflink.blue*.  This machine is a Raspberry Pi CM4
mounted on a Waveshare CM4-IO-Base A and clipped onto the DIN rail.
It's got 8 GB of RAM and 32 GB of eMMC storage.  I intend to use it to
build container images locally, instead of bringing up cloud instances.
2023-10-05 14:32:19 -05:00
Dustin a74113d95f metricspi: Scrape Zincati metrics from CoreOS hosts
Zincati is the automatic update manager on Fedora CoreOS.  It exposes
Prometheus metrics for host/update statistics, which are useful to track
the progress of automatic updates and identify update issues.

Zinciti actually exposes its metrics via a Unix socket on the
filesystem.  Another process, [local_exporter], is required to expose
the metrics from this socket via HTTP so Prometheus can scrape them.

[local_exporter]: https://github.com/lucab/local_exporter
2023-10-03 10:29:12 -05:00
Dustin d7f778b01c metricspi: Scrape metrics from k8s-aarch64-n0
*collectd* is now running on *k8s-aarch64-n0.pyrocufflink.blue*,
exposing system metrics.  As it is not a member of the AD domain, it has
to be explicitly listed in the `scrape_collectd_extra_targets` variable.
2023-10-03 10:29:11 -05:00
Dustin 50f4b565f8 hosts: Remove nvr1.p.b as managed system
*nvr1.pyrocufflink.blue* has been migrated to Fedora CoreOS.  As such,
it is no longer managed by Ansible; its configuration is done via
Butane/Ignition.  It is no longer a member of the Active Directory
domain, but it does still run *collectd* and export Prometheus metrics.
2023-09-27 20:24:47 -05:00
Dustin e4c2b36dfd r/scrape-collectd: Also scrape unmanaged targets
The `scrape_collectd_extra_targets` variable can be used to specify a
list of additional targets to scrape, in addition to the hosts in the
*collectd-prometheus* group.  This will allow us to scrape hosts that
are not managed by the configuration policy, but still expose Prometheus
metrics via collectd.
2023-09-27 20:24:47 -05:00
Dustin d3799607ec hosts: Move nvr1.p.b back to main inventory
*nvr1.pyrocufflink.blue* is no longer offline.
2023-09-26 07:40:33 -05:00
Dustin 0037a3c281 r/minio: Reload server after changing cert
MinIO is supposed to automatically reload itself when the certificate
changes, but this does not appear to happen in all cases.  To ensure the
updated certificate gets used, we need to send SIGHUP to the MinIO
server process.
2023-09-22 07:29:05 -05:00
Dustin 1b63332872 r/jellyfin: Restrict HTTPS redirect to Jellyfin
Since Jellyfin is running on the file server, which also hosts a few
other websites that do not define virtual hosts, the HTTP-to-HTTPS
redirect was applied to *all* requests.  To avoid this, we simply add a
rewrite condition so that the redirect only applies to requests for
Jellyfin.
2023-09-13 10:06:12 -05:00
Dustin a2b3f9b5b9 jellyfin: Deploy Jellyfin media server
Jellyfin is a multimedia library manager. Clients can browse and stream
music, movies, and TV shows from the server and play them locally
(including in the browser).
2023-09-12 13:38:35 -05:00
Dustin 226a6bef46 Revert "hosts: Move serial0.p.b offline"
This reverts commit 9d29961b38.
2023-08-07 11:41:06 -05:00
Dustin 9d29961b38 hosts: Move serial0.p.b offline
It seems this machine has died and probably needs to be rebuilt.
2023-07-26 11:49:46 -05:00
Dustin 16d05fcfb4 hosts: Move nvr1.p.b offline
This machine is offline until I get the cameras installed at the new
house.
2023-07-26 11:48:38 -05:00
Dustin 7120e4ebf8 hosts: Decommission hass2.p.b
Home Assistant is now hosted in Kubernetes.
2023-07-24 11:33:12 -05:00
Dustin 4cdb5dee70 certs/samba: Add missing symlink for dc-ag62kz.p.b 2023-07-24 08:36:20 -05:00
Dustin 7a9c678ff3 burp-server: Keep more backups
New retention policy:

* 7 daily backups
* 4 weekly backups
* 12 ~monthly backups
* 5 ~yearly backups
2023-07-17 16:36:37 -05:00
Dustin 06782b03bb vm-hosts: Update VM autostart list
* *dc2* is gone for a long time, replaced by two new domain controllers
* *unifi0* was recently replaced by *unifi1*
2023-07-07 10:05:22 -05:00
Dustin 6a5d1437e8 hosts: add unifi1.p.b
*unifi1.pyrocufflink.blue* is a Fedora machine that hosts the Unifi
Network controller software.
2023-07-07 10:05:01 -05:00
Dustin 71a43ccf07 unifi: Deploy Unifi Network controller
Since Ubiquiti only publishes Debian packages for the Unifi Network
controller software, running it on Fedora has historically been neigh
impossible.  Fortunately, a modern solution is available: containers.
The *linuxserver.io* project publishes a container image for the
controller software, making it fairly easy to deploy on any host with an
OCI runtime.  I briefly considered creating my own image, since theirs
must be run as root, but I decided the maintenance burden would not be
worth it.  Using Podman's user namespace functionality, I was able to
work around this requirement anyway.
2023-07-07 10:05:01 -05:00
Dustin 61844e8a95 pyrocufflink: Add Luma SSH keys for root
Sometimes I need to connect to a machine when there is an AD issue (e.g.
domain controllers are down, clocks are out of sync, etc.) but I can't
do it from my desktop.
2023-07-05 16:35:57 -05:00
Dustin 9f221cf734 web/dustinandtabitha: Disable RSVP form
The spammers have found our wedding RSVP form.
2023-06-27 09:02:54 -05:00
Dustin 0a68d84121 metricspi: Scrape hatchlearningcenter.org
To monitor site availability and certificate expiration.
2023-06-21 14:31:33 -05:00
Dustin 4e608e379f metricspi/alerts: Correct BURP archive alert query
When the RAID array is being resynchronized after the archived disk has
been reconnected, md changes the disk status from "missing" to "spare."
Once the synchronization is complete, it changes from "spare" to
"active."  We only want to trigger the "disk needs archived" alert once
the synchronization process is complete; otherwise, both the "disks need
swapped" and "disk needs archived" alerts would be active at the same
time, which makes no sense.  By adjusting the query for the "disk needs
archived" alert to consider disks in both "missing" and "spare" status,
we can delay firing that alert until the proper time.
2023-06-20 11:58:35 -05:00
Dustin b05edbf7fb r/minio: Configure firewall
The firewall needs to allow inbound connections to the MinIO HTTP API
and web UI ports.
2023-06-08 10:07:32 -05:00
Dustin 4776303db2 k8s-node: Deploy NFS client
Longhorn's new RWX (read-write many) mode requires the NFS client
utilities installed on the host machine.
2023-06-08 10:06:02 -05:00
Dustin 679ea47bf7 r/homeassistant: Protect ~/.ssh
When the Home Assistant container restarts, Podman relabels the entire
`/var/lib/homeassistant` directory as `container_file_t`.  Since the
*homeassistant* user's home directory is `/var/lib/homeassistant`, its
`~/.ssh` directory is thus also relabeled, preventing the SSH daemon
from accessing it.  Since Home Assistant itself does not need access to
this path, we can tell systemd to mount an empty tmpfs filesystem there
in the service unit's mount namespace.  This way, when Podman relabels
the directory, it will change the label of the tmpfs mount point instead
of the actual directory.
2023-06-08 10:05:36 -05:00
Dustin bf4d57b5cb frigate: Configure journal2ntfy for MD RAID
The Frigate server has a RAID array that it uses to store video
recordings.  Since there have been a few occasions where the array has
suddenly stopped functioning, probably because of the cheap SATA
controller, it will be nice to get an alert as soon as the kernel
detects the problem, so as to minimize data loss.
2023-06-08 10:05:36 -05:00
Dustin 87e8ec2ed4 synapse: Back up data using BURP
Most of the Synapse server's state is in its SQLite database.  It also
has a `media_store` directory that needs to be backed up, though.

In order to back up the SQLite database while the server is running, the
database must be in "WAL mode."  By default, Synapse leaves the database
in the default "rollback journal mode," which disallows multiple
processes from accessing the database, even for read-only operations.
To change the journal mode:

```sh
sudo systemctl stop synapse
sudo -u synapse sqlite3 /var/lib/synapse/homeserver.db 'PRAGMA journal_mode=WAL;'
sudo systemctl start synapse
```
2023-05-23 09:52:50 -05:00
Dustin 74243080bb r/burp-client: Support pre/post-restore scripts
BURP can run scripts before and after restore.  This may be useful, for
example, to clean up files in a backup that may be in an inconsistent
state.
2023-05-23 09:52:50 -05:00
Dustin 66d0a9157f burp-client: Switch from cron to systemd timer
systemd timer units are supported on all relevant OS versions now.
There is no longer any reason to use cron.
2023-05-23 09:51:07 -05:00
Dustin cd1f7b354b ci: Add Jenkins pipeline for MinIO 2023-05-23 08:33:09 -05:00
Dustin d26de78b3d r/samba-dc: Rotate KDC log weekly
The Samba KDC log file seems to grow rather quickly sometimes, outpacing
the monthly rotation policy.  Let's rotate it weekly and keep 4
historical versions.
2023-05-23 08:31:58 -05:00
Dustin 78296f7198 Merge branch 'journal2ntfy' 2023-05-23 08:31:52 -05:00
Dustin 347cda74fd metrics: Scrape metrics from Kubernetes API server
Kubernetes exports a *lot* of metrics in Prometheus format.  I am not
sure what all is there, yet, but apparently several thousand time series
were added.

To allow anonymous access to the metrics, I added this RoleBinding:

```yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: prometheus
rules:
- apiGroups:
  - ""
  resources:
  - nodes/metrics
  verbs:
  - get
- nonResourceURLs:
  - /metrics
  verbs:
  - get
```
2023-05-22 21:21:08 -05:00
Dustin c0bb387b18 metricspi: Scrape metrics from MinIO backup storage
MinIO exposes metrics in Prometheus exposition format.  By default, it
requires an authentication token to access the metrics, but I was unable
to get this to work.  Fortunately, it can be configured to allow
anonymous access to the metrics, which is fine, in my opinion.
2023-05-22 21:19:25 -05:00
Dustin a7319c561d journal2ntfy: Script to send log messagess via ntfy
The `journal2ntfy.py` script follows the systemd journal by spawning
`journalctl` as a child process and reading from its standard output
stream.  Any command-line arguments passed to `journal2ntfy` are passed
to `journalctl`, which allows the caller to specify message filters.
For any matching journal message, `journal2ntfy` sends a message via
the *ntfy* web service.

For the BURP server, we're going to use `journal2ntfy` to generate
alerts about the RAID array.  When I reconnect the disk that was in the
fireproof safe, the kernel will log a message from the *md* subsystem
indicating that the resynchronization process has begun.  Then, when
the disks are again in sync, it will log another message, which will
let me know it is safe to archive the other disk.
2023-05-17 14:51:21 -05:00
Dustin 2c002aa7c5 alerts: Add alert to archive BURP disk
This alert will fire once the MD RAID resynchronization process has
completed and both disks in the array are online.  It will clear when
one disk is disconnected and moved to the safe.
2023-05-16 08:33:13 -05:00
Dustin 877dcc3879 alerts: Add alerts for missed client backups
When BURP fails to even *start* a backup, it does not trigger a
notification at all.  As a result, I may not notice for a few days when
backups are not happening.  That was the case this week, when clients'
backups were failing immediately, because of a file permissions issue on
the server.  To hopefully avoid missing backups for too long in the
future, I've added two new alerts:

* The *no recent backups* alert fires if there have not been *any* BURP
  backups recently.  This may also fire, for example, if the BURP
  exporter is not working, or if there is something wrong with the BURP
  data volume.
* The *missed client backup* alert fires if an active BURP client (i.e.
  one that has had at least one backup in the past 90 days) has not been
  backed up in the last 24 hours.
2023-05-14 11:48:36 -05:00
Dustin a2bcd5ccbb alerts: Adjust BURP RAID disk swap alert
Using a 30-day window for the `tlast_change_over_time` function
effectively "caps out" the value at 30 days.  Thus, the alert reminding
me to swap the BURP backup volume will never fire, since the value will
never be greater than the 30-day threshold.  Using a wider window
resolves that issue (though the query will still produce inaccurate
results beyond the window).
2023-05-14 11:38:00 -05:00
Dustin ad9fb6798e samba-dc: Omit tls cafile setting
The `tls cafile` setting in `smb.conf` is not necessary.  It is used for
verifying peer certificates for mutual TLS authentication, not to
specify the intermediate certificate authority chain like I thought.

The setting cannot simply be left out, though.  If it is not specified,
Samba will attempt to load a file from a built-in default path, which
will fail, causing the server to crash.  This is avoided by setting the
value to the empty string.
2023-05-10 08:28:49 -05:00
Dustin 5ebe10fb0b Merge branch 'minio' 2023-05-10 08:05:03 -05:00
Dustin a3ea838cac burp-server: Deploy MinIO
We're going to run MinIO on the BURP server to provide a backup target
for the [Postgres Operator][0]/[WAL-E][1].  Although the Postgres
Operator also supports backups via [WAL-G][2], which supports more
backup targets like SFTP, the operator does not support restoring from
those targets.  As such, the best way to get fully-featured backups for
the Postgres Operator, including environment cloning, etc., is to use
S3.  Since I absolutely do not want to store my backups "in the cloud,"
using MinIO seems a decent alternative.  Running it on the BURP server
allows the backups to be stored and rotated along with regular system
backups.

[0]: https://github.com/zalando/postgres-operator/
[1]: https://github.com/wal-e/wal-e
[2]: https://github.com/wal-g/wal-g
2023-05-09 21:55:25 -05:00
Dustin f54bc44a48 minio: Install and configure MinIO
[MinIO][0] is an S3-compatible object storage server.  It is designed to
provide storage for cloud-native applications for on-premises
deployments.

MinIO has not been packaged for Fedora (yet?).  As such, the best way to
deploy it is usining its official container image.  Here, we are using
`podman-systemd-generator` (Quadlet) to generate a systemd service
unit to manage the container process.
2023-05-09 21:37:46 -05:00
Dustin 9722fed1b8 metricspi: Scrape dustinandtabitha.com 2023-05-09 21:30:11 -05:00