222 Commits

Author SHA1 Message Date
719be9a4e9 Deploy Radarr, Sonarr, Prowlarr on file0.p.b
I had originally intended to deploy Radarr, Sonarr, and Prowlarr on
Kubernetes.  Unfortunately, this turned out to be problematic, as I
would need a way to share the download directory between Radarr/Sonar
and Aria2, and the media directory between Radarr/Sonarr and Jellyfin.
The only way I could fathom to do this would be to expose both
directories via NFS and mount that share into the pods.  I decided this
would be too much of a hassle for no real gain, at least not in the
short term.  Instead, it makes more sense to deploy the *arr suite on
the same server as Aria2 and Jellyfin, which is essentially what the
community expects.

The recommended images for deploying the applications in containers are
pretty crappy. I didn't really want to mess with trying to get the
them to work natively on Fedora, nor deal with installing them from
tarballs with Ansible, so I created my own Debian-based container images
for them and deployed those via Podman+Quadlet.  These images are
published to the _Packages_ organization in Gitea, which is not public
and requires authentication.  We can use the Kubernetes Secret to obtain
the authentication token to use to pull the image.
2025-12-03 23:05:21 -06:00
23670338b3 sonarr: Deploy Sonarr in a Podman container
The `sonarr.yml` playbook and corresponding role deploy Sonarr, the
indexer manager for the *arr suite, in a Podman container.

Note that we're relocating the log files from the Sonarr AppData
directory to `/var/log/sonarr` so they can be picked up by Fluent Bit.
2025-12-03 23:00:54 -06:00
9223dbe820 prowlarr: Deploy Prowlarr in a Podman container
The `prowlarr.yml` playbook and corresponding role deploy Prowlarr, the
indexer manager for the *arr suite, in a Podman container.

Note that we're relocating the log files from the Prowlarr AppData
directory to `/var/log/prowlarr` so they can be picked up by Fluent Bit.
2025-12-03 23:00:54 -06:00
a41a3fa3d0 radarr: Deploy Radarr in a Podman container
The `radarr.yml` playbook and corresponding role deploy Radarr, the
movie library/download manager, in a Podman container.

Note that we're relocating the log files from the Radarr AppData
directory to `/var/log/radarr` so they can be picked up by Fluent Bit.
2025-12-03 23:00:54 -06:00
fd8cc42720 hosts: Move PiKVM to separate inventory
There's no reason for Jenkins to be messing with this machine.  It's too
different than the rest of the hosts it manages, so it's been quite
difficult getting it to work anyway.  Let's just move it to a separate
inventory file that we have to specify manually when we want to apply a
Playbook to it.
2025-12-02 08:52:22 -06:00
e9d2d21ec3 hosts: Add pikvm-nvr2.m.p.b
This is a Raspberry Pi 2 with HDMI-CSI adapter and Raspberry Pi Pico,
connected to _nvr2.pyrocufflink.blue_, as the latter does not have a
serial console.
2025-12-01 10:03:05 -06:00
cce485db54 pikvm: Add role/playbook for PiKVM
PiKVM comes with its own custom Arch Linux-based operating systems.  We
want to be able to manage it with our configuration policy, especially
for setting up authentication, etc.  It won't really work with the
host-provisioner without some pretty significant changes to the base
playbooks, but we can control some bits directly.
2025-12-01 10:01:07 -06:00
0334b1b77a Merge branch 'fluent-bit' 2025-11-24 07:49:05 -06:00
04f62a1467 hosts: Remove nvr2 from AD domain
The NVMe drive in _nvr2.pyrocufflink.blue_ died, so I had to re-install
Fedora on a new drive.  This time around, it will not be a domain
member, as with the other new servers added recently.
2025-11-16 16:48:20 -06:00
a500e0ece4 hosts: Decommission dc-headphone.p.b
_dc-headphone.pyrocufflink.blue_ has been replaced by
_dc-backless.pyrocufflink.blue_.
2025-11-01 22:28:43 -05:00
7929176b4e create-dc: Update to use new provisioning process
Instead of running `virt-install` directly from the `create-dc.sh`
script, it now relies on `newvm.sh`.  This will ensure that VMs created
to be domain controllers will conform to the same expectations as all
other machines, such as using the libvirt domain metadata to build
dynamic inventory.

Similarly, the `create-dc.yml` playbook now imports the `host-setup.yml`
playbook, which covers the basic setup of a new machine.  Again, this
ensures that the same policy is applied to DCs as to other machines.

Finally, domain controller machines now no longer use _winbind_ for
OS user accounts and authentication.  This never worked particularly
well on DCs anyway (particularly because of the way _winbind_ insists on
using domain-prefixed user accounts when it runs on a DC), and is now
worse with recent Fedora changes.  Instead, DCs now have local users who
authenticate via SSH certificates, the same as other current-generaton
servers.
2025-10-27 12:53:27 -05:00
2cba5eb2e4 fluent-bit: Make ntfy pipeline steps optional
Most hosts will not need to send any messages to ntfy.  Let's define the
ntfy pipeline stages only for the machines that need them.  There are
currently two use cases for ntfy:

* MD RAID status messages (from Chromie and nvr2)
* WAN Link status messages (from gw1)

Breaking up the pipeline into smaller pieces allows both of these use
cases to define their appropriate filters while still sharing the common
steps.  The other machines that have no use for these steps now omit
them entirely.
2025-09-15 10:46:45 -05:00
57a5f83262 nextcloud: Run an SMTP relay locally
For some reason, Nextcloud seems to have trouble sending mail via the
network-wide relay.  It opens a connection, then just sits there and
never sends anything until it times out.  This happens probably 4 out of
5 times it attempts to send e-mail messages.

Running Postfix locally and directing Nextcloud to send mail through it
and then on to the network-wide relay seems to work much more reliably.
2025-08-23 22:43:45 -05:00
b72676a1bb nextcloud: Fetch HTTPS cert from Kubernetes
Since Nextcloud uses the _pyrocufflink.net_ wildcard certificate, we can
load it directly from the Kubernetes Secret, rather than from the file
in the _certs_ submodule, just like Gitea et al.
2025-08-11 10:39:54 -05:00
8a93ef0fc1 hosts: Remove chromie.p.b from AD domain
Since it was updated to Fedora 42, Jenkins configuration management jobs
have been failing to apply policy to _chromie.pyrocufflink.blue_.  It
claims "jenkins is not in the sudoers file," apparently because
`winbind` keeps "forgetting" that _jenkins_ is a member of the _server
admins_ group, which is listed in `sudoers` file.

I'm getting tired of messing with `winbind` and its barrage of bugs and
quirks.  There's no particular reason for _chromie_ to be an AD domain
member, so let's just remove it and manage its users statically.
2025-08-07 15:07:02 -05:00
e6ac6ae202 hosts: Decommission k8s-ctrl0
Just a few days before its third birthday 🎂

There are now three Kubernetes control plane nodes:

* _ctrl-2ed8d3.k8s.pyrocufflink.black_ Raspberry Pi CM4
* _ctrl-crave.k8s.pyrocufflink.black_ (virtual machine)
* _ctrl-sycamore.k8s.pyrocufflink.black_ (virtual machine)
2025-07-28 17:52:11 -05:00
e1c157ce87 raspberry-pi: Add collectd sensors, thermal plugins
All the Raspberry Pi machines should have the _sensors_ and _thermal_
plugins enabled so we can monitor their CPU etc. temperatures.
2025-07-28 17:50:39 -05:00
53c0107651 hosts: Add CM4 k8s cluster nodes
These three machines are Raspberry Pi CM4 nodes on the DeskPi Super 6c
cluster board.  The worker nodes have a 256 GB NVMe SSD attached.
2025-07-27 17:47:24 -05:00
c67e5f4e0c cm4-k8s-node: Add group
The Raspberry Pi CM4 nodes on the DeskPi Super 6c cluster board are
members of the _cm4-k8s-node_ group.  This group is a child of
_k8s-node_ which overrides the data volume configuration and node
labels.
2025-07-27 17:45:46 -05:00
0e6cc4882d Add k8s-test group
This group is used for temporary machines while testing Kubernetes node
deployment changes.
2025-07-22 16:21:49 -05:00
a5b47eb661 hosts: Add vm-hosts to collectd group
Now that the VM hosts are not members of the AD domain, they need to be
added to the _collectd_ group directly.
2025-07-18 12:47:55 -05:00
906819dd1c r/apache: Use variables for HTTPS cert/key content
Using files for certificates and private keys is less than ideal.
The only way to "share" a certificate between multiple hosts is with
symbolic links, which means the configuration policy has to be prepared
for each managed system.  As we're moving toward a much more dynamic
environment, this becomes problematic; the host-provisioner will never
be able to copy a certificate to a new host that was just created.
Further, I have never really liked the idea of storing certificates and
private keys in Git anyway, even if it is in a submodule with limited
access.
2025-07-13 16:02:57 -05:00
a399591f16 hosts: Decommission node-refrain.k.p.b
I did something stupid to this machine trying to clear up its
`/var/lib/containers/storage` volume and now it won't start any new
pods.  Killing it and replacing.
2025-06-21 17:51:06 -05:00
025f2ddd8c hosts: Remove VM hosts from AD domain
Having the VM hosts as members of the domain has been troublesome since
the very beginning.  In full shutdown events, it's often difficult or
impossible to log in to the VM hosts while the domain controller VMs are
down or still coming up, even with winbind caching.

Now that we have the `users.yml` playbook, the SSH certificate
authority, and `doas`+*pam_ssh_agent_auth*, we really don't need the AD
domain for centralized authentication.
2025-06-08 09:04:27 -05:00
d4d3f0ef81 r/victoria-logs: Deploy VictoriaLogs
I've become rather frusted witih Grafana Loki lately.  It has several
bugs that affect my usage, including issues with counting and
aggregation, completely broken retention and cleanup, spamming itself
with bogus error log messages, and more.  Now that VitoriaLogs has
first-class support in Grafana and support for alerts, it seems like a
good time to try it out.  It's under very active development, with bugs
getting fixed extremely quickly, and new features added constantly.
Indeed, as I was experimenting with it, I thought, "it would be nice if
the web UI could decode ANSI escapes for terminal colors," and just a
few days later, that feature was added!  Native support for syslog is
also a huge benefit, as it will allow me to collect logs directly from
network devices, without first collecting them into a file on the Unifi
controller.

This new role deploys VictoriaLogs in a manner very similar to how I
have Loki set up, as a systemd-managed Podman container.   As it has no
built-in authentication or authorization, we rely on Caddy to handle
that.  As with Loki, mTLS is used to prevent anonymous access to
querying the logs, however, authentication via Authelia is also an
option for human+browser usage.  I'm re-using the same certificate
authority as with Loki to simplify Grafana configuration.  Eventually, I
would like to have a more robust PKI, probably using OpenBao, at which
point I will (hopefully) have decided which log database I will be
using, and can use a proper CA for it.
2025-05-30 21:19:05 -05:00
6df0cc39da unifi: Back up with Restic
The Unifi Network data will now be backed up by Restic.
2025-03-29 09:36:37 -05:00
78d70af574 hosts: Add Unifi controllers to needproxy group
Since the network device management network does not have access to the
Internet, the Unifi controller machines must access it via the proxy.
2025-03-19 07:50:52 -05:00
db54b03aa8 r/unifi: Switching to custom container image
The _linuxserver.io_ image for UniFi Network is deprecated.  It sucked
anyway.  I've created a simple image based on Debian that installs the
_unifi_ package from the upstream apt repository.  This image doesn't
require running anything as _root_, so it doesn't need a user namespace.
2025-03-16 16:40:57 -05:00
c300dc1b6c chrony: Add role/PB for chrony
I continually struggle with machines' (physical and virtual, even the
Roku devices!) clocks getting out of sync.  I have been putting off
fixing this because I wanted to set up a Windows-compatible NTP server
(i.e. on the domain controllers, with Kerberos signing), but there's
really no reason to wait for that to fix the clocks on all the
non-Windows machines, especially since there are exactly 0 Windows
machines on the network right now.

The *chrony* role and corresponding `chrony.yml` playbook are generic,
configured via the `chrony_pools`, `chrony_servers`, and `chrony_allow`
variables.  The values for these variables will configure the firewall
to act as an NTP server, synchronizing with the NTP pool on the
Internet, while all other machines will synchronize with it.  This
allows machines on networks without Internet access to keep their clocks
in sync.
2025-03-16 16:37:19 -05:00
5f4b1627db hosts: Add nut1.p.b to pyrocufflink group
*nut1.pyrocufflink.blue* is a member of the *pyrocufflink.blue* AD
domain.  I'm not sure how it got to be so without belonging to the
_pyrocufflink_ Ansible group...
2025-02-25 21:03:14 -06:00
f705e98fab hosts: Add k8s-iot-net-ctrl group
The *k8s-iot-net-ctrl* group is for the Raspberry Pi that has the Zigbee
and Z-Wave controllers connected to it.  This node runs the Zigbee2MQTT
and ZWaveJS2MQTT servers as Kubernetes pods.
2025-01-31 19:49:51 -06:00
b1c29fc12a hosts: Remove hostvds group
Since the _hostvds_ group is not defined in the static inventory but by
the OpenStack inventory plugin via `hostvds.openstack.yml`, when the
static inventory is used by itself, Ansible fails to load it with an
error:

> Section [vps:children] includes undefined group: hostvds

To fix this, we could explicitly define an empty _hostvds_ group in the
static inventory, but since we aren't currently running any HostVDS
instances, we might as well just get rid of it.
2025-01-31 19:45:58 -06:00
ec4fa25bd8 Merge remote-tracking branch 'refs/remotes/origin/master' 2025-01-30 21:15:40 -06:00
c00d6f49de hosts: Add OVH VPS
It turns out, $0.99/mo might be _too_ cheap for a cloud server.  Running
the Blackbox Exporter+vmagent on the HostVDS instance worked for a few
days, but then it started having frequent timeouts when probing the
websites.  I tried redeploying the instance, switching to a larger
instance, and moving it to different networks.  Unfortunately, none of
this seemed to help.

Switching over to a VPS running in OVH cloud.  OVH VPS servers are
managed statically, as opposed to via API, so we can't use Pulumi to
create them.  This one was created for me when I signed up for an OVH
acount.
2025-01-26 13:08:59 -06:00
33f315334e users: Configure sudo on some machines
`doas` is not available on Alma Linux, so we still have to use `sudo` on
the VPS.
2025-01-26 13:08:59 -06:00
ad0bd7d4a5 remote-blackbox: Add group
The _remote-blackbox_ group defines a system that runs
_blackbox-exporter_ and _vmagent_ in a remote (cloud) location.  This
system will monitor our public web sites.  This will give a better idea
of their availability from the perspective of a user on the Internet,
which can be by factors that are necessarily visible from within the
network.
2025-01-26 13:08:59 -06:00
f5bee79bac hosts: Decommission bw0.p.b
Vaultwarden is now hosted in Kubernetes.
2025-01-10 20:09:53 -06:00
d993d59bee Deploy new Kubernetes nodes
The *stor-* nodes are dedicated to Longhorn replicas.  The other nodes
handle general workloads.
2024-11-24 10:33:21 -06:00
0f600b9e6e kubernetes: Manage worker nodes
So far, I have been managing Kubernetes worker nodes with Fedora CoreOS
Ignition, but I have decided to move everything back to Fedora and
Ansible.  I like the idea of an immutable operating system, but the FCOS
implementation is not really what I want.  I like the automated updates,
but that can be accomplished with _dnf-automatic_.  I do _not_ like
giving up control of when to upgrade to the next Fedora release.
Mostly, I never did come up with a good way to manage application-level
configuration on FCOS machines.  None of my experiments (Cue+tmpl,
KCL+etcd+Luci) were successful, which mostly resulted in my manually
managing configuration on nodes individually.  Managing OS-level
configuration is also rather cumbersome, since it requires redeploying
the machine entirely.  Altogether, I just don't think FCOS fits with my
model of managing systems.

This commit introduces a new playbook, `kubernetes.yml`, and a handful of
new roles to manage Kubernetes worker nodes running Fedora Linux.  It
also adds two new deploy scripts, `k8s-worker.sh` and `k8s-longhorn.sh`,
which fully automate the process of bringing up worker nodes.
2024-11-24 10:33:21 -06:00
a82700a257 chromie: Configure serial terminal server 2024-11-10 13:15:08 -06:00
010f652060 hosts: Add loki1.p.b
_loki1.pyrocufflink.blue_ replaces _loki0.pyrocufflink.blue_.  The
former runs Fedora Linux and is managed by Ansible, while the latter ran
Fedora CoreOS and was managed by Ignition and _cfg_.
2024-11-05 06:54:27 -06:00
4cd983d5f4 loki: Add role+playbook for Grafana Loki
The current Grafana Loki server, *loki0.pyrocufflink.blue*, runs Fedora
CoreOS and is managed by Ignition and *cfg*.  Since I have declared
*cfg* a failed experiment, I'm going to re-deploy Loki on a new VM
running Fedora Linux and managed by Ansible.

The *loki* role installs Podman and defines a systemd-managed container
to run Grafana Loki.
2024-10-20 12:10:55 -05:00
ceaef3f816 hosts: Decommission burp1.p.b
Everything has finally been moved to Chromie.
2024-10-13 17:52:48 -05:00
5ced24f2be hosts: Decommission matrix0.p.b
The Synapse server hasn't been working for a while, but we don't use it
for anything any more anyway.
2024-10-13 12:53:49 -05:00
621f82c88d hosts: Migrate remaining hosts to Restic
Gitea and Vaultwarden both have SQLite databases.  We'll need to add
some logic to ensure these are in a consistent state before beginning
the backup.  Fortunately, neither of them are very busy databases, so
the likelihood of an issue is pretty low.  It's definitely more
important to get backups going again sooner, and we can deal with that
later.
2024-09-07 20:45:24 -05:00
c2c283c431 nextcloud: Back up Nextcloud with Restic
Now that the database is hosted externally, we don't have to worry about
backing it up specifically.  Restic only backs up the data on the
filesystem.
2024-09-04 17:41:42 -05:00
0f4dea9007 restic: Add role+playbook for Restic backups
The `restic.yml` playbook applies the _restic_ role to hosts in the
_restic_ group.  The _restic_ role installs `restic` and creates a
systemd timer and service unit to run `restic backup` every day.

Restic doesn't really have a configuration file; all its settings are
controlled either by environment variables or command-line options. Some
options, such as the list of files to include in or exclude from
backups, take paths to files containing the values.  We can make use of
these to provide some configurability via Ansible variables.  The
`restic_env` variable is a map of environment variables and values to
set for `restic`.  The `restic_include` and `restic_exclude` variables
are lists of paths/patterns to include and exclude, respectively.
Finally, the `restic_password` variable contains the password to decrypt
the repository contents.  The password is written to a file and exposed
to the _restic-backup.service_ unit using [systemd credentials][0].

When using S3 or a compatible service for respository storage, Restic of
course needs authentication credentials.  These can be set using the
`restic_aws_credentials` variable.  If this variable is defined, it
should be a map containing the`aws_access_key_id` and
`aws_secret_access_key` keys, which will be written to an AWS shared
credentials file.  This file is then exposed to the
_restic-backup.service_ unit using [systemd credentials][0].

[0]: https://systemd.io/CREDENTIALS/
2024-09-04 09:40:29 -05:00
708bcbc87e Merge remote-tracking branch 'refs/remotes/origin/master' 2024-09-03 17:18:18 -05:00
a0378feda8 nextcloud: Move database to db0
Moving the Nextcloud database to the central PostgreSQL server will
allow it to take advantage of the monitoring and backups in place there.
For backups specifically, this will make it easier to switch from BURP
to Restic, since now only the contents of the filesystem need backed up.

The PostgreSQL server on _db0_ requires certificate authentication for
all clients.  The certificate for Nextcloud is stored in a Secret in
Kubernetes, so we need to use the _nextcloud-db-cert_ role to install
the script to fetch it.  Nextcloud configuration doesn't expose the
parameters for selecting the certificate and private key files, but
fortunately, they can be encoded in the value provided to the `host`
parameter, though it makes for a rather cumbersome value.
2024-09-02 21:03:33 -05:00
d3a09a2e88 hosts: Add chromie, nvr2 to nut-monitor group
Deploy `nut-monitor` on these physical machines so they will shut down
safely in the event of a power outage.
2024-09-01 18:52:33 -05:00