Commit Graph

1192 Commits

Author SHA1 Message Date
68b045d6d1 websites: Drop unnecessary cert for hatch.chat
The Synapse server has been gone for a long time.
2025-11-17 07:56:38 -06:00
c1944fc78a site: Remove frigate PB
The `frigate` playbook cannot be applied by the host provisioner for
several reasons.  First, it needs manual intervention in order to enroll
the MOK which is used to sign the `gasket-driver` kernel modules.
Further, it needs several encrypted values from Ansible Vault, which are
not available to the _host-provisioner_.
2025-11-16 16:49:15 -06:00
2d53fe6acd gw1/squid: Allow pxe.p.b via HTTPS
Now that Kickstart files are hosted on _pxe.pyrocufflink.blue_, we can
allow access to that entire (sub-)domain, enabling clients to fetch the
files over HTTPS.  Previously, this was not possible because in order to
allow access to Kickstart files but nothing else on Gitea, we had to
rely on full URL matching.
2025-11-16 16:49:15 -06:00
2aca0429eb useproxy: Add ntfy.p.b to NO_PROXY
Specifically for _fluent-bit_, which does not correctly handle wildcards
or subdomains in `NO_PROXY`, to send real-time notifications from logs
via ntfy.
2025-11-16 16:49:15 -06:00
04f62a1467 hosts: Remove nvr2 from AD domain
The NVMe drive in _nvr2.pyrocufflink.blue_ died, so I had to re-install
Fedora on a new drive.  This time around, it will not be a domain
member, as with the other new servers added recently.
2025-11-16 16:48:20 -06:00
60b7a20e1f frigate: Switch to pre-compiled gasket-driver RPM
The DKMS package for the _gasket-driver_ kernel modules is something of
a problem.  For one thing, upstream seems to have abandoned the driver
itself, and it now requires several patches in order to compile for
current kernel versions.  These patches are not included in the DKMS
package, and thus have to be applied manually after installing it.  More
generally, I don't really like how DKMS works anyway.  Besides requiring
a full kernel development toolchain on a production system, it's
impossible to know if a module will compile successfully until _after_
the new kernel has been installed and booted.  This has frequently meant
that Frigate won't come up after an update because building the module
failed.  I would much rather have a notification about a compatibility
issue for an _upcoming_ update, rather than an applied one.

To rectify these issues, I have created a new RPM package tha contains
pre-built, signed kernel modules for the Coral EdgeTPU device.  Unlike
the DKMS package, this package needs to be rebuilt for every kernel
version, however, this is done by Jenkins before the updated kernel gets
installed on the machine.  It also expresses a dependency on an exact
kernel version, so the kernel cannot be updated until a corresponding
_gasket-driver_ package is available.
2025-11-16 16:30:51 -06:00
94a777fec8 r/collectd-sensors: Add missing handlers file 2025-11-16 16:30:51 -06:00
0df95c8378 Drop .certs submodule
Nothing uses these certificates anymore, and nothing manages/renews
them.  Everything has either been converted to ACME, or fetches the
_pyrocufflink.net_ wildcard certificate directly from the Kubernetes
Secret.
2025-11-16 16:28:49 -06:00
daa91e71a1 Merge remote-tracking branch 'refs/remotes/origin/master' 2025-11-16 16:24:04 -06:00
fce060bdec r/ssh-host-certs: Fix circular dep in reload.path
The `reload-ssh-cert.path` unit introduced a circular ordering
dependency with `sshd.service` by way of `paths.target`.  There's no
particular reason for this dependency here, so we need to remove it to
resolve the issue.
2025-11-13 18:40:52 -06:00
44c3dba46a r/gitea: Update to v1.24.7 2025-11-12 17:48:09 -06:00
4b91e088ea r/apache: Reduce amount of logs stored
There's really no reason to keep 4 256 MiB log files, especially access
logs.  In any case, most of the web servers only have 1 GiB log volume,
so this configuration tends to fill them up.
2025-11-09 13:23:02 -06:00
28ecc2974c fluent-bit: Remove Promtail 2025-11-06 09:44:22 -06:00
a500e0ece4 hosts: Decommission dc-headphone.p.b
_dc-headphone.pyrocufflink.blue_ has been replaced by
_dc-backless.pyrocufflink.blue_.
2025-11-01 22:28:43 -05:00
5af25bcccf r/dch-yum: Trust GPG key
We need to explicitly add the GPG signing key for the _dch_ repository
to the system trust store, otherwise, _dnf-automatic_ will fail, as it
cannot implicitly add new keys during an update.
2025-10-27 12:54:07 -05:00
1804bc06f0 domain-controller: Remove vault secrets
The secret values stored in this vault file were never actually used.
They weren't even correct.
2025-10-27 12:54:07 -05:00
7929176b4e create-dc: Update to use new provisioning process
Instead of running `virt-install` directly from the `create-dc.sh`
script, it now relies on `newvm.sh`.  This will ensure that VMs created
to be domain controllers will conform to the same expectations as all
other machines, such as using the libvirt domain metadata to build
dynamic inventory.

Similarly, the `create-dc.yml` playbook now imports the `host-setup.yml`
playbook, which covers the basic setup of a new machine.  Again, this
ensures that the same policy is applied to DCs as to other machines.

Finally, domain controller machines now no longer use _winbind_ for
OS user accounts and authentication.  This never worked particularly
well on DCs anyway (particularly because of the way _winbind_ insists on
using domain-prefixed user accounts when it runs on a DC), and is now
worse with recent Fedora changes.  Instead, DCs now have local users who
authenticate via SSH certificates, the same as other current-generaton
servers.
2025-10-27 12:53:27 -05:00
3f761eacb4 newvm: Add support for specifying static IP config
Although rare, there are scenarios where we may want to deploy a new
virtual machine with a static, manually-configured IP address.
Anaconda/Dracut support this via the `ip=` kernel command-line argument.
To simplify populating that argument, the `newvm` script now takes
additional command-line arguments for IP address (in CIDR prefix),
default gateway, and name server address(es) and creates the appropriate
string from these discrete values.
2025-10-24 11:17:11 -05:00
3bed59055c users: Do not apply sudo role on Samba DCs
Users, auth, etc. for domain controllers will be handled by the
`create-dc.yml` playbook.  I haven't decided exactly how this playbook
will get applied, I want to make sure the host provisioner is able to
successfully provision machines in the _samba-dc_ group nonetheless.
2025-10-22 21:13:03 -05:00
7308b45047 fluent-bit: Enable EPEL repo if needed
The _fluent-bit_ package is provided by EPEL for Red Hat/CentOS/AlmaLinux.
2025-10-19 09:28:47 -05:00
0b914d617e ci: Optionally allow installing packages
Usually, we do not want the continuous enforcement jobs installing or
upgrading software packages.  Sometimes, though, we may want to use a
Jenkins job to roll out something new, so this new `ALLOW_INSTALL`
parameter will control whether or not Ansible tasks tagged with
`install` are skipped.
2025-10-19 09:04:27 -05:00
ea1253c9b8 ci: Remove remount RO/RW stages
None of the extant servers have read-only root filesystems any more, so
these stages are no longer necessary.
2025-10-19 08:57:19 -05:00
bcfe7cc699 ci: Add pipeline for fluent-bit Playbook 2025-10-17 07:53:10 -05:00
dc8961de92 fluent-bit: Do not apply to K8s nodes
We'll manage Fluent-Bit on Kubernetes nodes as a DaemonSet.  This will
be necessary in order to grant it access to the Kubernetes API so it can
augment log records with Kubernetes metadata (labels, pod name, etc.).
2025-10-17 07:51:32 -05:00
96ac5be3b5 r/kubelet: Schedule automatic image prune
As pods move around between nodes, applications are updated, etc., nodes
tend to accumulate images in their container stores that are no longer
used.  These take up space unnecessarily, eventually triggering disk
usage alarms.  From now, the _kubelet_ role installs a systemd timer and
service unit to periodically clean up these unused images.
2025-10-13 09:54:20 -05:00
142682ce2f r/ssh-host-certs: Fix restart handler
The _ssh-host-certs.target_ unit does not exist any more.  It was
provided by the _sshca-cli-systemd_ package to allow machines to
automatically request their SSH host certificates on first boot.  It had
a `ConditionFirstBoot=` requirement, which made it not work at any other
time, so there was no reason to move it into the Ansible configuration
policy.  Instead, we can use the _ssh-host-certs-renew.target_ unit to
trigger requesting or renewing host certificates.
2025-09-17 06:40:20 -05:00
8a7faac35b r/ssh-host-certs: Reload sshd after renewing certs
In Fedora 41, it seems the SSH daemon no longer automatically uses the
new certificate after its host certificates have been renewed.  To get
it to pick up the new ones, we have to explicitly tell it to reload.  To
handle that automatically, I've added a new systemd path unit that
monitors the certificate files.  When it detects that one of them has
changed, it will send the signal to the SSH daemon to tell it to reload.
2025-09-14 15:08:41 -05:00
37e6622351 r/ssh-host-certs: Import systemd unit files
The _sshca-cli_ package no longer provides a _-systemd_ sub-package
containing the systemd unit files for automatically requesting and
renewing SSH host certificates.  Its original intent was to support
automatically signing certificates on first boot by having the unit
files installed by Anaconda, but this never really worked for various
reasons.  Since I'd rather not have to rebuild the RPMs every time I
need to make a change to the systemd units, and Ansible is required to
actually get the certificates issued anyway, it makes more sense to have
the unit files in the configuration policy instead.
2025-09-14 15:08:41 -05:00
8e8c109bf6 websites/pyrocufflink: Switch to mod_md for cert
The _pyrocufflink.net_ site now obtains its certificate from Let's
Encrypt using the Apache _mod_md_ (managed domain) module.  This
dramatically simplifies the deployment of this certificate, eliminating
the need for _cert-manager_ to obtain it, _cert-exporter_ to add it to
_certs.git_, and Jenkins to push it out to the web server.
2025-09-04 10:04:37 -05:00
29cdafac2a scripts/shutdown-vmhost: Skip Longhorn nodes
We **DO NOT** want to shut down the Longhorn Kubernetes nodes!  Doing so
would pretty much nuke everything running in the cluster.  The shutdown
script will need to migrate them online; fortunately, since they don't
run anything except Longhorn, they should be able to migrate fine.
2025-08-29 21:38:12 -05:00
c11a792eb8 websites/hlc: Drop formsubmit config tasks
_formsubmit_ runs in Kubernetes since some time now.
2025-08-25 09:00:20 -05:00
524ac0931a websites/hlc: Switch to mod_md for cert management
To avoid having separate certificates for the canonical
_www.hatchlearningcenter.org_ site and all the redirects, we'll combine
these virtual hosts into one.  We can use a `RewriteCond` to avoid the
redirect for the canonical name itself.
2025-08-25 09:00:20 -05:00
fb93598586 dch-proxy: Use PROXY protocol v1 for Nextcloud
Apache doesn't fully support the PROXY v2 protocol.  When it's enabled,
it spams its error log with messages about unsupported features, e.g.:

> [remoteip:error] [pid 1257:tid 1302] [client 172.30.0.6:45614]
> AH03507: RemoteIPProxyProtocol: unsupported command 20
2025-08-23 22:52:08 -05:00
57a5f83262 nextcloud: Run an SMTP relay locally
For some reason, Nextcloud seems to have trouble sending mail via the
network-wide relay.  It opens a connection, then just sits there and
never sends anything until it times out.  This happens probably 4 out of
5 times it attempts to send e-mail messages.

Running Postfix locally and directing Nextcloud to send mail through it
and then on to the network-wide relay seems to work much more reliably.
2025-08-23 22:43:45 -05:00
1a3f68e18b Merge remote-tracking branch 'refs/remotes/origin/master' 2025-08-23 22:43:00 -05:00
1c1bff3ec0 r/nextcloud: Fix a bunch of deployment warnings
The Nextcloud administration overview page listed a bunch of deployment
configuration warnings that needed to be addressed:

* Set the default phone region
* Define a maintenance window starting at 0600 UTC
* Increase the PHP memory limit to 1GiB
* Increase the PHP OPCache interned strings buffer size
* Increase the allowed PHP OPcache memory limit
* Fix Apache rewrite rules for /.well-known paths
2025-08-23 22:39:44 -05:00
6cd576dd2b dch-proxy: Proxy for Authelia
Authelia is now exposed to the public Internet, under the name
_auth.pyrocufflink.net_, which allows it to protect public websites as
well.
2025-08-23 22:29:28 -05:00
70909d1b13 websites: Enable PROXY protocol for HTTPS sites
Since the reverse proxy does TLS pass-through instead of termination,
the original source address is lost.  Since the source address is
important for logging, rate limiting, and access control, we need to use
the HAProxy PROXY protocol to pass it along to the web server.

Since the PROXY protocol works at the TCP layer, _all_ connections must
use it. Fortunately, all of the sites hosted by the public web server
are in fact public and only accessed through HAProxy.  Similarly,
enabling it for one named virtual host enables it for all virtual hosts
on that port.  Thus, we only have to explicitly set it for one site, and
all the rest will use it as well.
2025-08-23 22:21:54 -05:00
717a8f90c6 websites: Remove formsubmit
Nothing is using _formsubmit_ right now, but it's been moved to
Kubernetes anyway.
2025-08-23 20:44:41 -05:00
7fc3465d56 smtp1: Fix mynetworks setting for k8s network
The "Kubernetes" subnet is a /27, not a /28.  There are hosts in that
upper section that was masked out, and these were unable to send e-mails
via the relay because they were excluded from the `mynetworks` value.
2025-08-20 07:11:27 -05:00
5dbe26fc60 r/repohost: Optimize createrepo queue loop
Instead of waking every 30 seconds, the queue loop in
`repohost-createrepo.sh` now only wakes when it receives an inotify
event indicating the queue file has been modified.  To avoid missing
events that occured while a `createrepo` process was running, there's
now an inner loop that runs until the queue is completely empty, before
returning to blocking on `inotifywait`.
2025-08-20 07:11:27 -05:00
2d51e2001d gw1: Allow internal IPv6 clients
Specifically to allow the Synology to synchronize its clock, as it only
has an IPv6 address.

We also need to explicitly override `chrony_servers` to an empty list
for the firewall itself, since it syncs with the NTP pool, rather than
its next hop router.
2025-08-17 20:52:36 -05:00
f8d58ef0ed websites/dcow: Transition to static site
We don't really use this site for screenshot sharing any more.  It's
cool to keep to look at old screenshots, so I've saved a static snapshot
of it that can be hosted by plain ol' Apache.
2025-08-16 08:55:28 -05:00
b72676a1bb nextcloud: Fetch HTTPS cert from Kubernetes
Since Nextcloud uses the _pyrocufflink.net_ wildcard certificate, we can
load it directly from the Kubernetes Secret, rather than from the file
in the _certs_ submodule, just like Gitea et al.
2025-08-11 10:39:54 -05:00
f5ab739c9e websites: dustinandtabitha: Switch to mod_md for cert
The _dustinandtabitha.com_ site now obtains its certificate from Let's
Encrypt using the Apache _mod_md_ (managed domain) module.  This
dramatically simplifies the deployment of this certificate, eliminating
the need for _cert-manager_ to obtain it, _cert-exporter_ to add it to
_certs.git_, and Jenkins to push it out to the web server.
2025-08-11 10:34:30 -05:00
33da25209d r/lego: Fix timer unit trigger
`OnActiveSec` only fires once.  To trigger the renew periodically, we
need to use `OnCalendar`.
2025-08-10 17:45:46 -05:00
713fd794a3 remote-blackbox: Scrape HTTPS for some sites
Now that the Blackbox exporter does not follow redirects, we need to
explicitly tell it to scrape the HTTPS variant of sites that have it
enabled.  Otherwise, we only get info about the first HTTP-to-HTTPS
redirect response, which is not helpful for watching certificate expiry.
2025-08-08 11:09:28 -05:00
8a93ef0fc1 hosts: Remove chromie.p.b from AD domain
Since it was updated to Fedora 42, Jenkins configuration management jobs
have been failing to apply policy to _chromie.pyrocufflink.blue_.  It
claims "jenkins is not in the sudoers file," apparently because
`winbind` keeps "forgetting" that _jenkins_ is a member of the _server
admins_ group, which is listed in `sudoers` file.

I'm getting tired of messing with `winbind` and its barrage of bugs and
quirks.  There's no particular reason for _chromie_ to be an AD domain
member, so let's just remove it and manage its users statically.
2025-08-07 15:07:02 -05:00
423f28ea53 remote-blackbox: Do not follow HTTP redirects
There are a couple of websites we scrape that simply redirect to another
name (e.g. _pyrocufflink.net_ → _dustin.hatch.name_, _tabitha.biz_ →
_hatchlearningcenter.org_).  For these, we want to track the
availability of the first step, not the last, especially with regard to
their certificate lifetimes.
2025-08-07 11:55:31 -05:00
0e15c6a635 needproxy: Add logs.p.b to NO_PROXY
`fluent-bit` has a bug ([#3619], [#3907], [#6759]) in its handling of
the `NO_PROXY` environment variable.  Instead of matching a domain and
all its subdomain, like it claims to do in its [documentation][0], it
only does an exact string match on the full host name.  To work around
this, we need to explicitly list `logs.pyrocufflink.blue` in the
`no_proxy` value; this will not have any impact on other consumers of
this variable, but will make `fluent-bit` work as expected, connecting
directly to Victoria Logs instead of through the proxy.

[0]: https://docs.fluentbit.io/manual/administration/http-proxy#no_proxy
[#3619]: https://github.com/fluent/fluent-bit/issues/3619
[#3907]: https://github.com/fluent/fluent-bit/issues/3907
[#6759]: https://github.com/fluent/fluent-bit/issues/6759
2025-08-06 10:46:03 -05:00