Commit Graph

1180 Commits

Author SHA1 Message Date
4b91e088ea r/apache: Reduce amount of logs stored
There's really no reason to keep 4 256 MiB log files, especially access
logs.  In any case, most of the web servers only have 1 GiB log volume,
so this configuration tends to fill them up.
2025-11-09 13:23:02 -06:00
28ecc2974c fluent-bit: Remove Promtail 2025-11-06 09:44:22 -06:00
a500e0ece4 hosts: Decommission dc-headphone.p.b
_dc-headphone.pyrocufflink.blue_ has been replaced by
_dc-backless.pyrocufflink.blue_.
2025-11-01 22:28:43 -05:00
1804bc06f0 domain-controller: Remove vault secrets
The secret values stored in this vault file were never actually used.
They weren't even correct.
2025-10-27 12:54:07 -05:00
7929176b4e create-dc: Update to use new provisioning process
Instead of running `virt-install` directly from the `create-dc.sh`
script, it now relies on `newvm.sh`.  This will ensure that VMs created
to be domain controllers will conform to the same expectations as all
other machines, such as using the libvirt domain metadata to build
dynamic inventory.

Similarly, the `create-dc.yml` playbook now imports the `host-setup.yml`
playbook, which covers the basic setup of a new machine.  Again, this
ensures that the same policy is applied to DCs as to other machines.

Finally, domain controller machines now no longer use _winbind_ for
OS user accounts and authentication.  This never worked particularly
well on DCs anyway (particularly because of the way _winbind_ insists on
using domain-prefixed user accounts when it runs on a DC), and is now
worse with recent Fedora changes.  Instead, DCs now have local users who
authenticate via SSH certificates, the same as other current-generaton
servers.
2025-10-27 12:53:27 -05:00
3f761eacb4 newvm: Add support for specifying static IP config
Although rare, there are scenarios where we may want to deploy a new
virtual machine with a static, manually-configured IP address.
Anaconda/Dracut support this via the `ip=` kernel command-line argument.
To simplify populating that argument, the `newvm` script now takes
additional command-line arguments for IP address (in CIDR prefix),
default gateway, and name server address(es) and creates the appropriate
string from these discrete values.
2025-10-24 11:17:11 -05:00
3bed59055c users: Do not apply sudo role on Samba DCs
Users, auth, etc. for domain controllers will be handled by the
`create-dc.yml` playbook.  I haven't decided exactly how this playbook
will get applied, I want to make sure the host provisioner is able to
successfully provision machines in the _samba-dc_ group nonetheless.
2025-10-22 21:13:03 -05:00
7308b45047 fluent-bit: Enable EPEL repo if needed
The _fluent-bit_ package is provided by EPEL for Red Hat/CentOS/AlmaLinux.
2025-10-19 09:28:47 -05:00
0b914d617e ci: Optionally allow installing packages
Usually, we do not want the continuous enforcement jobs installing or
upgrading software packages.  Sometimes, though, we may want to use a
Jenkins job to roll out something new, so this new `ALLOW_INSTALL`
parameter will control whether or not Ansible tasks tagged with
`install` are skipped.
2025-10-19 09:04:27 -05:00
ea1253c9b8 ci: Remove remount RO/RW stages
None of the extant servers have read-only root filesystems any more, so
these stages are no longer necessary.
2025-10-19 08:57:19 -05:00
bcfe7cc699 ci: Add pipeline for fluent-bit Playbook 2025-10-17 07:53:10 -05:00
dc8961de92 fluent-bit: Do not apply to K8s nodes
We'll manage Fluent-Bit on Kubernetes nodes as a DaemonSet.  This will
be necessary in order to grant it access to the Kubernetes API so it can
augment log records with Kubernetes metadata (labels, pod name, etc.).
2025-10-17 07:51:32 -05:00
96ac5be3b5 r/kubelet: Schedule automatic image prune
As pods move around between nodes, applications are updated, etc., nodes
tend to accumulate images in their container stores that are no longer
used.  These take up space unnecessarily, eventually triggering disk
usage alarms.  From now, the _kubelet_ role installs a systemd timer and
service unit to periodically clean up these unused images.
2025-10-13 09:54:20 -05:00
142682ce2f r/ssh-host-certs: Fix restart handler
The _ssh-host-certs.target_ unit does not exist any more.  It was
provided by the _sshca-cli-systemd_ package to allow machines to
automatically request their SSH host certificates on first boot.  It had
a `ConditionFirstBoot=` requirement, which made it not work at any other
time, so there was no reason to move it into the Ansible configuration
policy.  Instead, we can use the _ssh-host-certs-renew.target_ unit to
trigger requesting or renewing host certificates.
2025-09-17 06:40:20 -05:00
8a7faac35b r/ssh-host-certs: Reload sshd after renewing certs
In Fedora 41, it seems the SSH daemon no longer automatically uses the
new certificate after its host certificates have been renewed.  To get
it to pick up the new ones, we have to explicitly tell it to reload.  To
handle that automatically, I've added a new systemd path unit that
monitors the certificate files.  When it detects that one of them has
changed, it will send the signal to the SSH daemon to tell it to reload.
2025-09-14 15:08:41 -05:00
37e6622351 r/ssh-host-certs: Import systemd unit files
The _sshca-cli_ package no longer provides a _-systemd_ sub-package
containing the systemd unit files for automatically requesting and
renewing SSH host certificates.  Its original intent was to support
automatically signing certificates on first boot by having the unit
files installed by Anaconda, but this never really worked for various
reasons.  Since I'd rather not have to rebuild the RPMs every time I
need to make a change to the systemd units, and Ansible is required to
actually get the certificates issued anyway, it makes more sense to have
the unit files in the configuration policy instead.
2025-09-14 15:08:41 -05:00
8e8c109bf6 websites/pyrocufflink: Switch to mod_md for cert
The _pyrocufflink.net_ site now obtains its certificate from Let's
Encrypt using the Apache _mod_md_ (managed domain) module.  This
dramatically simplifies the deployment of this certificate, eliminating
the need for _cert-manager_ to obtain it, _cert-exporter_ to add it to
_certs.git_, and Jenkins to push it out to the web server.
2025-09-04 10:04:37 -05:00
29cdafac2a scripts/shutdown-vmhost: Skip Longhorn nodes
We **DO NOT** want to shut down the Longhorn Kubernetes nodes!  Doing so
would pretty much nuke everything running in the cluster.  The shutdown
script will need to migrate them online; fortunately, since they don't
run anything except Longhorn, they should be able to migrate fine.
2025-08-29 21:38:12 -05:00
c11a792eb8 websites/hlc: Drop formsubmit config tasks
_formsubmit_ runs in Kubernetes since some time now.
2025-08-25 09:00:20 -05:00
524ac0931a websites/hlc: Switch to mod_md for cert management
To avoid having separate certificates for the canonical
_www.hatchlearningcenter.org_ site and all the redirects, we'll combine
these virtual hosts into one.  We can use a `RewriteCond` to avoid the
redirect for the canonical name itself.
2025-08-25 09:00:20 -05:00
fb93598586 dch-proxy: Use PROXY protocol v1 for Nextcloud
Apache doesn't fully support the PROXY v2 protocol.  When it's enabled,
it spams its error log with messages about unsupported features, e.g.:

> [remoteip:error] [pid 1257:tid 1302] [client 172.30.0.6:45614]
> AH03507: RemoteIPProxyProtocol: unsupported command 20
2025-08-23 22:52:08 -05:00
57a5f83262 nextcloud: Run an SMTP relay locally
For some reason, Nextcloud seems to have trouble sending mail via the
network-wide relay.  It opens a connection, then just sits there and
never sends anything until it times out.  This happens probably 4 out of
5 times it attempts to send e-mail messages.

Running Postfix locally and directing Nextcloud to send mail through it
and then on to the network-wide relay seems to work much more reliably.
2025-08-23 22:43:45 -05:00
1a3f68e18b Merge remote-tracking branch 'refs/remotes/origin/master' 2025-08-23 22:43:00 -05:00
1c1bff3ec0 r/nextcloud: Fix a bunch of deployment warnings
The Nextcloud administration overview page listed a bunch of deployment
configuration warnings that needed to be addressed:

* Set the default phone region
* Define a maintenance window starting at 0600 UTC
* Increase the PHP memory limit to 1GiB
* Increase the PHP OPCache interned strings buffer size
* Increase the allowed PHP OPcache memory limit
* Fix Apache rewrite rules for /.well-known paths
2025-08-23 22:39:44 -05:00
6cd576dd2b dch-proxy: Proxy for Authelia
Authelia is now exposed to the public Internet, under the name
_auth.pyrocufflink.net_, which allows it to protect public websites as
well.
2025-08-23 22:29:28 -05:00
70909d1b13 websites: Enable PROXY protocol for HTTPS sites
Since the reverse proxy does TLS pass-through instead of termination,
the original source address is lost.  Since the source address is
important for logging, rate limiting, and access control, we need to use
the HAProxy PROXY protocol to pass it along to the web server.

Since the PROXY protocol works at the TCP layer, _all_ connections must
use it. Fortunately, all of the sites hosted by the public web server
are in fact public and only accessed through HAProxy.  Similarly,
enabling it for one named virtual host enables it for all virtual hosts
on that port.  Thus, we only have to explicitly set it for one site, and
all the rest will use it as well.
2025-08-23 22:21:54 -05:00
717a8f90c6 websites: Remove formsubmit
Nothing is using _formsubmit_ right now, but it's been moved to
Kubernetes anyway.
2025-08-23 20:44:41 -05:00
7fc3465d56 smtp1: Fix mynetworks setting for k8s network
The "Kubernetes" subnet is a /27, not a /28.  There are hosts in that
upper section that was masked out, and these were unable to send e-mails
via the relay because they were excluded from the `mynetworks` value.
2025-08-20 07:11:27 -05:00
5dbe26fc60 r/repohost: Optimize createrepo queue loop
Instead of waking every 30 seconds, the queue loop in
`repohost-createrepo.sh` now only wakes when it receives an inotify
event indicating the queue file has been modified.  To avoid missing
events that occured while a `createrepo` process was running, there's
now an inner loop that runs until the queue is completely empty, before
returning to blocking on `inotifywait`.
2025-08-20 07:11:27 -05:00
2d51e2001d gw1: Allow internal IPv6 clients
Specifically to allow the Synology to synchronize its clock, as it only
has an IPv6 address.

We also need to explicitly override `chrony_servers` to an empty list
for the firewall itself, since it syncs with the NTP pool, rather than
its next hop router.
2025-08-17 20:52:36 -05:00
f8d58ef0ed websites/dcow: Transition to static site
We don't really use this site for screenshot sharing any more.  It's
cool to keep to look at old screenshots, so I've saved a static snapshot
of it that can be hosted by plain ol' Apache.
2025-08-16 08:55:28 -05:00
b72676a1bb nextcloud: Fetch HTTPS cert from Kubernetes
Since Nextcloud uses the _pyrocufflink.net_ wildcard certificate, we can
load it directly from the Kubernetes Secret, rather than from the file
in the _certs_ submodule, just like Gitea et al.
2025-08-11 10:39:54 -05:00
f5ab739c9e websites: dustinandtabitha: Switch to mod_md for cert
The _dustinandtabitha.com_ site now obtains its certificate from Let's
Encrypt using the Apache _mod_md_ (managed domain) module.  This
dramatically simplifies the deployment of this certificate, eliminating
the need for _cert-manager_ to obtain it, _cert-exporter_ to add it to
_certs.git_, and Jenkins to push it out to the web server.
2025-08-11 10:34:30 -05:00
33da25209d r/lego: Fix timer unit trigger
`OnActiveSec` only fires once.  To trigger the renew periodically, we
need to use `OnCalendar`.
2025-08-10 17:45:46 -05:00
713fd794a3 remote-blackbox: Scrape HTTPS for some sites
Now that the Blackbox exporter does not follow redirects, we need to
explicitly tell it to scrape the HTTPS variant of sites that have it
enabled.  Otherwise, we only get info about the first HTTP-to-HTTPS
redirect response, which is not helpful for watching certificate expiry.
2025-08-08 11:09:28 -05:00
8a93ef0fc1 hosts: Remove chromie.p.b from AD domain
Since it was updated to Fedora 42, Jenkins configuration management jobs
have been failing to apply policy to _chromie.pyrocufflink.blue_.  It
claims "jenkins is not in the sudoers file," apparently because
`winbind` keeps "forgetting" that _jenkins_ is a member of the _server
admins_ group, which is listed in `sudoers` file.

I'm getting tired of messing with `winbind` and its barrage of bugs and
quirks.  There's no particular reason for _chromie_ to be an AD domain
member, so let's just remove it and manage its users statically.
2025-08-07 15:07:02 -05:00
423f28ea53 remote-blackbox: Do not follow HTTP redirects
There are a couple of websites we scrape that simply redirect to another
name (e.g. _pyrocufflink.net_ → _dustin.hatch.name_, _tabitha.biz_ →
_hatchlearningcenter.org_).  For these, we want to track the
availability of the first step, not the last, especially with regard to
their certificate lifetimes.
2025-08-07 11:55:31 -05:00
0e15c6a635 needproxy: Add logs.p.b to NO_PROXY
`fluent-bit` has a bug ([#3619], [#3907], [#6759]) in its handling of
the `NO_PROXY` environment variable.  Instead of matching a domain and
all its subdomain, like it claims to do in its [documentation][0], it
only does an exact string match on the full host name.  To work around
this, we need to explicitly list `logs.pyrocufflink.blue` in the
`no_proxy` value; this will not have any impact on other consumers of
this variable, but will make `fluent-bit` work as expected, connecting
directly to Victoria Logs instead of through the proxy.

[0]: https://docs.fluentbit.io/manual/administration/http-proxy#no_proxy
[#3619]: https://github.com/fluent/fluent-bit/issues/3619
[#3907]: https://github.com/fluent/fluent-bit/issues/3907
[#6759]: https://github.com/fluent/fluent-bit/issues/6759
2025-08-06 10:46:03 -05:00
daa602495c r/frigate: Add udev rules for coral tpu
Since the _frigate.service_ unit depends on _dev-apex_0.device_,
`/dev/apex_0` needs to have the `systemd` "tag" on its udev device info.
Without this tag, systemd will not "see" the device and thus will not
mark the `.device` unit as active.
2025-08-06 09:04:04 -05:00
9b4232d01a Merge remote-tracking branch 'refs/remotes/origin/master' 2025-08-05 18:17:13 -05:00
6bc0475e89 raid-array: Fix md re-add automation
Recent versions of `mdadm` stopped accepting `/dev/disk/by-id` symlinks
as the MD device:

> mdadm: Value "/dev/disk/by-id/md-name-backup5" cannot be set as devname. Reason: Cannot be started from '/' or '<'.

To work around this, we need a script to resolve the symlink and pass
the real block device name.
2025-08-05 10:31:33 -05:00
dcef009353 fluent-bit: send md alerts to ntfy
For machines that have Linux MD RAID arrays, I want to receive
notifications about the status of the arrays immediately via _ntfy_.  I
had this before with `journal2ntfy`, but I never got around to setting
it up for the current generation of machines (_nvr2_, _chromie_).  Now
that we have `fluent-bit` deployed, we can use its pipeline capabilities
to select the subset of messages for which we want immediate alerts and
send them directly to _ntfy_.  We use a Lua function to transform the
log record into a body compatible with _ntfy_'s JSON publish request;
`fluent-bit` doesn't have any other way to set array values, as needed
for the `tags` member.
2025-08-05 10:28:20 -05:00
0fe296f7f3 fluent-bit: Deploy log collector for Victoria Logs
[fluent-bit][0] is a generic, highly-configurable log collector.  It was
apparently initially developed for fluentd, but is has so many output
capabilities that it works wil many different log aggregation systems,
including Victoria Logs.

Although Victoria Logs supports the Loki input format, and therefore
_Promtail_ would work, I want to try to avoid depending on third-party
repositories.  _fluent-bit_ is packaged by Fedora, so there shouldn't be
any dependency issues, etc.

[0]: https://fluentbit.io
2025-08-05 07:14:08 -05:00
c35c7b8520 r/apache: log errors to syslog by default
Logging to syslog will allow messages to be aggregated in the central
server (Loki now, Victoria Logs eventually), so I don't have to SSH into
the web server to check for errors.
2025-08-04 09:49:19 -05:00
84a8a0d4af websites: dustin.hatch.n: Switch to mod_md for cert
The _dustin.hatch.name_ site now obtains its certificate from Let's
Encrypt using the Apache _mod_md_ (managed domain) module.  This
dramatically simplifies the deployment of this certificate, eliminating
the need for _cert-manager_ to obtain it, _cert-exporter_ to add it to
_certs.git_, and Jenkins to push it out to the web server.
2025-08-04 09:49:19 -05:00
71b1363c58 r/vmhost: Install nmap-ncat
While clients can use `virt-ssh-helper` to communicate with `libvirtd`,
they need `nc` in order to forward SPICE graphics communication.
2025-07-31 10:19:11 -05:00
9e7b9420f4 k8s-iot-net-ctrl: Add node role taints
Previously, _node-474c83.k8s.pyrocufflink.black_ was tainted
`du5t1n.me/machine=raspberrypi`, which prevented arbitrary pods from
being scheduled on it.  Now that there are two more Raspberry Pi nodes
in the cluster, and arbitrary pods _should_ be scheduled on them, this
taint no longer makes sense.  Instead, having specific taints for the
node's roles is more clear.
2025-07-29 21:44:29 -05:00
7f8e39ebd4 websites: chmod777.sh: Switch to mod_md for cert
The _chmod777.sh_ site now obtains its certificate from Let's
Encrypt using the Apache _mod_md_ (managed domain) module.  This
dramatically simplifies the deployment of this certificate, eliminating
the need for _cert-manager_ to obtain it, _cert-exporter_ to add it to
_certs.git_, and Jenkins to push it out to the web server.
2025-07-28 18:53:58 -05:00
2b12ce769c remote-blackbox: Scrape Invoice Ninja 2025-07-28 18:28:30 -05:00
3270011fee r/vmhost: Work around libvirt SELinux policy bug
With the transition to modular _libvirt_ daemons, the SELinux policy is
a bit more granular.  Unfortunately, the new policy has a funny [bug]: it
assumes directories named `storage` under `/run/libvirt` must be for
_virtstoraged_ and labels them as such, which prevents _virtnetworkd_
from managing a virtual network named `storage`.

To work around this, we need to give `/run/libvirt/network` a special
label so that its children do not match the file transition pattern for
_virtstoraged_ and thus keep their `virtnetworkd_var_run_t` label.

[bug]: https://bugzilla.redhat.com/show_bug.cgi?id=2362040
2025-07-28 18:23:24 -05:00