Commit Graph

1206 Commits

Author SHA1 Message Date
85fc29d511 remote-blackbox: Increase scrape timeout
In order to avoid false positives, especially with Invoice Ninja, I'm
increasing the timeout values for scraping the public-facing websites.
They can occasionally be quite slow, either because of our Internet
connection, or load on the servers.
2025-11-25 21:56:20 -06:00
0334b1b77a Merge branch 'fluent-bit' 2025-11-24 07:49:05 -06:00
f1b61a8d0a v-l: Enable useRemoteIP for syslog
Victoria Logs can now record the source address for syslog messages in a
`remoteIP` field.  This has to be enabled specifically, although I can't
think of a reason why someone would _not_ want to record that
information.
2025-11-24 07:47:35 -06:00
8aa1e986d4 r/gitea: Enable PROXY protocol
Using the PROXY protocol allows the publicly-facing reverse proxy to
pass through the original source address of the client, without doing
TLS termination.  Clients on the internal network will not go through
the proxy, though, so we have to disable the PROXY protocol for those
addresses.  Unfortunately, the syntax for this is kind of cumbersome,
because Apache only has a deny list, not an allow list, so we have to
enumerate all of the possible internal addresses _except_ the proxy.
2025-11-19 07:43:29 -06:00
25d813144c r/web/hlc: Drop cert role
The certificate for _hatchlearningcenter.org_ is managed by Apache
*mod_md* now.
2025-11-17 08:00:45 -06:00
68b045d6d1 websites: Drop unnecessary cert for hatch.chat
The Synapse server has been gone for a long time.
2025-11-17 07:56:38 -06:00
c1944fc78a site: Remove frigate PB
The `frigate` playbook cannot be applied by the host provisioner for
several reasons.  First, it needs manual intervention in order to enroll
the MOK which is used to sign the `gasket-driver` kernel modules.
Further, it needs several encrypted values from Ansible Vault, which are
not available to the _host-provisioner_.
2025-11-16 16:49:15 -06:00
2d53fe6acd gw1/squid: Allow pxe.p.b via HTTPS
Now that Kickstart files are hosted on _pxe.pyrocufflink.blue_, we can
allow access to that entire (sub-)domain, enabling clients to fetch the
files over HTTPS.  Previously, this was not possible because in order to
allow access to Kickstart files but nothing else on Gitea, we had to
rely on full URL matching.
2025-11-16 16:49:15 -06:00
2aca0429eb useproxy: Add ntfy.p.b to NO_PROXY
Specifically for _fluent-bit_, which does not correctly handle wildcards
or subdomains in `NO_PROXY`, to send real-time notifications from logs
via ntfy.
2025-11-16 16:49:15 -06:00
04f62a1467 hosts: Remove nvr2 from AD domain
The NVMe drive in _nvr2.pyrocufflink.blue_ died, so I had to re-install
Fedora on a new drive.  This time around, it will not be a domain
member, as with the other new servers added recently.
2025-11-16 16:48:20 -06:00
60b7a20e1f frigate: Switch to pre-compiled gasket-driver RPM
The DKMS package for the _gasket-driver_ kernel modules is something of
a problem.  For one thing, upstream seems to have abandoned the driver
itself, and it now requires several patches in order to compile for
current kernel versions.  These patches are not included in the DKMS
package, and thus have to be applied manually after installing it.  More
generally, I don't really like how DKMS works anyway.  Besides requiring
a full kernel development toolchain on a production system, it's
impossible to know if a module will compile successfully until _after_
the new kernel has been installed and booted.  This has frequently meant
that Frigate won't come up after an update because building the module
failed.  I would much rather have a notification about a compatibility
issue for an _upcoming_ update, rather than an applied one.

To rectify these issues, I have created a new RPM package tha contains
pre-built, signed kernel modules for the Coral EdgeTPU device.  Unlike
the DKMS package, this package needs to be rebuilt for every kernel
version, however, this is done by Jenkins before the updated kernel gets
installed on the machine.  It also expresses a dependency on an exact
kernel version, so the kernel cannot be updated until a corresponding
_gasket-driver_ package is available.
2025-11-16 16:30:51 -06:00
94a777fec8 r/collectd-sensors: Add missing handlers file 2025-11-16 16:30:51 -06:00
0df95c8378 Drop .certs submodule
Nothing uses these certificates anymore, and nothing manages/renews
them.  Everything has either been converted to ACME, or fetches the
_pyrocufflink.net_ wildcard certificate directly from the Kubernetes
Secret.
2025-11-16 16:28:49 -06:00
daa91e71a1 Merge remote-tracking branch 'refs/remotes/origin/master' 2025-11-16 16:24:04 -06:00
fce060bdec r/ssh-host-certs: Fix circular dep in reload.path
The `reload-ssh-cert.path` unit introduced a circular ordering
dependency with `sshd.service` by way of `paths.target`.  There's no
particular reason for this dependency here, so we need to remove it to
resolve the issue.
2025-11-13 18:40:52 -06:00
44c3dba46a r/gitea: Update to v1.24.7 2025-11-12 17:48:09 -06:00
4b91e088ea r/apache: Reduce amount of logs stored
There's really no reason to keep 4 256 MiB log files, especially access
logs.  In any case, most of the web servers only have 1 GiB log volume,
so this configuration tends to fill them up.
2025-11-09 13:23:02 -06:00
28ecc2974c fluent-bit: Remove Promtail 2025-11-06 09:44:22 -06:00
a500e0ece4 hosts: Decommission dc-headphone.p.b
_dc-headphone.pyrocufflink.blue_ has been replaced by
_dc-backless.pyrocufflink.blue_.
2025-11-01 22:28:43 -05:00
5af25bcccf r/dch-yum: Trust GPG key
We need to explicitly add the GPG signing key for the _dch_ repository
to the system trust store, otherwise, _dnf-automatic_ will fail, as it
cannot implicitly add new keys during an update.
2025-10-27 12:54:07 -05:00
1804bc06f0 domain-controller: Remove vault secrets
The secret values stored in this vault file were never actually used.
They weren't even correct.
2025-10-27 12:54:07 -05:00
7929176b4e create-dc: Update to use new provisioning process
Instead of running `virt-install` directly from the `create-dc.sh`
script, it now relies on `newvm.sh`.  This will ensure that VMs created
to be domain controllers will conform to the same expectations as all
other machines, such as using the libvirt domain metadata to build
dynamic inventory.

Similarly, the `create-dc.yml` playbook now imports the `host-setup.yml`
playbook, which covers the basic setup of a new machine.  Again, this
ensures that the same policy is applied to DCs as to other machines.

Finally, domain controller machines now no longer use _winbind_ for
OS user accounts and authentication.  This never worked particularly
well on DCs anyway (particularly because of the way _winbind_ insists on
using domain-prefixed user accounts when it runs on a DC), and is now
worse with recent Fedora changes.  Instead, DCs now have local users who
authenticate via SSH certificates, the same as other current-generaton
servers.
2025-10-27 12:53:27 -05:00
3f761eacb4 newvm: Add support for specifying static IP config
Although rare, there are scenarios where we may want to deploy a new
virtual machine with a static, manually-configured IP address.
Anaconda/Dracut support this via the `ip=` kernel command-line argument.
To simplify populating that argument, the `newvm` script now takes
additional command-line arguments for IP address (in CIDR prefix),
default gateway, and name server address(es) and creates the appropriate
string from these discrete values.
2025-10-24 11:17:11 -05:00
3bed59055c users: Do not apply sudo role on Samba DCs
Users, auth, etc. for domain controllers will be handled by the
`create-dc.yml` playbook.  I haven't decided exactly how this playbook
will get applied, I want to make sure the host provisioner is able to
successfully provision machines in the _samba-dc_ group nonetheless.
2025-10-22 21:13:03 -05:00
7308b45047 fluent-bit: Enable EPEL repo if needed
The _fluent-bit_ package is provided by EPEL for Red Hat/CentOS/AlmaLinux.
2025-10-19 09:28:47 -05:00
0b914d617e ci: Optionally allow installing packages
Usually, we do not want the continuous enforcement jobs installing or
upgrading software packages.  Sometimes, though, we may want to use a
Jenkins job to roll out something new, so this new `ALLOW_INSTALL`
parameter will control whether or not Ansible tasks tagged with
`install` are skipped.
2025-10-19 09:04:27 -05:00
ea1253c9b8 ci: Remove remount RO/RW stages
None of the extant servers have read-only root filesystems any more, so
these stages are no longer necessary.
2025-10-19 08:57:19 -05:00
bcfe7cc699 ci: Add pipeline for fluent-bit Playbook 2025-10-17 07:53:10 -05:00
dc8961de92 fluent-bit: Do not apply to K8s nodes
We'll manage Fluent-Bit on Kubernetes nodes as a DaemonSet.  This will
be necessary in order to grant it access to the Kubernetes API so it can
augment log records with Kubernetes metadata (labels, pod name, etc.).
2025-10-17 07:51:32 -05:00
96ac5be3b5 r/kubelet: Schedule automatic image prune
As pods move around between nodes, applications are updated, etc., nodes
tend to accumulate images in their container stores that are no longer
used.  These take up space unnecessarily, eventually triggering disk
usage alarms.  From now, the _kubelet_ role installs a systemd timer and
service unit to periodically clean up these unused images.
2025-10-13 09:54:20 -05:00
142682ce2f r/ssh-host-certs: Fix restart handler
The _ssh-host-certs.target_ unit does not exist any more.  It was
provided by the _sshca-cli-systemd_ package to allow machines to
automatically request their SSH host certificates on first boot.  It had
a `ConditionFirstBoot=` requirement, which made it not work at any other
time, so there was no reason to move it into the Ansible configuration
policy.  Instead, we can use the _ssh-host-certs-renew.target_ unit to
trigger requesting or renewing host certificates.
2025-09-17 06:40:20 -05:00
4601b4d092 victoria-logs: Update to v1.33.1 2025-09-15 11:13:01 -05:00
c2d26f1f59 r/fluent-bit: Drop network.target requirement
The _network.target_ unit should be used for ordering only.  Listing it
as a `Requires=` dependency can cause _fluent-bit.service_ to fail to
start at all if the network takes slightly too long to initialize at
boot.
2025-09-15 10:49:32 -05:00
2cba5eb2e4 fluent-bit: Make ntfy pipeline steps optional
Most hosts will not need to send any messages to ntfy.  Let's define the
ntfy pipeline stages only for the machines that need them.  There are
currently two use cases for ntfy:

* MD RAID status messages (from Chromie and nvr2)
* WAN Link status messages (from gw1)

Breaking up the pipeline into smaller pieces allows both of these use
cases to define their appropriate filters while still sharing the common
steps.  The other machines that have no use for these steps now omit
them entirely.
2025-09-15 10:46:45 -05:00
faf4822918 fluent-bit: Ignore all HTTP output status messages
If the Fluent Bit pipeline includes multiple HTTP outputs, we need to
supporess the `HTTP status=200` messages from _all_ of them.
2025-09-15 08:01:42 -05:00
3d4bf3dd6c fluent-bit: Add hostname field to all records
Messages from sources other than the systemd journal do not have a
`hostname` field by default.  This could make filtering logs difficult
if there are multiple servers that host the same application.  Thus, we
need to inject the host name statically into every record, to ensure
they can be correctly traced to their source machine.
2025-09-15 08:00:16 -05:00
414cb828e1 unifi: Configure Fluent Bit for Unifi server
The Unifi Network server writes a bunch of log files that we need to
forward to Victoria Logs.  This commit introduces components to the
Fluent Bit pipeline to read these files with the `tail` input plugin,
parse them using regular expressions to extract the correct time stamp
from the messages, and send them to Victoria Logs.
2025-09-15 07:58:29 -05:00
75061c4d78 all: Split up Fluent Bit vars
Instead of defining the common values for Fluent bit inputs, filters,
and outputs directly in the variables used by the _fluent-bit_ role, we
need to split these into reusable pieces.  This way, hosts and groups
that need to use a slightly different pipeline configuration can access
the default values without having to redefine them.
2025-09-15 07:55:43 -05:00
0331a55b3e r/fluent-bit: Set HOSTNAME environment variable
Fluent-bit does not have any native capability for setting a field with
the hostname of the machine, but it can set a field with the value of an
environment variable.  Thus, we can set the `HOSTNAME` environment
variable and then use that to set the field in the pipeline.
2025-09-15 07:53:13 -05:00
d0bffdeb15 r/fluent-bit: Support configuring parsers
When ingesting logs from sources other than systemd, such as
unstructured log files written by uncooperative services, it may be
necessary to define custom parsers.
2025-09-15 07:51:39 -05:00
8a7faac35b r/ssh-host-certs: Reload sshd after renewing certs
In Fedora 41, it seems the SSH daemon no longer automatically uses the
new certificate after its host certificates have been renewed.  To get
it to pick up the new ones, we have to explicitly tell it to reload.  To
handle that automatically, I've added a new systemd path unit that
monitors the certificate files.  When it detects that one of them has
changed, it will send the signal to the SSH daemon to tell it to reload.
2025-09-14 15:08:41 -05:00
37e6622351 r/ssh-host-certs: Import systemd unit files
The _sshca-cli_ package no longer provides a _-systemd_ sub-package
containing the systemd unit files for automatically requesting and
renewing SSH host certificates.  Its original intent was to support
automatically signing certificates on first boot by having the unit
files installed by Anaconda, but this never really worked for various
reasons.  Since I'd rather not have to rebuild the RPMs every time I
need to make a change to the systemd units, and Ansible is required to
actually get the certificates issued anyway, it makes more sense to have
the unit files in the configuration policy instead.
2025-09-14 15:08:41 -05:00
8e8c109bf6 websites/pyrocufflink: Switch to mod_md for cert
The _pyrocufflink.net_ site now obtains its certificate from Let's
Encrypt using the Apache _mod_md_ (managed domain) module.  This
dramatically simplifies the deployment of this certificate, eliminating
the need for _cert-manager_ to obtain it, _cert-exporter_ to add it to
_certs.git_, and Jenkins to push it out to the web server.
2025-09-04 10:04:37 -05:00
29cdafac2a scripts/shutdown-vmhost: Skip Longhorn nodes
We **DO NOT** want to shut down the Longhorn Kubernetes nodes!  Doing so
would pretty much nuke everything running in the cluster.  The shutdown
script will need to migrate them online; fortunately, since they don't
run anything except Longhorn, they should be able to migrate fine.
2025-08-29 21:38:12 -05:00
c11a792eb8 websites/hlc: Drop formsubmit config tasks
_formsubmit_ runs in Kubernetes since some time now.
2025-08-25 09:00:20 -05:00
524ac0931a websites/hlc: Switch to mod_md for cert management
To avoid having separate certificates for the canonical
_www.hatchlearningcenter.org_ site and all the redirects, we'll combine
these virtual hosts into one.  We can use a `RewriteCond` to avoid the
redirect for the canonical name itself.
2025-08-25 09:00:20 -05:00
fb93598586 dch-proxy: Use PROXY protocol v1 for Nextcloud
Apache doesn't fully support the PROXY v2 protocol.  When it's enabled,
it spams its error log with messages about unsupported features, e.g.:

> [remoteip:error] [pid 1257:tid 1302] [client 172.30.0.6:45614]
> AH03507: RemoteIPProxyProtocol: unsupported command 20
2025-08-23 22:52:08 -05:00
57a5f83262 nextcloud: Run an SMTP relay locally
For some reason, Nextcloud seems to have trouble sending mail via the
network-wide relay.  It opens a connection, then just sits there and
never sends anything until it times out.  This happens probably 4 out of
5 times it attempts to send e-mail messages.

Running Postfix locally and directing Nextcloud to send mail through it
and then on to the network-wide relay seems to work much more reliably.
2025-08-23 22:43:45 -05:00
1a3f68e18b Merge remote-tracking branch 'refs/remotes/origin/master' 2025-08-23 22:43:00 -05:00
1c1bff3ec0 r/nextcloud: Fix a bunch of deployment warnings
The Nextcloud administration overview page listed a bunch of deployment
configuration warnings that needed to be addressed:

* Set the default phone region
* Define a maintenance window starting at 0600 UTC
* Increase the PHP memory limit to 1GiB
* Increase the PHP OPCache interned strings buffer size
* Increase the allowed PHP OPcache memory limit
* Fix Apache rewrite rules for /.well-known paths
2025-08-23 22:39:44 -05:00