We don't need a notification about paperless not scheduling email tasks
every time there is a gap in the metric. This can happen in some
innocuous situations like when the pod restarts or if there is a brief
disruption of service. Using the `absent_over_time` function with a
range vector, we can have the alert fire only if there have been no
email tasks scheduled within the last 12 hours.
It turns out this alert is not very useful, and indeed quite annoying.
Many servers can go for days or even weeks with no changes, which is
completely normal.
Since the IP address assigned to the ingress controller is now managed
by keepalived and known to Kubernetes, the network policy needs to allow
access to it by pod namespace rather than IP address. It seems that the
former takes precedence over the latter, so even though the IP address
was explicitly allowed, traffic was not permitted because it was
destined for a Kubernetes service that was not.
Since _ingress-nginx_ no longer runs in the host network namespace,
traffic will appear to come from pods' internal IP addresses now.
Similarly, the network policy for Invoice Ninja needs to be updated to
allow traffic _to_ the ingress controllers' new addresses.
Clients outside the cluster can now communicate with RabbitMQ directly
on port 5671 by using its dedicated external IP address. This address
is automatically assigned to the node where RabbitMQ is running by
`keepalived`.
Clients outside the cluster can now communicate with Mosquitto directly
on port 8883 by using its dedicated external IP address. This address
is automatically assigned to the node where Mosquitto is running by
`keepalived`.
Now that we have `keepalived` managing the "virtual" IP address for the
ingress controller, we can change _ingress-nginx_ to run as a Deployment
rather than a DaemonSet. It no longer needs to use the host network
namespace, as `kube-proxy` will route all traffic sent to the configured
external IP address to the controller pods. Using the _Local_ external
traffic policy disables NAT, so incoming traffic is seen by the
nginx unmodified.
Running `keepalived` as a DaemonSet will allow managing floating
"virtual" IP addresses for Kubernetes services with configured external
IP addresses. The main services we want to expose outside the cluster
are _ingress-nginx_, Mosquitto, and RabbitMQ. The `keepalived` cluster
will negotiate using the VRRF protocol to determine which node should
have each external address. Using the process tracking feature of
`keepalived`, we can steer traffic directly to the node where the target
service is running.
I've created new worker nodes that are dedicated to running Longhorn
replicas. These nodes are tainted with the
`node-role.kubernetes.io/longhorn` taint, so no regular pods will be
scheduled there by default. Longhorn pods thus needs to be configured
to tolerate that taint, and to be scheduled on nodes with the
similarly-named label.
This will make it easier to "blow away" the RabbitMQ data volume on the
occasions when it gets into a weird state. Simply scale the StatefulSet
down to 0 replicas, delete the PVC, then scale back up. Kubernetes will
handle creating a new PVC automatically.
Nextcloud uses a _client-side_ (Javascript) redirect to navigate the
browser to its `index.php`. The page it serves with this redirect is
static and will often load successfully, even if there is a problem with
the application. This causes the Blackbox exporter to record the site
as "up," even when it it definitely is not. To avoid this, we can
scrape the `index.php` page explicitly, ensuring that the application is
loaded.
The _fleetlock_ server drains all pods from a node before allocating the
reboot lock to that node. Unfortunately, it doesn't actually wait for
those pods to be completely evicted. If some pods take too long to shut
down, they may get stuck in `Terminating` state once the machine starts
rebooting. This makes it so those pods cannot be replaced on another
node with the original one is offline, which pretty much defeats the
purpose of using Fleetlock in the first place.
It seems upstream has abandoned this project, as there is an open [Pull
Request][0] to fix this issue that has so far been ignored.
Fortunately, building a new container image containing the patch is easy
enough, so we can run our own patched build.
[0]: https://github.com/poseidon/fleetlock/pull/271