Found another race condition: If the first pod evicted is deleted
quickly, before any other pods are evicted, the wait list will become
empty immediately, causing the `wait_drained` function to return too
early.
I've completely rewritten the `drain_node` function (again) to hopefully
handle all of these races. Now, it's purely reactive: instead of
getting a list of pods to evict ahead of time, it uses the `Added`
events of the watch stream to determine the pods to evict. As soon as a
pod is determined to be a candidate for eviction, it is added to the
wait list. If eviction fails of a pod fails irrecoverably, that pod
is removed from the wait list, to prevent the loop from running forever.
This works because `Added` events for all current pods will arrive as
soon as the stream is opened. `Deleted` events will start arriving once
all the `Added` events are processed. The key difference between this
implementation and the previous one, though, is when pods are added to
the wait list. Previously, we only added them to the list _after_ they
were evicted, but this made populating the list too slow. Now, since we
add them to the list _before_ they are evicted, we can be sure the list
is never empty until every pod is deleted (or unable to be evicted at
all).
The new `kubeRolloutRestart` function works similarly to the
`kubeRestartDeployment` function, but supports any kind of pod
controller, including DaemonSet.
Using a channel to transfer the list of pods from the task that is
evicting the pods to the task that is waiting for them to be deleted
creates a race condition. It is possible for the watch event stream to
handle the pod delete event _before_ the channel delivers the pod
identifier, so the pod gets added to the wait list _after_ it's already
been deleted. This results in the `wait_drained` task waiting forever
for the pod to be deleted, even though it is already gone.
To address this, we need to construct the wait list in the `drain_node`
task, as we are evicting pods. This way, we can be sure that every pod
that was evicted is in the wait list immediately.
Instead of replacing the current process with the reboot command
directly via `exec`, we need to run it in a child process and keep
the current process running. The former method has the interesting
side-effect of getting the machine into a state where it can never
reboot:
1. When the reboot sentinel file appears, the coordinator acquires the
lock and drains the node, then `exec`s the reboot command.
2. The DaemonSet pod goes into _Completed_ state once the reboot command
finishes. If the reboot command starts the reboot process
immediately, there is no issue, but if it starts a delayed reboot,
trouble ensues.
3. After a timeout, Kubernetes restarts the DaemonSet pod, starting the
coordinator process again.
4. The coordinator notices that the reboot sentinel already exists and
immediately `exec`s the reboot command again.
5. The reboot command restarts the delayed reboot process, pushing the
actual reboot time further into the future.
6. Return to step 2.
To break this loop, someone needs to either remove the reboot sentinel
file, letting the coordinator start up and run without doing anything,
or forcably reboot the node.
We can avoid this loop by never exiting from the process managed by the
pod. The reboot command runs and exits, but the parent process
continues until it's signalled to stop.
After some initial testing, I decided that the HTTP API approach to
managing the reboot lock is not going to work. I originally implemented
it this way so that the reboot process on the nodes could stay the same
as it had always been, only adding a systemd unit to interact with the
server to obtain the lock and drain the node. Unfortunately, this does
not actually work in practice because there is no way to ensure that the
new unit runs _first_ during the shutdown process. In fact, systemd
practically _insists_ on stopping all running containers before any
other units. The only solution, therefore, is to obtain the reboot lock
and drain the node before initiating the actual shutdown procedure.
I briefly considered installing a script on each node to handle all of
this, and configuring _dnf-automatic_ to run that. I decided against
that, though, as I would prefer to have as much of the node
configuration managed by Kubnernetes as possible; I don't want to have
to maintain that script with Ansible.
I decided that the best way to resolve these issues was to rewrite the
coordinator as a daemon that runs on every node. It waits for a
sentinel file to appear (`/run/reboot-needed` by default), and then
tries to obtain the reboot lock, drain the node, and reboot the machine.
All of the logic is contained in the daemon and deployed by Kubernetes;
the only change that has to be deployed by Ansible is configuring
_dnf-automatic_ to run `touch /run/reboot-needed` instead of `shutdown
-r +5`.
This implementation is heavily inspired by [kured](https://kured.dev).
Both rely on a sentinel file to trigger the reboot, but Kured uses a
naive polling method for detecting it, which either means wasting a lot
of CPU checking frequently, or introducing large delays by checking
infrequently. Kured also implements the reboot lock without using a
Lease, which may or may not be problematic if multiple nodes try to
reboot simultaneously.
There was a race condition while waiting for a node to be drained,
especially if there are pods that cannot be evicted immediately when the
wait starts. It was possible for the `wait_drained` function to return
before all of the pods had been deleted, if the wait list temporarily
became empty at some point. This could happen, for example, if multiple
`WatchEvent` messages were processed from the stream before any messages
were processed from the channel; even though there were pod identifiers
waiting in the channel to be added to the wait list, if the wait list
became empty after processing the watch events, the loop would complete.
This is made much more likely if a PodDisruptionBudget temporarily
prevents a pod from being evicted; it could take 5 or more seconds for
that pod's identifier to be pushed to the channel, and in that time, the
rest of the pods could be deleted.
To resolve this, we need to ensure that the `wait_drained` function
never returns until the sender side of the channel is dropped. This
way, we are sure that no more pods will be added to the wait list, so
when it gets emptied, we are sure we are actually done.
Specifying these compute resources to ensure builds to not run on
Raspberry Pi nodes, but instead trigger the autoscaler to launch an EC2
instance to run them.
The term _controller_ has a specific meaning in Kubernetes context, and
this process doesn't really fit it. It doesn't monitor any Kubernetes
resources, custom or otherwise. It does use Kubernetes as a data store
(via the lease), but I don't really think that counts. Anyway, the term
_coordinator_ fits better in my opinion.
If evicting a pod fails with an HTTP 239 Too Many Requests error, it
means there is a PodDisruptionBudget that prevents the pod from being
deleted. This can happen, for example, when draining a node that has
Longhorn volumes attached, as Longhorn creates a PDB for its instance
manager pods on such nodes. Longhorn will automatically remove the PDB
once there are no workloads on that node that use its Volumes, so we
must continue to evict other pods and try evicting the failed pods again
later. This behavior mostly mimics what `kubectl drain` does to handle
this same condition.
Whenever a lock request is made for a host that is a node in the current
Kubernetes cluster, the node will now be cordoned and all pods evicted
from it. The HTTP request will not return until all pods are gone,
making the lock request suitable for use in a system shutdown step.
This commit introduces two HTTP path operations:
* POST /api/v1/lock: Acquire a reboot lock
* POST /api/v1/unlock: Release a reboot lock
Both operations take a _multipart/form-data_ or
_application/x-www-form-urlencoded_ body with a required `hostname`
field. This field indicates the name of the host acquiring/releasing
the lock. the `lock` operation also takes an optional `wait` field. If
this value is provided with a `false` value, and the reboot lock cannot
be acquired immediately, the request will fail with an HTTP 419
conflict. If a `true` value is provided, or the field is omitted, the
request will block until the lock can be acquired.
Locking is implemented with a Kubernetes Lease resource using
Server-Side Apply. By setting the field manager of the `holderIdentity`
field to match its value, we can ensure that there are no race
conditions in acquiring the lock; Kubernetes will reject the update if
both the new value and the field manager do not match. This is
significantly safer than a more naïve check-then-set approach.
Since the API provided by this service is intended to be used on the
command line e.g. with `curl`, we need our responses to have a trailing
newline. This ensures that, when used interactively, the next shell
prompt is correctly placed on a new line, and when used
non-interactively, line-buffered output is correctly flushed (i.e. to a
log file).