18 Commits

Author SHA1 Message Date
cbe2f2cfd2 kubernetes: Add toleration for control plane nodes
Some checks failed
dustin/k8s-reboot-coordinator/pipeline/head There was a failure building this commit
We want to schedule reboots on control plane nodes just the same as
regular workers.
2025-10-13 10:32:25 -05:00
f6e8becc3a drain: Handle yet another race condition
All checks were successful
dustin/k8s-reboot-coordinator/pipeline/head This commit looks good
Found another race condition: If the first pod evicted is deleted
quickly, before any other pods are evicted, the wait list will become
empty immediately, causing the `wait_drained` function to return too
early.

I've completely rewritten the `drain_node` function (again) to hopefully
handle all of these races.  Now, it's purely reactive: instead of
getting a list of pods to evict ahead of time, it uses the `Added`
events of the watch stream to determine the pods to evict.  As soon as a
pod is determined to be a candidate for eviction, it is added to the
wait list.  If eviction fails of a pod fails irrecoverably, that pod
is removed from the wait list, to prevent the loop from running forever.

This works because `Added` events for all current pods will arrive as
soon as the stream is opened.  `Deleted` events will start arriving once
all the `Added` events are processed.  The key difference between this
implementation and the previous one, though, is when pods are added to
the wait list.  Previously, we only added them to the list _after_ they
were evicted, but this made populating the list too slow.  Now, since we
add them to the list _before_ they are evicted, we can be sure the list
is never empty until every pod is deleted (or unable to be evicted at
all).
2025-10-13 10:16:53 -05:00
07be7027f4 ci: Restart DaemonSet after build
All checks were successful
dustin/k8s-reboot-coordinator/pipeline/head This commit looks good
The new `kubeRolloutRestart` function works similarly to the
`kubeRestartDeployment` function, but supports any kind of pod
controller, including DaemonSet.
2025-10-12 10:31:28 -05:00
46b26199b0 drain: Fix a race condition while waiting for pods
All checks were successful
dustin/k8s-reboot-coordinator/pipeline/head This commit looks good
Using a channel to transfer the list of pods from the task that is
evicting the pods to the task that is waiting for them to be deleted
creates a race condition.  It is possible for the watch event stream to
handle the pod delete event _before_ the channel delivers the pod
identifier, so the pod gets added to the wait list _after_ it's already
been deleted.  This results in the `wait_drained` task waiting forever
for the pod to be deleted, even though it is already gone.

To address this, we need to construct the wait list in the `drain_node`
task, as we are evicting pods.  This way, we can be sure that every pod
that was evicted is in the wait list immediately.
2025-10-10 08:32:59 -05:00
48b19604fd Do not replace current process with reboot command
All checks were successful
dustin/k8s-reboot-coordinator/pipeline/head This commit looks good
Instead of replacing the current process with the reboot command
directly via `exec`, we need to run it in a child process and keep
the current process running.  The former method has the interesting
side-effect of getting the machine into a state where it can never
reboot:

1. When the reboot sentinel file appears, the coordinator acquires the
   lock and drains the node, then `exec`s the reboot command.
2. The DaemonSet pod goes into _Completed_ state once the reboot command
   finishes.  If the reboot command starts the reboot process
   immediately, there is no issue, but if it starts a delayed reboot,
   trouble ensues.
3. After a timeout, Kubernetes restarts the DaemonSet pod, starting the
   coordinator process again.
4. The coordinator notices that the reboot sentinel already exists and
   immediately `exec`s the reboot command again.
5. The reboot command restarts the delayed reboot process, pushing the
   actual reboot time further into the future.
6. Return to step 2.

To break this loop, someone needs to either remove the reboot sentinel
file, letting the coordinator start up and run without doing anything,
or forcably reboot the node.

We can avoid this loop by never exiting from the process managed by the
pod.  The reboot command runs and exits, but the parent process
continues until it's signalled to stop.
2025-10-08 20:19:48 -05:00
40e55a984b Rewrite to run directly on nodes
All checks were successful
dustin/k8s-reboot-coordinator/pipeline/head This commit looks good
After some initial testing, I decided that the HTTP API approach to
managing the reboot lock is not going to work.  I originally implemented
it this way so that the reboot process on the nodes could stay the same
as it had always been, only adding a systemd unit to interact with the
server to obtain the lock and drain the node.  Unfortunately, this does
not actually work in practice because there is no way to ensure that the
new unit runs _first_ during the shutdown process.  In fact, systemd
practically _insists_ on stopping all running containers before any
other units.  The only solution, therefore, is to obtain the reboot lock
and drain the node before initiating the actual shutdown procedure.

I briefly considered installing a script on each node to handle all of
this, and configuring _dnf-automatic_ to run that.  I decided against
that, though, as I would prefer to have as much of the node
configuration managed by Kubnernetes as possible;  I don't want to have
to maintain that script with Ansible.

I decided that the best way to resolve these issues was to rewrite the
coordinator as a daemon that runs on every node.  It waits for a
sentinel file to appear (`/run/reboot-needed` by default), and then
tries to obtain the reboot lock, drain the node, and reboot the machine.
All of the logic is contained in the daemon and deployed by Kubernetes;
the only change that has to be deployed by Ansible is configuring
_dnf-automatic_ to run `touch /run/reboot-needed` instead of `shutdown
-r +5`.

This implementation is heavily inspired by [kured](https://kured.dev).
Both rely on a sentinel file to trigger the reboot, but Kured uses a
naive polling method for detecting it, which either means wasting a lot
of CPU checking frequently, or introducing large delays by checking
infrequently.  Kured also implements the reboot lock without using a
Lease, which may or may not be problematic if multiple nodes try to
reboot simultaneously.
2025-10-08 12:41:05 -05:00
d4638239b3 drain: Wait for outer loop to complete
All checks were successful
dustin/k8s-reboot-coordinator/pipeline/head This commit looks good
There was a race condition while waiting for a node to be drained,
especially if there are pods that cannot be evicted immediately when the
wait starts.  It was possible for the `wait_drained` function to return
before all of the pods had been deleted, if the wait list temporarily
became empty at some point.  This could happen, for example, if multiple
`WatchEvent` messages were processed from the stream before any messages
were processed from the channel; even though there were pod identifiers
waiting in the channel to be added to the wait list, if the wait list
became empty after processing the watch events, the loop would complete.
This is made much more likely if a PodDisruptionBudget temporarily
prevents a pod from being evicted; it could take 5 or more seconds for
that pod's identifier to be pushed to the channel, and in that time, the
rest of the pods could be deleted.

To resolve this, we need to ensure that the `wait_drained` function
never returns until the sender side of the channel is dropped.  This
way, we are sure that no more pods will be added to the wait list, so
when it gets emptied, we are sure we are actually done.
2025-09-29 07:08:12 -05:00
93c86f0fe3 ci: Restart production deployment after build
All checks were successful
dustin/k8s-reboot-coordinator/pipeline/head This commit looks good
2025-09-27 09:46:05 -05:00
2596864c8c ci: Begin Jenkins build pipeline
All checks were successful
dustin/k8s-reboot-coordinator/pipeline/head This commit looks good
Specifying these compute resources to ensure builds to not run on
Raspberry Pi nodes, but instead trigger the autoscaler to launch an EC2
instance to run them.
2025-09-26 10:23:12 -05:00
0f57f2c582 kubernetes: Add example manifests 2025-09-25 18:23:43 -05:00
976518dd03 Rename controller → coordinator
The term _controller_ has a specific meaning in Kubernetes context, and
this process doesn't really fit it.  It doesn't monitor any Kubernetes
resources, custom or otherwise.  It does use Kubernetes as a data store
(via the lease), but I don't really think that counts.  Anyway, the term
_coordinator_ fits better in my opinion.
2025-09-25 18:03:41 -05:00
afb0f53c7b Add container build script 2025-09-25 18:01:39 -05:00
d937bd6fb2 drain: Retry failed evictions
If evicting a pod fails with an HTTP 239 Too Many Requests error, it
means there is a PodDisruptionBudget that prevents the pod from being
deleted.  This can happen, for example, when draining a node that has
Longhorn volumes attached, as Longhorn creates a PDB for its instance
manager pods on such nodes.  Longhorn will automatically remove the PDB
once there are no workloads on that node that use its Volumes, so we
must continue to evict other pods and try evicting the failed pods again
later.  This behavior mostly mimics what `kubectl drain` does to handle
this same condition.
2025-09-25 12:26:00 -05:00
7d8ee51016 drain: Add tools to drain pods from nodes on lock
Whenever a lock request is made for a host that is a node in the current
Kubernetes cluster, the node will now be cordoned and all pods evicted
from it.  The HTTP request will not return until all pods are gone,
making the lock request suitable for use in a system shutdown step.
2025-09-24 19:38:39 -05:00
2ea03c6670 lock: Return message text on success
Since the lock API is intended to be used from command-line utilities
and shell scripts, we should return a helpful message when successful.
2025-09-24 14:44:17 -05:00
4bb72900fa Begin lock/unlock implementation
This commit introduces two HTTP path operations:

* POST /api/v1/lock: Acquire a reboot lock
* POST /api/v1/unlock: Release a reboot lock

Both operations take a _multipart/form-data_ or
_application/x-www-form-urlencoded_ body with a required `hostname`
field.  This field indicates the name of the host acquiring/releasing
the lock.  the `lock` operation also takes an optional `wait` field.  If
this value is provided with a `false` value, and the reboot lock cannot
be acquired immediately, the request will fail with an HTTP 419
conflict.  If a `true` value is provided, or the field is omitted, the
request will block until the lock can be acquired.

Locking is implemented with a Kubernetes Lease resource using
Server-Side Apply.  By setting the field manager of the `holderIdentity`
field to match its value, we can ensure that there are no race
conditions in acquiring the lock; Kubernetes will reject the update if
both the new value and the field manager do not match.  This is
significantly safer than a more naïve check-then-set approach.
2025-09-24 08:25:32 -05:00
2a10c815be catcher: Add trailing newline to body text
Since the API provided by this service is intended to be used on the
command line e.g. with `curl`, we need our responses to have a trailing
newline.  This ensures that, when used interactively, the next shell
prompt is correctly placed on a new line, and when used
non-interactively, line-buffered output is correctly flushed (i.e. to a
log file).
2025-09-24 08:17:06 -05:00
6f7160fc02 Initial commit 2025-09-24 08:17:03 -05:00