k8s-reboot-coordinator

dustin/k8s-reboot-coordinator

Fork 0

Commit Graph

Author	SHA1	Message	Date
Dustin C. Hatch	48b19604fd	Do not replace current process with reboot command All checks were successful dustin/k8s-reboot-coordinator/pipeline/head This commit looks good Details Instead of replacing the current process with the reboot command directly via `exec`, we need to run it in a child process and keep the current process running. The former method has the interesting side-effect of getting the machine into a state where it can never reboot: 1. When the reboot sentinel file appears, the coordinator acquires the lock and drains the node, then `exec`s the reboot command. 2. The DaemonSet pod goes into _Completed_ state once the reboot command finishes. If the reboot command starts the reboot process immediately, there is no issue, but if it starts a delayed reboot, trouble ensues. 3. After a timeout, Kubernetes restarts the DaemonSet pod, starting the coordinator process again. 4. The coordinator notices that the reboot sentinel already exists and immediately `exec`s the reboot command again. 5. The reboot command restarts the delayed reboot process, pushing the actual reboot time further into the future. 6. Return to step 2. To break this loop, someone needs to either remove the reboot sentinel file, letting the coordinator start up and run without doing anything, or forcably reboot the node. We can avoid this loop by never exiting from the process managed by the pod. The reboot command runs and exits, but the parent process continues until it's signalled to stop.	2025-10-08 20:19:48 -05:00
Dustin C. Hatch	40e55a984b	Rewrite to run directly on nodes All checks were successful dustin/k8s-reboot-coordinator/pipeline/head This commit looks good Details After some initial testing, I decided that the HTTP API approach to managing the reboot lock is not going to work. I originally implemented it this way so that the reboot process on the nodes could stay the same as it had always been, only adding a systemd unit to interact with the server to obtain the lock and drain the node. Unfortunately, this does not actually work in practice because there is no way to ensure that the new unit runs _first_ during the shutdown process. In fact, systemd practically _insists_ on stopping all running containers before any other units. The only solution, therefore, is to obtain the reboot lock and drain the node before initiating the actual shutdown procedure. I briefly considered installing a script on each node to handle all of this, and configuring _dnf-automatic_ to run that. I decided against that, though, as I would prefer to have as much of the node configuration managed by Kubnernetes as possible; I don't want to have to maintain that script with Ansible. I decided that the best way to resolve these issues was to rewrite the coordinator as a daemon that runs on every node. It waits for a sentinel file to appear (`/run/reboot-needed` by default), and then tries to obtain the reboot lock, drain the node, and reboot the machine. All of the logic is contained in the daemon and deployed by Kubernetes; the only change that has to be deployed by Ansible is configuring _dnf-automatic_ to run `touch /run/reboot-needed` instead of `shutdown -r +5`. This implementation is heavily inspired by [kured](https://kured.dev). Both rely on a sentinel file to trigger the reboot, but Kured uses a naive polling method for detecting it, which either means wasting a lot of CPU checking frequently, or introducing large delays by checking infrequently. Kured also implements the reboot lock without using a Lease, which may or may not be problematic if multiple nodes try to reboot simultaneously.	2025-10-08 12:41:05 -05:00

Author

SHA1

Message

Date

Dustin C. Hatch

48b19604fd

Do not replace current process with reboot command

dustin/k8s-reboot-coordinator/pipeline/head This commit looks good

Details

Instead of replacing the current process with the reboot command
directly via `exec`, we need to run it in a child process and keep
the current process running.  The former method has the interesting
side-effect of getting the machine into a state where it can never
reboot:

1. When the reboot sentinel file appears, the coordinator acquires the
   lock and drains the node, then `exec`s the reboot command.
2. The DaemonSet pod goes into _Completed_ state once the reboot command
   finishes.  If the reboot command starts the reboot process
   immediately, there is no issue, but if it starts a delayed reboot,
   trouble ensues.
3. After a timeout, Kubernetes restarts the DaemonSet pod, starting the
   coordinator process again.
4. The coordinator notices that the reboot sentinel already exists and
   immediately `exec`s the reboot command again.
5. The reboot command restarts the delayed reboot process, pushing the
   actual reboot time further into the future.
6. Return to step 2.

To break this loop, someone needs to either remove the reboot sentinel
file, letting the coordinator start up and run without doing anything,
or forcably reboot the node.

We can avoid this loop by never exiting from the process managed by the
pod.  The reboot command runs and exits, but the parent process
continues until it's signalled to stop.

2025-10-08 20:19:48 -05:00

Dustin C. Hatch

40e55a984b

Rewrite to run directly on nodes

dustin/k8s-reboot-coordinator/pipeline/head This commit looks good

Details

After some initial testing, I decided that the HTTP API approach to
managing the reboot lock is not going to work.  I originally implemented
it this way so that the reboot process on the nodes could stay the same
as it had always been, only adding a systemd unit to interact with the
server to obtain the lock and drain the node.  Unfortunately, this does
not actually work in practice because there is no way to ensure that the
new unit runs _first_ during the shutdown process.  In fact, systemd
practically _insists_ on stopping all running containers before any
other units.  The only solution, therefore, is to obtain the reboot lock
and drain the node before initiating the actual shutdown procedure.

I briefly considered installing a script on each node to handle all of
this, and configuring _dnf-automatic_ to run that.  I decided against
that, though, as I would prefer to have as much of the node
configuration managed by Kubnernetes as possible;  I don't want to have
to maintain that script with Ansible.

I decided that the best way to resolve these issues was to rewrite the
coordinator as a daemon that runs on every node.  It waits for a
sentinel file to appear (`/run/reboot-needed` by default), and then
tries to obtain the reboot lock, drain the node, and reboot the machine.
All of the logic is contained in the daemon and deployed by Kubernetes;
the only change that has to be deployed by Ansible is configuring
_dnf-automatic_ to run `touch /run/reboot-needed` instead of `shutdown
-r +5`.

This implementation is heavily inspired by [kured](https://kured.dev).
Both rely on a sentinel file to trigger the reboot, but Kured uses a
naive polling method for detecting it, which either means wasting a lot
of CPU checking frequently, or introducing large delays by checking
infrequently.  Kured also implements the reboot lock without using a
Lease, which may or may not be problematic if multiple nodes try to
reboot simultaneously.

2025-10-08 12:41:05 -05:00

2 Commits