k8s-reboot-coordinator

dustin/k8s-reboot-coordinator

Fork 0

Files

History

Dustin C. Hatch 48b19604fd

dustin/k8s-reboot-coordinator/pipeline/head This commit looks good

Details

Do not replace current process with reboot command

Instead of replacing the current process with the reboot command
directly via `exec`, we need to run it in a child process and keep
the current process running.  The former method has the interesting
side-effect of getting the machine into a state where it can never
reboot:

1. When the reboot sentinel file appears, the coordinator acquires the
   lock and drains the node, then `exec`s the reboot command.
2. The DaemonSet pod goes into _Completed_ state once the reboot command
   finishes.  If the reboot command starts the reboot process
   immediately, there is no issue, but if it starts a delayed reboot,
   trouble ensues.
3. After a timeout, Kubernetes restarts the DaemonSet pod, starting the
   coordinator process again.
4. The coordinator notices that the reboot sentinel already exists and
   immediately `exec`s the reboot command again.
5. The reboot command restarts the delayed reboot process, pushing the
   actual reboot time further into the future.
6. Return to step 2.

To break this loop, someone needs to either remove the reboot sentinel
file, letting the coordinator start up and run without doing anything,
or forcably reboot the node.

We can avoid this loop by never exiting from the process managed by the
pod.  The reboot command runs and exits, but the parent process
continues until it's signalled to stop.

2025-10-08 20:19:48 -05:00

main.rs

Do not replace current process with reboot command

2025-10-08 20:19:48 -05:00