The _fleetlock_ server drains all pods from a node before allocating the
reboot lock to that node. Unfortunately, it doesn't actually wait for
those pods to be completely evicted. If some pods take too long to shut
down, they may get stuck in `Terminating` state once the machine starts
rebooting. This makes it so those pods cannot be replaced on another
node with the original one is offline, which pretty much defeats the
purpose of using Fleetlock in the first place.
It seems upstream has abandoned this project, as there is an open [Pull
Request][0] to fix this issue that has so far been ignored.
Fortunately, building a new container image containing the patch is easy
enough, so we can run our own patched build.
[0]: https://github.com/poseidon/fleetlock/pull/271
[fleetlock] is an implementation of the Zincati FleetLock reboot
coordination protocol. It only works for machines that are Kubernetes
nodes, but it does enable safe rolling updates for those machines.
Specifically, when a node acquires a lock (backed by a Kubernetes
Lease), it cordons that node and evicts pods from it. After the node
has rebooted into the new version of Fedora CoreOS, it uncordons the
node and releases the lock.
[fleetlock]: https://github.com/poseidon/fleetlock