k8s-reboot-coordinator

dustin/k8s-reboot-coordinator

Fork 0

Files

History

Dustin C. Hatch f6e8becc3a

dustin/k8s-reboot-coordinator/pipeline/head This commit looks good

Details

drain: Handle yet another race condition

Found another race condition: If the first pod evicted is deleted
quickly, before any other pods are evicted, the wait list will become
empty immediately, causing the `wait_drained` function to return too
early.

I've completely rewritten the `drain_node` function (again) to hopefully
handle all of these races.  Now, it's purely reactive: instead of
getting a list of pods to evict ahead of time, it uses the `Added`
events of the watch stream to determine the pods to evict.  As soon as a
pod is determined to be a candidate for eviction, it is added to the
wait list.  If eviction fails of a pod fails irrecoverably, that pod
is removed from the wait list, to prevent the loop from running forever.

This works because `Added` events for all current pods will arrive as
soon as the stream is opened.  `Deleted` events will start arriving once
all the `Added` events are processed.  The key difference between this
implementation and the previous one, though, is when pods are added to
the wait list.  Previously, we only added them to the list _after_ they
were evicted, but this made populating the list too slow.  Now, since we
add them to the list _before_ they are evicted, we can be sure the list
is never empty until every pod is deleted (or unable to be evicted at
all).

2025-10-13 10:16:53 -05:00

backoff.rs

Rewrite to run directly on nodes

2025-10-08 12:41:05 -05:00

context.rs

Rewrite to run directly on nodes

2025-10-08 12:41:05 -05:00

drain.rs

drain: Handle yet another race condition

2025-10-13 10:16:53 -05:00

lock.rs

Rewrite to run directly on nodes

2025-10-08 12:41:05 -05:00

main.rs

Do not replace current process with reboot command

2025-10-08 20:19:48 -05:00

test.rs

Rewrite to run directly on nodes

2025-10-08 12:41:05 -05:00