k8s-reboot-coordinator

Commit Graph

Author	SHA1	Message	Date
Dustin	976518dd03	Rename controller → coordinator The term _controller_ has a specific meaning in Kubernetes context, and this process doesn't really fit it. It doesn't monitor any Kubernetes resources, custom or otherwise. It does use Kubernetes as a data store (via the lease), but I don't really think that counts. Anyway, the term _coordinator_ fits better in my opinion.	2025-09-25 18:03:41 -05:00
Dustin	d937bd6fb2	drain: Retry failed evictions If evicting a pod fails with an HTTP 239 Too Many Requests error, it means there is a PodDisruptionBudget that prevents the pod from being deleted. This can happen, for example, when draining a node that has Longhorn volumes attached, as Longhorn creates a PDB for its instance manager pods on such nodes. Longhorn will automatically remove the PDB once there are no workloads on that node that use its Volumes, so we must continue to evict other pods and try evicting the failed pods again later. This behavior mostly mimics what `kubectl drain` does to handle this same condition.	2025-09-25 12:26:00 -05:00
Dustin	7d8ee51016	drain: Add tools to drain pods from nodes on lock Whenever a lock request is made for a host that is a node in the current Kubernetes cluster, the node will now be cordoned and all pods evicted from it. The HTTP request will not return until all pods are gone, making the lock request suitable for use in a system shutdown step.	2025-09-24 19:38:39 -05:00
Dustin	2ea03c6670	lock: Return message text on success Since the lock API is intended to be used from command-line utilities and shell scripts, we should return a helpful message when successful.	2025-09-24 14:44:17 -05:00
Dustin	4bb72900fa	Begin lock/unlock implementation This commit introduces two HTTP path operations: * POST /api/v1/lock: Acquire a reboot lock * POST /api/v1/unlock: Release a reboot lock Both operations take a _multipart/form-data_ or _application/x-www-form-urlencoded_ body with a required `hostname` field. This field indicates the name of the host acquiring/releasing the lock. the `lock` operation also takes an optional `wait` field. If this value is provided with a `false` value, and the reboot lock cannot be acquired immediately, the request will fail with an HTTP 419 conflict. If a `true` value is provided, or the field is omitted, the request will block until the lock can be acquired. Locking is implemented with a Kubernetes Lease resource using Server-Side Apply. By setting the field manager of the `holderIdentity` field to match its value, we can ensure that there are no race conditions in acquiring the lock; Kubernetes will reject the update if both the new value and the field manager do not match. This is significantly safer than a more naïve check-then-set approach.	2025-09-24 08:25:32 -05:00
Dustin	2a10c815be	catcher: Add trailing newline to body text Since the API provided by this service is intended to be used on the command line e.g. with `curl`, we need our responses to have a trailing newline. This ensures that, when used interactively, the next shell prompt is correctly placed on a new line, and when used non-interactively, line-buffered output is correctly flushed (i.e. to a log file).	2025-09-24 08:17:06 -05:00
Dustin	6f7160fc02	Initial commit	2025-09-24 08:17:03 -05:00

7 Commits (master)