Commit Graph

11 Commits (master)

Author SHA1 Message Date
Dustin 93c86f0fe3 ci: Restart production deployment after build
dustin/k8s-reboot-coordinator/pipeline/head This commit looks good Details
2025-09-27 09:46:05 -05:00
Dustin 2596864c8c ci: Begin Jenkins build pipeline
dustin/k8s-reboot-coordinator/pipeline/head This commit looks good Details
Specifying these compute resources to ensure builds to not run on
Raspberry Pi nodes, but instead trigger the autoscaler to launch an EC2
instance to run them.
2025-09-26 10:23:12 -05:00
Dustin 0f57f2c582 kubernetes: Add example manifests 2025-09-25 18:23:43 -05:00
Dustin 976518dd03 Rename controller → coordinator
The term _controller_ has a specific meaning in Kubernetes context, and
this process doesn't really fit it.  It doesn't monitor any Kubernetes
resources, custom or otherwise.  It does use Kubernetes as a data store
(via the lease), but I don't really think that counts.  Anyway, the term
_coordinator_ fits better in my opinion.
2025-09-25 18:03:41 -05:00
Dustin afb0f53c7b Add container build script 2025-09-25 18:01:39 -05:00
Dustin d937bd6fb2 drain: Retry failed evictions
If evicting a pod fails with an HTTP 239 Too Many Requests error, it
means there is a PodDisruptionBudget that prevents the pod from being
deleted.  This can happen, for example, when draining a node that has
Longhorn volumes attached, as Longhorn creates a PDB for its instance
manager pods on such nodes.  Longhorn will automatically remove the PDB
once there are no workloads on that node that use its Volumes, so we
must continue to evict other pods and try evicting the failed pods again
later.  This behavior mostly mimics what `kubectl drain` does to handle
this same condition.
2025-09-25 12:26:00 -05:00
Dustin 7d8ee51016 drain: Add tools to drain pods from nodes on lock
Whenever a lock request is made for a host that is a node in the current
Kubernetes cluster, the node will now be cordoned and all pods evicted
from it.  The HTTP request will not return until all pods are gone,
making the lock request suitable for use in a system shutdown step.
2025-09-24 19:38:39 -05:00
Dustin 2ea03c6670 lock: Return message text on success
Since the lock API is intended to be used from command-line utilities
and shell scripts, we should return a helpful message when successful.
2025-09-24 14:44:17 -05:00
Dustin 4bb72900fa Begin lock/unlock implementation
This commit introduces two HTTP path operations:

* POST /api/v1/lock: Acquire a reboot lock
* POST /api/v1/unlock: Release a reboot lock

Both operations take a _multipart/form-data_ or
_application/x-www-form-urlencoded_ body with a required `hostname`
field.  This field indicates the name of the host acquiring/releasing
the lock.  the `lock` operation also takes an optional `wait` field.  If
this value is provided with a `false` value, and the reboot lock cannot
be acquired immediately, the request will fail with an HTTP 419
conflict.  If a `true` value is provided, or the field is omitted, the
request will block until the lock can be acquired.

Locking is implemented with a Kubernetes Lease resource using
Server-Side Apply.  By setting the field manager of the `holderIdentity`
field to match its value, we can ensure that there are no race
conditions in acquiring the lock; Kubernetes will reject the update if
both the new value and the field manager do not match.  This is
significantly safer than a more naïve check-then-set approach.
2025-09-24 08:25:32 -05:00
Dustin 2a10c815be catcher: Add trailing newline to body text
Since the API provided by this service is intended to be used on the
command line e.g. with `curl`, we need our responses to have a trailing
newline.  This ensures that, when used interactively, the next shell
prompt is correctly placed on a new line, and when used
non-interactively, line-buffered output is correctly flushed (i.e. to a
log file).
2025-09-24 08:17:06 -05:00
Dustin 6f7160fc02 Initial commit 2025-09-24 08:17:03 -05:00