projects: Add dynk8s page

projects
Dustin 2024-08-18 08:59:57 -05:00
parent 97a5cf4ac3
commit 62c4477478
4 changed files with 563 additions and 0 deletions

Binary file not shown.

After

Width:  |  Height:  |  Size: 33 KiB

View File

@ -0,0 +1,164 @@
+++
title = "Dynamic Cloud Worker Nodes for On-Premises Kubernetes"
description = """\
Automatically launch EC2 instances as worker nodes in an on-premises Kubernetes
cluster when they are needed, and remove them when they are not
"""
[extra]
image = "projects/dynk8s/cloudcontainer.jpg"
+++
One of the first things I wanted to do with my Kubernetes cluster at home was
start using it for Jenkins jobs. With the [Kubernetes][0] plugin, Jenkins can
run create ephemeral Kubernetes pods to use as worker nodes to execute builds.
Migrating all of my jobs to use this mechanism would allow me to get rid of the
static agents running on VMs and Raspberry Pis.
Getting the plugin installed and configured was relatively straightforward, and
defining pod templates for CI pipelines was simple enough. It did not take
long to migrate the majority of the jobs that can run on x86_64 machines. The
aarch64, jobs, though, needed some more attention.
It's no secret that Raspberry Pis are *slow*. They are fine for very light
use, or for dedicated single-application purposes, but trying to compile code,
especially Rust, on one is a nightmare. So, while I was redoing my Jenkins
jobs, I took the opportunity to try to find a better, faster solution.
Jenkins has an [Amazon EC2][1] plugin, which dynamically launches EC2 instances
to execute builds and terminates them when they are no longer needed. We use
this plugin at work, and it is a decent solution. I could configure Jenkins to
launch Graviton instances to build aarch64 code. Unfortunately, I would either
need to pre-create AMIs with all of the necessary build dependencies and run
the jobs directly on the worker nodes, or use the [Docker Pipeline][2] plugin
to run them in Docker containers. What I really wanted, though, was to be able
to use Kubernetes for all of the jobs, so I set out to find a way to
dynamically add cloud machines to my local Kubernetes cluster.
The [Cluster Autoscaler][3] is a component for Kubernetes that integrates with
cloud providers to automatically launch and terminate instances in response to
demand in the Kubernetes cluster. That is all it does, though; it does not
integrate with the Kubernetes API to perform TLS bootstrapping or register the
node in the cluster. In the [Autoscaler FAQ][4], it hints at how to handle
this limitation, though:
> Example: If you use `kubeadm` to provision your cluster, it is up to you to
> automatically execute `kubeadm join` at boot time via some script.
With that in mind, I set out to build a solution that uses the Cluster
Autoscaler, WireGuard, and `kubeadm` to automatically provision nodes in the
cloud to run Jenkins jobs on pods created by the Jenkins Kubernetes plugin.
[0]: https://plugins.jenkins.io/kubernetes
[1]: https://plugins.jenkins.io/ec2
[2]: https://plugins.jenkins.io/docker-workflow
[3]: https://github.com/kubernetes/autoscaler
[4]: https://github.com/kubernetes/autoscaler/blob/de560600991a5039fd9157b0eeeb39ec59247779/cluster-autoscaler/FAQ.md#how-does-scale-up-work
## Process
<div style="text-align: center;">
[![Sequence Diagram](sequence.svg)](sequence.svg)
</div>
1. When Jenkins starts running a job that is configured to run in a Kubernetes
Pod, it uses the job's pod template to create the Pod resource. It also
creates a worker node and waits for the JNLP agent in the pod to attach
itself to that node.
2. Kubernetes attempts to schedule the pod Jenkins created. If there is not a
node available, the scheduling fails.
3. The Cluster Autoscaler detects that scheduling the pod failed. It checks
the requirements for the pod, matches them to an EC2 Autoscaling Group, and
determines that scheduling would succeed if it increased the capacity of the
group.
4. The Cluster Autoscaler increases the desired capacity of the EC2 Autoscaling
Group, launching a new EC2 instance.
5. Amazon EventBridge sends a notification, via Amazon Simple Notification
Service, to the provisioning service, indicating that a new EC2 instance has
started.
6. The provisioning service generates a `kubeadm` boostrap token for the new
instance and stores it as a Secret resource in Kubernetes.
7. The provisioning service looks for an available Secret resource in
Kubernetes containing WireGuard configuration and marks it as assigned to
the new EC2 instance.
8. The EC2 instance, via a script executed by *cloud-init*, fetches the
WireGuard configuration assigned to it from the provisioning service.
9. The provisioning service searches for the Secret resource in Kubernetes
containing the WireGuard configuration assigned to the EC2 instance and
returns it in the HTTP response.
10. The *cloud-init* script on the EC2 instance uses the returned WireGuard
configuration to configure a WireGuard interface and connect to the VPN.
11. The *cloud-init* script on the EC2 instance generates a
[`JoinConfiguration`][7] document with cluster discovery configuration
pointing to the provisioning service and passes it to `kubeadm join`.
12. The provisioning service looks up the Secret resource in Kubernetes
containing the bootstrap token assigned to the EC2 instance and generates a
*kubeconfig* file containing the cluster configuration information and that
token. The *kubeconfig* file is returned in the HTTP response.
13. `kubeadm join`, running on the EC2 instance communicates with the
Kubernetes API server, over the WireGuard tunnel, to perform TLS
bootstrapping and configure the Kubelet as a worker node in the cluster.
14. When the Kubelet on the new EC2 instance is ready, Kubernetes detects that
the pod created by Jenkins can now be scheduled to run on it and instructs
the Kublet to start the containers in the pod.
15. The Kublet on the new EC2 instance starts the pod's containers. The JNLP
agent, running as one of the containers in the pod, connects to the Jenkins
controller.
16. Jenkins assigns the job run to the new agent, which executes the job.
[7]: https://kubernetes.io/docs/reference/config-api/kubeadm-config.v1beta3/#kubeadm-k8s-io-v1beta3-JoinConfiguration
## Components
### Jenkins Kubernetes Plugin
The [Kubernetes plugin][0] for Jenkins is responsible for dynamically creating
Kubernetes pods from templates associated with pipeline jobs. Jobs provide a
pod template that describe the containers and configuration they require in
order to run. Jenkins creates the corresponding resources using the Kubernetes
API.
### Autoscaler
The [Cluster Autoscaler][3] is an optional Kubernetes component that integrates
with cloud provider APIs to create or destroy worker nodes. It does not handle
any configuration on the machines themselves (i.e. running `kubeadm join`), but
it does watch the cluster state and determine when to create or destroy new
nodes based on pod requests.
### cloud-init
[cloud-init][5] is a tool that comes pre-installed on most cloud machine images
(including the official Fedora AMIs) that can be used to automatically
provision machines when they are first launched. It can install packages,
create configuration files, run commands, etc.
[5]: https://cloud-init.io/
### WireGuard
[WireGuard][6] is a simple and high-performance VPN protocol. It will provide
the cloud instances with connectivity back to the private network, and
therefore access to internal resources including the Kubernetes API.
Unfortunately, WireGuard is not particularly amenable to "dynamic" clients
(i.e. peers that come and go). This means either custom tooling will be
necessary to configure WireGuard peers on the fly OR pre-generating
configuration for a set number of peers and ensuring that no more than that
number of instances are every online simultaneously.
[6]: https://www.wireguard.com/
### Provisioning Service
This is a custom piece of software that is responsible for provisioning
secrets, etc. for the dynamic nodes. Since it will be responsible for handing
out WireGuard keys, it will have to be accessible directly over the Internet.
It will have to authenticate requests somehow to ensure that they are from
authorized clients (i.e. EC2 nodes created by the k8s Autoscaler) before
generating any keys/tokens.

View File

@ -0,0 +1,36 @@
@startuml
box Internal Network
participant Jenkins
participant Pod
participant Kubernetes
participant Autoscaler
participant Provisioner
Jenkins -> Kubernetes : Create Pod
Kubernetes -> Autoscaler : Scale Up
end box
Autoscaler -> AWS : Launch Instance
create "EC2 Instance"
AWS -> "EC2 Instance" : Start
AWS --> Provisioner : Instance Started
Provisioner -> Provisioner : Generate Bootstrap Token
Provisioner -> Kubernetes : Store Bootstrap Token
Provisioner -> Kubernetes : Allocate WireGuard Config
"EC2 Instance" -> Provisioner : Request WireGuard Config
Provisioner -> Kubernetes : Request WireGuard Config
Kubernetes -> Provisioner : Return WireGuard Config
Provisioner -> "EC2 Instance" : Return WireGuard Config
"EC2 Instance" -> "EC2 Instance" : Configure WireGuard
"EC2 Instance" -> Provisioner : Request Cluster Config
Provisioner -> "EC2 Instance" : Return Cluster Config
group WireGuard Tunnel
"EC2 Instance" -> Kubernetes : Request Certificate
Kubernetes -> "EC2 Instance" : Return Certificate
"EC2 Instance" -> Kubernetes : Join Cluster
Kubernetes -> "EC2 Instance" : Acknowledge Join
Kubernetes -> "EC2 Instance" : Schedule Pod
"EC2 Instance" -> Kubernetes : Pod Started
end
Kubernetes -> Jenkins : Pod Started
create Pod
Jenkins -> Pod : Execute job
@enduml

File diff suppressed because one or more lines are too long

After

Width:  |  Height:  |  Size: 22 KiB