diff --git a/content/projects/dynk8s/cloudcontainer.jpg b/content/projects/dynk8s/cloudcontainer.jpg new file mode 100644 index 0000000..838fc6d Binary files /dev/null and b/content/projects/dynk8s/cloudcontainer.jpg differ diff --git a/content/projects/dynk8s/index.md b/content/projects/dynk8s/index.md new file mode 100644 index 0000000..5d9346c --- /dev/null +++ b/content/projects/dynk8s/index.md @@ -0,0 +1,164 @@ ++++ +title = "Dynamic Cloud Worker Nodes for On-Premises Kubernetes" +description = """\ +Automatically launch EC2 instances as worker nodes in an on-premises Kubernetes +cluster when they are needed, and remove them when they are not +""" + +[extra] +image = "projects/dynk8s/cloudcontainer.jpg" ++++ + +One of the first things I wanted to do with my Kubernetes cluster at home was +start using it for Jenkins jobs. With the [Kubernetes][0] plugin, Jenkins can +run create ephemeral Kubernetes pods to use as worker nodes to execute builds. +Migrating all of my jobs to use this mechanism would allow me to get rid of the +static agents running on VMs and Raspberry Pis. + +Getting the plugin installed and configured was relatively straightforward, and +defining pod templates for CI pipelines was simple enough. It did not take +long to migrate the majority of the jobs that can run on x86_64 machines. The +aarch64, jobs, though, needed some more attention. + +It's no secret that Raspberry Pis are *slow*. They are fine for very light +use, or for dedicated single-application purposes, but trying to compile code, +especially Rust, on one is a nightmare. So, while I was redoing my Jenkins +jobs, I took the opportunity to try to find a better, faster solution. + +Jenkins has an [Amazon EC2][1] plugin, which dynamically launches EC2 instances +to execute builds and terminates them when they are no longer needed. We use +this plugin at work, and it is a decent solution. I could configure Jenkins to +launch Graviton instances to build aarch64 code. Unfortunately, I would either +need to pre-create AMIs with all of the necessary build dependencies and run +the jobs directly on the worker nodes, or use the [Docker Pipeline][2] plugin +to run them in Docker containers. What I really wanted, though, was to be able +to use Kubernetes for all of the jobs, so I set out to find a way to +dynamically add cloud machines to my local Kubernetes cluster. + +The [Cluster Autoscaler][3] is a component for Kubernetes that integrates with +cloud providers to automatically launch and terminate instances in response to +demand in the Kubernetes cluster. That is all it does, though; it does not +integrate with the Kubernetes API to perform TLS bootstrapping or register the +node in the cluster. In the [Autoscaler FAQ][4], it hints at how to handle +this limitation, though: + +> Example: If you use `kubeadm` to provision your cluster, it is up to you to +> automatically execute `kubeadm join` at boot time via some script. + +With that in mind, I set out to build a solution that uses the Cluster +Autoscaler, WireGuard, and `kubeadm` to automatically provision nodes in the +cloud to run Jenkins jobs on pods created by the Jenkins Kubernetes plugin. + +[0]: https://plugins.jenkins.io/kubernetes +[1]: https://plugins.jenkins.io/ec2 +[2]: https://plugins.jenkins.io/docker-workflow +[3]: https://github.com/kubernetes/autoscaler +[4]: https://github.com/kubernetes/autoscaler/blob/de560600991a5039fd9157b0eeeb39ec59247779/cluster-autoscaler/FAQ.md#how-does-scale-up-work + + +## Process + +
+ +[![Sequence Diagram](sequence.svg)](sequence.svg) + +
+ + +1. When Jenkins starts running a job that is configured to run in a Kubernetes + Pod, it uses the job's pod template to create the Pod resource. It also + creates a worker node and waits for the JNLP agent in the pod to attach + itself to that node. +2. Kubernetes attempts to schedule the pod Jenkins created. If there is not a + node available, the scheduling fails. +3. The Cluster Autoscaler detects that scheduling the pod failed. It checks + the requirements for the pod, matches them to an EC2 Autoscaling Group, and + determines that scheduling would succeed if it increased the capacity of the + group. +4. The Cluster Autoscaler increases the desired capacity of the EC2 Autoscaling + Group, launching a new EC2 instance. +5. Amazon EventBridge sends a notification, via Amazon Simple Notification + Service, to the provisioning service, indicating that a new EC2 instance has + started. +6. The provisioning service generates a `kubeadm` boostrap token for the new + instance and stores it as a Secret resource in Kubernetes. +7. The provisioning service looks for an available Secret resource in + Kubernetes containing WireGuard configuration and marks it as assigned to + the new EC2 instance. +8. The EC2 instance, via a script executed by *cloud-init*, fetches the + WireGuard configuration assigned to it from the provisioning service. +9. The provisioning service searches for the Secret resource in Kubernetes + containing the WireGuard configuration assigned to the EC2 instance and + returns it in the HTTP response. +10. The *cloud-init* script on the EC2 instance uses the returned WireGuard + configuration to configure a WireGuard interface and connect to the VPN. +11. The *cloud-init* script on the EC2 instance generates a + [`JoinConfiguration`][7] document with cluster discovery configuration + pointing to the provisioning service and passes it to `kubeadm join`. +12. The provisioning service looks up the Secret resource in Kubernetes + containing the bootstrap token assigned to the EC2 instance and generates a + *kubeconfig* file containing the cluster configuration information and that + token. The *kubeconfig* file is returned in the HTTP response. +13. `kubeadm join`, running on the EC2 instance communicates with the + Kubernetes API server, over the WireGuard tunnel, to perform TLS + bootstrapping and configure the Kubelet as a worker node in the cluster. +14. When the Kubelet on the new EC2 instance is ready, Kubernetes detects that + the pod created by Jenkins can now be scheduled to run on it and instructs + the Kublet to start the containers in the pod. +15. The Kublet on the new EC2 instance starts the pod's containers. The JNLP + agent, running as one of the containers in the pod, connects to the Jenkins + controller. +16. Jenkins assigns the job run to the new agent, which executes the job. + +[7]: https://kubernetes.io/docs/reference/config-api/kubeadm-config.v1beta3/#kubeadm-k8s-io-v1beta3-JoinConfiguration + + +## Components + +### Jenkins Kubernetes Plugin + +The [Kubernetes plugin][0] for Jenkins is responsible for dynamically creating +Kubernetes pods from templates associated with pipeline jobs. Jobs provide a +pod template that describe the containers and configuration they require in +order to run. Jenkins creates the corresponding resources using the Kubernetes +API. + +### Autoscaler + +The [Cluster Autoscaler][3] is an optional Kubernetes component that integrates +with cloud provider APIs to create or destroy worker nodes. It does not handle +any configuration on the machines themselves (i.e. running `kubeadm join`), but +it does watch the cluster state and determine when to create or destroy new +nodes based on pod requests. + +### cloud-init + +[cloud-init][5] is a tool that comes pre-installed on most cloud machine images +(including the official Fedora AMIs) that can be used to automatically +provision machines when they are first launched. It can install packages, +create configuration files, run commands, etc. + +[5]: https://cloud-init.io/ + +### WireGuard + +[WireGuard][6] is a simple and high-performance VPN protocol. It will provide +the cloud instances with connectivity back to the private network, and +therefore access to internal resources including the Kubernetes API. + +Unfortunately, WireGuard is not particularly amenable to "dynamic" clients +(i.e. peers that come and go). This means either custom tooling will be +necessary to configure WireGuard peers on the fly OR pre-generating +configuration for a set number of peers and ensuring that no more than that +number of instances are every online simultaneously. + +[6]: https://www.wireguard.com/ + +### Provisioning Service + +This is a custom piece of software that is responsible for provisioning +secrets, etc. for the dynamic nodes. Since it will be responsible for handing +out WireGuard keys, it will have to be accessible directly over the Internet. +It will have to authenticate requests somehow to ensure that they are from +authorized clients (i.e. EC2 nodes created by the k8s Autoscaler) before +generating any keys/tokens. diff --git a/content/projects/dynk8s/sequence.plantuml b/content/projects/dynk8s/sequence.plantuml new file mode 100644 index 0000000..e0fec04 --- /dev/null +++ b/content/projects/dynk8s/sequence.plantuml @@ -0,0 +1,36 @@ +@startuml +box Internal Network +participant Jenkins +participant Pod +participant Kubernetes +participant Autoscaler +participant Provisioner +Jenkins -> Kubernetes : Create Pod +Kubernetes -> Autoscaler : Scale Up +end box +Autoscaler -> AWS : Launch Instance +create "EC2 Instance" +AWS -> "EC2 Instance" : Start +AWS --> Provisioner : Instance Started +Provisioner -> Provisioner : Generate Bootstrap Token +Provisioner -> Kubernetes : Store Bootstrap Token +Provisioner -> Kubernetes : Allocate WireGuard Config +"EC2 Instance" -> Provisioner : Request WireGuard Config +Provisioner -> Kubernetes : Request WireGuard Config +Kubernetes -> Provisioner : Return WireGuard Config +Provisioner -> "EC2 Instance" : Return WireGuard Config +"EC2 Instance" -> "EC2 Instance" : Configure WireGuard +"EC2 Instance" -> Provisioner : Request Cluster Config +Provisioner -> "EC2 Instance" : Return Cluster Config +group WireGuard Tunnel +"EC2 Instance" -> Kubernetes : Request Certificate +Kubernetes -> "EC2 Instance" : Return Certificate +"EC2 Instance" -> Kubernetes : Join Cluster +Kubernetes -> "EC2 Instance" : Acknowledge Join +Kubernetes -> "EC2 Instance" : Schedule Pod +"EC2 Instance" -> Kubernetes : Pod Started +end +Kubernetes -> Jenkins : Pod Started +create Pod +Jenkins -> Pod : Execute job +@enduml diff --git a/content/projects/dynk8s/sequence.svg b/content/projects/dynk8s/sequence.svg new file mode 100644 index 0000000..76a5423 --- /dev/null +++ b/content/projects/dynk8s/sequence.svg @@ -0,0 +1,363 @@ +Internal NetworkJenkinsJenkinsPodKubernetesKubernetesAutoscalerAutoscalerProvisionerProvisionerAWSAWSEC2 InstanceCreate PodScale UpLaunch InstanceStartEC2 InstanceInstance StartedGenerate Bootstrap TokenStore Bootstrap TokenAllocate WireGuard ConfigRequest WireGuard ConfigRequest WireGuard ConfigReturn WireGuard ConfigReturn WireGuard ConfigConfigure WireGuardRequest Cluster ConfigReturn Cluster ConfigWireGuard TunnelRequest CertificateReturn CertificateJoin ClusterAcknowledge JoinSchedule PodPod StartedPod StartedExecute jobPod \ No newline at end of file