Commit Graph

13 Commits (f531b03e7ccab5dcea0ff512a658ac64ab26021a)

Author SHA1 Message Date
Dustin f531b03e7c tf/userdata: Use IMDSv2 tokens
The Fedora 40 AMIs require IMDSv2.  Our `kubeadm-join` script therefore
needs to fetch the auth token and include it with metada requests.
2024-11-03 12:31:27 -06:00
Dustin 0ec109b088 tf/asg: Update to Fedora 40
Upstream changed the naming convention for Fedora AMIs.  It also seems
they've stopped publishing "release" artifacts; all the AMIs are now
date-stamped.  We should probably consider running `terraform apply`
periodically to keep up-to-date.
2024-11-03 12:31:11 -06:00
Dustin c63c4d9e8c tf/userdata: Taint node for Jenkins only
dustin/dynk8s-provisioner/pipeline/head This commit looks good Details
If a Jenkins job runs for a while, Kubernetes may schedule other Pods
on it eventually.  If a long-running Pod gets assigned to the ephemeral
node, the Cluster Autoscaler won't be able to scale down the ASG.  To
prevent this, we apply a taint to the node so normal Pods will not get
assigned to it.  We have to apply the corresponding toleration to Pods
for Jenkins jobs.
2024-02-13 07:52:54 -06:00
Dustin 925d22b9d2 tf/userdata: Provision instance storage
The *c7gd.xlarge* instance type has a directly-attached NVMe disk.
Let's use it for Kubernetes Pod storage to increase performance a bit.
2024-02-13 07:50:43 -06:00
Dustin 6f279430c2 tf/asg: Use larger instance type
I'd rather spend a few extra pennies on beefier ephemeral worker nodes
to speed up builds.
2024-02-13 07:41:05 -06:00
Dustin 3c4f84e039 tf/userdata: Remove default CRI-O CNI config
Fedora AMIs have the default locale set to en_US.UTF-8, which sorts
`100-crio-bridge.conflist` before `10-calico.conflist`.  As a result,
Pods end up with incorrect network configuration, and cannot be reached
from other Pods on the container network.  Since we do not need the
default configuration, the easiest way to resolve this is to just delete
it.
2024-02-05 20:58:31 -06:00
Dustin f6910f04df tf/asg: Add CA resource tag for FUSE device plugin
dustin/dynk8s-provisioner/pipeline/head This commit looks good Details
Jenkins jobs that build container images in user namespaces need access
to `/dev/fuse`, which is provided by the [fuse-device-plugin][0].  This
plugin runs as a DaemonSet, which updates the status of the node it's
running on when it starts to indicate that the FUSE device is available.
When scaling up from zero nodes, Cluster Autoscaler has no way to know
that this will occur, and therefore cannot determine that scaling up the
ASG will create a node with the required resources.  Thus, the ASG needs
a tag to inform CA that the nodes it creates will indeed have the
resources and scaling it up will allow the pod to be scheduled.

Although this feature of CA was added in 1.14, it apparently got broken
at some point and no longer works in 1.22.  It works again in 1.26,
though.

[0]: https://github.com/kuberenetes-learning-group/fuse-device-plugin/tree/master
2024-01-14 11:42:46 -06:00
Dustin 5a79680b22 tf/userdata: Install CRI-O from Fedora base
The *cri-o* package has moved from its own module into the base Fedora
repository, as Fedora is [eliminating modules][0].  The last modular
version was 1.25, which is too old to run pods with user namespaces.
Version 1.26 is available in the base repository, which does support
user namespaces.

[0]: https://fedoraproject.org/wiki/Changes/RetireModularity
2024-01-13 10:10:46 -06:00
Dustin 473e279a18 tf/userdata: Remove default DNS configuration
Lately, cloud nodes seem to be failing to come up more frequently.  I
traced this down to the fact that `/etc/resolv.conf` in the `kube-proxy`
container contains both the AWS-provided DNS server and the on-premises
server set by Wireguard.  This evidently "works" correctly sometimes,
but not always.  When it doesn't, the `kube-proxy` cannot resolve the
Kubernetes API server address, and thus cannot create the necessary
netfilter rules to forward traffic correctly.  This causes pods to be
unable to communicate.

I am not entirely sure what the "correct" solution to this problem would
be, since there are various issues in play here.  Fortunately, cloud
nodes are only ever around for a short time, and never need to be
rebooted.  As such, we can use a "quick fix" and simply remove the
AWS-provided DNS configuration.
2023-11-13 19:52:57 -06:00
Dustin 2f0f134223 terraform: userdata: Add Longhorn issue workaround
dustin/dynk8s-provisioner/pipeline/head This commit looks good Details
There's apparently a bug in open-iscsi (see
[issue #4988](https://github.com/longhorn/longhorn/issues/4988)) that
prevents Longhorn from working on Fedora 36+.  We need a SELinux policy
patch to work around it.
2023-01-10 21:09:46 -06:00
Dustin b01841ab72 terraform: Update node template to Fedora 36
dustin/dynk8s-provisioner/pipeline/head Something is wrong with the build of this commit Details
2023-01-10 17:19:20 -06:00
Dustin e11f98b430 terraform: Add config for auto-scaling group
The Cluser Autoscaler uses EC2 Auto-Scaling Groups to configure the
instances it launches when it determines additional worker nodes are
necessary.  Auto-Scaling Groups have an associated Launch Template,
which describes the properties of the instances, such as AMI ID,
instance type, security groups, etc.

When instances are first launched, they need to be configured to join
the on-premises Kubernetes cluster.  This is handled by *cloud-init*
using the configuration in the instance user data.  The configuration
supplied here specifies the Fedora packages that need to be installed on
a Kubernetes worker node, plus some additional configuration required by
`kubeadm`, `kubelet`, and/or `cri-o`.  It also includes a script that
fetches the WireGuard client configuration and connects to the VPN,
finalizes the setup process, and joins the cluster.
2022-10-11 21:40:42 -05:00
Dustin 8e1165eb95 terraform: Begin AWS configuration
The `terraform` directory contains the resource descriptions for all AWS
services that need to be configured in order for the dynamic K8s
provisioner to work.  Specifically, it defines the EventBridge rule and
SNS topic/subscriptions that instruct AWS to send EC2 instance state
change notifications to the *dynk8s-provisioner*'s HTTP interface.
2022-09-27 12:58:51 -05:00