CRI-O now installs more `.conflist` files in `/etc/cni/net.d`. Their
presence interferes with Calico, so they need to be deleted in order to
have fully working Pod networking, especially for pods that start very
early (before Calico is completely ready).
Upstream changed the naming convention for Fedora AMIs. It also seems
they've stopped publishing "release" artifacts; all the AMIs are now
date-stamped. We should probably consider running `terraform apply`
periodically to keep up-to-date.
dustin/dynk8s-provisioner/pipeline/head This commit looks goodDetails
If a Jenkins job runs for a while, Kubernetes may schedule other Pods
on it eventually. If a long-running Pod gets assigned to the ephemeral
node, the Cluster Autoscaler won't be able to scale down the ASG. To
prevent this, we apply a taint to the node so normal Pods will not get
assigned to it. We have to apply the corresponding toleration to Pods
for Jenkins jobs.
Fedora AMIs have the default locale set to en_US.UTF-8, which sorts
`100-crio-bridge.conflist` before `10-calico.conflist`. As a result,
Pods end up with incorrect network configuration, and cannot be reached
from other Pods on the container network. Since we do not need the
default configuration, the easiest way to resolve this is to just delete
it.
The default root block device for Fedora EC2 instances is only 10 GiB.
This is insufficient for many jobs, especially those that build large
container images.
dustin/dynk8s-provisioner/pipeline/head This commit looks goodDetails
Jenkins jobs that build container images in user namespaces need access
to `/dev/fuse`, which is provided by the [fuse-device-plugin][0]. This
plugin runs as a DaemonSet, which updates the status of the node it's
running on when it starts to indicate that the FUSE device is available.
When scaling up from zero nodes, Cluster Autoscaler has no way to know
that this will occur, and therefore cannot determine that scaling up the
ASG will create a node with the required resources. Thus, the ASG needs
a tag to inform CA that the nodes it creates will indeed have the
resources and scaling it up will allow the pod to be scheduled.
Although this feature of CA was added in 1.14, it apparently got broken
at some point and no longer works in 1.22. It works again in 1.26,
though.
[0]: https://github.com/kuberenetes-learning-group/fuse-device-plugin/tree/master
The *cri-o* package has moved from its own module into the base Fedora
repository, as Fedora is [eliminating modules][0]. The last modular
version was 1.25, which is too old to run pods with user namespaces.
Version 1.26 is available in the base repository, which does support
user namespaces.
[0]: https://fedoraproject.org/wiki/Changes/RetireModularity
Instead of hard-coding the AMI ID of the Fedora build we want, we can
use the `aws_ami` data source to search for it. The Fedora release team
has a consistent naming scheme for AMIs, so finding the correct one is
straightforward.
Lately, cloud nodes seem to be failing to come up more frequently. I
traced this down to the fact that `/etc/resolv.conf` in the `kube-proxy`
container contains both the AWS-provided DNS server and the on-premises
server set by Wireguard. This evidently "works" correctly sometimes,
but not always. When it doesn't, the `kube-proxy` cannot resolve the
Kubernetes API server address, and thus cannot create the necessary
netfilter rules to forward traffic correctly. This causes pods to be
unable to communicate.
I am not entirely sure what the "correct" solution to this problem would
be, since there are various issues in play here. Fortunately, cloud
nodes are only ever around for a short time, and never need to be
rebooted. As such, we can use a "quick fix" and simply remove the
AWS-provided DNS configuration.
The default configuration for the *kubelet.service* unit does not
specify the path to the `config.yml` generated by `kubeadm`. Thus, any
settings defined in the `kublet-config` ConfigMap do not take effect.
To resolve this, we have to explicitly set the path in the `config`
property of the `kubeletExtraArgs` object in the join configuration.
The Cluser Autoscaler uses EC2 Auto-Scaling Groups to configure the
instances it launches when it determines additional worker nodes are
necessary. Auto-Scaling Groups have an associated Launch Template,
which describes the properties of the instances, such as AMI ID,
instance type, security groups, etc.
When instances are first launched, they need to be configured to join
the on-premises Kubernetes cluster. This is handled by *cloud-init*
using the configuration in the instance user data. The configuration
supplied here specifies the Fedora packages that need to be installed on
a Kubernetes worker node, plus some additional configuration required by
`kubeadm`, `kubelet`, and/or `cri-o`. It also includes a script that
fetches the WireGuard client configuration and connects to the VPN,
finalizes the setup process, and joins the cluster.
Initially, I thought it was necessary to use a ClusterRole in order to
assign permissions in one namespace to a service account in another. It
turns out, this is not necessary, as RoleBinding rules can refer to
subjects in any namespace. Thus, we can limit the privileges of the
*dynk8s-provisioner* service account by only allowing it access to the
Secret and ConfigMap resources in the *kube-system* and *kube-public*
namespaces, respectively, plus the Secret resources in its own
namespace.
dustin/dynk8s-provisioner/pipeline/head This commit looks goodDetails
The Cluster Autoscaler does not delete the Node resource in Kubernetes
after it terminates an instance:
> It does not delete the Node object from Kubernetes. Cleaning up Node
> objects corresponding to terminated instances is the responsibility of
> the cloud node controller, which can run as part of
> kube-controller-manager or cloud-controller-manager.
On-premises clusters are probably not running the Cloud Controller
Manager, so Node resources are liable to be left behind after a
scale-down event.
To keep unused Node resources from accumulating, the
*dynk8s-provisioner* will now delete the Node resource associated with
an EC2 instance when it receives a state-change event indicating the
instance has been terminated. To identify the correct Node, it compares
the value of the `providerID` field of each existing node with the
instance ID mentioned in the event. An exact match is not possible,
since the provider ID includes the availability zone of the instance,
which is not included in the event, however, instances IDs are unique
enough that this "should" never be an issue.
dustin/dynk8s-provisioner/pipeline/head There was a failure building this commitDetails
Cargo uses the sources in the `tests` directory to build and run
integration tests. For each `tests/foo.rs` or `tests/foo/main.rs`, it
creates an executable that runs the test functions therein. These
executables are separate crates from the main package, and thus do not
have access to its private members. Integration tests are expected to
test only the public functionality of the package.
Application crates do not have any public members; their public
interface is the command line. Integration tests would typically run
the command (e.g. using `std::process::Command`) and test its output.
Since *dynk8s-provisioner* is not really a command-line tool, testing it
this way would be difficult; each test would need to start the server,
make requests to it, and then stop it. This would be slow and
cumbersome.
In order to avoid this tedium and be able to use Rocket's built-in test
client, I have converted *dynk8s-provisioner* into a library crate that
also includes an executable. The library makes the `rocket` function
public, which allows the integration tests to import it and pass it to
the Rocket test client.
The point of integration tests, of course, is to validate the
functionality of the application as a whole. This necessarily requires
allowing it to communicate with the Kubernetes API. In the Jenkins CI
environment, the application will need the appropriate credentials, and
will need to use a separate Kubernetes namespace from the production
deployment. The `setup.yaml` manifest in the `tests` directory defines
the resources necessary to run integration tests, and the
`genkubeconfig.sh` script can be used to create the appropriate
kubeconfig file containing the credentials. The kubeconfig is exposed
to the tests via the `KUBECONFIG` environment variable, which is
populated from a Jenkins secret file credential.
Note: The `data` directory moved from `test` to `tests` to avoid
duplication and confusing names.
When an instance is terminated, any bootstrap tokens assigned to it are
now deleted. Though these would expire anyway, deleting them ensures
that they cannot be used again if they happened to be leaked while the
instance was running. Further, it ensures that attempting to fetch the
`kubeadm` configuration for the instance will return an HTTP 404 Not
Found response once the instance has terminated.
The *GET /kubeadm/kubeconfig/<instance-id>* operation returns a
configuration document for `kubeadm` to add the node to the cluster as a
worker. The document is derived from the kubeconfig stored in the
`cluster-info` ConfigMap, which includes the external URL of the
Kubernetes API server and the root CA certificate used in the cluster.
The bootstrap token assigned to the specified instance is added to the
document for `kubeadm` to use for authentication. The kubeconfig is
stored in the ConfigMap as a string, so extracting data from it requires
deserializing the YAML document first.
In order to access the cluster information ConfigMap, the service
account bound to the pod running the provisioner service must have the
appropriate permissions.
The * GET /wireguard/config/<instance-id>* resource returns the
WireGuard client configuration assigned to the specified instance ID.
The resource contents are stored in the Kubernetes Secret, in a data
field named `wireguard-config`. The contents of this field are returned
directly as a string, without any transformation. Thus, the value must
be a complete, valid WireGuard configuration document. Instances will
fetch and save this configuration when they first launch, to configure
their access to the VPN.
Setting up the WireGuard client requires several pieces of information,
beyond the node private key and peer's public key. The peer endpoint
address/port, peer public key, and node IP address are also required.
As such, naming the resource a "key" is somewhat misleading.
In order to join the on-premises Kubernetes cluster, EC2 instances will
need to first connect to the WireGuard VPN. The *dynk8s* provisioner
will provide keys to instances to configure their WireGuard clients.
WireGuard keys must be pre-configured on the server and stored in
Kubernetes as *dynk8s.du5t1n.me/wireguard-key* Secret resources. They
must also have a `dynk8s.du5t1n.me/ec2-instance-id` label. If this
label is empty, the key is available to be assigned to an instance.
When an EventBridge event is received indicating an instance is now
running, a WireGuard key is assigned to that instance (by setting the
`dynk8s.du5t1n.me/ec2-instance-id` label). Conversely, when an event is
received indicating that the instance is terminated, any WireGuard keys
assigned to that instance are freed.
The lifecycle of ephemeral Kubernetes worker nodes is driven by events
emitted by Amazon EventBridge and delivered via Amazon Simple
Notification Service. These events trigger the *dynk8s* provisioner to
take the appropriate action based on the state of an EC2 instance.
In order to add a node to the cluster using `kubeadm`, a "bootstrap
token" needs to be created. When manually adding a node, this would be
done e.g. using `kubeadm token create`. Since bootstrap tokens are just
a special type of Secret, they can be easily created programmatically as
well. When a new EC2 instance enters the "running" state, the
provisioner creates a new bootstrap token and associates it with the
instance by storing the instance ID in a label in the Secret resource's
metadata.
The initial implementation of the event handler is rather naïve. It
generates a token for every instance, though some instances may not be
intended to be used as Kubernetes workers. Ideally, the provisioner
would only allocate tokens for instances matching some configurable
criteria, such as AWS tags. Further, a token is allocated every time
the instance enters the running state, even if a token already exists or
is not needed.
The `terraform` directory contains the resource descriptions for all AWS
services that need to be configured in order for the dynamic K8s
provisioner to work. Specifically, it defines the EventBridge rule and
SNS topic/subscriptions that instruct AWS to send EC2 instance state
change notifications to the *dynk8s-provisioner*'s HTTP interface.
dustin/dynk8s-provisioner/pipeline/head This commit looks goodDetails
Fedora 36 has OpenSSL 3, while the *rust* container image has OpenSSL
1.1. Since Fedora 35 is still supported, and it includes OpenSSL 1.1,
we can use it as our base for the runtime image.
Upon receipt of a notification or unsubscribe confirmation message from
SNS, after the message signature has been verified, the receiver will
now write the re-serialized contents of the message out to the
filesystem. This will allow the messages to be inspected later in order
to develop additional functionality for this service.
The messages are saved in a `messages` director within the current
working directory. This directory contains a subdirectory for each SNS
topic. Within the topic subdirectories, the each message is saved in a
file named with the message timestamp and ID.
This commit introduces the HTTP interface for the dynamic K8s node
provisioner. It will serve as the main communication point between the
ephemeral nodes in the cloud, sharing the keys and tokens they require
in order to join the Kubernetes cluster.
The initial functionality is simply an Amazon SNS notification receiver.
SNS notifications will be used to manage the lifecycle of the dynamic
nodes.
For now, the notification receiver handles subscription confirmation
messages by following the link provided to confirm the subscription.
All other messages are simply written to the filesystem; these will be
used to implement and test future functionality.
The `model::sns::Message` enumeration provides a mechanism for
deserializing a JSON document into the correct type. It will be used by
the HTTP operation that receives messages from SNS in order to determine
the correct action to take in response to the message.
In order to prevent arbitrary clients from using the provisioner to
retrieve WireGuard keys and Kubernetes bootstrap tokens, access to those
resources *must* be restricted to the EC2 machines created by the
Kubernetes Cloud Autoscaler. The key to the authentication process will
be SNS notifications from AWS to indicate when new EC2 instances are
created; everything that the provisioner does will be associated with an
instance it discovered through an SNS notification.
SNS messages are signed using PKCS#1 v1.5 RSA-SHA1, with a public key
distributed in an X.509 certificate. To ensure that messages received
are indeed from AWS, the provisioner will need to verify those
signatures. Messages with missing or invalid signatures will be
considered unsafe and ignored.
The `model::sns` module includes the data structures that represent SNS
messages. The `sns::sig` module includes the primitive operations for
implementing signature verification.