Buildroot jobs really benefit from having a persistent workspace volume
instead of an ephemeral one. This way, only the packages, etc. that
have changed since the last build need to be built, instead of the whole
toolchain and operating system.
We don't want to hard-code a namespace for the `ssh-known-hosts`
ConfigMap because that makes it less useful for other projects besides
Jenkins. Instead, we omit the namespace specification and allow
consumers to specify their own.
The _jenkins_ project doesn't have a default namespace, since it
specifies resources in the `jenkins` and `jenkins-jobs` namespaces, we
need to create a sub-project to set the namespace for the
`ssh-known-hosts` ConfigMap.
Jenkins that build Gentoo-based systems, like Aimee OS, need a
persistent storage volume for the Gentoo ebuild repository. The Job
initially populates the repository using `emerge-webrsync`, and then the
CronJob keeps it up-to-date by running `emaint sync` daily.
In addition to the Portage repository, we also need a volume to store
built binary packages. Jenkins job pods can mount this volume to make
binary packages they build available for subsequent runs.
Both of these volumes are exposed to use cases outside the cluster using
`rsync` in daemon mode. This can be useful for e.g. local builds.
The new machines have names in the _pyrocufflink.black_ zone. We need
to trust the SSHCA certificate to sign keys for these names in order to
connect to them and manage them with Ansible.
Instead of routing iSCSI traffic from the Kubernetes network, through
the firewall, to the storage network, nodes now have a second network
adapter connected to directly to the storage network. The nodes with
such an adapter are labelled `network.du5t1n.me/storage`, so we can pin
the Jenkins PersistentVolume to them via a node affinity rule.
Managing the Jenkins volume with Longhorn has become increasingly
problematic. Because of its large size, whenever Longhorn needs to
rebuild/replicate it (which happens often for no apparent reason), it
can take several hours. While the synchronization is happening, the
entire cluster suffers from degraded performance.
Instead of using Longhorn, I've decided to try storing the data directly
on the Synology NAS and expose it to Kubernetes via iSCSI. The Synology
offers many of the same features as Longhorn, including
snapshots/rollbacks and backups. Using the NAS allows the volume to be
available to any Kubernetes node, without keeping multiple copies of
the data.
In order to expose the iSCSI service on the NAS to the Kubernetes nodes,
I had to make the storage VLAN routable. I kept it as IPv6-only,
though, as an extra precaution against unauthorized access. The
firewall only allows nodes on the Kubernetes network to access the NAS
via iSCSI.
I originally tried proxying the iSCSI connection via the VM hosts,
however, this failed because of how iSCSI target discovery works. The
provided "target host" is really only used to identify available LUNs;
follow-up communication is done with the IP address returned by the
discovery process. Since the NAS would return its IP address, which
differed from the proxy address, the connection would fail. Thus, I
resorted to reconfiguring the storage network and connecting directly
to the NAS.
To migrate the contents of the volume, I temporarily created a PVC with
a different name and bound it to the iSCSI PersistentVolume. Using a
pod with both the original PVC and the new PVC mounted, I used `rsync`
to copy the data. Once the copy completed, I deleted the Pod and both
PVCs, then created a new PVC with the original name (i.e. `jenkins`),
bound to the iSCSI PV. While doing this, Longhorn, for some reason,
kept re-creating the PVC whenever I would delete it, no matter how I
requested the deletion. Deleting the PV, the PVC, or the Volume, using
either the Kubernetes API or the Longhorn UI, they would all get
recreated almost immediately. Fortunately, there was actually enough of
a delay after deleting it before Longhorn would recreate it that I was
able to create the new PVC manually. Once I did that, Longhorn seemed
to give up.
Since (almost) all managed hosts have SSH certificates signed by SSHCA
now, the need to maintain a pseudo-dynamic SSH key list is winding down.
If we include the SSH CA key in the global known hosts file, and
explicitly list the couple of hosts that do not have a certificate, we
can let Ansible use that instead of fetching the host keys on each run.
The *jenkins-repohost* Secret contains an SSH private key Jenkins jobs
can use to publish RPM packages to the Yum repo host on
*files.pyrocufflink.blue*.
The *rpm-gpg-key* and *rpm-gpg-key-passphrase* Secrets contain the GnuPG
private key and its encryption passphrase, respectively, that can be
used to sign RPM packages. This key is trusted by managed nodes on the
Pyrocufflink network.
The [Kubernetes Credentials Provider][0] plugin for Jenkins allows
Jenkins to expose Kubernetes Secret resources as Jenkins Credentials.
Jobs can use them like normal Jenkins credentials, e.g. using
`withCredentials`, `sshagent`, etc. The only drawback is that every
credential exposed this way is available to every job, at least until
[PR #40][1] is merged. Fortunately, jobs managed by this Jenkins
instance are all trusted; no anonymous pull requests are possible, so
the risk is mitigated.
[0]: https://jenkinsci.github.io/kubernetes-credentials-provider-plugin/
[1]: https://github.com/jenkinsci/kubernetes-credentials-provider-plugin/pull/40
Setting the `imagePullSecrets` property on the default service account
for the *jenkins-jobs* namespace allows jobs to run from private
container images automatically, without additional configuration in the
pipeline definitions.
The Raspberry Pi usually has the most free RAM of all the Kubernetes
nodes, so pods tend to get assigned there even when it would not be
appropriate. Jenkins, for example definitely does not need to run
there, so let's force it to run on the bigger nodes.
Argo CD will delete and re-create this Job each time it synchronizes the
*jenkins* application. The job creates a snapshot of the Jenkins volume
using an HTTP request to the Longhorn UI.
When cloning/fetching a Git repository in a Jenkins pipeline, the Git
Client plugin uses the configured *Host Key Verification Strategy* to
verify the SSH host key of the remote Git server. Unfortunately, there
does not seem to be any way to use the configured strategy from the
`git` command line in a Pipeline job, so e.g. `git push` does not
respect it. This causes jobs to fail to push changes to the remote if
the container they're using does not already have the SSH host key for
the remote in its known hosts database.
This commit adds a ConfigMap to the *jenkins-jobs* namespace that can be
mounted in containers to populate the SSH host key database.
I don't want Jenkins updating itself whenever the pod restarts, so I'm
going to pin it to a specific version. This way, I can be sure to take
a snapshot of the data volume before upgrading.
Setting a static SELinux level for the container allows CRI-O to skip
relabeling all the files in the persistent volume each time the
container starts. For this to work, the pod needs a special annotation,
and CRI-O itself has to be configured to respect it:
```toml
[crio.runtime.runtimes.runc]
allowed_annotations = ["io.kubernetes.cri-o.TrySkipVolumeSELinuxLabel"]
```
This *dramatically* improves the start time of the Jenkins container.
Instead of taking 5+ minutes, it now starts instantly.
https://github.com/cri-o/cri-o/issues/6185#issuecomment-1334719982
Running Jenkins in Kubernetes is relatively straightforward. The
Kubernetes plugin automatically discovers all the connection and
authentication configuration, so a `kubeconfig` file is no longer
necessary. I did set the *Jenkins tunnel* option, though, so that
agents will connect directly to the Jenkins JNLP port instead of going
through the ingress controller.
Jobs now run in pods in the *jenkins-job* namespace instead of the
*jenkins* namespace. The latter is now where the Jenkins controller
runs, and the controller should not have permission to modify its own
resources.
Jenkins doesn't really need full control of all resources in its
namespace. Rather, it only needs to be able to manage Pod and
PersistentVolumeClaim resources.