97 lines
4.0 KiB
Markdown
97 lines
4.0 KiB
Markdown
+++
|
|
title = 'Speed Up Jenkins Startup Time in Kubernetes'
|
|
date = 2022-12-01T21:40:17-06:00
|
|
+++
|
|
|
|
I recently migrated my Jenkins server at home to run inside my Kubernetes
|
|
cluster. I am very happy with it overall; upgrades are a lot simpler, and
|
|
Longhorn volume snapshots make rolling back bad plugin updates a breeze. One
|
|
issue that troubled me for a while, though, was that it took a *really* long
|
|
time for the Jenkins server container to start. Kubernetes would list the pod
|
|
in `ContainerCreating` state for several minutes, and then in
|
|
`ContainerCreateError` for a while, before finally starting the process. It
|
|
turns out this was because of the huge number of files in the Jenkins home
|
|
directory. When the container starts up, the container runtime has to go
|
|
through every file in the persistent volume and fix its permissions. My
|
|
Jenkins instance has over 1.5 million files, so scanning and modifying them all
|
|
takes a very long time.
|
|
|
|
I was finally able to fix this issue today, after messing with it for a week or
|
|
so. There are two changes the container runtime has to make to every file in
|
|
the persistent volume:
|
|
|
|
1. The group ownership/GID
|
|
2. The SELinux label
|
|
|
|
Fixing the first problem is straightforward: set
|
|
`securityContext.fsGroupChangePolicy` on the pod or container to
|
|
`OnRootMismatch`. The container runtime will check the GID of the root
|
|
directory of the persistent volume, and if it is correct, skip checking any of
|
|
the rest of the files and directories.
|
|
|
|
The second problem was quite a bit trickier, but still fixable. It took me a
|
|
bit longer to get the solution right, but with the help of a [cri-o GitHub
|
|
issue][0], I finally managed. The key is to configure the container to have a
|
|
static SELinux context; by default, the container runtime will assign a random
|
|
category when the container starts. Naturally, this means the context labels
|
|
of all the files in the persistent volume have to be changed every time, to
|
|
match the new category. Fortunately, the
|
|
`securityContext.seLinuxOptions.level` setting on the pod/container is
|
|
available. I looked at the category of the Jenkins current process and set
|
|
`level` to that:
|
|
|
|
```sh
|
|
ps Z -p $(pgrep -f 'jenkins\.war')
|
|
```
|
|
|
|
```
|
|
LABEL PID TTY STAT TIME COMMAND
|
|
system_u:system_r:container_t:s0:c525,c600 196790 ? Sl 0:50 java -Duser.home=/var/jenkins_home -Djenkins.model.Jenkins.slaveAgentPort=50000 -Dhudson.lifecycle=hudson.lifecycle.ExitLifecycle -jar /usr/share/jenkins/jenkins.war
|
|
```
|
|
|
|
The *level* field is the final two parts of the process's label and includes
|
|
the context's category.
|
|
|
|
```yaml
|
|
spec:
|
|
containers:
|
|
- securityContext:
|
|
seLinuxOptions:
|
|
level: s0:c525,c600
|
|
```
|
|
|
|
With this setting in place, the container will start with the same SELinux
|
|
context every time, so if the files are already labelled correctly, they do not
|
|
have to be changed. Unfortunately, by default, CRI-O, still walks the whole
|
|
directory tree to make sure. It can be configured to skip that step, though,
|
|
similar to the `fsGroupChangePolicy`. The pod needs a special annotation:
|
|
|
|
```yaml
|
|
metadata:
|
|
annotations:
|
|
io.kubernetes.cri-o.TrySkipVolumeSELinuxLabel: 'true'
|
|
```
|
|
|
|
CRI-O itself also has to be configured to respect that annotation. CRI-O's
|
|
configuration is not well documented, but I was able to determine that these
|
|
two lines need to be added to `/etc/crio/crio.conf`:
|
|
|
|
```toml
|
|
[crio.runtime.runtimes.runc]
|
|
allowed_annotations = ["io.kubernetes.cri-o.TrySkipVolumeSELinuxLabel"]
|
|
```
|
|
|
|
In summary, there were four steps to configure the container runtime not to
|
|
scan and touch every file in the persistent volume when starting the Jenkins
|
|
container:
|
|
|
|
1. Set `securityContext.fsGroupChangePolicy` to `OnRootMismatch`
|
|
2. Set `securityContext.seLinuxOptions.level` to a static value
|
|
3. Add the `io.kubernetes.cri-o.TrySkipVolumeSELinuxLabel` annotation
|
|
4. Configure CRI-O to respect said annotation
|
|
|
|
After completing all four steps, the Jenkins container starts up in seconds
|
|
instead of minutes.
|
|
|
|
[0]: https://github.com/cri-o/cri-o/issues/6185
|