+++
title = 'Speed Up Jenkins Startup Time in Kubernetes'
date = 2022-12-01T21:40:17-06:00
+++

I recently migrated my Jenkins server at home to run inside my Kubernetes
cluster.  I am very happy with it overall; upgrades are a lot simpler, and
Longhorn volume snapshots make rolling back bad plugin updates a breeze.  One
issue that troubled me for a while, though, was that it took a *really* long
time for the Jenkins server container to start.  Kubernetes would list the pod
in `ContainerCreating` state for several minutes, and then in
`ContainerCreateError` for a while, before finally starting the process.  It
turns out this was because of the huge number of files in the Jenkins home
directory.  When the container starts up, the container runtime has to go
through every file in the persistent volume and fix its permissions.  My
Jenkins instance has over 1.5 million files, so scanning and modifying them all
takes a very long time.

I was finally able to fix this issue today, after messing with it for a week or
so.  There are two changes the container runtime has to make to every file in
the persistent volume:

1. The group ownership/GID
2. The SELinux label

Fixing the first problem is straightforward: set
`securityContext.fsGroupChangePolicy` on the pod or container to
`OnRootMismatch`.  The container runtime will check the GID of the root
directory of the persistent volume, and if it is correct, skip checking any of
the rest of the files and directories. 

The second problem was quite a bit trickier, but still fixable.  It took me a
bit longer to get the solution right, but with the help of a [cri-o GitHub
issue][0], I finally managed.  The key is to configure the container to have a
static SELinux context; by default, the container runtime will assign a random
category when the container starts.  Naturally, this means the context labels
of all the files in the persistent volume have to be changed every time, to
match the new category.  Fortunately, the
`securityContext.seLinuxOptions.level` setting on the pod/container is
available.  I looked at the category of the Jenkins current process and set
`level` to that:

```sh
ps Z -p $(pgrep -f 'jenkins\.war')
```

```
LABEL                               PID TTY      STAT   TIME COMMAND
system_u:system_r:container_t:s0:c525,c600 196790 ? Sl   0:50 java -Duser.home=/var/jenkins_home -Djenkins.model.Jenkins.slaveAgentPort=50000 -Dhudson.lifecycle=hudson.lifecycle.ExitLifecycle -jar /usr/share/jenkins/jenkins.war
```

The *level* field is the final part of the process's label, after the last
colon.

```yaml
spec:
  containers:
  - securityContext:
      seLinuxOptions:
        level: s0:c525,c600
```

With this setting in place, the container will start with the same SELinux
context every time, so if the files are already labelled correctly, they do not
have to be changed.  Unfortunately, by default, CRI-O, still walks the whole
directory tree to make sure.  It can be configured to skip that step, though,
similar to the `fsGroupChangePolicy`.  The pod needs a special annotation:

```yaml
metadata:
  annotations:
    io.kubernetes.cri-o.TrySkipVolumeSELinuxLabel: 'true'
```

CRI-O itself also has to be configured to respect that annotation.  CRI-O's
configuration is not well documented, but I was able to determine that these
two lines need to be added to `/etc/crio/crio.conf`:

```toml
[crio.runtime.runtimes.runc]
allowed_annotations = ["io.kubernetes.cri-o.TrySkipVolumeSELinuxLabel"]
```

In summary, there were four steps to configure the container runtime not to
scan and touch every file in the persistent volume when starting the Jenkins
container:

1. Set `securityContext.fsGroupChangePolicy` to `OnRootMismatch`
2. Set `securityContext.seLinuxOptions.level` to a static value
3. Add the `io.kubernetes.cri-o.TrySkipVolumeSELinuxLabel` annotation
4. Configure CRI-O to respect said annotation

After completing all four steps, the Jenkins container starts up in seconds
instead of minutes.

[0]: https://github.com/cri-o/cri-o/issues/6185