Fast fully containerized CI pipelines

When evaluating CI solutions for my homelab, I found striking that many require you to write how to build things three times:

Once in the remote pipeline definition.
Once in commands or scripts that you run locally.
Once in a Dockerfile to build the image.

Since I will never go back to how I was deploying things before 3 came into existence, let's try to eliminate the first two. In a Dockerfile, you can write additional stages that you can build and run locally or in a CI pipeline, but there are several blockers:

Whatever speed benefits you had from the incremental compilation done by the local or remote toolchain is gone.
You have to send heavy images between build stages that contain the toolchain and build artifacts.
Running tests with external dependencies like a database requires more control over how images are executed by the CI runners.

This article shows how to make this work in a simple Drone CI pipeline, leveraging various BuildKit caching mechanisms to keep things fast.

Content

Caching strategies
1. Layer-level
2. Builder-level
Multi-step pipelines
1. Building a test image
2. Running a test image
Further notes

§
Caching strategies

Let's start with a simple Go app to demonstrate common caching strategies:

main.go

package main

import (
    "fmt"
   
    "github.com/google/uuid"
)

func RunID() string {
    return fmt.Sprintf("RunID-%s", uuid.NewString())
}

func main() {
    fmt.Println(RunID())
}

Initialize the Go module and lock its dependencies (information saved in go.mod and go.sum):

Console

$ go mod init example.com/hello
$ go mod tidy

§
Layer-level

Here's a simple Dockerfile to build this app:

Dockerfile

FROM golang:1.23-alpine
WORKDIR /src
COPY . .
RUN go build -x -o /app .
CMD ["/app"]

You can run the build command:

Console

$ docker build --pull -t hello .
[+] Building 38.7s (9/9) FINISHED
 => [1/4] FROM docker.io/library/golang:1.23-alpine@sha256:2c49857f2295e89b23b28386e57e018a86620a8fede5003900f2d138ba9c4037
 => [2/4] WORKDIR /src
 => [3/4] COPY . .
 => [4/4] RUN go build -x -o /app .

§
Reducing cache invalidation

If you run this command again, you will see that all the layers have been cached. Same input, same output:

Console

$ docker build --pull -t hello .
[+] Building 1.7s (9/9) FINISHED
 => [1/4] FROM docker.io/library/golang:1.23-alpine@sha256:2c49857f2295e89b23b28386e57e018a86620a8fede5003900f2d138ba9c4037
 => CACHED [2/4] WORKDIR /src
 => CACHED [3/4] COPY . .
 => CACHED [4/4] RUN go build -x -o /app .

Now we can start making changes to the source file to test the cache invalidation in a typical development workflow.

After changing main.go (or any part of the source code), COPY . . will invalidate all further instructions, which means that go build will re-download the dependencies and recompile them each time you build the image, which can be quite time consuming.

To solve this issue, you can add an intermediate step that will re-download the dependencies only when go.mod or go.sum changes:

Dockerfile

FROM golang:1.23-alpine
WORKDIR /src
COPY go.mod go.sum .
RUN go mod download -x
COPY . .
RUN go build -x -o /app .
CMD ["/app"]

The most important thing to keep in mind is to put instructions that change often at the end. And this is pretty much all we can do with layer caching.

§
Multi-stage builds

There is still some room for improvement if you want to reduce the final image size:

FROM golang:1.23-alpine provides the Go development toolchain which is no longer needed after building the executable.
COPY . . adds source files are not needed to run the app.
RUN go build -o /app . downloads dependencies, produces intermediate build artifacts, and adds debugging symbols to the executable.

All of this can be avoided by leveraging multi-stage builds and a few compiler options to keep the final image as slim as possible:

Dockerfile

FROM golang:1.23-alpine AS build
WORKDIR /src
COPY go.mod go.sum .
RUN go mod download -x
COPY . .
RUN go build -x -ldflags='-s -w' -trimpath -o /app .

FROM scratch
COPY --from=build /app /app
CMD ["/app"]

§
Distributing the layer cache

Now you may wonder how you could distribute the layer cache so it benefits to the entire fleet of CI runners. BuildKit has features to import / export layer caches to a regular registry. When all runners are configured accordingly, this is a simple way to distribute the layer cache. You can test this feature with the --cache-from and --cache-to build options:

Console

$ docker build \
    --cache-from type=registry,ref=registry/app:buildcache \
    --cache-to type=registry,ref=registry/app:buildcache \
    -t registry/app:latest .

Unfortunately, this technique is not be as efficient as it looks, because there is a trade-off between the time gained due to caching, and the time lost due to network transfer times. For instance, a compilation step often outputs intermediate build artifacts much faster than they can be retrieved from the network.

Also, it has all the limitations of local layer caching. It is likely that each build contains changes to the source files, which means that starting at some instruction (like COPY . .), the cache is useless because all the following layers must be rebuilt. This is exactly what prevents us from fully optimizing the recompilation step in our current Dockerfile. And you will pay the price of pushing these new cache layers after each build even if they are not reused.

§
Builder-level

In the previous section, we eliminated the bottleneck of downloading the dependencies, so now we can tackle the bottleneck of their recompilation. For that, we need an orthogonal caching mechanism introduced with BuildKit, called builder caching.

The idea is that you can mount a cache volume into the image being built, which can be reused for subsequent builds on the same builder. Contrary to layer caches, they cannot be exported. That means there is no way to actively distribute this cache, each builder will have to create and maintain its own copy (which is not that inefficient for the reasons outlined in the previous section).

The Go compiler relies on the following locations to improve recompilation times:

/go/pkg/mod: downloaded dependencies.
~/.cache/go-build: intermediate build artifacts.

We can mount build caches at these locations using the --mount option for RUN:

Dockerfile

FROM golang:1.23-alpine AS build
WORKDIR /src
COPY go.mod go.sum .
RUN --mount=type=cache,target=/go/pkg/mod \
    --mount=type=cache,target=/root/.cache/go-build \
    go mod download -x
COPY . .
RUN --mount=type=cache,target=/go/pkg/mod \
    --mount=type=cache,target=/root/.cache/go-build \
    go build -x -ldflags='-s -w' -trimpath -o /app .

FROM scratch
COPY --from=build /app /app
CMD ["/app"]

If you add a new dependency or make some changes to the source files, you should see that only missing dependencies are downloaded and that the compilation step is much faster. But there are a few things to keep in mind:

You should not assume that these caches are always available, your build should work without them. An easy way to check that is to remove them with docker builder prune -a and build again.
Each time you invoke the Go command (or any command that read from this kind of cache), you have to explicitly mount all the necessary caches again.
Build caches can be used concurrently by multiple builds jobs running on the same builder. To prevent issues with concurrent accesses, BuildKit provides various sharing policies: shared / locked / private.
Each cache volume is associated to a target location. A rogue Dockerfile could mount such volume and poison it, so there are security consideration when sharing builder instances between untrusted and production builds.
BuildKit default GC policies are pretty aggressive, since they prune build caches older than 48 hours when they exceed 512 MB, but this is configurable.

Here are some additional examples:

Dockerfile

FROM alpine:3 AS apk
RUN --mount=type=cache,sharing=locked,target=/var/cache/apk \
    apk add -U curl

FROM debian:bullseye-slim AS apt
RUN --mount=type=cache,sharing=locked,target=/var/lib/apt \
    --mount=type=cache,sharing=locked,target=/var/cache/apt \
    rm -f /etc/apt/apt.conf.d/docker-clean \
    apt-get update && \
    apt-get upgrade -y && \
    apt-get install -y --no-install-recommends curl

FROM python:3.12-alpine AS pip
RUN python -m venv .venv
COPY requirements.txt .
RUN --mount=type=cache,target=/root/.cache/pip \
    ./.venv/bin/pip install -r requirements.txt

FROM node:20-slim AS yarn
COPY package.json yarn.lock .
RUN --mount=type=cache,target=/usr/local/share/.cache/yarn \
    yarn install --frozen-lockfile --production

§
Multi-step pipelines

In some pipelines, building a Docker image is only one of many steps, typically among the final ones. So if you want to test your application, you will usually have a dedicated step where you will encounter exactly the same caching issues when downloading and building the dependencies prior to running the tests.

With GitHub Actions for example, these difficulties are mostly handled by third-party packages which provide appropriate caching hooks. Container-oriented solutions like Drone CI or GitLab's Docker-based runners will typically have you use a base image like golang:1.23-alpine for the ephemeral test container and run go test commands directly inside it, without giving a good solution for caching.

In this section, you will see how you can reuse the caching strategies from the previous section by implementing intermediate stages directly in the Dockerfile, which are then used by each pipeline stage.

§
Building a test image

First, let's add some tests:

main_test.go

package main

import (
    "strings"
    "testing"
)

func TestRunID(t *testing.T) {
    if !strings.HasPrefix(RunID(), "RunID-") {
        t.Fatalf("RunID doesn't start with \"RunID-\"")
    }
}

Executing this test locally is just a matter of running go test, which works the same way inside a Dockerfile:

Dockerfile

FROM golang:1.23-alpine AS base
WORKDIR /src
COPY go.mod go.sum .
RUN --mount=type=cache,target=/go/pkg/mod \
    --mount=type=cache,target=/root/.cache/go-build \
    go mod download -x
COPY . .

FROM base AS test
RUN --mount=type=cache,target=/go/pkg/mod \
    --mount=type=cache,target=/root/.cache/go-build \
    go test -v ./...

FROM base AS build
RUN --mount=type=cache,target=/go/pkg/mod \
    --mount=type=cache,target=/root/.cache/go-build \
    go build -x -ldflags='-s -w' -trimpath -o /app .

FROM scratch
COPY --from=build /app /app
CMD ["/app"]

Since the final image doesn't depend on the test target, you have to invoke it explicitly with the --target <name> build option:

Console

$ docker build -t foo . --target test

If you invalidate the layer cache (for instance by adding a dummy environment variable before the RUN instruction) and you re-run the build command, the tests are considered cached by go test since there is no change to the source code and the previous run was saved to the cache volume:

Console

$ docker build -t foo . --target test --progress=plain
...
#10 1.006 === RUN   TestRunID
#10 1.006 --- PASS: TestRunID (0.00s)
#10 1.006 PASS
#10 1.006 ok    example.com/hello       (cached)
...

You could as easily add a lint stage using a base image for golangci-lint:

Dockerfile

FROM golangci/golangci-lint:v1.63.4-alpine AS lint
COPY . .
RUN --mount=type=cache,target=/go/pkg/mod \
    --mount=type=cache,target=/root/.cache/go-build \
    --mount=type=cache,target=/root/.cache/golangci-lint \
    golangci-lint run -v

This works as long as your tests can be contained inside the image build process.

§
Running a test image

What happens if your tests depend on an external system like a database? Instead of running the test directly with RUN, you could prepare everything inside the image and use CMD to define the runtime test command. You would then set relevant environment variables like DATABASE_HOST and run both the database and the test image you've just built to execute your tests.

There are two main problems with that:

If you use CMD go test, the tests will always be re-built and re-executed at runtime because they cannot rely on the build cache volumes.
In a CI pipeline, how do you run an image that you've just built without copying it to an external registry?

§
Pre-building the tests

There are various solutions to solve the test caching problem in Go:

Mount the cache directories at runtime. It's a little bit inelegant since we already have cache directories at build time, and it's additional options to pass to docker run that are not documented in the Dockerfile.
Run go test -v -run DummyTestTarget ./... so no tests are actually executed. You must be careful not to initialize global state that depends on external dependencies.
Run go test -v -c ./tests ./... to pre-compile all the test executables in ./tests/ and run them with CMD.

Let's go with the third option:

Dockerfile

FROM base AS test
RUN --mount=type=cache,target=/go/pkg/mod \
    --mount=type=cache,target=/root/.cache/go-build \
    mkdir -p tests; \
    go test -v -c ./... -o ./tests
CMD set -x; \
    for f in ./tests/*; do \
        "$f" -test.v; \
    done

Build and run the tests:

Console

$ docker build -t test . --target test --progress=plain
$ docker run --rm -it test
+ ./tests/hello.test -test.v
=== RUN   TestRunInfo
--- PASS: TestRunInfo (0.00s)
PASS

If your CI pipeline allows direct access to docker, then this is all you need.

§
Docker-in-Docker

Most often, CI pipelines can run Docker images but do not give access to the underlying Docker daemon (for obvious security reasons). Drone CI provides a plugin system that allows to build Docker images, so you could first push the test image to a registry and then run it:

drone.yml

kind: pipeline
name: default
steps:
- name: build
  image: plugins/docker
  settings:
    target: test
    repo: foo/hello-test:$DRONE_COMMIT_SHA
- name: test
  image: foo/hello-test:$DRONE_COMMIT_SHA

In the execution model of Drone CI, the runner is a container image that has access to the host Docker daemon, so both plugins/docker and foo/hello-test are executed by the runner on the host Docker daemon. However, the Docker plugin doesn't have access to the host's Docker daemon, so the images are built using Docker-in-Docker (DinD). Unfortunately, that prevents running an image that was just built without first pushing it to an external registry.

The solution to this problem is to use the same Docker daemon by mounting the host Docker socket, but it would be an obvious security risk. As an alternative, you can use a Docker-in-Docker service and persist its data on the host:

drone.yml

kind: pipeline
name: default
steps:
- name: build
  image: docker
  volumes:
  - name: dockersock
    path: /var/run
  commands:
  - until docker info &>/dev/null; do sleep 1; done
  - docker build --target=test -t foo/hello-test:${DRONE_COMMIT_SHA} .
- name: test
  image: docker
  volumes:
  - name: dockersock
    path: /var/run
  commands:
  - docker run --rm -it foo/hello-test:${DRONE_COMMIT_SHA}
services:
- name: docker
  image: docker:dind
  privileged: true
  volumes:
  - name: docker
    path: /var/lib/docker
  - name: dockersock
    path: /var/run
volumes:
- name: docker
  host:
    path: /srv/docker
- name: dockersock
  temp: {}

The advantage is that we can build, tag, and run images locally without having to push them to an external registry, but there are two major issues:

The data from the DinD service must be persisted, otherwise the build cache would be deleted at the end. Because it is bound to a single directory on the host, you cannot have two pipelines executing simultaneously, so the runner concurrency must not exceed 1.
This pipeline requires to bypass Drone's security model since DinD requires privileged execution and its data must be persisted on the host. Unfortunately, as soon as you enable trusted builds, a rogue pull-request can change the pipeline definition to mount any other path from the host.

§
Runner-in-Docker-in-Docker

To better isolate the runner from the host, you can run it inside DinD. See drone-dind-runner for a sample compose.yml. There is no need to have an additional nested DinD daemon, so its socket can be mounted directly, which simplifies the pipeline configuration:

drone.yml

kind: pipeline
name: default
steps:
- name: build
  image: docker
  volumes:
  - name: dockersock
    path: /var/run/docker.sock
  commands:
  - docker build --target=test -t foo/hello-test:${DRONE_COMMIT_SHA} .
- name: test
  image: foo/hello-test:${DRONE_COMMIT_SHA}
volumes:
- name: dockersock
  host:
    path: /var/run/docker.sock

Enabling trusted builds is still required, but the security implications are slightly less concerning, at least for a private runner. (For a public runner however, direct access to the Docker daemon cannot be allowed.)

§
Further notes

This article is the result of trying to optimize Drone CI builds for the backend of this blog (written in Rust). Due to complex compile-time checks, Rust builds can be very slow, and when you add all the inefficiencies we've seen, it becomes a nightmare. Over time, I applied various optimizations to reduce the build time:

Better layer caching using cargo-chef. The principle is the same as go mod download, although it goes a step further by also pre-building the dependencies, which saves considerable time (from 2 hours down to 20 minutes between a cold and hot build). Of course that requires being able to export the cache to an external registry so it can be reused between steps / builds. (.drone.yml, Dockerfile)
Using a persistent builder to decrease layer cache download times. I used BuildKit directly, but this is conceptually identical to starting a Docker daemon. Not relying on Drone's Docker plugin allows to download necessary cache layers lazily, instead of always downloading everything. (.drone.yml, Dockerfile)
Replacing cargo-chef with builder caching and using the "Runner-in-Docker-in-Docker" approach to persist both the layer and the builder cache, which allows to reuse temporary images without having to push them to an external registry. Nowadays cold builds take 20 minutes, hot builds take 5 minutes, largely dominated by the final steps in Rust release compile / linking time. (.drone.yml, Dockerfile)

§Caching strategies

§Layer-level

§Reducing cache invalidation

§Multi-stage builds

§Distributing the layer cache

§Builder-level

§Multi-step pipelines

§Building a test image

§Running a test image

§Pre-building the tests

§Docker-in-Docker

§Runner-in-Docker-in-Docker

§Further notes

§
Caching strategies

§
Layer-level

§
Reducing cache invalidation

§
Multi-stage builds

§
Distributing the layer cache

§
Builder-level

§
Multi-step pipelines

§
Building a test image

§
Running a test image

§
Pre-building the tests

§
Docker-in-Docker

§
Runner-in-Docker-in-Docker

§
Further notes