Creating SECCOMP profiles for docker containers

Goal

The default seccomp profile for docker is on by default and still allows for more than 300 syscalls, that's about 3/4 of the available syscalls on Linux.

We pay a hefty performance cost for enabling seccomp so we may as well get some serious protection from it!

The same is true for containerd/kubernetes, but GKE for example does not enable it by default outside of Autopilot.

Profiles can be created by hand, but need expertise that few people possess, fortunately a great tool exist, oci-seccomp-bpf-hook, but it's designed and documented to run with podman, not docker and docker compose.


Getting oci-seccomp-bpf-hook to play nicely with docker compose.

Problems

Docker is vaguely OCI compliant, but does not allow to run OCI hooks natively.

docker compose started supporting annotations recently, but was held back by a bug until very recently.


Solution (ubuntu 24.04)

Note that you do not need to do any of this to use a generated seccomp profile, this is just to generate a new profile.

Building the tools

Some dependencies to install:

sudo apt install bpfcc-tools libseccomp-dev golang

remove the docker-compose-v2 package

git clone and build the following repos:

Configuration

Configure oci-seccomp-bpf-hook in oci-add-hooks:

/etc/docker/oci-add-hooks.json

{
  "hooks": {
    "prestart": [
      {
        "path": "/usr/local/libexec/oci/hooks.d/oci-seccomp-bpf-hook",
        "args": ["oci-seccomp-bpf-hook", "-s"]
      }
    ]
  }
}

Configure oci-add-hooks in docker:

/etc/docker/daemon.json

{
  "runtimes": {
    "oci-add-hook": {
      "path": "/usr/local/bin/oci-add-hooks",
      "runtimeArgs": ["--hook-config-path",
        "/etc/docker/oci-add-hooks.json",
        "--runtime-path",
        "/usr/sbin/runc"]
    }
  }
}

Now you should be able to restart the dockerd service and everything should work as usual, there is just an extra runtime available, oci-add-hook:

$ sudo docker info | grep Runtime
 Runtimes: io.containerd.runc.v2 oci-add-hook runc
 Default Runtime: runc

Using our new powers

To run the hook, we need to choose oci-add-hook as a runtime for a particular service, and we need to annotate the service to tell oci-seccomp-bpf-hook how to behave.

Example, securing the traefik-forward-auth middleware

traefik-forward-auth is a traefik middleware that allows to intercept calls to an application and redirect the user to an OIDC endpoint, such as keycloak, if they are not authenticated. It turns traefik into an authenticating access proxy.

Surely it doesn't need access to more than 300 syscalls...

To instrument it, it is enough to add this to the service definition:

    runtime: oci-add-hook
    annotations:
      - "io.containers.trace-syscall=of:/tmp/traefik-forward-auth.json;if:/etc/docker/default-seccomp.json"

The if: part is optional, see the documentation.

After re-creating the container, capture some real-world normal traffic, and stop the container, you can now remove the special runtime.

If it all went well, you should obtain a nice /tmp/traefik-forward-auth.json file that looks like this (once beautified with jq):

{
  "defaultAction": "SCMP_ACT_ERRNO",
  "architectures": [
    "SCMP_ARCH_AARCH64"
  ],
  "syscalls": [
    {
      "names": [
        "accept4",
        "bind",
        "brk",
        "capget",
        "capset",
        "chdir",
        "clone",
        "close",
        "connect",
        "dup3",
        "epoll_create1",
        "epoll_ctl",
        "epoll_pwait",
        "execve",
        "exit_group",
        "faccessat2",
        "fchdir",
        "fchown",
        "fcntl",
        "fstat",
        "fstatfs",
        "futex",
        "getcwd",
        "getdents64",
        "getpeername",
        "getpid",
        "getppid",
        "getrandom",
        "getsockname",
        "getsockopt",
        "gettid",
        "listen",
        "madvise",
        "mmap",
        "mount",
        "nanosleep",
        "newfstatat",
        "openat",
        "pipe2",
        "pivot_root",
        "prctl",
        "prlimit64",
        "read",
        "rt_sigaction",
        "rt_sigprocmask",
        "rt_sigreturn",
        "sched_getaffinity",
        "sched_yield",
        "setgid",
        "setgroups",
        "sethostname",
        "setsockopt",
        "setuid",
        "sigaltstack",
        "socket",
        "statfs",
        "tgkill",
        "umask",
        "umount2",
        "write"
      ],
      "action": "SCMP_ACT_ALLOW",
      "args": [],
      "comment": "",
      "includes": {},
      "excludes": {}
    },
    {
      "action": "SCMP_ACT_ALLOW",
      "args": [],
      "comment": "",
      "includes": {},
      "excludes": {}
    }
  ]
}

That is a grand total of 60 syscalls, or a reduction of more than 80% of the attack surface of the kernel, at no extra performance cost!

Of course, traefik-forward-auth is a middleware that should be quite simple, what about bigger applications? I profiled a NextCloud container, how many syscalls does it need? 132. Still a good reduction.

Putting it to the test

Once you have your json file, make it part of your deployment and enforce it by using the security_opt service element in docker compose, for example:

    security_opt:
      - seccomp:traefik-forward-auth-seccomp.json

And re-create your container, it should now be limited to the syscalls that were encountered during the profiling phase.

More from F Guerraz
All posts