== Docker, Linux, Security. Kinda. We will be exploring some Linux features in the context of a docker application container. Another way of explaining it would be to say we will talk about how to make more secure application containers. We will not talk about firewall and apparmor because they are tools that enhance security on the host in general and not specific to a docker application container. A secure host means a more secure application container but that is discussion for another post. We will focus on Linux containers since FreeBSD containers are still experimental(see https://wiki.freebsd.org/Docker[here] and https://github.com/samuelkarp/runj[here]). Yes, windows containers exist. We will not discuss performance. Here be performance penalties, but again that is not the focus of this post. Before we begin, Linux docker containers are Linux. They are using most of the functionality that existed before application containers in the form of docker were a thing. Knowing Linux better means you know Linux Docker containers(application containers is a more correct term) better. We will see this point throughout this post. === Base Image We start with the first building block of a new docker image, The base image. By far the most used base images are the Alpine docker base image, followed by Debian and Ubuntu docker base images. These distros have two major differences that we want to focus on: * C standard library implementation * the userspace utility implementation Debian and Ubuntu(we are not forgetting that Ubuntu itself is a Debian derivative) both use glibc, as in gnu’s https://www.gnu.org/software/libc/[libc] implementation. Alpine uses https://www.musl-libc.org/[musl-libc] as its C standard library implementation. The major difference here which will come into play later on again is glibc has been around for much longer, so it has to keep backwards compatibility for a much longer period of time and for far more many things. Also the general attitude with the glibc team is that they have to support everything since if they don’t then who will? Libmusl on the other hand, does not try to support everything under the sun, a relatively newer project, comparatively, and, they keep their codebase lean. As a result not all applications are supported by libmusl but a good number of them are. In simpler terms, libmusl has a far smaller attack surface compared to glibc. On to our second point, which is the cli utilities’ implementation. Debian and Ubuntu use gnu’s https://www.gnu.org/software/coreutils/[Coreutils] while Alpine uses https://busybox.net/[Busybox](remember, we are talking about the most used application container bases. You can install a desktop version of Alpine with GNU coreutils). Here we have the same situation as before, The GNU coreutils are bigger, do more and have a larger attack surface. Busybox is smaller, does not support as many features as GNU Coreutils but does support enough of them to make them useful. Needless to say, busybox is small and hence, it has a smaller attack surface. To get a feel for how this plays out in the real world, you can look at some of the popular images that come in both Debian and Alpine flavours on dockerhub. Take a look at the number of reported vulnerabilities for both bases. The theme we observe is simple. The bigger the attack surface the bigger the number of vulnerabilities. Alpine images are small, lean and functional, just like libmusl and busybox but there are still quite a few things on an alpine image that are extraneous. We can take them out and have a perfectly functioning application container. That’s how we get https://github.com/GoogleContainerTools/distroless[distroless]. Distroless base images follow the same pattern as alpine base docker images, as in, less functionality while still keeping enough functionality to be able to do the job and minimize the attack surface. Minimizing a base image like this means that the base images are very specialized so we have base images for golang, python, java and the like. === Dokcer Runtimes By default docker uses containerd which in turn uses runc for the runtime. There are two additional runtimes that we want to focus on who try to provide a more secure runtime environment for docker. * gvisor * kata ==== gvisor gVisor creates a sandbox environment. Containers interact with the host through this sandboxed environment. gvisor has two components. Gofer and Sentry. Sentry is a kernel that runs the containers and intercepts and responds to system calls made by the application so as not to have an application directly control the syscalls that it makes. Gofer handles filesystem access(not /proc) for the application. The application is a regular application. gVisor aims to provide an environment equivalent to Linux 4.4. gvisor presently does not implement every system call, `/proc` file or `/sys` file. Every sandbox environment gets its own instance of Sentry. Every container in the sandbox gets its own instance of Gofer. gVisor currently does not support all system calls. You can find the list of supported system calls for amd64 https://gvisor.dev/docs/user_guide/compatibility/linux/amd64/[here]. .... ------------- |Application| ------------- |system calls | -------- 9p ------- |Sentry|<------->|Gofer| -------- ------- | limited |system | syscalls |calls --------------- | Host Kernel | --------------- | |hardware .... ==== kata Kata creates a sandbox environment for containers to interact with as proxy, not too dissimilar to gvisor but the main point of difference is that kata uses a VM to achieve this. gVisor and katacontainers allow us to implement defense in depth when it comes to application containers and host system security. === Capabilites and Syscalls Let’s talk about capabilities for a bit. From https://manpages.debian.org/bookworm/manpages/capabilities.7.en.html[man 7 capabilities]: [source,txt] ---- For the purpose of performing permission checks, traditional UNIX implementations distinguish two categories of processes: privileged processes (whose effective user ID is 0, referred to as superuser or root), and unprivileged processes (whose effective UID is nonzero). Privileged processes bypass all kernel permission checks, while unprivileged processes are subject to full permission checking based on the process's credentials (usually: effective UID, effective GID, and supplementary group list). Starting with Linux 2.2, Linux divides the privileges traditionally associated with superuser into distinct units, known as capabilities, which can be independently enabled and disabled. Capabilities are a per-thread attribute. ---- Capabilities give you a more granular control over which privileges to give instead of just root and non-root. Docker let’s us choose which capabilities to give to a container. So we can for example allow a non-privileged process to bind to privileged ports using capabilities. As an example, a simple application making calls to API endpoints and writing results back to a database does not require any capabilities. It can run under a non-privileged user with no capabilities and do all the tasks that it needs to do. That being said, determining which capabilities are required can be a bit challenging when it comes to certain applications since there is no straightforward way of achieving this. In certain cases we can get away with dropping all capabilities, running our application and then trying to figure out, based on the received error messages, which capability is missing and needs to be given to the application. But in certain cases this may not be feasible or practical. From https://manpages.debian.org/bookworm/manpages-dev/syscalls.2.en.html[man 2 sycalls]: [source,txt] ---- The system call is the fundamental interface between an application and the Linux kernel. ---- The Linux kernel lets us choose which ones of these interface calls can be allowed to be made by an application. We can essentially filter which syscalls are allowed and which ones are not on a per application basis. Docker enables this functionality with an arguably more friendly approach. Capabilities and syscall filtering are tools to implement principle of least privilege. Ideally, we would like to allow a container to only have access to what it needs and just that. Not more, and obviously not less. ==== capabilities in the wild Capabilities are a Linux feature, docker allows us to use that with application containers. We’ll look at a very simple example of how one can set capabilities for a regular executable on Linux. https://manpages.debian.org/bookworm/libcap2-bin/setcap.8.en.html[man 8 setcap] lets us set capabilities for a file. ==== syscall Filtering in the wild As an example we will look at https://manpages.debian.org/bookworm/bubblewrap/bwrap.1.en.html[man 1 bwrap]. https://github.com/containers/bubblewrap[Bubblewrap] allows us to sandbox an application, not too dissimilar to docker. Flatpaks use bubblewrap as part of their sandbox. Bubblewrap can optionally take in a list of syscalls to https://www.kernel.org/doc/html/v4.19/userspace-api/seccomp_filter.html[filter]. The filter is expressed as a BPF(Berkley Packet Filter program - remember when I said docker gives you a https://docs.docker.com/engine/security/seccomp/[friendlier] interface to seccomp?) program. Below is a short program that defines a BPF program that can be passed to an application using bwrap that lets us log all the sycalls the application makes to syslog. [source,c] ---- #include #include #include #include void log_all_syscalls(void) { scmp_filter_ctx ctx = seccomp_init(SCMP_ACT_LOG); seccomp_arch_add(ctx, SCMP_ARCH_X86_64); seccomp_export_bpf(ctx, 1); seccomp_export_pfc(ctx, 2); seccomp_release(ctx); } int main(int argc, char **argv) { log_all_syscalls(); } ---- Building is straightforward. Just remember to link against `libseccomp` with `-lseccomp`. [source,bash] ---- gcc main.c -lseccomp ---- Running the above code we get this: [source,txt] ---- > 5@# # pseudo filter code start # # filter for arch x86_64 (3221225534) if ($arch == 3221225534) # default action action LOG; # invalid architecture action action KILL; # # pseudo filter code end # ---- [source,bash] ---- #!/usr/bin/dash TEMP_LOG=/tmp/seccomp_logging_filter.bpf ./a.out > ${TEMP_LOG} bwrap --seccomp 9 9<${TEMP_LOG} bash ---- Then we can go and see where the logs end up. On my host, they are logged under `/var/log/audit/audit.log` and they look like this: .... type=SECCOMP msg=audit(1716144132.339:4036728): auid=1000 uid=1000 gid=1000 ses=1 subj=unconfined pid=19633 comm="bash" exe="/usr/bin/bash" sig=0 arch=c000003e syscall=13 compat=0 ip=0x7fa58591298f code=0x7ffc0000AUID="devi" UID="devi" GID="devi" ARCH=x86_64 SYSCALL=rt_sigaction type=SECCOMP msg=audit(1716144132.339:4036729): auid=1000 uid=1000 gid=1000 ses=1 subj=unconfined pid=19633 comm="bash" exe="/usr/bin/bash" sig=0 arch=c000003e syscall=13 compat=0 ip=0x7fa58591298f code=0x7ffc0000AUID="devi" UID="devi" GID="devi" ARCH=x86_64 SYSCALL=rt_sigaction type=SECCOMP msg=audit(1716144132.339:4036730): auid=1000 uid=1000 gid=1000 ses=1 subj=unconfined pid=19633 comm="bash" exe="/usr/bin/bash" sig=0 arch=c000003e syscall=13 compat=0 ip=0x7fa58591298f code=0x7ffc0000AUID="devi" UID="devi" GID="devi" ARCH=x86_64 SYSCALL=rt_sigaction type=SECCOMP msg=audit(1716144132.339:4036731): auid=1000 uid=1000 gid=1000 ses=1 subj=unconfined pid=19633 comm="bash" exe="/usr/bin/bash" sig=0 arch=c000003e syscall=13 compat=0 ip=0x7fa58591298f code=0x7ffc0000AUID="devi" UID="devi" GID="devi" ARCH=x86_64 SYSCALL=rt_sigaction type=SECCOMP msg=audit(1716144132.339:4036732): auid=1000 uid=1000 gid=1000 ses=1 subj=unconfined pid=19633 comm="bash" exe="/usr/bin/bash" sig=0 arch=c000003e syscall=13 compat=0 ip=0x7fa58591298f code=0x7ffc0000AUID="devi" UID="devi" GID="devi" ARCH=x86_64 SYSCALL=rt_sigaction type=SECCOMP msg=audit(1716144132.339:4036733): auid=1000 uid=1000 gid=1000 ses=1 subj=unconfined pid=19633 comm="bash" exe="/usr/bin/bash" sig=0 arch=c000003e syscall=14 compat=0 ip=0x7fa5859664f4 code=0x7ffc0000AUID="devi" UID="devi" GID="devi" ARCH=x86_64 SYSCALL=rt_sigprocmask type=SECCOMP msg=audit(1716144132.339:4036734): auid=1000 uid=1000 gid=1000 ses=1 subj=unconfined pid=19633 comm="bash" exe="/usr/bin/bash" sig=0 arch=c000003e syscall=13 compat=0 ip=0x7fa58591298f code=0x7ffc0000AUID="devi" UID="devi" GID="devi" ARCH=x86_64 SYSCALL=rt_sigaction type=SECCOMP msg=audit(1716144132.339:4036735): auid=1000 uid=1000 gid=1000 ses=1 subj=unconfined pid=19633 comm="bash" exe="/usr/bin/bash" sig=0 arch=c000003e syscall=1 compat=0 ip=0x7fa5859ce5d0 code=0x7ffc0000AUID="devi" UID="devi" GID="devi" ARCH=x86_64 SYSCALL=write type=SECCOMP msg=audit(1716144132.339:4036736): auid=1000 uid=1000 gid=1000 ses=1 subj=unconfined pid=19633 comm="bash" exe="/usr/bin/bash" sig=0 arch=c000003e syscall=1 compat=0 ip=0x7fa5859ce5d0 code=0x7ffc0000AUID="devi" UID="devi" GID="devi" ARCH=x86_64 SYSCALL=write type=SECCOMP msg=audit(1716144132.339:4036737): auid=1000 uid=1000 gid=1000 ses=1 subj=unconfined pid=19633 comm="bash" exe="/usr/bin/bash" sig=0 arch=c000003e syscall=270 compat=0 ip=0x7fa5859d77bc code=0x7ffc0000AUID="devi" UID="devi" GID="devi" ARCH=x86_64 SYSCALL=pselect6 .... Docker allows us to do the https://docs.docker.com/engine/security/seccomp/[same]. We can give docker a seccomp profile to filter out the syscalls that are not required for a specific container. You can find the default docker seccomp profile https://github.com/moby/moby/blob/master/profiles/seccomp/default.json[here]. === Namespaces .... A namespace wraps a global system resource in an abstraction that makes it appear to the processes within the namespace that they have their own isolated instance of the global resource. Changes to the global resource are visible to other processes that are members of the namespace, but are invisible to other processes. One use of namespaces is to implement containers. .... From https://manpages.debian.org/bookworm/manpages/namespaces.7.en.html[man 7 namespaces]. You can think of namespaces as almost the same thing as a namespace does in some programming languages. Docker uses its own namespaces for the containers so as to further isolate the application containers from the host system. ==== Namespaces in the Wild As an example let’s look at the script provided below. Here we are creating a new network namespace. The new interface is provided by simply connecting an android phone for USB tethering. Depending on the situation you have going on and the `udev` naming rules the interface name will differ but the concept is the same. We are creating a new network namespace for a second internet provider, which in this case, is our android phone. We then use this network namespace to execute commands in the context of this specific network namespace. Essentially, we can choose which applications get to use our phone internet and which ones use whatever it is we were previously connected to. [source,sh] ---- #!/usr/bin/env sh PHONE_NS=phone_ns IF=enp0s20f0u6 sudo ip netns add ${PHONE_NS} sudo ip link set ${IF} netns ${PHONE_NS} sudo ip netns exec ${PHONE_NS} ip link set ${IF} up sudo ip netns exec ${PHONE_NS} ip link set dev lo up sudo ip netns exec ${PHONE_NS} dhclient ${IF} ---- [source,sh] ---- $ sudo ip netns exec home_ns curl -4 icanhaveip.com 113.158.237.102 $ curl -4 icanhasip.com 114.201.132.98 ---- *_HINT_*: The IP addresses are made up. The only thing that matters is that they are different. Since we have the android phone’s interface on another namespace the two cannot interfere with each other. This is pretty much how docker uses namespaces. Without a network namespace we would have to make a small VM, run a VPN on the VM and then make a socks5 proxy to the VM from the host and then have applications pass their traffic through a socks5 proxy with varying degrees of success. *_NOTE_*: since we are not running the script on a hook, you might blow out your net having two upstreams at the same time. In which case, run the script, then restart NetworkManager or whatever you have. === SBOM and Provenance Attestation What is SBOM? NIST defines SBOM as a ``formal record containing the details and supply chain relationships of various components used in building software.''. It contains details about the components used to create a certain piece of software. SBOM is meant to help mitigate the threat of supply chain attacks(remember xz?). What is provenance? .... The provenance attestations include facts about the build process, including details such as: Build timestamps Build parameters and environment Version control metadata Source code details Materials (files, scripts) consumed during the build .... https://docs.docker.com/build/attestations/sbom/[source] ==== Example Let’s review all that we learned about in the form of a light exercise. For the first build, we use a non-vendored version. Vendoring means that you store your dependencies locally. This means you are in control of your dependencies. You don’t need to pull them from a remote. Even if one or more of your dependencies One of the more famous examples is Lua. The Lua foundation actually recommend vendoring your Lua dependency. Vendoring helps with build reproducability. We will use https://github.com/terminaldweller/milla[milla] as an exmaple. It’s a simple go codebase. [source,dockerfile] ---- FROM alpine:3.19 as builder RUN apk update && \ apk upgrade && \ apk add go git WORKDIR /milla COPY go.sum go.mod /milla/ RUN go mod download COPY *.go /milla/ RUN go build FROM alpine:3.19 ENV HOME /home/user RUN set -eux; \ adduser -u 1001 -D -h "$HOME" user; \ mkdir "$HOME/.irssi"; \ chown -R user:user "$HOME" COPY --from=builder /milla/milla "$HOME/milla" RUN chown user:user "$HOME/milla" ENTRYPOINT ["home/user/milla"] ---- The first docker image build is fairly simple. We copy the source code in, get our dependencies and build a static executable. As for the second stage of the build, we simply put the executable into a new base image and we are done. The second build which is a vendored build with a golang distroless base. We copy over the source code for the project and all its dependencies and then do the same as before. [source,dockerfile] ---- FROM golang:1.21 as builder WORKDIR /milla COPY go.sum go.mod /milla/ RUN go mod download COPY *.go /milla/ RUN CGO_ENABLED=0 go build FROM gcr.io/distroless/static-debian12 COPY --from=builder /milla/milla "/usr/bin/milla" ENTRYPOINT ["milla"] ---- Below You can see an example docker compose file. Milla can optionally use a postgres database to store messages. We also include a pgadmin instance. Now let’s talk about the docker compose file. [source,yaml] ---- services: terra: image: milla_distroless_vendored build: context: . dockerfile: ./Dockerfile_distroless_vendored deploy: resources: limits: memory: 128M logging: driver: "json-file" options: max-size: "100m" networks: - terranet user: 1000:1000 restart: unless-stopped entrypoint: ["/usr/bin/milla"] command: ["--config", "/config.toml"] volumes: - ./config.toml:/config.toml - /etc/localtime:/etc/localtime:ro cap_drop: - ALL runtime: runsc postgres: image: postgres:16-alpine3.19 deploy: resources: limits: memory: 4096M logging: driver: "json-file" options: max-size: "200m" restart: unless-stopped ports: - "127.0.0.1:5455:5432/tcp" volumes: - terra_postgres_vault:/var/lib/postgresql/data - ./scripts/:/docker-entrypoint-initdb.d/:ro environment: - POSTGRES_PASSWORD_FILE=/run/secrets/pg_pass_secret - POSTGRES_USER_FILE=/run/secrets/pg_user_secret - POSTGRES_INITDB_ARGS_FILE=/run/secrets/pg_initdb_args_secret - POSTGRES_DB_FILE=/run/secrets/pg_db_secret networks: - terranet - dbnet secrets: - pg_pass_secret - pg_user_secret - pg_initdb_args_secret - pg_db_secret runtime: runsc pgadmin: image: dpage/pgadmin4:8.6 deploy: resources: limits: memory: 1024M logging: driver: "json-file" options: max-size: "100m" environment: - PGADMIN_LISTEN_PORT=${PGADMIN_LISTEN_PORT:-5050} - PGADMIN_DEFAULT_EMAIL=${PGADMIN_DEFAULT_EMAIL:-devi@terminaldweller.com} - PGADMIN_DEFAULT_PASSWORD_FILE=/run/secrets/pgadmin_pass - PGADMIN_DISABLE_POSTFIX=${PGADMIN_DISABLE_POSTFIX:-YES} ports: - "127.0.0.1:5050:5050/tcp" restart: unless-stopped volumes: - terra_pgadmin_vault:/var/lib/pgadmin networks: - dbnet secrets: - pgadmin_pass networks: terranet: driver: bridge dbnet: volumes: terra_postgres_vault: terra_pgadmin_vault: secrets: pg_pass_secret: file: ./pg/pg_pass_secret pg_user_secret: file: ./pg/pg_user_secret pg_initdb_args_secret: file: ./pg/pg_initdb_args_secret pg_db_secret: file: ./pg/pg_db_secret pgadmin_pass: file: ./pgadmin/pgadmin_pass ---- We are assigning memory usage limits for the containers. We are also limiting the size of the logs we are keeping on disk. One thing that we did not talk about before is the networking side of compose. As can be seen, the postgres and pgadmin container share one network while the postgres container and milla share another network. This makes it so that milla and pgadmin do not have access to each other. This is inline with principle of least privilege. Milla and pgadmin don’t need to talk to each other so they can’t do that. Also we refrain from using host networking. We are also binding the open ports to the host’s localhost interface. This does not let us connect to the endpoints directly. In our example we don’t need the ports to be exposed to the internet but we will need access to them. What we can do is bind the open ports to the host’s localhost and then use ssh to forward the ports onto our own machine, assuming the docker host is a remote. [source,sh] ---- ssh -L 127.0.0.1:5460:127.0.0.1:5455 user@remotehost ---- While building milla, for the second stage of the build, we made a non-privileged user and our now mapping a non-privileged user on the host to that user. We are removing all capabilities from milla since milla will be making requests and has no server functionality. Milla will only need to bind to high-numbered ports which does not require a special privileges. We run both postgres and milla with gvisor’s runsc runtime since it’s possible to do so. Finally we use docker secrets to put the secrets into the container’s runtime environment. Now onto the attestations. In order to view the SBOM for the image we will use docker https://docs.docker.com/scout/install/[scout]. [source,sh] ---- docker scout sbom milla docker scout sbom milla_distroless_vendored ---- The SBOMs can be viewed https://gist.github.com/terminaldweller/8e8ecdcb68d4052aecb6804823648b4d[here] and https://gist.github.com/terminaldweller/f4ede7122f159506f8e6e6be2bfd6a8b[here] respectively. Now lets look at the provenance attestations. [source,sh] ---- docker buildx imagetools inspect terminaldweller/milla:main --format "{{ json .Provenance.SLSA }}" ---- And https://gist.github.com/terminaldweller/033ae07a9e685db85b18eb822dea4be3[here] you can look at the result. === Further Reading * https://manpages.debian.org/bookworm/manpages/cgroups.7.en.html[man 7 cgroups] * system containers using https://github.com/lxc/incus[lxc/incus] * https://katacontainers.io/[katacontainers] + + timestamp:1716163133 + version:1.0.0 + https://blog.terminaldweller.com/rss/feed + https://raw.githubusercontent.com/terminaldweller/blog/main/mds/securedocker.md +