path: root/mds/securedocker.txt



== Docker, Linux, Security. Kinda.

We will be exploring some Linux features in the context of a docker
application container. Another way of explaining it would be to say we
will talk about how to make more secure application containers. We will
not talk about firewall and apparmor because they are tools that enhance
security on the host in general and not specific to a docker application
container. A secure host means a more secure application container but
that is discussion for another post. We will focus on Linux containers
since FreeBSD containers are still experimental(see
https://wiki.freebsd.org/Docker[here] and
https://github.com/samuelkarp/runj[here]). Yes, windows containers
exist. We will not discuss performance. Here be performance penalties,
but again that is not the focus of this post.

Before we begin, Linux docker containers are Linux. They are using most
of the functionality that existed before application containers in the
form of docker were a thing. Knowing Linux better means you know Linux
Docker containers(application containers is a more correct term) better.
We will see this point throughout this post.

=== Base Image

We start with the first building block of a new docker image, The base
image. By far the most used base images are the Alpine docker base
image, followed by Debian and Ubuntu docker base images. These distros
have two major differences that we want to focus on:

* C standard library implementation
* the userspace utility implementation

Debian and Ubuntu(we are not forgetting that Ubuntu itself is a Debian
derivative) both use glibc, as in gnu’s
https://www.gnu.org/software/libc/[libc] implementation. Alpine uses
https://www.musl-libc.org/[musl-libc] as its C standard library
implementation. The major difference here which will come into play
later on again is glibc has been around for much longer, so it has to
keep backwards compatibility for a much longer period of time and for
far more many things. Also the general attitude with the glibc team is
that they have to support everything since if they don’t then who will?
Libmusl on the other hand, does not try to support everything under the
sun, a relatively newer project, comparatively, and, they keep their
codebase lean. As a result not all applications are supported by libmusl
but a good number of them are. In simpler terms, libmusl has a far
smaller attack surface compared to glibc.

On to our second point, which is the cli utilities’ implementation.
Debian and Ubuntu use gnu’s
https://www.gnu.org/software/coreutils/[Coreutils] while Alpine uses
https://busybox.net/[Busybox](remember, we are talking about the most
used application container bases. You can install a desktop version of
Alpine with GNU coreutils). Here we have the same situation as before,
The GNU coreutils are bigger, do more and have a larger attack surface.
Busybox is smaller, does not support as many features as GNU Coreutils
but does support enough of them to make them useful. Needless to say,
busybox is small and hence, it has a smaller attack surface.

To get a feel for how this plays out in the real world, you can look at
some of the popular images that come in both Debian and Alpine flavours
on dockerhub. Take a look at the number of reported vulnerabilities for
both bases. The theme we observe is simple. The bigger the attack
surface the bigger the number of vulnerabilities.

Alpine images are small, lean and functional, just like libmusl and
busybox but there are still quite a few things on an alpine image that
are extraneous. We can take them out and have a perfectly functioning
application container.

That’s how we get
https://github.com/GoogleContainerTools/distroless[distroless].
Distroless base images follow the same pattern as alpine base docker
images, as in, less functionality while still keeping enough
functionality to be able to do the job and minimize the attack surface.
Minimizing a base image like this means that the base images are very
specialized so we have base images for golang, python, java and the
like.

=== Dokcer Runtimes

By default docker uses containerd which in turn uses runc for the
runtime. There are two additional runtimes that we want to focus on who
try to provide a more secure runtime environment for docker.

* gvisor
* kata

==== gvisor

gVisor creates a sandbox environment. Containers interact with the host
through this sandboxed environment. gvisor has two components. Gofer and
Sentry. Sentry is a kernel that runs the containers and intercepts and
responds to system calls made by the application so as not to have an
application directly control the syscalls that it makes. Gofer handles
filesystem access(not /proc) for the application. The application is a
regular application. gVisor aims to provide an environment equivalent to
Linux 4.4. gvisor presently does not implement every system call,
`+/proc+` file or `+/sys+` file. Every sandbox environment gets its own
instance of Sentry. Every container in the sandbox gets its own instance
of Gofer. gVisor currently does not support all system calls. You can
find the list of supported system calls for amd64
https://gvisor.dev/docs/user_guide/compatibility/linux/amd64/[here].

....
  -------------
  |Application|
  -------------
       |system calls
       |
    --------   9p    -------
    |Sentry|<------->|Gofer|
    --------         -------
         | limited    |system
         | syscalls   |calls
        ---------------
        | Host Kernel |
        ---------------
               |
               |hardware
....

==== kata

Kata creates a sandbox environment for containers to interact with as
proxy, not too dissimilar to gvisor but the main point of difference is
that kata uses a VM to achieve this.

gVisor and katacontainers allow us to implement defense in depth when it
comes to application containers and host system security.

=== Capabilites and Syscalls

Let’s talk about capabilities for a bit.

From
https://manpages.debian.org/bookworm/manpages/capabilities.7.en.html[man
7 capabilities]:

[source,txt]
----
For the purpose of performing permission checks, traditional UNIX implementations distinguish two
categories of processes: privileged processes (whose effective user ID is 0, referred to as
superuser or root), and unprivileged processes (whose effective UID is nonzero).  Privileged
processes bypass all kernel permission checks, while unprivileged processes are subject to full
permission checking based on the process's credentials (usually: effective UID, effective GID, and
supplementary group list).

Starting with Linux 2.2, Linux divides the privileges traditionally associated with superuser into
distinct units, known as capabilities, which can be independently enabled and disabled.
Capabilities are a per-thread attribute.
----

Capabilities give you a more granular control over which privileges to
give instead of just root and non-root. Docker let’s us choose which
capabilities to give to a container. So we can for example allow a
non-privileged process to bind to privileged ports using capabilities.
As an example, a simple application making calls to API endpoints and
writing results back to a database does not require any capabilities. It
can run under a non-privileged user with no capabilities and do all the
tasks that it needs to do. That being said, determining which
capabilities are required can be a bit challenging when it comes to
certain applications since there is no straightforward way of achieving
this. In certain cases we can get away with dropping all capabilities,
running our application and then trying to figure out, based on the
received error messages, which capability is missing and needs to be
given to the application. But in certain cases this may not be feasible
or practical.

From
https://manpages.debian.org/bookworm/manpages-dev/syscalls.2.en.html[man
2 sycalls]:

[source,txt]
----
The system call is the fundamental interface between an application and the Linux kernel.
----

The Linux kernel lets us choose which ones of these interface calls can
be allowed to be made by an application. We can essentially filter which
syscalls are allowed and which ones are not on a per application basis.
Docker enables this functionality with an arguably more friendly
approach. Capabilities and syscall filtering are tools to implement
principle of least privilege. Ideally, we would like to allow a
container to only have access to what it needs and just that. Not more,
and obviously not less.

==== capabilities in the wild

Capabilities are a Linux feature, docker allows us to use that with
application containers. We’ll look at a very simple example of how one
can set capabilities for a regular executable on Linux.
https://manpages.debian.org/bookworm/libcap2-bin/setcap.8.en.html[man 8
setcap] lets us set capabilities for a file.

==== syscall Filtering in the wild

As an example we will look at
https://manpages.debian.org/bookworm/bubblewrap/bwrap.1.en.html[man 1
bwrap]. https://github.com/containers/bubblewrap[Bubblewrap] allows us
to sandbox an application, not too dissimilar to docker. Flatpaks use
bubblewrap as part of their sandbox. Bubblewrap can optionally take in a
list of syscalls to
https://www.kernel.org/doc/html/v4.19/userspace-api/seccomp_filter.html[filter].
The filter is expressed as a BPF(Berkley Packet Filter program -
remember when I said docker gives you a
https://docs.docker.com/engine/security/seccomp/[friendlier] interface
to seccomp?) program. Below is a short program that defines a BPF
program that can be passed to an application using bwrap that lets us
log all the sycalls the application makes to syslog.

[source,c]
----
#include <fcntl.h>
#include <seccomp.h>
#include <stdbool.h>
#include <unistd.h>

void log_all_syscalls(void) {
  scmp_filter_ctx ctx = seccomp_init(SCMP_ACT_LOG);
  seccomp_arch_add(ctx, SCMP_ARCH_X86_64);
  seccomp_export_bpf(ctx, 1);
  seccomp_export_pfc(ctx, 2);
  seccomp_release(ctx);
}

int main(int argc, char **argv) {
  log_all_syscalls();
}
----

Building is straightforward. Just remember to link against
`+libseccomp+` with `+-lseccomp+`.

[source,bash]
----
gcc main.c -lseccomp
----

Running the above code we get this:

[source,txt]
----
 > 5@#
# pseudo filter code start
#
# filter for arch x86_64 (3221225534)
if ($arch == 3221225534)
  # default action
  action LOG;
# invalid architecture action
action KILL;
#
# pseudo filter code end
#
----

[source,bash]
----
#!/usr/bin/dash

TEMP_LOG=/tmp/seccomp_logging_filter.bpf

./a.out > ${TEMP_LOG}

bwrap --seccomp 9 9<${TEMP_LOG} bash
----

Then we can go and see where the logs end up. On my host, they are
logged under `+/var/log/audit/audit.log+` and they look like this:

....
type=SECCOMP msg=audit(1716144132.339:4036728): auid=1000 uid=1000 gid=1000 ses=1 subj=unconfined pid=19633 comm="bash" exe="/usr/bin/bash" sig=0 arch=c000003e syscall=13 compat=0 ip=0x7fa58591298f code=0x7ffc0000AUID="devi" UID="devi" GID="devi" ARCH=x86_64 SYSCALL=rt_sigaction
type=SECCOMP msg=audit(1716144132.339:4036729): auid=1000 uid=1000 gid=1000 ses=1 subj=unconfined pid=19633 comm="bash" exe="/usr/bin/bash" sig=0 arch=c000003e syscall=13 compat=0 ip=0x7fa58591298f code=0x7ffc0000AUID="devi" UID="devi" GID="devi" ARCH=x86_64 SYSCALL=rt_sigaction
type=SECCOMP msg=audit(1716144132.339:4036730): auid=1000 uid=1000 gid=1000 ses=1 subj=unconfined pid=19633 comm="bash" exe="/usr/bin/bash" sig=0 arch=c000003e syscall=13 compat=0 ip=0x7fa58591298f code=0x7ffc0000AUID="devi" UID="devi" GID="devi" ARCH=x86_64 SYSCALL=rt_sigaction
type=SECCOMP msg=audit(1716144132.339:4036731): auid=1000 uid=1000 gid=1000 ses=1 subj=unconfined pid=19633 comm="bash" exe="/usr/bin/bash" sig=0 arch=c000003e syscall=13 compat=0 ip=0x7fa58591298f code=0x7ffc0000AUID="devi" UID="devi" GID="devi" ARCH=x86_64 SYSCALL=rt_sigaction
type=SECCOMP msg=audit(1716144132.339:4036732): auid=1000 uid=1000 gid=1000 ses=1 subj=unconfined pid=19633 comm="bash" exe="/usr/bin/bash" sig=0 arch=c000003e syscall=13 compat=0 ip=0x7fa58591298f code=0x7ffc0000AUID="devi" UID="devi" GID="devi" ARCH=x86_64 SYSCALL=rt_sigaction
type=SECCOMP msg=audit(1716144132.339:4036733): auid=1000 uid=1000 gid=1000 ses=1 subj=unconfined pid=19633 comm="bash" exe="/usr/bin/bash" sig=0 arch=c000003e syscall=14 compat=0 ip=0x7fa5859664f4 code=0x7ffc0000AUID="devi" UID="devi" GID="devi" ARCH=x86_64 SYSCALL=rt_sigprocmask
type=SECCOMP msg=audit(1716144132.339:4036734): auid=1000 uid=1000 gid=1000 ses=1 subj=unconfined pid=19633 comm="bash" exe="/usr/bin/bash" sig=0 arch=c000003e syscall=13 compat=0 ip=0x7fa58591298f code=0x7ffc0000AUID="devi" UID="devi" GID="devi" ARCH=x86_64 SYSCALL=rt_sigaction
type=SECCOMP msg=audit(1716144132.339:4036735): auid=1000 uid=1000 gid=1000 ses=1 subj=unconfined pid=19633 comm="bash" exe="/usr/bin/bash" sig=0 arch=c000003e syscall=1 compat=0 ip=0x7fa5859ce5d0 code=0x7ffc0000AUID="devi" UID="devi" GID="devi" ARCH=x86_64 SYSCALL=write
type=SECCOMP msg=audit(1716144132.339:4036736): auid=1000 uid=1000 gid=1000 ses=1 subj=unconfined pid=19633 comm="bash" exe="/usr/bin/bash" sig=0 arch=c000003e syscall=1 compat=0 ip=0x7fa5859ce5d0 code=0x7ffc0000AUID="devi" UID="devi" GID="devi" ARCH=x86_64 SYSCALL=write
type=SECCOMP msg=audit(1716144132.339:4036737): auid=1000 uid=1000 gid=1000 ses=1 subj=unconfined pid=19633 comm="bash" exe="/usr/bin/bash" sig=0 arch=c000003e syscall=270 compat=0 ip=0x7fa5859d77bc code=0x7ffc0000AUID="devi" UID="devi" GID="devi" ARCH=x86_64 SYSCALL=pselect6
....

Docker allows us to do the
https://docs.docker.com/engine/security/seccomp/[same]. We can give
docker a seccomp profile to filter out the syscalls that are not
required for a specific container. You can find the default docker
seccomp profile
https://github.com/moby/moby/blob/master/profiles/seccomp/default.json[here].

=== Namespaces

....
A namespace wraps a global system resource in an abstraction that makes it appear to the processes
within the namespace that they have their own isolated instance of the global resource.  Changes
to the global resource are visible to other processes that are members of the namespace, but are
invisible to other processes.  One use of namespaces is to implement containers.
....

From
https://manpages.debian.org/bookworm/manpages/namespaces.7.en.html[man 7
namespaces]. You can think of namespaces as almost the same thing as a
namespace does in some programming languages. Docker uses its own
namespaces for the containers so as to further isolate the application
containers from the host system.

==== Namespaces in the Wild

As an example let’s look at the script provided below. Here we are
creating a new network namespace. The new interface is provided by
simply connecting an android phone for USB tethering. Depending on the
situation you have going on and the `+udev+` naming rules the interface
name will differ but the concept is the same. We are creating a new
network namespace for a second internet provider, which in this case, is
our android phone. We then use this network namespace to execute
commands in the context of this specific network namespace. Essentially,
we can choose which applications get to use our phone internet and which
ones use whatever it is we were previously connected to.

[source,sh]
----
#!/usr/bin/env sh
PHONE_NS=phone_ns
IF=enp0s20f0u6

sudo ip netns add ${PHONE_NS}
sudo ip link set ${IF} netns ${PHONE_NS}
sudo ip netns exec ${PHONE_NS} ip link set ${IF} up
sudo ip netns exec ${PHONE_NS} ip link set dev lo up
sudo ip netns exec ${PHONE_NS} dhclient ${IF}
----

[source,sh]
----
$ sudo ip netns exec home_ns curl -4 icanhaveip.com
113.158.237.102
$ curl -4 icanhasip.com
114.201.132.98
----

*_HINT_*: The IP addresses are made up. The only thing that matters is
that they are different.

Since we have the android phone’s interface on another namespace the two
cannot interfere with each other. This is pretty much how docker uses
namespaces. Without a network namespace we would have to make a small
VM, run a VPN on the VM and then make a socks5 proxy to the VM from the
host and then have applications pass their traffic through a socks5
proxy with varying degrees of success. *_NOTE_*: since we are not
running the script on a hook, you might blow out your net having two
upstreams at the same time. In which case, run the script, then restart
NetworkManager or whatever you have.

=== SBOM and Provenance Attestation

What is SBOM? NIST defines SBOM as a "`formal record containing the
details and supply chain relationships of various components used in
building software.`". It contains details about the components used to
create a certain piece of software. SBOM is meant to help mitigate the
threat of supply chain attacks(remember xz?).

What is provenance?

....
The provenance attestations include facts about the build process, including details such as:

    Build timestamps
    Build parameters and environment
    Version control metadata
    Source code details
    Materials (files, scripts) consumed during the build
....

https://docs.docker.com/build/attestations/sbom/[source]

==== Example

Let’s review all that we learned about in the form of a light exercise.

For the first build, we use a non-vendored version. Vendoring means that
you store your dependencies locally. This means you are in control of
your dependencies. You don’t need to pull them from a remote. Even if
one or more of your dependencies One of the more famous examples is Lua.
The Lua foundation actually recommend vendoring your Lua dependency.
Vendoring helps with build reproducability.

We will use https://github.com/terminaldweller/milla[milla] as an
exmaple. It’s a simple go codebase.

[source,dockerfile]
----
FROM alpine:3.19 as builder
RUN apk update && \
      apk upgrade && \
      apk add go git
WORKDIR /milla
COPY go.sum go.mod /milla/
RUN go mod download
COPY *.go /milla/
RUN go build

FROM alpine:3.19
ENV HOME /home/user
RUN set -eux; \
  adduser -u 1001 -D -h "$HOME" user; \
  mkdir "$HOME/.irssi"; \
  chown -R user:user "$HOME"
COPY --from=builder /milla/milla "$HOME/milla"
RUN chown user:user "$HOME/milla"
ENTRYPOINT ["home/user/milla"]
----

The first docker image build is fairly simple. We copy the source code
in, get our dependencies and build a static executable. As for the
second stage of the build, we simply put the executable into a new base
image and we are done.

The second build which is a vendored build with a golang distroless
base. We copy over the source code for the project and all its
dependencies and then do the same as before.

[source,dockerfile]
----
FROM golang:1.21 as builder
WORKDIR /milla
COPY go.sum go.mod /milla/
RUN go mod download
COPY *.go /milla/
RUN CGO_ENABLED=0 go build

FROM gcr.io/distroless/static-debian12
COPY --from=builder /milla/milla "/usr/bin/milla"
ENTRYPOINT ["milla"]
----

Below You can see an example docker compose file. Milla can optionally
use a postgres database to store messages. We also include a pgadmin
instance. Now let’s talk about the docker compose file.

[source,yaml]
----
services:
  terra:
    image: milla_distroless_vendored
    build:
      context: .
      dockerfile: ./Dockerfile_distroless_vendored
    deploy:
      resources:
        limits:
          memory: 128M
    logging:
      driver: "json-file"
      options:
        max-size: "100m"
    networks:
      - terranet
    user: 1000:1000
    restart: unless-stopped
    entrypoint: ["/usr/bin/milla"]
    command: ["--config", "/config.toml"]
    volumes:
      - ./config.toml:/config.toml
      - /etc/localtime:/etc/localtime:ro
    cap_drop:
      - ALL
    runtime: runsc
  postgres:
    image: postgres:16-alpine3.19
    deploy:
      resources:
        limits:
          memory: 4096M
    logging:
      driver: "json-file"
      options:
        max-size: "200m"
    restart: unless-stopped
    ports:
      - "127.0.0.1:5455:5432/tcp"
    volumes:
      - terra_postgres_vault:/var/lib/postgresql/data
      - ./scripts/:/docker-entrypoint-initdb.d/:ro
    environment:
      - POSTGRES_PASSWORD_FILE=/run/secrets/pg_pass_secret
      - POSTGRES_USER_FILE=/run/secrets/pg_user_secret
      - POSTGRES_INITDB_ARGS_FILE=/run/secrets/pg_initdb_args_secret
      - POSTGRES_DB_FILE=/run/secrets/pg_db_secret
    networks:
      - terranet
      - dbnet
    secrets:
      - pg_pass_secret
      - pg_user_secret
      - pg_initdb_args_secret
      - pg_db_secret
    runtime: runsc
  pgadmin:
    image: dpage/pgadmin4:8.6
    deploy:
      resources:
        limits:
          memory: 1024M
    logging:
      driver: "json-file"
      options:
        max-size: "100m"
    environment:
      - PGADMIN_LISTEN_PORT=${PGADMIN_LISTEN_PORT:-5050}
      - PGADMIN_DEFAULT_EMAIL=${PGADMIN_DEFAULT_EMAIL:-devi@terminaldweller.com}
      - PGADMIN_DEFAULT_PASSWORD_FILE=/run/secrets/pgadmin_pass
      - PGADMIN_DISABLE_POSTFIX=${PGADMIN_DISABLE_POSTFIX:-YES}
    ports:
      - "127.0.0.1:5050:5050/tcp"
    restart: unless-stopped
    volumes:
      - terra_pgadmin_vault:/var/lib/pgadmin
    networks:
      - dbnet
    secrets:
      - pgadmin_pass
networks:
  terranet:
    driver: bridge
  dbnet:
volumes:
  terra_postgres_vault:
  terra_pgadmin_vault:
secrets:
  pg_pass_secret:
    file: ./pg/pg_pass_secret
  pg_user_secret:
    file: ./pg/pg_user_secret
  pg_initdb_args_secret:
    file: ./pg/pg_initdb_args_secret
  pg_db_secret:
    file: ./pg/pg_db_secret
  pgadmin_pass:
    file: ./pgadmin/pgadmin_pass
----

We are assigning memory usage limits for the containers. We are also
limiting the size of the logs we are keeping on disk. One thing that we
did not talk about before is the networking side of compose. As can be
seen, the postgres and pgadmin container share one network while the
postgres container and milla share another network. This makes it so
that milla and pgadmin do not have access to each other. This is inline
with principle of least privilege. Milla and pgadmin don’t need to talk
to each other so they can’t do that. Also we refrain from using host
networking. We are also binding the open ports to the host’s localhost
interface. This does not let us connect to the endpoints directly. In
our example we don’t need the ports to be exposed to the internet but we
will need access to them. What we can do is bind the open ports to the
host’s localhost and then use ssh to forward the ports onto our own
machine, assuming the docker host is a remote.

[source,sh]
----
ssh -L 127.0.0.1:5460:127.0.0.1:5455 user@remotehost
----

While building milla, for the second stage of the build, we made a
non-privileged user and our now mapping a non-privileged user on the
host to that user. We are removing all capabilities from milla since
milla will be making requests and has no server functionality. Milla
will only need to bind to high-numbered ports which does not require a
special privileges. We run both postgres and milla with gvisor’s runsc
runtime since it’s possible to do so. Finally we use docker secrets to
put the secrets into the container’s runtime environment.

Now onto the attestations. In order to view the SBOM for the image we
will use docker https://docs.docker.com/scout/install/[scout].

[source,sh]
----
docker scout sbom milla
docker scout sbom milla_distroless_vendored
----

The SBOMs can be viewed
https://gist.github.com/terminaldweller/8e8ecdcb68d4052aecb6804823648b4d[here]
and
https://gist.github.com/terminaldweller/f4ede7122f159506f8e6e6be2bfd6a8b[here]
respectively.

Now lets look at the provenance attestations.

[source,sh]
----
docker buildx imagetools inspect terminaldweller/milla:main --format "{{ json .Provenance.SLSA }}"
----

And
https://gist.github.com/terminaldweller/033ae07a9e685db85b18eb822dea4be3[here]
you can look at the result.

=== Further Reading

* https://manpages.debian.org/bookworm/manpages/cgroups.7.en.html[man 7
cgroups]
* system containers using https://github.com/lxc/incus[lxc/incus]
* https://katacontainers.io/[katacontainers]
+
+
timestamp:1716163133
+
version:1.0.0
+
https://blog.terminaldweller.com/rss/feed
+
https://raw.githubusercontent.com/terminaldweller/blog/main/mds/securedocker.md
+
== Docker, Linux, Security. Kinda.

We will be exploring some Linux features in the context of a docker
application container. Another way of explaining it would be to say we
will talk about how to make more secure application containers. We will
not talk about firewall and apparmor because they are tools that enhance
security on the host in general and not specific to a docker application
container. A secure host means a more secure application container but
that is discussion for another post. We will focus on Linux containers
since FreeBSD containers are still experimental(see
https://wiki.freebsd.org/Docker[here] and
https://github.com/samuelkarp/runj[here]). Yes, windows containers
exist. We will not discuss performance. Here be performance penalties,
but again that is not the focus of this post.

Before we begin, Linux docker containers are Linux. They are using most
of the functionality that existed before application containers in the
form of docker were a thing. Knowing Linux better means you know Linux
Docker containers(application containers is a more correct term) better.
We will see this point throughout this post.

=== Base Image

We start with the first building block of a new docker image, The base
image. By far the most used base images are the Alpine docker base
image, followed by Debian and Ubuntu docker base images. These distros
have two major differences that we want to focus on:

* C standard library implementation
* the userspace utility implementation

Debian and Ubuntu(we are not forgetting that Ubuntu itself is a Debian
derivative) both use glibc, as in gnu’s
https://www.gnu.org/software/libc/[libc] implementation. Alpine uses
https://www.musl-libc.org/[musl-libc] as its C standard library
implementation. The major difference here which will come into play
later on again is glibc has been around for much longer, so it has to
keep backwards compatibility for a much longer period of time and for
far more many things. Also the general attitude with the glibc team is
that they have to support everything since if they don’t then who will?
Libmusl on the other hand, does not try to support everything under the
sun, a relatively newer project, comparatively, and, they keep their
codebase lean. As a result not all applications are supported by libmusl
but a good number of them are. In simpler terms, libmusl has a far
smaller attack surface compared to glibc.

On to our second point, which is the cli utilities’ implementation.
Debian and Ubuntu use gnu’s
https://www.gnu.org/software/coreutils/[Coreutils] while Alpine uses
https://busybox.net/[Busybox](remember, we are talking about the most
used application container bases. You can install a desktop version of
Alpine with GNU coreutils). Here we have the same situation as before,
The GNU coreutils are bigger, do more and have a larger attack surface.
Busybox is smaller, does not support as many features as GNU Coreutils
but does support enough of them to make them useful. Needless to say,
busybox is small and hence, it has a smaller attack surface.

To get a feel for how this plays out in the real world, you can look at
some of the popular images that come in both Debian and Alpine flavours
on dockerhub. Take a look at the number of reported vulnerabilities for
both bases. The theme we observe is simple. The bigger the attack
surface the bigger the number of vulnerabilities.

Alpine images are small, lean and functional, just like libmusl and
busybox but there are still quite a few things on an alpine image that
are extraneous. We can take them out and have a perfectly functioning
application container.

That’s how we get
https://github.com/GoogleContainerTools/distroless[distroless].
Distroless base images follow the same pattern as alpine base docker
images, as in, less functionality while still keeping enough
functionality to be able to do the job and minimize the attack surface.
Minimizing a base image like this means that the base images are very
specialized so we have base images for golang, python, java and the
like.

=== Dokcer Runtimes

By default docker uses containerd which in turn uses runc for the
runtime. There are two additional runtimes that we want to focus on who
try to provide a more secure runtime environment for docker.

* gvisor
* kata

==== gvisor

gVisor creates a sandbox environment. Containers interact with the host
through this sandboxed environment. gvisor has two components. Gofer and
Sentry. Sentry is a kernel that runs the containers and intercepts and
responds to system calls made by the application so as not to have an
application directly control the syscalls that it makes. Gofer handles
filesystem access(not /proc) for the application. The application is a
regular application. gVisor aims to provide an environment equivalent to
Linux 4.4. gvisor presently does not implement every system call,
`+/proc+` file or `+/sys+` file. Every sandbox environment gets its own
instance of Sentry. Every container in the sandbox gets its own instance
of Gofer. gVisor currently does not support all system calls. You can
find the list of supported system calls for amd64
https://gvisor.dev/docs/user_guide/compatibility/linux/amd64/[here].

....
  -------------
  |Application|
  -------------
       |system calls
       |
    --------   9p    -------
    |Sentry|<------->|Gofer|
    --------         -------
         | limited    |system
         | syscalls   |calls
        ---------------
        | Host Kernel |
        ---------------
               |
               |hardware
....

==== kata

Kata creates a sandbox environment for containers to interact with as
proxy, not too dissimilar to gvisor but the main point of difference is
that kata uses a VM to achieve this.

gVisor and katacontainers allow us to implement defense in depth when it
comes to application containers and host system security.

=== Capabilites and Syscalls

Let’s talk about capabilities for a bit.

From
https://manpages.debian.org/bookworm/manpages/capabilities.7.en.html[man
7 capabilities]:

[source,txt]
----
For the purpose of performing permission checks, traditional UNIX implementations distinguish two
categories of processes: privileged processes (whose effective user ID is 0, referred to as
superuser or root), and unprivileged processes (whose effective UID is nonzero).  Privileged
processes bypass all kernel permission checks, while unprivileged processes are subject to full
permission checking based on the process's credentials (usually: effective UID, effective GID, and
supplementary group list).

Starting with Linux 2.2, Linux divides the privileges traditionally associated with superuser into
distinct units, known as capabilities, which can be independently enabled and disabled.
Capabilities are a per-thread attribute.
----

Capabilities give you a more granular control over which privileges to
give instead of just root and non-root. Docker let’s us choose which
capabilities to give to a container. So we can for example allow a
non-privileged process to bind to privileged ports using capabilities.
As an example, a simple application making calls to API endpoints and
writing results back to a database does not require any capabilities. It
can run under a non-privileged user with no capabilities and do all the
tasks that it needs to do. That being said, determining which
capabilities are required can be a bit challenging when it comes to
certain applications since there is no straightforward way of achieving
this. In certain cases we can get away with dropping all capabilities,
running our application and then trying to figure out, based on the
received error messages, which capability is missing and needs to be
given to the application. But in certain cases this may not be feasible
or practical.

From
https://manpages.debian.org/bookworm/manpages-dev/syscalls.2.en.html[man
2 sycalls]:

[source,txt]
----
The system call is the fundamental interface between an application and the Linux kernel.
----

The Linux kernel lets us choose which ones of these interface calls can
be allowed to be made by an application. We can essentially filter which
syscalls are allowed and which ones are not on a per application basis.
Docker enables this functionality with an arguably more friendly
approach. Capabilities and syscall filtering are tools to implement
principle of least privilege. Ideally, we would like to allow a
container to only have access to what it needs and just that. Not more,
and obviously not less.

==== capabilities in the wild

Capabilities are a Linux feature, docker allows us to use that with
application containers. We’ll look at a very simple example of how one
can set capabilities for a regular executable on Linux.
https://manpages.debian.org/bookworm/libcap2-bin/setcap.8.en.html[man 8
setcap] lets us set capabilities for a file.

==== syscall Filtering in the wild

As an example we will look at
https://manpages.debian.org/bookworm/bubblewrap/bwrap.1.en.html[man 1
bwrap]. https://github.com/containers/bubblewrap[Bubblewrap] allows us
to sandbox an application, not too dissimilar to docker. Flatpaks use
bubblewrap as part of their sandbox. Bubblewrap can optionally take in a
list of syscalls to
https://www.kernel.org/doc/html/v4.19/userspace-api/seccomp_filter.html[filter].
The filter is expressed as a BPF(Berkley Packet Filter program -
remember when I said docker gives you a
https://docs.docker.com/engine/security/seccomp/[friendlier] interface
to seccomp?) program. Below is a short program that defines a BPF
program that can be passed to an application using bwrap that lets us
log all the sycalls the application makes to syslog.

[source,c]
----
#include <fcntl.h>
#include <seccomp.h>
#include <stdbool.h>
#include <unistd.h>

void log_all_syscalls(void) {
  scmp_filter_ctx ctx = seccomp_init(SCMP_ACT_LOG);
  seccomp_arch_add(ctx, SCMP_ARCH_X86_64);
  seccomp_export_bpf(ctx, 1);
  seccomp_export_pfc(ctx, 2);
  seccomp_release(ctx);
}

int main(int argc, char **argv) {
  log_all_syscalls();
}
----

Building is straightforward. Just remember to link against
`+libseccomp+` with `+-lseccomp+`.

[source,bash]
----
gcc main.c -lseccomp
----

Running the above code we get this:

[source,txt]
----
 > 5@#
# pseudo filter code start
#
# filter for arch x86_64 (3221225534)
if ($arch == 3221225534)
  # default action
  action LOG;
# invalid architecture action
action KILL;
#
# pseudo filter code end
#
----

[source,bash]
----
#!/usr/bin/dash

TEMP_LOG=/tmp/seccomp_logging_filter.bpf

./a.out > ${TEMP_LOG}

bwrap --seccomp 9 9<${TEMP_LOG} bash
----

Then we can go and see where the logs end up. On my host, they are
logged under `+/var/log/audit/audit.log+` and they look like this:

....
type=SECCOMP msg=audit(1716144132.339:4036728): auid=1000 uid=1000 gid=1000 ses=1 subj=unconfined pid=19633 comm="bash" exe="/usr/bin/bash" sig=0 arch=c000003e syscall=13 compat=0 ip=0x7fa58591298f code=0x7ffc0000AUID="devi" UID="devi" GID="devi" ARCH=x86_64 SYSCALL=rt_sigaction
type=SECCOMP msg=audit(1716144132.339:4036729): auid=1000 uid=1000 gid=1000 ses=1 subj=unconfined pid=19633 comm="bash" exe="/usr/bin/bash" sig=0 arch=c000003e syscall=13 compat=0 ip=0x7fa58591298f code=0x7ffc0000AUID="devi" UID="devi" GID="devi" ARCH=x86_64 SYSCALL=rt_sigaction
type=SECCOMP msg=audit(1716144132.339:4036730): auid=1000 uid=1000 gid=1000 ses=1 subj=unconfined pid=19633 comm="bash" exe="/usr/bin/bash" sig=0 arch=c000003e syscall=13 compat=0 ip=0x7fa58591298f code=0x7ffc0000AUID="devi" UID="devi" GID="devi" ARCH=x86_64 SYSCALL=rt_sigaction
type=SECCOMP msg=audit(1716144132.339:4036731): auid=1000 uid=1000 gid=1000 ses=1 subj=unconfined pid=19633 comm="bash" exe="/usr/bin/bash" sig=0 arch=c000003e syscall=13 compat=0 ip=0x7fa58591298f code=0x7ffc0000AUID="devi" UID="devi" GID="devi" ARCH=x86_64 SYSCALL=rt_sigaction
type=SECCOMP msg=audit(1716144132.339:4036732): auid=1000 uid=1000 gid=1000 ses=1 subj=unconfined pid=19633 comm="bash" exe="/usr/bin/bash" sig=0 arch=c000003e syscall=13 compat=0 ip=0x7fa58591298f code=0x7ffc0000AUID="devi" UID="devi" GID="devi" ARCH=x86_64 SYSCALL=rt_sigaction
type=SECCOMP msg=audit(1716144132.339:4036733): auid=1000 uid=1000 gid=1000 ses=1 subj=unconfined pid=19633 comm="bash" exe="/usr/bin/bash" sig=0 arch=c000003e syscall=14 compat=0 ip=0x7fa5859664f4 code=0x7ffc0000AUID="devi" UID="devi" GID="devi" ARCH=x86_64 SYSCALL=rt_sigprocmask
type=SECCOMP msg=audit(1716144132.339:4036734): auid=1000 uid=1000 gid=1000 ses=1 subj=unconfined pid=19633 comm="bash" exe="/usr/bin/bash" sig=0 arch=c000003e syscall=13 compat=0 ip=0x7fa58591298f code=0x7ffc0000AUID="devi" UID="devi" GID="devi" ARCH=x86_64 SYSCALL=rt_sigaction
type=SECCOMP msg=audit(1716144132.339:4036735): auid=1000 uid=1000 gid=1000 ses=1 subj=unconfined pid=19633 comm="bash" exe="/usr/bin/bash" sig=0 arch=c000003e syscall=1 compat=0 ip=0x7fa5859ce5d0 code=0x7ffc0000AUID="devi" UID="devi" GID="devi" ARCH=x86_64 SYSCALL=write
type=SECCOMP msg=audit(1716144132.339:4036736): auid=1000 uid=1000 gid=1000 ses=1 subj=unconfined pid=19633 comm="bash" exe="/usr/bin/bash" sig=0 arch=c000003e syscall=1 compat=0 ip=0x7fa5859ce5d0 code=0x7ffc0000AUID="devi" UID="devi" GID="devi" ARCH=x86_64 SYSCALL=write
type=SECCOMP msg=audit(1716144132.339:4036737): auid=1000 uid=1000 gid=1000 ses=1 subj=unconfined pid=19633 comm="bash" exe="/usr/bin/bash" sig=0 arch=c000003e syscall=270 compat=0 ip=0x7fa5859d77bc code=0x7ffc0000AUID="devi" UID="devi" GID="devi" ARCH=x86_64 SYSCALL=pselect6
....

Docker allows us to do the
https://docs.docker.com/engine/security/seccomp/[same]. We can give
docker a seccomp profile to filter out the syscalls that are not
required for a specific container. You can find the default docker
seccomp profile
https://github.com/moby/moby/blob/master/profiles/seccomp/default.json[here].

=== Namespaces

....
A namespace wraps a global system resource in an abstraction that makes it appear to the processes
within the namespace that they have their own isolated instance of the global resource.  Changes
to the global resource are visible to other processes that are members of the namespace, but are
invisible to other processes.  One use of namespaces is to implement containers.
....

From
https://manpages.debian.org/bookworm/manpages/namespaces.7.en.html[man 7
namespaces]. You can think of namespaces as almost the same thing as a
namespace does in some programming languages. Docker uses its own
namespaces for the containers so as to further isolate the application
containers from the host system.

==== Namespaces in the Wild

As an example let’s look at the script provided below. Here we are
creating a new network namespace. The new interface is provided by
simply connecting an android phone for USB tethering. Depending on the
situation you have going on and the `+udev+` naming rules the interface
name will differ but the concept is the same. We are creating a new
network namespace for a second internet provider, which in this case, is
our android phone. We then use this network namespace to execute
commands in the context of this specific network namespace. Essentially,
we can choose which applications get to use our phone internet and which
ones use whatever it is we were previously connected to.

[source,sh]
----
#!/usr/bin/env sh
PHONE_NS=phone_ns
IF=enp0s20f0u6

sudo ip netns add ${PHONE_NS}
sudo ip link set ${IF} netns ${PHONE_NS}
sudo ip netns exec ${PHONE_NS} ip link set ${IF} up
sudo ip netns exec ${PHONE_NS} ip link set dev lo up
sudo ip netns exec ${PHONE_NS} dhclient ${IF}
----

[source,sh]
----
$ sudo ip netns exec home_ns curl -4 icanhaveip.com
113.158.237.102
$ curl -4 icanhasip.com
114.201.132.98
----

*_HINT_*: The IP addresses are made up. The only thing that matters is
that they are different.

Since we have the android phone’s interface on another namespace the two
cannot interfere with each other. This is pretty much how docker uses
namespaces. Without a network namespace we would have to make a small
VM, run a VPN on the VM and then make a socks5 proxy to the VM from the
host and then have applications pass their traffic through a socks5
proxy with varying degrees of success. *_NOTE_*: since we are not
running the script on a hook, you might blow out your net having two
upstreams at the same time. In which case, run the script, then restart
NetworkManager or whatever you have.

=== SBOM and Provenance Attestation

What is SBOM? NIST defines SBOM as a "`formal record containing the
details and supply chain relationships of various components used in
building software.`". It contains details about the components used to
create a certain piece of software. SBOM is meant to help mitigate the
threat of supply chain attacks(remember xz?).

What is provenance?

....
The provenance attestations include facts about the build process, including details such as:

    Build timestamps
    Build parameters and environment
    Version control metadata
    Source code details
    Materials (files, scripts) consumed during the build
....

https://docs.docker.com/build/attestations/sbom/[source]

==== Example

Let’s review all that we learned about in the form of a light exercise.

For the first build, we use a non-vendored version. Vendoring means that
you store your dependencies locally. This means you are in control of
your dependencies. You don’t need to pull them from a remote. Even if
one or more of your dependencies One of the more famous examples is Lua.
The Lua foundation actually recommend vendoring your Lua dependency.
Vendoring helps with build reproducability.

We will use https://github.com/terminaldweller/milla[milla] as an
exmaple. It’s a simple go codebase.

[source,dockerfile]
----
FROM alpine:3.19 as builder
RUN apk update && \
      apk upgrade && \
      apk add go git
WORKDIR /milla
COPY go.sum go.mod /milla/
RUN go mod download
COPY *.go /milla/
RUN go build

FROM alpine:3.19
ENV HOME /home/user
RUN set -eux; \
  adduser -u 1001 -D -h "$HOME" user; \
  mkdir "$HOME/.irssi"; \
  chown -R user:user "$HOME"
COPY --from=builder /milla/milla "$HOME/milla"
RUN chown user:user "$HOME/milla"
ENTRYPOINT ["home/user/milla"]
----

The first docker image build is fairly simple. We copy the source code
in, get our dependencies and build a static executable. As for the
second stage of the build, we simply put the executable into a new base
image and we are done.

The second build which is a vendored build with a golang distroless
base. We copy over the source code for the project and all its
dependencies and then do the same as before.

[source,dockerfile]
----
FROM golang:1.21 as builder
WORKDIR /milla
COPY go.sum go.mod /milla/
RUN go mod download
COPY *.go /milla/
RUN CGO_ENABLED=0 go build

FROM gcr.io/distroless/static-debian12
COPY --from=builder /milla/milla "/usr/bin/milla"
ENTRYPOINT ["milla"]
----

Below You can see an example docker compose file. Milla can optionally
use a postgres database to store messages. We also include a pgadmin
instance. Now let’s talk about the docker compose file.

[source,yaml]
----
services:
  terra:
    image: milla_distroless_vendored
    build:
      context: .
      dockerfile: ./Dockerfile_distroless_vendored
    deploy:
      resources:
        limits:
          memory: 128M
    logging:
      driver: "json-file"
      options:
        max-size: "100m"
    networks:
      - terranet
    user: 1000:1000
    restart: unless-stopped
    entrypoint: ["/usr/bin/milla"]
    command: ["--config", "/config.toml"]
    volumes:
      - ./config.toml:/config.toml
      - /etc/localtime:/etc/localtime:ro
    cap_drop:
      - ALL
    runtime: runsc
  postgres:
    image: postgres:16-alpine3.19
    deploy:
      resources:
        limits:
          memory: 4096M
    logging:
      driver: "json-file"
      options:
        max-size: "200m"
    restart: unless-stopped
    ports:
      - "127.0.0.1:5455:5432/tcp"
    volumes:
      - terra_postgres_vault:/var/lib/postgresql/data
      - ./scripts/:/docker-entrypoint-initdb.d/:ro
    environment:
      - POSTGRES_PASSWORD_FILE=/run/secrets/pg_pass_secret
      - POSTGRES_USER_FILE=/run/secrets/pg_user_secret
      - POSTGRES_INITDB_ARGS_FILE=/run/secrets/pg_initdb_args_secret
      - POSTGRES_DB_FILE=/run/secrets/pg_db_secret
    networks:
      - terranet
      - dbnet
    secrets:
      - pg_pass_secret
      - pg_user_secret
      - pg_initdb_args_secret
      - pg_db_secret
    runtime: runsc
  pgadmin:
    image: dpage/pgadmin4:8.6
    deploy:
      resources:
        limits:
          memory: 1024M
    logging:
      driver: "json-file"
      options:
        max-size: "100m"
    environment:
      - PGADMIN_LISTEN_PORT=${PGADMIN_LISTEN_PORT:-5050}
      - PGADMIN_DEFAULT_EMAIL=${PGADMIN_DEFAULT_EMAIL:-devi@terminaldweller.com}
      - PGADMIN_DEFAULT_PASSWORD_FILE=/run/secrets/pgadmin_pass
      - PGADMIN_DISABLE_POSTFIX=${PGADMIN_DISABLE_POSTFIX:-YES}
    ports:
      - "127.0.0.1:5050:5050/tcp"
    restart: unless-stopped
    volumes:
      - terra_pgadmin_vault:/var/lib/pgadmin
    networks:
      - dbnet
    secrets:
      - pgadmin_pass
networks:
  terranet:
    driver: bridge
  dbnet:
volumes:
  terra_postgres_vault:
  terra_pgadmin_vault:
secrets:
  pg_pass_secret:
    file: ./pg/pg_pass_secret
  pg_user_secret:
    file: ./pg/pg_user_secret
  pg_initdb_args_secret:
    file: ./pg/pg_initdb_args_secret
  pg_db_secret:
    file: ./pg/pg_db_secret
  pgadmin_pass:
    file: ./pgadmin/pgadmin_pass
----

We are assigning memory usage limits for the containers. We are also
limiting the size of the logs we are keeping on disk. One thing that we
did not talk about before is the networking side of compose. As can be
seen, the postgres and pgadmin container share one network while the
postgres container and milla share another network. This makes it so
that milla and pgadmin do not have access to each other. This is inline
with principle of least privilege. Milla and pgadmin don’t need to talk
to each other so they can’t do that. Also we refrain from using host
networking. We are also binding the open ports to the host’s localhost
interface. This does not let us connect to the endpoints directly. In
our example we don’t need the ports to be exposed to the internet but we
will need access to them. What we can do is bind the open ports to the
host’s localhost and then use ssh to forward the ports onto our own
machine, assuming the docker host is a remote.

[source,sh]
----
ssh -L 127.0.0.1:5460:127.0.0.1:5455 user@remotehost
----

While building milla, for the second stage of the build, we made a
non-privileged user and our now mapping a non-privileged user on the
host to that user. We are removing all capabilities from milla since
milla will be making requests and has no server functionality. Milla
will only need to bind to high-numbered ports which does not require a
special privileges. We run both postgres and milla with gvisor’s runsc
runtime since it’s possible to do so. Finally we use docker secrets to
put the secrets into the container’s runtime environment.

Now onto the attestations. In order to view the SBOM for the image we
will use docker https://docs.docker.com/scout/install/[scout].

[source,sh]
----
docker scout sbom milla
docker scout sbom milla_distroless_vendored
----

The SBOMs can be viewed
https://gist.github.com/terminaldweller/8e8ecdcb68d4052aecb6804823648b4d[here]
and
https://gist.github.com/terminaldweller/f4ede7122f159506f8e6e6be2bfd6a8b[here]
respectively.

Now lets look at the provenance attestations.

[source,sh]
----
docker buildx imagetools inspect terminaldweller/milla:main --format "{{ json .Provenance.SLSA }}"
----

And
https://gist.github.com/terminaldweller/033ae07a9e685db85b18eb822dea4be3[here]
you can look at the result.

=== Further Reading

* https://manpages.debian.org/bookworm/manpages/cgroups.7.en.html[man 7
cgroups]
* system containers using https://github.com/lxc/incus[lxc/incus]
* https://katacontainers.io/[katacontainers]
+
+
timestamp:1716163133
+
version:1.0.0
+
https://blog.terminaldweller.com/rss/feed
+
https://raw.githubusercontent.com/terminaldweller/blog/main/mds/securedocker.md
+