aboutsummaryrefslogtreecommitdiffstats
path: root/mds/securedocker.md
blob: c8080467440860bad3484d502a439b4a6112f226 (plain) (blame)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
# Docker Containers, Linux Features and Security

OK. Let's take it from the top.<br/>

We will be exploring in which ways we can make an application container, more specifically a docker container, more secure.<br/>
We will not talk about firewalls and apparmor because they are tools that enhance security on the host in general and not specific to a docker application container. Be that as it may, it still means a secure host is always better than a non-secure host.<br/>
We will focus on Linux containers since Freebsd containers are still experimental(see [here](https://wiki.freebsd.org/Docker) and [here](https://github.com/samuelkarp/runj)). Yes, windows containers exist.<br/>

Before we begin, Linux docker containers are Linux. They are using most of the functionality that existed before application containers in the form of docker were a thing. Knowing Linux better means you know Linux Docker containers better. We will reference this fact a couple of time later on.<br/>

## Base Image

We start with the first building block of a new docker image, The base image. By far the most used base images are the Alpine docker base image, followed by Debian and Ubuntu docker base images.
These distros have two major differences that we want to focus on:

- C standard library implementation
- the userspace utility implementation

Debian and Ubuntu(we are not forgetting that Ubuntu itself is a Debian derivative) both use glibc, as in gnu's [libc](https://www.gnu.org/software/libc/) implementation. Alpine uses [musl-libc](https://www.musl-libc.org/) as its C standard library implementation.<br/>
The major difference here which will come into play later on again is glibc has been around for much longer, so it has to keep backwards compatibility for a much longer period of time and for far more many things. Also the general attitude with the glibc team is that they have to support everything since if they don't then who will?<br/>
Libmusl on the other hand, does not try to support everything under the sun, a relatively newer project comparatively, and, keep their codebase lean.<br/>
As a result not all applications are supported by libmusl but a good number of them are.<br/>
In simpler terms, libmusl has a far smaller attack surface compared to glic.<br/>

On to our second point, which is the cli utilities' implementation. Debian and Ubuntu use gnu's [Coreutils](https://www.gnu.org/software/coreutils/) while Alpine uses [Busybox](https://busybox.net/).<br/>
Here we have the same situation as before, The GNU coreutils are bigger, do more and have a larger attack surface. Busybox is smaller, does not support as many features as GNU Coreutils but do support enough of them to make them useful. Needless to say, busybox is small and lean hence it has a smaller attack surface.<br/>

For some intuitive observation, you can look at the some popular images that come in both Debian and Alpine flavours on dockerhub. Take a look at the number of reported vulnerabilities for both bases. The theme we observe is simple. The bigger the attack surface the bigger the number of vulnerabilities.<br/>

Alpine images are small, lean and functional, just like libmusl and busybox but there are still quite a few things on an alpine image that are extraneous. We can take them out and have a perfectly functioning application container.<br/>

That's how we get [distroless](https://github.com/GoogleContainerTools/distroless).<br/>
Distroless base images follow the same pattern as alpine base docker images, as in, less functionality while still keeping enough functionality to be able to do the job and minimizing the attack surface.
Minimizing a base image like this means that the base images are very specialized so we have base images for golang, python, java and the like.<br/>

## Dokcer Runtimes

What is a docker runtime?

- runc
- nvidia
- gvisor

### gviros's runsc

## Capabilites and Syscalls

[man 7 capabilites](https://manpages.debian.org/bookworm/manpages/capabilities.7.en.html)
[man 2 sycalls](https://manpages.debian.org/bookworm/manpages-dev/syscalls.2.en.html)

### capabilities in the wild

[man 8 setcap](https://manpages.debian.org/bookworm/libcap2-bin/setcap.8.en.html)

### syscall Filtering in the wild

[Bubblewrap](https://github.com/containers/bubblewrap)

Let's see how we can

```c
#include <errno.h>
#include <fcntl.h>
#include <inttypes.h>
#include <seccomp.h>
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>

// https://blog.mnus.de/2020/05/sandboxing-soldatserver-with-bubblewrap-and-seccomp/

void log_all_syscalls(void) {
  scmp_filter_ctx ctx = seccomp_init(SCMP_ACT_LOG);
  seccomp_arch_add(ctx, SCMP_ARCH_X86_64);
  seccomp_export_bpf(ctx, 1);
  seccomp_export_pfc(ctx, 2);
  seccomp_release(ctx);
}

int log_current_seccomp(void) {
  int rc = -1;
  scmp_filter_ctx ctx;
  int filter_fd;

  ctx = seccomp_init(SCMP_ACT_KILL);
  if (ctx == NULL)
    goto out;

  filter_fd = open("/tmp/seccomp_filter.bpf",
                   O_CREAT | O_WRONLY | O_NOFOLLOW | O_TRUNC, S_IRWXU);
  if (filter_fd == -1) {
    rc = -errno;
    goto out;
  }

  rc = seccomp_export_bpf(ctx, filter_fd);
  if (rc < 0) {
    close(filter_fd);
    goto out;
  }
  close(filter_fd);

  filter_fd = open("/tmp/seccomp_filter.pfc",
                   O_CREAT | O_WRONLY | O_NOFOLLOW | O_TRUNC, S_IRWXU);
  if (filter_fd == -1) {
    rc = -errno;
    goto out;
  }

  rc = seccomp_export_pfc(ctx, filter_fd);
  if (rc < 0) {
    close(filter_fd);
    goto out;
  }
  close(filter_fd);

out:
  seccomp_release(ctx);
  return -rc;
}

int main(int argc, char **argv) {
  if (argc == 3) {
    if (!strcmp("--filter", argv[1])) {
      if (!strcmp("current", argv[2])) {
        log_current_seccomp();
      } else if (!strcmp("logging", argv[2])) {
        log_all_syscalls();
      } else {
      }
    }
  } else {
    printf("going with the default filter kind which is logging.\n");
    log_all_syscalls();
  }
}
```

```bash
gcc -lseccomp
```

### Namespaces in the Wild

```sh
#!/usr/bin/dash

NS=home_ns
IF=wlp0s20f3
PHY=phy0

sudo ip netns add ${NS} || true
sudo iw phy ${PHY} set netns "$(sudo ip netns exec home_ns sh -c 'sleep 1 >&- & echo "$!"')"
# sudo ip link set ${IF} netns ${NS}
sudo ip netns exec ${NS} ip link set ${IF} up
sudo ip netns exec ${NS} ip link set dev lo up
sudo ip netns exec ${NS} dhclient ${IF}

ip netns exec ${NS} ping -4 9.9.9.9
ip netns exec ${NS} ping -4 google.com
ip netns exec ${NS} curl -4 icanhazip.com
```

```sh
sudo ip netns exec home_ns curl -4 icanhaveip.com
```

### Docker syscall filtering

### BPF

## SBOM and Provenance Attestation

### Conclusion

```Dockerfile
FROM alpine:3.19 as builder
RUN apk update && \
      apk upgrade && \
      apk add go git
WORKDIR /milla
COPY go.sum go.mod /milla/
RUN go mod download
COPY *.go /milla/
RUN go build

FROM alpine:3.19
ENV HOME /home/user
RUN set -eux; \
  adduser -u 1001 -D -h "$HOME" user; \
  mkdir "$HOME/.irssi"; \
  chown -R user:user "$HOME"
COPY --from=builder /milla/milla "$HOME/milla"
RUN chown user:user "$HOME/milla"
ENTRYPOINT ["home/user/milla"]
```

```Dockerfile
FROM golang:1.21 as builder
WORKDIR /milla
COPY go.sum go.mod /milla/
RUN go mod download
COPY *.go /milla/
RUN CGO_ENABLED=0 go build

FROM gcr.io/distroless/static-debian12
COPY --from=builder /milla/milla "/usr/bin/milla"
ENTRYPOINT ["milla"]
```

```yaml
services:
  terra:
    image: milla_distroless_vendored
    build:
      context: .
      dockerfile: ./Dockerfile_distroless_vendored
    deploy:
      resources:
        limits:
          memory: 128M
    logging:
      driver: "json-file"
      options:
        max-size: "100m"
    networks:
      - terranet
    user: 1000:1000
    restart: unless-stopped
    entrypoint: ["/usr/bin/milla"]
    command: ["--config", "/config.toml"]
    volumes:
      - ./config.toml:/config.toml
      - /etc/localtime:/etc/localtime:ro
    cap_drop:
      - ALL
    environment:
      - HTTPS_PROXY=http://172.17.0.1:8120
      - https_proxy=http://172.17.0.1:8120
      - HTTP_PROXY=http://172.17.0.1:8120
      - http_proxy=http://172.17.0.1:8120
  postgres:
    image: postgres:16-alpine3.19
    deploy:
      resources:
        limits:
          memory: 4096M
    logging:
      driver: "json-file"
      options:
        max-size: "200m"
    restart: unless-stopped
    ports:
      - "127.0.0.1:5455:5432/tcp"
    volumes:
      - terra_postgres_vault:/var/lib/postgresql/data
      - ./scripts/:/docker-entrypoint-initdb.d/:ro
    environment:
      - POSTGRES_PASSWORD_FILE=/run/secrets/pg_pass_secret
      - POSTGRES_USER_FILE=/run/secrets/pg_user_secret
      - POSTGRES_INITDB_ARGS_FILE=/run/secrets/pg_initdb_args_secret
      - POSTGRES_DB_FILE=/run/secrets/pg_db_secret
    networks:
      - terranet
      - dbnet
    secrets:
      - pg_pass_secret
      - pg_user_secret
      - pg_initdb_args_secret
      - pg_db_secret
    runtime: runsc
  pgadmin:
    image: dpage/pgadmin4:8.6
    deploy:
      resources:
        limits:
          memory: 1024M
    logging:
      driver: "json-file"
      options:
        max-size: "100m"
    environment:
      - PGADMIN_LISTEN_PORT=${PGADMIN_LISTEN_PORT:-5050}
      - PGADMIN_DEFAULT_EMAIL=${PGADMIN_DEFAULT_EMAIL:-devi@terminaldweller.com}
      - PGADMIN_DEFAULT_PASSWORD_FILE=/run/secrets/pgadmin_pass
      - PGADMIN_DISABLE_POSTFIX=${PGADMIN_DISABLE_POSTFIX:-YES}
    ports:
      - "127.0.0.1:5050:5050/tcp"
    restart: unless-stopped
    volumes:
      - terra_pgadmin_vault:/var/lib/pgadmin
    networks:
      - dbnet
    secrets:
      - pgadmin_pass
networks:
  terranet:
    driver: bridge
  dbnet:
volumes:
  terra_postgres_vault:
  terra_pgadmin_vault:
secrets:
  pg_pass_secret:
    file: ./pg/pg_pass_secret
  pg_user_secret:
    file: ./pg/pg_user_secret
  pg_initdb_args_secret:
    file: ./pg/pg_initdb_args_secret
  pg_db_secret:
    file: ./pg/pg_db_secret
  pgadmin_pass:
    file: ./pgadmin/pgadmin_pass
```

## Further Reading

- [man 7 cgroups](https://manpages.debian.org/bookworm/manpages/cgroups.7.en.html)
- [man 7 namespaces](https://manpages.debian.org/bookworm/manpages/namespaces.7.en.html)
- system containers using [lxc/incus](https://github.com/lxc/incus)
- [katacontainers](https://katacontainers.io/)