Bug 2270717 (CVE-2024-3056) - CVE-2024-3056 podman: kernel: containers in shared IPC namespace are vulnerable to denial of service attack
Summary: CVE-2024-3056 podman: kernel: containers in shared IPC namespace are vulnerab...
Keywords:
Status: NEW
Alias: CVE-2024-3056
Product: Security Response
Classification: Other
Component: vulnerability
Version: unspecified
Hardware: All
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Product Security
QA Contact:
URL:
Whiteboard:
Depends On: 2302003
Blocks: 2270713
TreeView+ depends on / blocked
 
Reported: 2024-03-21 14:56 UTC by Robb Gatica
Modified: 2025-06-03 19:05 UTC (History)
57 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed:
Embargoed:


Attachments (Terms of Use)
report email (111.72 KB, application/pdf)
2025-02-15 04:04 UTC, Zhi
no flags Details

Description Robb Gatica 2024-03-21 14:56:42 UTC
Summary:
We received a report of an attack vector on containers which share an IPC namespace (definitely Podman, but likely also applicable to Kubernetes, Docker, and other container runtimes). At least two containers are configured with a shared IPC namespace and a cgroup limiting memory. One of those containers is malicious, and contains a binary which creates a large number of IPC resources in /dev/shm, and continues doing so until it is OOM killed. The malicious container is now dead, its cgroup removed, but the IPC resources it created are not; they are tied to the IPC namespace that will not be removed until all containers using it are stopped, and one non-malicious container is holding the namespace open. The malicious container is restarted (either automatically or by attacker control), repeating the process and increasing the amount of memory consumed. With a container configured to restart always (e.g. `podman run --restart=always`) this results in a memory-based denial of service of the system.

Podman Version:
Version 5.0.0-dev and before

Comment 11 Giuseppe Scrivano 2024-04-29 12:44:14 UTC
the issue is that the allocated memory is assigned to the container cgroup that has a memory limit, if that limit is hit the kernel refuses to allocate more.

But if the container exits, the shared memory is not freed and it is still allocated to the first cgroup, that is not accessible anymore from user space (but it seems to be still referenced internally in the kernel).

When the container exits, we create a new one, that uses the same cgroup name and it also has a limit set.

In this way the container can be restarted multiple times and each restart can leak memory since the cgroup seems to not be freed internally.

I've not looked into the code, so this analysis is only based on my observations from user space, it might be completely wrong :-)

If I am not wrong, then I think the sensible thing to do in this case would be to migrate the memory allocated to the cgroup that was destroyed to another cgroup using that same IPC.  In this way from userspace we can avoid the issue by making sure each cgroup has a limit set, and that there are no leaks once a cgroup is deleted.

I think we should add "Waiman Long <llong>" to the conversation.  It seems I am not allowed to do it

Comment 12 Anten Skrabec 2024-05-02 19:28:21 UTC
Added llong to CC list

Comment 13 Waiman Long 2024-05-08 02:25:53 UTC
(In reply to Giuseppe Scrivano from comment #11)
> the issue is that the allocated memory is assigned to the container cgroup
> that has a memory limit, if that limit is hit the kernel refuses to allocate
> more.
> 
> But if the container exits, the shared memory is not freed and it is still
> allocated to the first cgroup, that is not accessible anymore from user
> space (but it seems to be still referenced internally in the kernel).
> 
> When the container exits, we create a new one, that uses the same cgroup
> name and it also has a limit set.
> 
> In this way the container can be restarted multiple times and each restart
> can leak memory since the cgroup seems to not be freed internally.
> 
> I've not looked into the code, so this analysis is only based on my
> observations from user space, it might be completely wrong :-)
> 
> If I am not wrong, then I think the sensible thing to do in this case would
> be to migrate the memory allocated to the cgroup that was destroyed to
> another cgroup using that same IPC.  In this way from userspace we can avoid
> the issue by making sure each cgroup has a limit set, and that there are no
> leaks once a cgroup is deleted.
> 
> I think we should add "Waiman Long <llong>" to the conversation. 
> It seems I am not allowed to do it

There is actually upstream discussion about this specific problem. In the case of shared memory, memory ownership is assigned to the memory cgroup of the first process that uses it. References to that shared memory can be present in other memory cgroups. When the owning cgroup has exited all its processes and to be destroyed, it remained in the zombie state because of the additional references to the shared memory.

AFAIK, there was no consensus on the best way forward the last time I checked. I need to check again to see if there is progress on this issue.

Comment 17 Alex 2024-07-31 10:34:31 UTC
Created podman tracking bugs for this issue:

Affects: fedora-all [bug 2302003]

Comment 19 Zhi 2025-02-15 04:04:46 UTC
Created attachment 2076557 [details]
report email

We are a security team from multiple universities. The earlier 'report' received by the Podman Security Team was from us. We would like to provide some supplementary information regarding additional attack vectors not mentioned in the original report.

- Network namespace sharing: If two malicious containers share a network namespace and are given the ‘net_admin’ privilege, they can coordinate to bypass cgroup restrictions by reproducing the DoS attack steps, just as they would if they share an IPC namespace. The difference is that when sharing a network namespace, malicious containers consume memory by creating network devices.

- PID Namespace sharing: By sharing a PID namespace, two malicious containers can bypass the cgroup limit on the number of processes and launch a DoS attack on the host system. Specifically, an attacker can create such containers with a non-functional 'init' process (specified using 'podman run --init-path=init_process') that cannot properly handle orphaned processes. This allows the container to generate lots of zombie processes. If similar attack steps in IPC namespace sharing are repeated, one container can continuously generate zombie processes and restart once it hits the cgroup PID limits. As a result, zombie processes will accumulate within the shared namespace, eventually exhausting the host system's PID resources.

Zhen Xu, Huazhong University of Science and Technology
Zhi Li, Huazhong University of Science and Technology
Weijie Liu, Nankai University
XiaoFeng Wang, Indiana University Bloomington


Note You need to log in before you can comment on or make changes to this bug.