Summary: We received a report of an attack vector on containers which share an IPC namespace (definitely Podman, but likely also applicable to Kubernetes, Docker, and other container runtimes). At least two containers are configured with a shared IPC namespace and a cgroup limiting memory. One of those containers is malicious, and contains a binary which creates a large number of IPC resources in /dev/shm, and continues doing so until it is OOM killed. The malicious container is now dead, its cgroup removed, but the IPC resources it created are not; they are tied to the IPC namespace that will not be removed until all containers using it are stopped, and one non-malicious container is holding the namespace open. The malicious container is restarted (either automatically or by attacker control), repeating the process and increasing the amount of memory consumed. With a container configured to restart always (e.g. `podman run --restart=always`) this results in a memory-based denial of service of the system. Podman Version: Version 5.0.0-dev and before
the issue is that the allocated memory is assigned to the container cgroup that has a memory limit, if that limit is hit the kernel refuses to allocate more. But if the container exits, the shared memory is not freed and it is still allocated to the first cgroup, that is not accessible anymore from user space (but it seems to be still referenced internally in the kernel). When the container exits, we create a new one, that uses the same cgroup name and it also has a limit set. In this way the container can be restarted multiple times and each restart can leak memory since the cgroup seems to not be freed internally. I've not looked into the code, so this analysis is only based on my observations from user space, it might be completely wrong :-) If I am not wrong, then I think the sensible thing to do in this case would be to migrate the memory allocated to the cgroup that was destroyed to another cgroup using that same IPC. In this way from userspace we can avoid the issue by making sure each cgroup has a limit set, and that there are no leaks once a cgroup is deleted. I think we should add "Waiman Long <llong>" to the conversation. It seems I am not allowed to do it
Added llong to CC list
(In reply to Giuseppe Scrivano from comment #11) > the issue is that the allocated memory is assigned to the container cgroup > that has a memory limit, if that limit is hit the kernel refuses to allocate > more. > > But if the container exits, the shared memory is not freed and it is still > allocated to the first cgroup, that is not accessible anymore from user > space (but it seems to be still referenced internally in the kernel). > > When the container exits, we create a new one, that uses the same cgroup > name and it also has a limit set. > > In this way the container can be restarted multiple times and each restart > can leak memory since the cgroup seems to not be freed internally. > > I've not looked into the code, so this analysis is only based on my > observations from user space, it might be completely wrong :-) > > If I am not wrong, then I think the sensible thing to do in this case would > be to migrate the memory allocated to the cgroup that was destroyed to > another cgroup using that same IPC. In this way from userspace we can avoid > the issue by making sure each cgroup has a limit set, and that there are no > leaks once a cgroup is deleted. > > I think we should add "Waiman Long <llong>" to the conversation. > It seems I am not allowed to do it There is actually upstream discussion about this specific problem. In the case of shared memory, memory ownership is assigned to the memory cgroup of the first process that uses it. References to that shared memory can be present in other memory cgroups. When the owning cgroup has exited all its processes and to be destroyed, it remained in the zombie state because of the additional references to the shared memory. AFAIK, there was no consensus on the best way forward the last time I checked. I need to check again to see if there is progress on this issue.
Created podman tracking bugs for this issue: Affects: fedora-all [bug 2302003]
Created attachment 2076557 [details] report email We are a security team from multiple universities. The earlier 'report' received by the Podman Security Team was from us. We would like to provide some supplementary information regarding additional attack vectors not mentioned in the original report. - Network namespace sharing: If two malicious containers share a network namespace and are given the ‘net_admin’ privilege, they can coordinate to bypass cgroup restrictions by reproducing the DoS attack steps, just as they would if they share an IPC namespace. The difference is that when sharing a network namespace, malicious containers consume memory by creating network devices. - PID Namespace sharing: By sharing a PID namespace, two malicious containers can bypass the cgroup limit on the number of processes and launch a DoS attack on the host system. Specifically, an attacker can create such containers with a non-functional 'init' process (specified using 'podman run --init-path=init_process') that cannot properly handle orphaned processes. This allows the container to generate lots of zombie processes. If similar attack steps in IPC namespace sharing are repeated, one container can continuously generate zombie processes and restart once it hits the cgroup PID limits. As a result, zombie processes will accumulate within the shared namespace, eventually exhausting the host system's PID resources. Zhen Xu, Huazhong University of Science and Technology Zhi Li, Huazhong University of Science and Technology Weijie Liu, Nankai University XiaoFeng Wang, Indiana University Bloomington