Bug 1991528

Summary: podman pod rm --force failed with device or resource busy when set --cgroup-manager to cgroupfs
Product: Red Hat Enterprise Linux 9 Reporter: Joy Pu <ypu>
Component: podmanAssignee: Matthew Heon <mheon>
Status: CLOSED CURRENTRELEASE QA Contact: atomic-bugs <atomic-bugs>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 9.0CC: bbaude, dwalsh, jligon, jnovy, lsm5, mheon, pthomas, tsweeney, umohnani
Target Milestone: betaKeywords: Reopened
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-07-10 19:28:50 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Joy Pu 2021-08-09 11:10:26 UTC
Description of problem:
podman pod rm --force failed with error meesage:
Error: error removing pod ff010ae9ed3136dc32caf113d232496164a085aeb058529296e3bbc146d2d089 conmon cgroup: remove /sys/fs/cgroup/libpod_parent/ff010ae9ed3136dc32caf113d232496164a085aeb058529296e3bbc146d2d089/conmon: remove /sys/fs/cgroup/libpod_parent/ff010ae9ed3136dc32caf113d232496164a085aeb058529296e3bbc146d2d089/conmon: device or resource busy


Version-Release number of selected component (if applicable):
podman-3.3.0-0.15.module+el9beta+12090+32d0f3c8.x86_64

How reproducible:
around 60%

Steps to Reproduce:
Find this when I run e2e test case "podman pod container --infra=false doesn't share SELinux labels". You can clone upstream podman to your local and run the test case with CGROUP_MANAGER="cgroupfs" and PODMAN_BINARY=`which podman`. The steps in the test are:
1. Create a pod with --infra=false
#  /usr/bin/podman --storage-opt vfs.imagestore=/tmp/podman/imagecachedir --root /tmp/podman_test431871916/crio --runroot /tmp/podman_test431871916/crio-run --runtime crun --conmon /usr/bin/conmon --cni-config-dir /etc/cni/net.d --cgroup-manager cgroupfs --tmpdir /tmp/podman_test431871916 --events-backend file --storage-driver vfs pod create --infra=false

2. Run two containers to checkthe attr is different
/usr/bin/podman --storage-opt vfs.imagestore=/tmp/podman/imagecachedir --root /tmp/podman_test431871916/crio --runroot /tmp/podman_test431871916/crio-run --runtime crun --conmon /usr/bin/conmon --cni-config-dir /etc/cni/net.d --cgroup-manager cgroupfs --tmpdir /tmp/podman_test431871916 --events-backend file --storage-driver vfs run --pod 5a218fbc12c7ecc5d8cda5b2531a7420688eb56bfe1ee421dda75766f1e295ec quay.io/libpod/alpine:latest cat /proc/self/attr/current

 /usr/bin/podman --storage-opt vfs.imagestore=/tmp/podman/imagecachedir --root /tmp/podman_test431871916/crio --runroot /tmp/podman_test431871916/crio-run --runtime crun --conmon /usr/bin/conmon --cni-config-dir /etc/cni/net.d --cgroup-manager cgroupfs --tmpdir /tmp/podman_test431871916 --events-backend file --storage-driver vfs run --pod 5a218fbc12c7ecc5d8cda5b2531a7420688eb56bfe1ee421dda75766f1e295ec quay.io/libpod/alpine:latest cat /proc/self/attr/current

3. Remove the pod with --force
Running: /usr/bin/podman --storage-opt vfs.imagestore=/tmp/podman/imagecachedir --root /tmp/podman_test431871916/crio --runroot /tmp/podman_test431871916/crio-run --runtime crun --conmon /usr/bin/conmon --cni-config-dir /etc/cni/net.d --cgroup-manager cgroupfs --tmpdir /tmp/podman_test431871916 --events-backend file --storage-driver vfs pod rm 5a218fbc12c7ecc5d8cda5b2531a7420688eb56bfe1ee421dda75766f1e295ec --force


Actual results:
pod rm command failed with error

Expected results:
pod can be remove as expected.

Additional info:

The same test pass 100% with --cgroupfs-manager systemd.

Comment 1 Matthew Heon 2021-08-09 13:22:05 UTC
I'll take this one. I was just digging around in the code around this. We have code that prevents this on CGroups v1 systems (the issue which we encountered before, and which I'm almost certain is happening here, is that the cleanup process is launching and occupying the conmon cgroup, preventing its deletion; the solution was to set a PID limit on said cgroup before stopping the containers of the pod, to prevent the cleanup process from being launched). I presume that cgroupfs has changed sufficiently from v1 to v2 that said code does not work on RHEL9 (and likely will not work on RHEL8 + cgroupsV2 + cgroupfs either).

Comment 5 RHEL Program Management 2023-02-09 07:27:47 UTC
After evaluating this issue, there are no plans to address it further or fix it in an upcoming release.  Therefore, it is being closed.  If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened.

Comment 8 Tom Sweeney 2023-05-04 20:53:11 UTC
@mheon any progress on this one?

Comment 10 Tom Sweeney 2023-07-10 17:59:56 UTC
reping @mheon

Comment 11 Matthew Heon 2023-07-10 19:28:50 UTC
The code in question appears to have been entirely removed while I was not working on this bug (was replaced as part of the effort to add resource limits to pods), so I think we can call this done. Wish I could say this was intentional and I was waiting for the code to be refactored out of existence, but this just fell lower in priority than other bugs long enough that the code changed around it.

Going to CLOSED CURRENTRELEASE given the complete removal of affected codepaths.