RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 1972211 - When running a lot of one-off containers, podman hangs forever
Summary: When running a lot of one-off containers, podman hangs forever
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 8
Classification: Red Hat
Component: runc
Version: 8.4
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: beta
: ---
Assignee: Giuseppe Scrivano
QA Contact: Alex Jia
URL:
Whiteboard:
Depends On:
Blocks: 1976749
TreeView+ depends on / blocked
 
Reported: 2021-06-15 12:36 UTC by Shane McDonald
Modified: 2023-09-15 01:09 UTC (History)
13 users (show)

Fixed In Version: runc-1.0.0-rc95 or newer
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1976749 (view as bug list)
Environment:
Last Closed: 2021-11-09 17:38:22 UTC
Type: Bug
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2021:4154 0 None None None 2021-11-09 17:39:24 UTC

Description Shane McDonald 2021-06-15 12:36:12 UTC
Description of problem:

Hello. Opening this bug per request from mheon in #podman.

The Ansible AWX/Tower team is seeing something concerning on RHEL 8.4 with Podman 3.0.2-dev. At some point during our integration tests, all Podman operations (run, ps, info) begin to hang for the user running the containers. Other users (both root and non-root) seem unaffected. There are quite a lot of /usr/bin/fuse-overlayfs processes running. Not sure if this is a red herring or not.

When the problem occurs, there are between 10-15 podman processes running that all look similar to this:

/usr/bin/podman run --rm --tty --interactive --workdir /runner/project -v /tmp/pdd_wrapper_1837_gzjymf96/awx_1837_xjrlo162/:/runner:Z -v /var/lib/awx/projects/_1874__project_shetraffic/:/var/lib/awx/projects/_1874__project_shetraffic:Z -v /var/lib/awx/projects/.__awx_cache/_1874__project_shetraffic/:/var/lib/awx/projects/.__awx_cache/_1874__project_shetraffic:Z --env-file /tmp/pdd_wrapper_1837_gzjymf96/awx_1837_xjrlo162/artifacts/1837/env.list --quiet --name ansible_runner_1837 --user=root --authfile=/tmp/pdd_wrapper_1837_gzjymf96/auth.json quay.io/aap/ansible-automation-platform-20-ee-supported-rhel8:latest ansible-playbook -t update_git,install_roles,install_collections -i /runner/inventory/hosts -e @/runner/env/extravars project_update.yml


Version-Release number of selected component (if applicable):

RHEL 8.4. Podman 3.0.2-dev.


How reproducible:

Our integration tests reliably hit this, but we are still working on coming up with an exact reproducer. We can hop on a call and do live debugging with someone if that would help.

Comment 1 Shane McDonald 2021-06-15 12:38:30 UTC
I should also mention that this just started happening sometime within the past week. We've been running these tests for 1-2 months now without issue.

Comment 2 Shane McDonald 2021-06-15 15:32:08 UTC
More info: We tried changing :Z to :z and it didn't help. We didn't think this was it, but wanted to rule it out.

Comment 3 Matthew Heon 2021-06-15 18:09:31 UTC
After initial debugging - looks like it's `runc` that's failing, not Podman. `podman run` is calling conmon, which is calling `runc`, which is freezing. We do have a timeout on container creation in runc, but it is *extremely* generous (240 seconds) by default, so Podman is stuck in a critical section, holding a container lock and preventing `podman ps` and anything else that wants a container lock, until said timeout happens. AWX appears to rapidly create new containers that similarly freeze, exacerbating the problem.

Initial workaround was to use `crun` instead of `runc` with `--runtime crun`. Going to dig further into why exactly runc is angry.

Comment 4 Matthew Heon 2021-06-16 13:32:09 UTC
Given observations here, I'm changing component to `runc`.

Comment 5 Tom Sweeney 2021-06-16 15:55:20 UTC
Giuseppe, could you take a look at this please?  I've also added Kir to the cc in case he has thoughts or wants to grab this.

Comment 6 Giuseppe Scrivano 2021-06-17 12:47:53 UTC
Matt, do you have a reproducer for the issue?

Comment 7 Matthew Heon 2021-06-17 14:30:36 UTC
Negative, but Shane was able to provide access to a VM that did reproduce.

Comment 8 Giuseppe Scrivano 2021-06-18 10:02:23 UTC
I've not managed to reproduce as well.

Shane, what is the version of runc you are using (rpm -qi runc)?

Kir, have you seen this issue before?

Comment 9 Kir Kolyshkin 2021-06-18 19:08:23 UTC
> Kir, have you seen this issue before?

There is a seccomp-related issue that can cause such behavior. The bug appeared in rc93 and was fixed in rc94 by https://github.com/opencontainers/runc/pull/2871/commits/7b3e0bcf2907c29e67eb49fb7ef6c03ea6456d45

Do not remember anything else wrt runc run.

Comment 10 Karel Klic 2021-06-20 14:43:56 UTC
The rc93 runc causes issues also for Kubernetes (with Docker and containerd for local container management).

It is reliably reproducible on a system with ~120 running containers: new pods stay in ContainerCreating state, kubelet reports lots of "context deadline exceeded" and "PLEG is not healthy" messages.

Downgrading runc solved the issue.

Comment 11 Jindrich Novy 2021-06-21 08:37:33 UTC
Confirming the https://github.com/opencontainers/runc/pull/2871/commits/7b3e0bcf2907c29e67eb49fb7ef6c03ea6456d45 is applied in rc95 which is currently present in rc95.

Comment 18 Shane McDonald 2021-07-06 16:11:26 UTC
> Shane, what is the version of runc you are using (rpm -qi runc)?

[root@ip-10-0-7-105 ~]# rpm -qi runc
Name        : runc
Version     : 1.0.0
Release     : 73.rc93.module+el8.4.0+11311+9da8acfb
Architecture: x86_64
Install Date: Tue 06 Jul 2021 02:41:27 PM UTC
Group       : Unspecified
Size        : 12109371
License     : ASL 2.0
Signature   : RSA/SHA256, Wed 09 Jun 2021 06:46:24 AM UTC, Key ID 199e2f91fd431d51
Source RPM  : runc-1.0.0-73.rc93.module+el8.4.0+11311+9da8acfb.src.rpm
Build Date  : Tue 08 Jun 2021 07:53:21 AM UTC
Build Host  : x86-vm-09.build.eng.bos.redhat.com
Relocations : (not relocatable)
Packager    : Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>
Vendor      : Red Hat, Inc.
URL         : https://github.com/opencontainers/runc
Summary     : CLI for running Open Containers
Description :
The runc command can be used to start containers which are packaged
in accordance with the Open Container Initiative's specifications,
and to manage containers running under runc.

Comment 20 Shane McDonald 2021-07-08 13:19:41 UTC
Alex - To the best of my knowledge we have *not* seen this error before. This is a stock EC2 instance so I can say with a reasonable level of confidence that we are not doing anything weird with the filesystem.

Comment 27 Alex Jia 2021-08-08 12:59:47 UTC
I can't test this for latest runc on RHEL 8.5.0, so I gave a verification from patch point of view, 
the patch https://github.com/opencontainers/runc/pull/2871/commits/7b3e0bcf2907c29e67eb49fb7ef6c03ea6456d45
has been merged into runc-1.0.1-5.module+el8.5.0+12157+04f1d6be, I assume it's also fine on RHEL 8.5.0 like
RHEL 8.4.0.

Comment 29 errata-xmlrpc 2021-11-09 17:38:22 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: container-tools:rhel8 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:4154

Comment 30 Red Hat Bugzilla 2023-09-15 01:09:50 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days


Note You need to log in before you can comment on or make changes to this bug.