Bug 1972211
Summary: | When running a lot of one-off containers, podman hangs forever | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Linux 8 | Reporter: | Shane McDonald <smcdonal> | |
Component: | runc | Assignee: | Giuseppe Scrivano <gscrivan> | |
Status: | CLOSED ERRATA | QA Contact: | Alex Jia <ajia> | |
Severity: | medium | Docs Contact: | ||
Priority: | unspecified | |||
Version: | 8.4 | CC: | bbaude, dornelas, dwalsh, jligon, jnovy, karel.klic, kir, lsm5, mheon, pthomas, tsweeney, umohnani, ypu | |
Target Milestone: | beta | Keywords: | ZStream | |
Target Release: | --- | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | runc-1.0.0-rc95 or newer | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1976749 (view as bug list) | Environment: | ||
Last Closed: | 2021-11-09 17:38:22 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1976749 |
Description
Shane McDonald
2021-06-15 12:36:12 UTC
I should also mention that this just started happening sometime within the past week. We've been running these tests for 1-2 months now without issue. More info: We tried changing :Z to :z and it didn't help. We didn't think this was it, but wanted to rule it out. After initial debugging - looks like it's `runc` that's failing, not Podman. `podman run` is calling conmon, which is calling `runc`, which is freezing. We do have a timeout on container creation in runc, but it is *extremely* generous (240 seconds) by default, so Podman is stuck in a critical section, holding a container lock and preventing `podman ps` and anything else that wants a container lock, until said timeout happens. AWX appears to rapidly create new containers that similarly freeze, exacerbating the problem. Initial workaround was to use `crun` instead of `runc` with `--runtime crun`. Going to dig further into why exactly runc is angry. Given observations here, I'm changing component to `runc`. Giuseppe, could you take a look at this please? I've also added Kir to the cc in case he has thoughts or wants to grab this. Matt, do you have a reproducer for the issue? Negative, but Shane was able to provide access to a VM that did reproduce. I've not managed to reproduce as well. Shane, what is the version of runc you are using (rpm -qi runc)? Kir, have you seen this issue before? > Kir, have you seen this issue before? There is a seccomp-related issue that can cause such behavior. The bug appeared in rc93 and was fixed in rc94 by https://github.com/opencontainers/runc/pull/2871/commits/7b3e0bcf2907c29e67eb49fb7ef6c03ea6456d45 Do not remember anything else wrt runc run. The rc93 runc causes issues also for Kubernetes (with Docker and containerd for local container management). It is reliably reproducible on a system with ~120 running containers: new pods stay in ContainerCreating state, kubelet reports lots of "context deadline exceeded" and "PLEG is not healthy" messages. Downgrading runc solved the issue. Confirming the https://github.com/opencontainers/runc/pull/2871/commits/7b3e0bcf2907c29e67eb49fb7ef6c03ea6456d45 is applied in rc95 which is currently present in rc95. > Shane, what is the version of runc you are using (rpm -qi runc)? [root@ip-10-0-7-105 ~]# rpm -qi runc Name : runc Version : 1.0.0 Release : 73.rc93.module+el8.4.0+11311+9da8acfb Architecture: x86_64 Install Date: Tue 06 Jul 2021 02:41:27 PM UTC Group : Unspecified Size : 12109371 License : ASL 2.0 Signature : RSA/SHA256, Wed 09 Jun 2021 06:46:24 AM UTC, Key ID 199e2f91fd431d51 Source RPM : runc-1.0.0-73.rc93.module+el8.4.0+11311+9da8acfb.src.rpm Build Date : Tue 08 Jun 2021 07:53:21 AM UTC Build Host : x86-vm-09.build.eng.bos.redhat.com Relocations : (not relocatable) Packager : Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla> Vendor : Red Hat, Inc. URL : https://github.com/opencontainers/runc Summary : CLI for running Open Containers Description : The runc command can be used to start containers which are packaged in accordance with the Open Container Initiative's specifications, and to manage containers running under runc. Alex - To the best of my knowledge we have *not* seen this error before. This is a stock EC2 instance so I can say with a reasonable level of confidence that we are not doing anything weird with the filesystem. I can't test this for latest runc on RHEL 8.5.0, so I gave a verification from patch point of view, the patch https://github.com/opencontainers/runc/pull/2871/commits/7b3e0bcf2907c29e67eb49fb7ef6c03ea6456d45 has been merged into runc-1.0.1-5.module+el8.5.0+12157+04f1d6be, I assume it's also fine on RHEL 8.5.0 like RHEL 8.4.0. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: container-tools:rhel8 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:4154 The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days |