Description of problem: When pod is created, then deleted before ready, pod is stuck terminating Kubelet shows: kubelet_pods.go:980] Pod "pod-submit-status-2-2_e2e-pods-5211(443f27c8-3654-402d-aed2-2383d120238c)" is terminated, but pod cgroup sandbox has not been cleaned up Logging into the worker: $ sudo su - Last login: Tue Dec 1 15:44:35 UTC 2020 on pts/0 [systemd] Failed Units: 1 crio-aaa1f33cc29448b6593b6859efc163865ea8b64e4de12f83446f9c3a4463be7c.scope # systemctl status crio-aaa1f33cc29448b6593b6859efc163865ea8b64e4de12f83446f9c3a4463be7c.scope ● crio-aaa1f33cc29448b6593b6859efc163865ea8b64e4de12f83446f9c3a4463be7c.scope - libcontainer container aaa1f33cc29448b6593b6859efc163865ea8b64e4de12f83446f9c3a4463be7c Loaded: loaded (/run/systemd/transient/crio-aaa1f33cc29448b6593b6859efc163865ea8b64e4de12f83446f9c3a4463be7c.scope; transient) Transient: yes Drop-In: /run/systemd/transient/crio-aaa1f33cc29448b6593b6859efc163865ea8b64e4de12f83446f9c3a4463be7c.scope.d └─50-CPUShares.conf, 50-DeviceAllow.conf, 50-DevicePolicy.conf Active: failed since Tue 2020-12-01 05:30:56 UTC; 10h ago CPU: 22ms CGroup: /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod443f27c8_3654_402d_aed2_2383d120238c.slice/crio-aaa1f33cc29448b6593b6859efc163865ea8b64e4de12f83446f9c3a4463be7c.scope └─1602268 /usr/bin/runc init Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable. # ps -fe | grep 1602268 root 1602268 1 0 05:27 ? 00:00:00 /usr/bin/runc init # cat /proc/1602268/status | grep State State: D (disk sleep) # kill -9 1602268 # ps -fe | grep 1602268 root 1602268 1 0 05:27 ? 00:00:00 /usr/bin/runc init runc init process is in unkillable sleep # cat /proc/1602268/stack [<0>] __refrigerator+0x3f/0x160 [<0>] unix_stream_read_generic+0x7aa/0x8a0 [<0>] unix_stream_recvmsg+0x53/0x70 [<0>] sock_read_iter+0x94/0xf0 [<0>] new_sync_read+0x121/0x170 [<0>] vfs_read+0x91/0x140 [<0>] ksys_read+0x4f/0xb0 [<0>] do_syscall_64+0x5b/0x1a0 [<0>] entry_SYSCALL_64_after_hwframe+0x65/0xca process is stuck in the kernel in the freezer # pwd /sys/fs/cgroup/freezer/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod443f27c8_3654_402d_aed2_2383d120238c.slice/crio-aaa1f33cc29448b6593b6859efc163865ea8b64e4de12f83446f9c3a4463be7c.scope # cat freezer.state FROZEN # cat freezer.self_freezing 1 confirmed runc init process is frozen (by self, not by parent) If I `echo THAWED > freezer.state`, the cgroups clean up an the pod is removed Version-Release number of selected component (if applicable): 4.6.6. How reproducible: Sometimes Steps to Reproduce: Happened to me running this kubernetes conformance test https://github.com/openshift/origin/blob/62c5b679aa8618a4769365085a293480469b3d75/vendor/k8s.io/kubernetes/test/e2e/node/pods.go#L219 Actual results: Pod is stuck terminating Expected results: Pod is removed cleanly Additional info: https://github.com/opencontainers/runc/search?q=frozen
Reproduced locally; working on a fix.
Proposed minimal fix: https://github.com/opencontainers/runc/pull/2774 The plan is to include it into rc93.
rc93 is out with the above minimal fix. Working on additional fix to further improve the success rate (unfortunately it's still a game of chances).
The additional fix (https://github.com/opencontainers/runc/pull/2791) greatly improves the chances of success. It will be released as part of rc94; if needed I can also backport it.
Backport to 4.6: https://github.com/projectatomic/runc/pull/40 Backport to 4.8 (which already have the first PR): https://github.com/projectatomic/runc/pull/41
runc build for 4.6 with the fixes is available from https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=1521730 (el7) and https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=1521735 (el8). @sjenning can you give it a try?
Hey Kir, I'm not going to have time confirm this fix. However, if you found a recreator and that recreator can't cause the issue after the fix, that's good enough for me!
To reiterate, the fixes are https://github.com/opencontainers/runc/pull/2774 and https://github.com/opencontainers/runc/pull/2791 -- both included into https://github.com/projectatomic/runc/pull/40, which was included in runc-1.0.0-83.rhaos4.6.git8c2e7c8.el8.x86_64.rpm, available in v4.6.21+. rc94 is not yet released, but it does not contain any additional fixes on top of what is already included in the above runc/ocp. @nkashyap As per ^^^ you don't need to wait for rc94 -- I suggest your customer to upgrade to latest version (at least v4.6.21+). @rkshirsa v4.6.21 already includes the fixes, but apparently they don't eliminate the issue entirely. I am also seeing this on runc tests, opened https://github.com/opencontainers/runc/issues/2907 yesterday to track it. Overall I think this is most probably specific to EL7 kernel, and the userspace (runc) can only try to work around it, so it is still a game of chances. Alas, at the moment I can not think about any additional fixes; maybe this needs to be fixed in the kernel.
At this point unfortunately I don't have anything to add. I work on reproducing it locally, no luck so far. I don't think a bug for the kernel is filed -- I may file it later. > Recently, they have shared an observation where pods were stuck and too many defunct processes were found on worker "srlp-on44". This is a separate issue, I recommend you to file a bug to cri-o, which is the parent of conmon and should wait(2) for them.
conmon issue was filed as https://bugzilla.redhat.com/show_bug.cgi?id=1848524 and apparently fixed, but since we're seeing it again in a later version (4.6.21) I guess we need to revisit this.
Got two theories. 1. EINTR We do not ported https://github.com/opencontainers/runc/pull/2258 and so fscommon.WriteFile() returns EINTR and thus the whole (*Freezer).Set fails unexpectedly, leaving cgroup in FREEZING state. This is very promising but does not hold the water as apparently golang 1.15 is used which already wraps reads and writes into retry-on-eintr loop. Can anyone can please confirm this by running "runc version" on 4.6.21? 2. (*Freezer).Set fails somewhere are returns an error without thawing the cgroup, so it's stuck in FREEZING or FROZEN state. This is purely theoretical, not confirmed by anything, but I have a proposed fix (https://github.com/opencontainers/runc/pull/2918) and its backport to 4.6 branch (https://github.com/projectatomic/runc/pull/47).
Hello Kir/Team, I got an update from the customer on case#02896371 where even after upgrading the cluster to 4.6.21 they are still able to reproduce the issue. I also check quickly from the latest sos and the must-gather shared by the customer. This is still the same pattern. I am attaching/sharing the latest must-gather and sos report for your further reference. Kindly check and update us. Thanks & Regards, Nirupma
@Kir: as you asked for: runc -version runc version spec: 1.0.2-dev go: go1.15.5 libseccomp: 2.4.1 it's from 4.6.21 as you asked for
It would make sense to test if my fix (upstream: https://github.com/opencontainers/runc/pull/2918, 4.6 backport: https://github.com/projectatomic/runc/pull/47) helps or not. I see that both PRs were merged, but I'm not sure if RPMs are available.
RPMs are available now
One more fix: https://github.com/projectatomic/runc/pull/52 (backport of https://github.com/opencontainers/runc/pull/2941) I am not sure which version it went to. @pehunt can you chime in?
updated with version it's fixed in
$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.8.0-0.nightly-2021-05-25-041803 True False 21m Cluster version is 4.8.0-0.nightly-2021-05-25-041803 sh-4.4# rpm -q runc runc-1.0.0-96.rhaos4.8.gitcd80260.el8.x86_64 When deleting a pod before it gets ready, succeed every time. And I don't find related kubelet log: Pod XXX is terminated, but pod cgroup sandbox has not been cleaned up Also no log found in recent 2 days: https://search.ci.openshift.org/?search=pod+cgroup+sandbox+has+not+been+cleaned+up&maxAge=48h&context=1&type=junit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job
It has been already backported to 4.6 (https://github.com/projectatomic/runc/pull/47, https://github.com/projectatomic/runc/pull/52) and these were merged; not sure if the fixed version is released -- @pehunt can you please tell us?
yup this was fixed in the attached version, which was released with 4.6.30
Thank you very much for the information, Kir and Peter.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438
*** Bug 1943564 has been marked as a duplicate of this bug. ***
I have found a very similar problem with one of our partner. They are using OCP4.7. Was also backported to 4.7?
@peter.hunter do you know if this was backported to OCP 4.7 too? @jgato if Peter can't get back to you by early next week, it might be worthwhile to open a new BZ about 4.7, this BZ has been closed for quite a while now.
oh darn, we never forward ported this change. Doing so now.
ok, so we will fix this into 4.7 and 4.8. Does it means that 4.9 and above is already fixed or not affected?
HEy Jose, sorry for the late reply, yeah 4.9 build is based on upstream 1.0.1, which has the corresponding upstream fix https://github.com/opencontainers/runc/pull/2918/commits/fcd7fe85e1ea706fbdfb383824e9f390a53a34e9
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days