Bug 1903228 - Pod stuck in Terminating, runc init process frozen
Summary: Pod stuck in Terminating, runc init process frozen
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 4.6
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.8.0
Assignee: Kir Kolyshkin
QA Contact: MinLi
URL:
Whiteboard:
: 1943564 (view as bug list)
Depends On:
Blocks: 2103215 2103217
TreeView+ depends on / blocked
 
Reported: 2020-12-01 16:45 UTC by Seth Jennings
Modified: 2024-10-01 17:09 UTC (History)
20 users (show)

Fixed In Version: runc-1.0.0-86.rhaos4.6.git23384e2
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 2103215 (view as bug list)
Environment:
Last Closed: 2021-07-27 22:34:25 UTC
Target Upstream Version:
Embargoed:
pehunt: needinfo+


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 22:34:49 UTC

Description Seth Jennings 2020-12-01 16:45:10 UTC
Description of problem:
When pod is created, then deleted before ready, pod is stuck terminating

Kubelet shows:
kubelet_pods.go:980] Pod "pod-submit-status-2-2_e2e-pods-5211(443f27c8-3654-402d-aed2-2383d120238c)" is terminated, but pod cgroup sandbox has not been cleaned up

Logging into the worker:
$ sudo su -
Last login: Tue Dec  1 15:44:35 UTC 2020 on pts/0
[systemd]
Failed Units: 1
  crio-aaa1f33cc29448b6593b6859efc163865ea8b64e4de12f83446f9c3a4463be7c.scope

# systemctl status crio-aaa1f33cc29448b6593b6859efc163865ea8b64e4de12f83446f9c3a4463be7c.scope
● crio-aaa1f33cc29448b6593b6859efc163865ea8b64e4de12f83446f9c3a4463be7c.scope - libcontainer container aaa1f33cc29448b6593b6859efc163865ea8b64e4de12f83446f9c3a4463be7c
   Loaded: loaded (/run/systemd/transient/crio-aaa1f33cc29448b6593b6859efc163865ea8b64e4de12f83446f9c3a4463be7c.scope; transient)
Transient: yes
  Drop-In: /run/systemd/transient/crio-aaa1f33cc29448b6593b6859efc163865ea8b64e4de12f83446f9c3a4463be7c.scope.d
           └─50-CPUShares.conf, 50-DeviceAllow.conf, 50-DevicePolicy.conf
   Active: failed since Tue 2020-12-01 05:30:56 UTC; 10h ago
      CPU: 22ms
   CGroup: /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod443f27c8_3654_402d_aed2_2383d120238c.slice/crio-aaa1f33cc29448b6593b6859efc163865ea8b64e4de12f83446f9c3a4463be7c.scope
           └─1602268 /usr/bin/runc init

Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.

# ps -fe | grep 1602268
root     1602268       1  0 05:27 ?        00:00:00 /usr/bin/runc init

# cat /proc/1602268/status | grep State
State:	D (disk sleep)

# kill -9 1602268

# ps -fe | grep 1602268
root     1602268       1  0 05:27 ?        00:00:00 /usr/bin/runc init

runc init process is in unkillable sleep

# cat /proc/1602268/stack 
[<0>] __refrigerator+0x3f/0x160
[<0>] unix_stream_read_generic+0x7aa/0x8a0
[<0>] unix_stream_recvmsg+0x53/0x70
[<0>] sock_read_iter+0x94/0xf0
[<0>] new_sync_read+0x121/0x170
[<0>] vfs_read+0x91/0x140
[<0>] ksys_read+0x4f/0xb0
[<0>] do_syscall_64+0x5b/0x1a0
[<0>] entry_SYSCALL_64_after_hwframe+0x65/0xca

process is stuck in the kernel in the freezer

# pwd
/sys/fs/cgroup/freezer/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod443f27c8_3654_402d_aed2_2383d120238c.slice/crio-aaa1f33cc29448b6593b6859efc163865ea8b64e4de12f83446f9c3a4463be7c.scope

# cat freezer.state 
FROZEN

# cat freezer.self_freezing   
1

confirmed runc init process is frozen (by self, not by parent)

If I `echo THAWED > freezer.state`, the cgroups clean up an the pod is removed

Version-Release number of selected component (if applicable):
4.6.6.

How reproducible:
Sometimes

Steps to Reproduce:
Happened to me running this kubernetes conformance test
https://github.com/openshift/origin/blob/62c5b679aa8618a4769365085a293480469b3d75/vendor/k8s.io/kubernetes/test/e2e/node/pods.go#L219

Actual results:
Pod is stuck terminating

Expected results:
Pod is removed cleanly

Additional info:
https://github.com/opencontainers/runc/search?q=frozen

Comment 4 Kir Kolyshkin 2021-01-28 01:57:07 UTC
Reproduced locally; working on a fix.

Comment 5 Kir Kolyshkin 2021-01-30 03:39:57 UTC
Proposed minimal fix: https://github.com/opencontainers/runc/pull/2774

The plan is to include it into rc93.

Comment 6 Kir Kolyshkin 2021-02-04 17:24:01 UTC
rc93 is out with the above minimal fix. Working on additional fix to further improve the success rate (unfortunately it's still a game of chances).

Comment 8 Kir Kolyshkin 2021-03-01 03:02:07 UTC
The additional fix (https://github.com/opencontainers/runc/pull/2791) greatly improves the chances of success. It will be released as part of rc94; if needed I can also backport it.

Comment 9 Kir Kolyshkin 2021-03-01 23:01:50 UTC
Backport to 4.6: https://github.com/projectatomic/runc/pull/40

Backport to 4.8 (which already have the first PR): https://github.com/projectatomic/runc/pull/41

Comment 10 Kir Kolyshkin 2021-03-02 19:59:49 UTC
runc build for 4.6 with the fixes is available from https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=1521730 (el7) and https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=1521735 (el8).

@sjenning can you give it a try?

Comment 11 Seth Jennings 2021-03-08 15:42:07 UTC
Hey Kir,

I'm not going to have time confirm this fix.  However, if you found a recreator and that recreator can't cause the issue after the fix, that's good enough for me!

Comment 18 Kir Kolyshkin 2021-04-14 19:31:24 UTC
To reiterate, the fixes are https://github.com/opencontainers/runc/pull/2774 and https://github.com/opencontainers/runc/pull/2791 -- both included into https://github.com/projectatomic/runc/pull/40, which was included in runc-1.0.0-83.rhaos4.6.git8c2e7c8.el8.x86_64.rpm, available in v4.6.21+.

rc94 is not yet released, but it does not contain any additional fixes on top of what is already included in the above runc/ocp.

@nkashyap As per ^^^ you don't need to wait for rc94 -- I suggest your customer to upgrade to latest version (at least v4.6.21+).

@rkshirsa v4.6.21 already includes the fixes, but apparently they don't eliminate the issue entirely. I am also seeing this on runc tests, opened https://github.com/opencontainers/runc/issues/2907 yesterday to track it.

Overall I think this is most probably specific to EL7 kernel, and the userspace (runc) can only try to work around it, so it is still a game of chances. Alas, at the moment I can not think about any additional fixes; maybe this needs to be fixed in the kernel.

Comment 23 Kir Kolyshkin 2021-04-20 17:53:30 UTC
At this point unfortunately I don't have anything to add. I work on reproducing it locally, no luck so far. I don't think a bug for the kernel is filed -- I may file it later.

> Recently, they have shared an observation where pods were stuck and too many defunct processes were found on worker "srlp-on44".

This is a separate issue, I recommend you to file a bug to cri-o, which is the parent of conmon and should wait(2) for them.

Comment 24 Kir Kolyshkin 2021-04-20 17:56:35 UTC
conmon issue was filed as https://bugzilla.redhat.com/show_bug.cgi?id=1848524 and apparently fixed, but since we're seeing it again in a later version (4.6.21) I guess we need to revisit this.

Comment 25 Kir Kolyshkin 2021-04-21 00:10:31 UTC
Got two theories.

1. EINTR

We do not ported https://github.com/opencontainers/runc/pull/2258 and so fscommon.WriteFile() returns EINTR and thus the whole (*Freezer).Set fails unexpectedly, leaving cgroup in FREEZING state.

This is very promising but does not hold the water as apparently golang 1.15 is used which already wraps reads and writes into retry-on-eintr loop.

Can anyone can please confirm this by running "runc version" on 4.6.21?

2. (*Freezer).Set fails somewhere are returns an error without thawing the cgroup, so it's stuck in FREEZING or FROZEN state.

This is purely theoretical, not confirmed by anything, but I have a proposed fix (https://github.com/opencontainers/runc/pull/2918) and its backport to 4.6 branch (https://github.com/projectatomic/runc/pull/47).

Comment 27 Nirupma Kashyap 2021-04-22 10:55:25 UTC
Hello Kir/Team,

I got an update from the customer on case#02896371 where even after upgrading the cluster to 4.6.21 they are still able to reproduce the issue. I also check quickly from the latest sos and the must-gather shared by the customer. This is still the same pattern.

I am attaching/sharing the latest must-gather and sos report for your further reference. Kindly check and update us.

Thanks & Regards,
Nirupma

Comment 30 Olimp Bockowski 2021-04-22 11:35:17 UTC
@Kir: as you asked for:

runc -version
runc version spec: 1.0.2-dev
go: go1.15.5
libseccomp: 2.4.1

it's from 4.6.21 as you asked for

Comment 35 Kir Kolyshkin 2021-04-27 22:45:49 UTC
It would make sense to test if my fix (upstream: https://github.com/opencontainers/runc/pull/2918, 4.6 backport: https://github.com/projectatomic/runc/pull/47) helps or not. I see that both PRs were merged, but I'm not sure if RPMs are available.

Comment 37 Peter Hunt 2021-04-28 15:39:42 UTC
RPMs are available now

Comment 40 Kir Kolyshkin 2021-05-21 19:03:39 UTC
One more fix: https://github.com/projectatomic/runc/pull/52 (backport of https://github.com/opencontainers/runc/pull/2941)

I am not sure which version it went to. @pehunt can you chime in?

Comment 41 Peter Hunt 2021-05-21 20:09:04 UTC
updated with version it's fixed in

Comment 43 MinLi 2021-05-25 10:28:14 UTC
$ oc get clusterversion 
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.nightly-2021-05-25-041803   True        False         21m     Cluster version is 4.8.0-0.nightly-2021-05-25-041803

sh-4.4# rpm -q runc 
runc-1.0.0-96.rhaos4.8.gitcd80260.el8.x86_64

When deleting a pod before it gets ready, succeed every time. And I don't find related kubelet log: Pod XXX is terminated, but pod cgroup sandbox has not been cleaned up

Also no log found in recent 2 days: 
https://search.ci.openshift.org/?search=pod+cgroup+sandbox+has+not+been+cleaned+up&maxAge=48h&context=1&type=junit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Comment 46 Kir Kolyshkin 2021-06-07 18:34:31 UTC
It has been already backported to 4.6 (https://github.com/projectatomic/runc/pull/47, https://github.com/projectatomic/runc/pull/52) and these were merged; not sure if the fixed version is released -- @pehunt can you please tell us?

Comment 47 Peter Hunt 2021-06-07 18:48:57 UTC
yup this was fixed in the attached version, which was released with 4.6.30

Comment 48 Lucas López Montero 2021-06-07 18:54:21 UTC
Thank you very much for the information, Kir and Peter.

Comment 55 errata-xmlrpc 2021-07-27 22:34:25 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Comment 56 Yu Qi Zhang 2021-11-01 16:19:56 UTC
*** Bug 1943564 has been marked as a duplicate of this bug. ***

Comment 57 Jose Gato 2022-07-01 08:37:27 UTC
I have found a very similar problem with one of our partner. They are using OCP4.7. Was also backported to 4.7?

Comment 58 Tom Sweeney 2022-07-01 15:26:11 UTC
@peter.hunter do you know if this was backported to OCP 4.7 too?  @jgato if Peter can't get back to you by early next week, it might be worthwhile to open a new BZ about 4.7, this BZ has been closed for quite a while now.

Comment 59 Peter Hunt 2022-07-01 16:50:15 UTC
oh  darn, we never forward ported this change. Doing so now.

Comment 60 Jose Gato 2022-07-04 07:04:15 UTC
ok, so we will fix this into 4.7 and 4.8. Does it means that 4.9 and above is already fixed or not affected?

Comment 62 Peter Hunt 2022-07-21 20:41:09 UTC
HEy Jose, sorry for the late reply, yeah 4.9 build is based on upstream 1.0.1, which has the corresponding upstream fix https://github.com/opencontainers/runc/pull/2918/commits/fcd7fe85e1ea706fbdfb383824e9f390a53a34e9

Comment 63 Red Hat Bugzilla 2023-09-15 01:31:28 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days


Note You need to log in before you can comment on or make changes to this bug.