1761936 – Race condition in kubelet cgroup destroy process

Bug 1761936 - Race condition in kubelet cgroup destroy process

Summary: Race condition in kubelet cgroup destroy process

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	3.11.0
Hardware:	All
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	3.11.z
Assignee:	Ryan Phillips
QA Contact:	Sunil Choudhary
Docs Contact:
URL:
Whiteboard:
Depends On:	1817576
Blocks:	1186913
TreeView+	depends on / blocked

Reported:	2019-10-15 16:06 UTC by Matthew Robson
Modified:	2023-10-06 18:40 UTC (History)
CC List:	12 users (show)
Fixed In Version:	docker-1.13.1-202.git0be3e21.el7_9
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1766665 (view as bug list)
Environment:
Last Closed:	2020-11-12 10:08:21 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2020:4430	0	None	None	None	2020-11-12 10:08:55 UTC

Description Matthew Robson 2019-10-15 16:06:22 UTC

Description of problem:

We are seeing an issue in which the removal of docker scopes are timing out (potentially) due to the incomplete cleanup of the pod slice by the kubelet.

Oct 03 07:56:57 server systemd[1]: docker-ae80d3d45a56fb3ac230aa8be8a1b02338fc1717a482645bbea7e693cb235126.scope stopping timed out. Killing.

This results in significant numbers of orphaned scopes and slides which can accumulate to the point in which systemd and docker stop functioning, causing the node to go into a NotReady state. The customer was seeing timeouts for all systemd and docker commands with over 2200 orphaned /run/systemd/system/kubepods-burstable-*/ entries:

Creating artificial scopes, shows the impact on systemd performance as the number grows. 

 # docker-CONTAINERID.scope.d   |    total execution time of
    under /run/systemd/system    |   systemctl list-unit-files
  -------------------------------|----------------------------
                  0              |          0.056 sec
                250              |          0.85  sec
                500              |          2.56  sec
                750              |          5.3   sec
               1000              |          8.9   sec


The thinking is that a premature destroy is sent from the Kubelet for the pod slice resulting in the inability to stop the scope due to its dependency chain.

The tryKillingCgroupProcesses()[1] function fetches the list of existing processes one time and iterates through the same list multiple times. Unlike the mechanism in the container runtime, the Kubelet does not freeze the existing processes, so in theory, the Kubelet could miss a newly spawned / forked process when it issues the Destroy for the pods cgroup[2].

[1] https://github.com/kubernetes/kubernetes/blob/release-1.11/pkg/kubelet/cm/pod_container_manager_linux.go#L148

[2] https://github.com/kubernetes/kubernetes/blob/release-1.11/pkg/kubelet/cm/pod_container_manager_linux.go#L191


The latest version of systemd changes how the dependency tree is handled, so we tested how that may impact the parent / child relationship:

Experiment with systemd version 219-62.el7_6.9 versus 219-67.el7_7.1
-------------------------------------------------------------------------------
- with system version 219-62.el7_6.9

In this case, a parent cgroup is set. When stopped, it stops immediately, irrespective of the state of the child. As you can see, the docker scope is still intact and cleaned up when the container stops.

# rpm -q systemd
systemd-219-62.el7_6.9.x86_64

# docker create --name=sf02456020 --cgroup-parent=sf02456020.slice registry.redhat.io/rhel7 sh -c "trap \"\" TERM ; sleep 365d"
e6cbd12e20156c0dc8bd0bc8ebdfc66117ece1d52b3834282aa09950fde0cd68

# docker start sf02456020
sf02456020

# ls -ld /sys/fs/cgroup/systemd/sf02456020.slice/docker-e6cbd12e20156c0dc8bd0bc8ebdfc66117ece1d52b3834282aa09950fde0cd68.scope
drwxr-xr-x. 2 root root 0 Sep 29 16:59 /sys/fs/cgroup/systemd/sf02456020.slice/docker-e6cbd12e20156c0dc8bd0bc8ebdfc66117ece1d52b3834282aa09950fde0cd68.scope

# systemctl list-units | grep -e sf02456020.slice -e docker-e6cbd12e20156c0dc8bd0bc8ebdfc66117ece1d52b3834282aa09950fde0cd68.scope
docker-e6cbd12e20156c0dc8bd0bc8ebdfc66117ece1d52b3834282aa09950fde0cd68.scope                         loaded active running   libcontainer container e6cbd12e20156c0dc8bd0bc8ebdfc66117ece1d52b3834282aa09950fde0cd68
sf02456020.slice                                                                                      loaded active active    sf02456020.slice

# time systemctl stop sf02456020.slice

real    0m0.026s
user    0m0.005s
sys    0m0.012s

# systemctl list-units | grep -e sf02456020.slice -e docker-e6cbd12e20156c0dc8bd0bc8ebdfc66117ece1d52b3834282aa09950fde0cd68.scope
docker-e6cbd12e20156c0dc8bd0bc8ebdfc66117ece1d52b3834282aa09950fde0cd68.scope                         loaded active running   libcontainer container e6cbd12e20156c0dc8bd0bc8ebdfc66117ece1d52b3834282aa09950fde0cd68

# ls -ld /sys/fs/cgroup/systemd/sf02456020.slice/docker-e6cbd12e20156c0dc8bd0bc8ebdfc66117ece1d52b3834282aa09950fde0cd68.scope
drwxr-xr-x. 2 root root 0 Sep 29 16:59 /sys/fs/cgroup/systemd/sf02456020.slice/docker-e6cbd12e20156c0dc8bd0bc8ebdfc66117ece1d52b3834282aa09950fde0cd68.scope

# cat /sys/fs/cgroup/systemd/sf02456020.slice/docker-e6cbd12e20156c0dc8bd0bc8ebdfc66117ece1d52b3834282aa09950fde0cd68.scope/cgroup.procs
2582
2618

# /usr/libexec/docker/docker-runc-current ps e6cbd12e20156c0dc8bd0bc8ebdfc66117ece1d52b3834282aa09950fde0cd68
UID        PID  PPID  C STIME TTY          TIME CMD
root      2582  2563  0 16:59 ?        00:00:00 sh -c trap "" TERM ; sleep 365d
root      2618  2582  0 16:59 ?        00:00:00 sleep 365d

# journalctl | grep "scope stopping timed out. Killing." -c
0


- with system version 219-67.el7_7.1, there is a tighter dependency hierarchy. You can see that stopping the parents cgroup takes 90 seconds, which is the default timeout. From there, it enters a failed state.

The inference from this test is that if the parent scope is stopped by the container runtime and the dependent pod slice is not fully cleaned up by the kubelet, you will see a race condition in which the removal times out, leaving the scopes and slices in a failed state.

# rpm -q systemd
systemd-219-67.el7_7.1.x86_64

# docker start sf02456020
sf02456020

# ls -ld /sys/fs/cgroup/systemd/sf02456020.slice/docker-e6cbd12e20156c0dc8bd0bc8ebdfc66117ece1d52b3834282aa09950fde0cd68.scope
drwxr-xr-x. 2 root root 0 Sep 29 17:09 /sys/fs/cgroup/systemd/sf02456020.slice/docker-e6cbd12e20156c0dc8bd0bc8ebdfc66117ece1d52b3834282aa09950fde0cd68.scope

# systemctl list-units | grep -e sf02456020.slice -e docker-e6cbd12e20156c0dc8bd0bc8ebdfc66117ece1d52b3834282aa09950fde0cd68.scope
docker-e6cbd12e20156c0dc8bd0bc8ebdfc66117ece1d52b3834282aa09950fde0cd68.scope                         loaded active running   libcontainer container e6cbd12e20156c0dc8bd0bc8ebdfc66117ece1d52b3834282aa09950fde0cd68
sf02456020.slice                                                                                      loaded active active    sf02456020.slice

# time systemctl stop sf02456020.slice

real    1m30.249s
user    0m0.005s
sys    0m0.018s

# systemctl list-units | grep -e sf02456020.slice -e docker-e6cbd12e20156c0dc8bd0bc8ebdfc66117ece1d52b3834282aa09950fde0cd68.scope
● docker-e6cbd12e20156c0dc8bd0bc8ebdfc66117ece1d52b3834282aa09950fde0cd68.scope             loaded failed failed    libcontainer container e6cbd12e20156c0dc8bd0bc8ebdfc66117ece1d52b3834282aa09950fde0cd68

# ls -ld /sys/fs/cgroup/systemd/sf02456020.slice/docker-e6cbd12e20156c0dc8bd0bc8ebdfc66117ece1d52b3834282aa09950fde0cd68.scope
ls: cannot access /sys/fs/cgroup/systemd/sf02456020.slice/docker-e6cbd12e20156c0dc8bd0bc8ebdfc66117ece1d52b3834282aa09950fde0cd68.scope: No such file or directory

# journalctl | grep "scope stopping timed out. Killing." -c
1

# journalctl | grep "scope stopping timed out. Killing."
Sep 29 17:11:33 rhel7vm systemd[1]: docker-e6cbd12e20156c0dc8bd0bc8ebdfc66117ece1d52b3834282aa09950fde0cd68.scope stopping timed out. Killing.

# docker ps -a
CONTAINER ID        IMAGE                      COMMAND                    CREATED             STATUS                       PORTS               NAMES
e6cbd12e2015        registry.redhat.io/rhel7   "sh -c 'trap \"\" TE..."   17 minutes ago      Exited (137) 4 minutes ago                       sf02456020

Increasing the log level in the node, we can see with 100% consistency, pid killing events 90 seconds before the timeout killing events. Unfortunately, there is nothing that ties these events to the cgroup - (I submitted a PR for this @ https://github.com/kubernetes/kubernetes/pull/83625 which was merged in). This helps reinforce the notion that pids are being killed and that it’s possible something was forked and missed when the Destroy is sent. There is no available mechanism available that would allow us to 100% confirm this right now.

Oct 02 16:17:25 server atomic-openshift-node[48943]: I1002 16:17:25.311748   48943 pod_container_manager_linux.go:164] Attempt to kill process with pid: 51517
Oct 02 16:17:25 server atomic-openshift-node[48943]: I1002 16:17:25.311893   48943 pod_container_manager_linux.go:171] successfully killed all unwanted processes.
Oct 02 16:17:25 server atomic-openshift-node[48943]: I1002 16:17:25.312427   48943 pod_container_manager_linux.go:164] Attempt to kill process with pid: 72030
Oct 02 16:17:25 server atomic-openshift-node[48943]: I1002 16:17:25.312534   48943 pod_container_manager_linux.go:164] Attempt to kill process with pid: 72888
Oct 02 16:17:25 server atomic-openshift-node[48943]: I1002 16:17:25.312539   48943 pod_container_manager_linux.go:171] successfully killed all unwanted processes.
Oct 02 16:17:25 server atomic-openshift-node[48943]: I1002 16:17:25.312647   48943 pod_container_manager_linux.go:171] successfully killed all unwanted processes.
Oct 02 16:17:25 server atomic-openshift-node[48943]: I1002 16:17:25.314103   48943 pod_container_manager_linux.go:164] Attempt to kill process with pid: 72476
Oct 02 16:17:25 server atomic-openshift-node[48943]: I1002 16:17:25.314198   48943 pod_container_manager_linux.go:171] successfully killed all unwanted processes.
Oct 02 16:17:25 server atomic-openshift-node[48943]: I1002 16:17:25.314731   48943 pod_container_manager_linux.go:164] Attempt to kill process with pid: 72182
Oct 02 16:17:25 server atomic-openshift-node[48943]: I1002 16:17:25.314817   48943 pod_container_manager_linux.go:171] successfully killed all unwanted processes.

... 90 seconds later ...

Oct 02 16:18:55 server systemd[1]: docker-197c4a8f5f0815186a323d21779cc86e603a8beeec75345cadaceeaccdde6d1b.scope stopping timed out. Killing.
Oct 02 16:18:55 server systemd[1]: docker-1c8aa3b0ce61d307bcd256cd417bb1ea6912e3a130984afffadbf5ce4fd22c21.scope stopping timed out. Killing.

We tried to debug this from the systemd side collect systemd-cgls data, but even at 5 seconds internals, we did not capture any relevant data.

For all of the occurrences we have analyzed, this issue is always due to very short lived containers of a few seconds. Predominantly, they are short cron jobs running normally, but we do see other instances related to containers stopping, failing or dying quickly that create the same issue.

Version-Release number of selected component (if applicable):
3.11.z


How reproducible:
Very reproducable in customers environment .


Steps to Reproduce:
1.
2.
3.

Actual results:
Continual buildup of scopes and slices

Expected results:
These should be properly cleaned up and removed.

Additional info:
This can be worked around by continually executing ‘systemctl reset-failed’ as a cron job on your server.

Comment 20 Fatima 2019-12-31 14:10:22 UTC

Hi team,

Any updates on this?

Thanks,
Fatima

Comment 27 Neelesh Agrawal 2020-09-03 02:29:55 UTC

Looks like RHEL 7.9 released is pushed-out further. This will not be fixed in the current sprint.

Comment 28 Matthew Robson 2020-09-29 12:47:05 UTC

Mrunal / Neelesh / Derrick - Could the root of this actually be solved by where we catch the failures in StopUnit and cleanup? -> https://github.com/opencontainers/runc/commit/bfa1b2aab39f7735db9319478b8f63b9adca26e3#diff-252ac01f71681e9ed8e7ce0571cdb6ac ?

Comment 37 errata-xmlrpc 2020-11-12 10:08:21 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 3.11.317 bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4430

Comment 38 Red Hat Bugzilla 2023-09-14 05:44:34 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days

Note You need to log in before you can comment on or make changes to this bug.