Description of problem: We are seeing an issue in which the removal of docker scopes are timing out (potentially) due to the incomplete cleanup of the pod slice by the kubelet. Oct 03 07:56:57 server systemd[1]: docker-ae80d3d45a56fb3ac230aa8be8a1b02338fc1717a482645bbea7e693cb235126.scope stopping timed out. Killing. This results in significant numbers of orphaned scopes and slides which can accumulate to the point in which systemd and docker stop functioning, causing the node to go into a NotReady state. The customer was seeing timeouts for all systemd and docker commands with over 2200 orphaned /run/systemd/system/kubepods-burstable-*/ entries: Creating artificial scopes, shows the impact on systemd performance as the number grows. # docker-CONTAINERID.scope.d | total execution time of under /run/systemd/system | systemctl list-unit-files -------------------------------|---------------------------- 0 | 0.056 sec 250 | 0.85 sec 500 | 2.56 sec 750 | 5.3 sec 1000 | 8.9 sec The thinking is that a premature destroy is sent from the Kubelet for the pod slice resulting in the inability to stop the scope due to its dependency chain. The tryKillingCgroupProcesses()[1] function fetches the list of existing processes one time and iterates through the same list multiple times. Unlike the mechanism in the container runtime, the Kubelet does not freeze the existing processes, so in theory, the Kubelet could miss a newly spawned / forked process when it issues the Destroy for the pods cgroup[2]. [1] https://github.com/kubernetes/kubernetes/blob/release-1.11/pkg/kubelet/cm/pod_container_manager_linux.go#L148 [2] https://github.com/kubernetes/kubernetes/blob/release-1.11/pkg/kubelet/cm/pod_container_manager_linux.go#L191 The latest version of systemd changes how the dependency tree is handled, so we tested how that may impact the parent / child relationship: Experiment with systemd version 219-62.el7_6.9 versus 219-67.el7_7.1 ------------------------------------------------------------------------------- - with system version 219-62.el7_6.9 In this case, a parent cgroup is set. When stopped, it stops immediately, irrespective of the state of the child. As you can see, the docker scope is still intact and cleaned up when the container stops. # rpm -q systemd systemd-219-62.el7_6.9.x86_64 # docker create --name=sf02456020 --cgroup-parent=sf02456020.slice registry.redhat.io/rhel7 sh -c "trap \"\" TERM ; sleep 365d" e6cbd12e20156c0dc8bd0bc8ebdfc66117ece1d52b3834282aa09950fde0cd68 # docker start sf02456020 sf02456020 # ls -ld /sys/fs/cgroup/systemd/sf02456020.slice/docker-e6cbd12e20156c0dc8bd0bc8ebdfc66117ece1d52b3834282aa09950fde0cd68.scope drwxr-xr-x. 2 root root 0 Sep 29 16:59 /sys/fs/cgroup/systemd/sf02456020.slice/docker-e6cbd12e20156c0dc8bd0bc8ebdfc66117ece1d52b3834282aa09950fde0cd68.scope # systemctl list-units | grep -e sf02456020.slice -e docker-e6cbd12e20156c0dc8bd0bc8ebdfc66117ece1d52b3834282aa09950fde0cd68.scope docker-e6cbd12e20156c0dc8bd0bc8ebdfc66117ece1d52b3834282aa09950fde0cd68.scope loaded active running libcontainer container e6cbd12e20156c0dc8bd0bc8ebdfc66117ece1d52b3834282aa09950fde0cd68 sf02456020.slice loaded active active sf02456020.slice # time systemctl stop sf02456020.slice real 0m0.026s user 0m0.005s sys 0m0.012s # systemctl list-units | grep -e sf02456020.slice -e docker-e6cbd12e20156c0dc8bd0bc8ebdfc66117ece1d52b3834282aa09950fde0cd68.scope docker-e6cbd12e20156c0dc8bd0bc8ebdfc66117ece1d52b3834282aa09950fde0cd68.scope loaded active running libcontainer container e6cbd12e20156c0dc8bd0bc8ebdfc66117ece1d52b3834282aa09950fde0cd68 # ls -ld /sys/fs/cgroup/systemd/sf02456020.slice/docker-e6cbd12e20156c0dc8bd0bc8ebdfc66117ece1d52b3834282aa09950fde0cd68.scope drwxr-xr-x. 2 root root 0 Sep 29 16:59 /sys/fs/cgroup/systemd/sf02456020.slice/docker-e6cbd12e20156c0dc8bd0bc8ebdfc66117ece1d52b3834282aa09950fde0cd68.scope # cat /sys/fs/cgroup/systemd/sf02456020.slice/docker-e6cbd12e20156c0dc8bd0bc8ebdfc66117ece1d52b3834282aa09950fde0cd68.scope/cgroup.procs 2582 2618 # /usr/libexec/docker/docker-runc-current ps e6cbd12e20156c0dc8bd0bc8ebdfc66117ece1d52b3834282aa09950fde0cd68 UID PID PPID C STIME TTY TIME CMD root 2582 2563 0 16:59 ? 00:00:00 sh -c trap "" TERM ; sleep 365d root 2618 2582 0 16:59 ? 00:00:00 sleep 365d # journalctl | grep "scope stopping timed out. Killing." -c 0 - with system version 219-67.el7_7.1, there is a tighter dependency hierarchy. You can see that stopping the parents cgroup takes 90 seconds, which is the default timeout. From there, it enters a failed state. The inference from this test is that if the parent scope is stopped by the container runtime and the dependent pod slice is not fully cleaned up by the kubelet, you will see a race condition in which the removal times out, leaving the scopes and slices in a failed state. # rpm -q systemd systemd-219-67.el7_7.1.x86_64 # docker start sf02456020 sf02456020 # ls -ld /sys/fs/cgroup/systemd/sf02456020.slice/docker-e6cbd12e20156c0dc8bd0bc8ebdfc66117ece1d52b3834282aa09950fde0cd68.scope drwxr-xr-x. 2 root root 0 Sep 29 17:09 /sys/fs/cgroup/systemd/sf02456020.slice/docker-e6cbd12e20156c0dc8bd0bc8ebdfc66117ece1d52b3834282aa09950fde0cd68.scope # systemctl list-units | grep -e sf02456020.slice -e docker-e6cbd12e20156c0dc8bd0bc8ebdfc66117ece1d52b3834282aa09950fde0cd68.scope docker-e6cbd12e20156c0dc8bd0bc8ebdfc66117ece1d52b3834282aa09950fde0cd68.scope loaded active running libcontainer container e6cbd12e20156c0dc8bd0bc8ebdfc66117ece1d52b3834282aa09950fde0cd68 sf02456020.slice loaded active active sf02456020.slice # time systemctl stop sf02456020.slice real 1m30.249s user 0m0.005s sys 0m0.018s # systemctl list-units | grep -e sf02456020.slice -e docker-e6cbd12e20156c0dc8bd0bc8ebdfc66117ece1d52b3834282aa09950fde0cd68.scope ● docker-e6cbd12e20156c0dc8bd0bc8ebdfc66117ece1d52b3834282aa09950fde0cd68.scope loaded failed failed libcontainer container e6cbd12e20156c0dc8bd0bc8ebdfc66117ece1d52b3834282aa09950fde0cd68 # ls -ld /sys/fs/cgroup/systemd/sf02456020.slice/docker-e6cbd12e20156c0dc8bd0bc8ebdfc66117ece1d52b3834282aa09950fde0cd68.scope ls: cannot access /sys/fs/cgroup/systemd/sf02456020.slice/docker-e6cbd12e20156c0dc8bd0bc8ebdfc66117ece1d52b3834282aa09950fde0cd68.scope: No such file or directory # journalctl | grep "scope stopping timed out. Killing." -c 1 # journalctl | grep "scope stopping timed out. Killing." Sep 29 17:11:33 rhel7vm systemd[1]: docker-e6cbd12e20156c0dc8bd0bc8ebdfc66117ece1d52b3834282aa09950fde0cd68.scope stopping timed out. Killing. # docker ps -a CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES e6cbd12e2015 registry.redhat.io/rhel7 "sh -c 'trap \"\" TE..." 17 minutes ago Exited (137) 4 minutes ago sf02456020 Increasing the log level in the node, we can see with 100% consistency, pid killing events 90 seconds before the timeout killing events. Unfortunately, there is nothing that ties these events to the cgroup - (I submitted a PR for this @ https://github.com/kubernetes/kubernetes/pull/83625 which was merged in). This helps reinforce the notion that pids are being killed and that it’s possible something was forked and missed when the Destroy is sent. There is no available mechanism available that would allow us to 100% confirm this right now. Oct 02 16:17:25 server atomic-openshift-node[48943]: I1002 16:17:25.311748 48943 pod_container_manager_linux.go:164] Attempt to kill process with pid: 51517 Oct 02 16:17:25 server atomic-openshift-node[48943]: I1002 16:17:25.311893 48943 pod_container_manager_linux.go:171] successfully killed all unwanted processes. Oct 02 16:17:25 server atomic-openshift-node[48943]: I1002 16:17:25.312427 48943 pod_container_manager_linux.go:164] Attempt to kill process with pid: 72030 Oct 02 16:17:25 server atomic-openshift-node[48943]: I1002 16:17:25.312534 48943 pod_container_manager_linux.go:164] Attempt to kill process with pid: 72888 Oct 02 16:17:25 server atomic-openshift-node[48943]: I1002 16:17:25.312539 48943 pod_container_manager_linux.go:171] successfully killed all unwanted processes. Oct 02 16:17:25 server atomic-openshift-node[48943]: I1002 16:17:25.312647 48943 pod_container_manager_linux.go:171] successfully killed all unwanted processes. Oct 02 16:17:25 server atomic-openshift-node[48943]: I1002 16:17:25.314103 48943 pod_container_manager_linux.go:164] Attempt to kill process with pid: 72476 Oct 02 16:17:25 server atomic-openshift-node[48943]: I1002 16:17:25.314198 48943 pod_container_manager_linux.go:171] successfully killed all unwanted processes. Oct 02 16:17:25 server atomic-openshift-node[48943]: I1002 16:17:25.314731 48943 pod_container_manager_linux.go:164] Attempt to kill process with pid: 72182 Oct 02 16:17:25 server atomic-openshift-node[48943]: I1002 16:17:25.314817 48943 pod_container_manager_linux.go:171] successfully killed all unwanted processes. ... 90 seconds later ... Oct 02 16:18:55 server systemd[1]: docker-197c4a8f5f0815186a323d21779cc86e603a8beeec75345cadaceeaccdde6d1b.scope stopping timed out. Killing. Oct 02 16:18:55 server systemd[1]: docker-1c8aa3b0ce61d307bcd256cd417bb1ea6912e3a130984afffadbf5ce4fd22c21.scope stopping timed out. Killing. We tried to debug this from the systemd side collect systemd-cgls data, but even at 5 seconds internals, we did not capture any relevant data. For all of the occurrences we have analyzed, this issue is always due to very short lived containers of a few seconds. Predominantly, they are short cron jobs running normally, but we do see other instances related to containers stopping, failing or dying quickly that create the same issue. Version-Release number of selected component (if applicable): 3.11.z How reproducible: Very reproducable in customers environment . Steps to Reproduce: 1. 2. 3. Actual results: Continual buildup of scopes and slices Expected results: These should be properly cleaned up and removed. Additional info: This can be worked around by continually executing ‘systemctl reset-failed’ as a cron job on your server.
Hi team, Any updates on this? Thanks, Fatima
Looks like RHEL 7.9 released is pushed-out further. This will not be fixed in the current sprint.
Mrunal / Neelesh / Derrick - Could the root of this actually be solved by where we catch the failures in StopUnit and cleanup? -> https://github.com/opencontainers/runc/commit/bfa1b2aab39f7735db9319478b8f63b9adca26e3#diff-252ac01f71681e9ed8e7ce0571cdb6ac ?
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 3.11.317 bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4430
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days