Description of problem: Customer is complaining of system response being very slow. Looking into the journal, the kernel has many the following: Oct 28 16:08:33 wb08.ocp01.nfvdev.tlabs.ca kernel: Out of memory and no killable processes... Oct 28 16:08:33 wb08.ocp01.nfvdev.tlabs.ca kernel: runc invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=-1000 Oct 28 16:08:33 wb08.ocp01.nfvdev.tlabs.ca kernel: runc cpuset=/ mems_allowed=0-1 Oct 28 16:08:33 wb08.ocp01.nfvdev.tlabs.ca kernel: CPU: 82 PID: 2043349 Comm: runc Tainted: G W L --------- - - 4.18.0-193.13.2.el8_2.x86_64 #1 Oct 28 16:08:34 wb08.ocp01.nfvdev.tlabs.ca kernel: Hardware name: HPE ProLiant DL360 Gen10/ProLiant DL360 Gen10, BIOS U32 02/02/2019 Oct 28 16:08:34 wb08.ocp01.nfvdev.tlabs.ca kernel: Call Trace: Oct 28 16:08:34 wb08.ocp01.nfvdev.tlabs.ca kernel: dump_stack+0x5c/0x80 Oct 28 16:08:34 wb08.ocp01.nfvdev.tlabs.ca kernel: dump_header+0x6e/0x27a Oct 28 16:08:34 wb08.ocp01.nfvdev.tlabs.ca kernel: out_of_memory.cold.31+0x39/0x87 Oct 28 16:08:34 wb08.ocp01.nfvdev.tlabs.ca kernel: mem_cgroup_out_of_memory+0x49/0x80 Oct 28 16:08:34 wb08.ocp01.nfvdev.tlabs.ca kernel: try_charge+0x58c/0x780 Oct 28 16:08:34 wb08.ocp01.nfvdev.tlabs.ca kernel: ? __alloc_pages_nodemask+0xef/0x280 Oct 28 16:08:34 wb08.ocp01.nfvdev.tlabs.ca kernel: mem_cgroup_try_charge+0x8b/0x1a0 Oct 28 16:08:34 wb08.ocp01.nfvdev.tlabs.ca kernel: mem_cgroup_try_charge_delay+0x1c/0x40 Oct 28 16:08:34 wb08.ocp01.nfvdev.tlabs.ca kernel: do_anonymous_page+0xb5/0x360 Oct 28 16:08:34 wb08.ocp01.nfvdev.tlabs.ca kernel: __handle_mm_fault+0x662/0x6a0 Oct 28 16:08:34 wb08.ocp01.nfvdev.tlabs.ca kernel: handle_mm_fault+0xda/0x200 Oct 28 16:08:34 wb08.ocp01.nfvdev.tlabs.ca kernel: __do_page_fault+0x22d/0x4e0 Oct 28 16:08:34 wb08.ocp01.nfvdev.tlabs.ca kernel: do_page_fault+0x32/0x110 Oct 28 16:08:34 wb08.ocp01.nfvdev.tlabs.ca kernel: ? page_fault+0x8/0x30 Oct 28 16:08:34 wb08.ocp01.nfvdev.tlabs.ca kernel: page_fault+0x1e/0x30 Oct 28 16:08:34 wb08.ocp01.nfvdev.tlabs.ca kernel: RIP: 0033:0x7fae9cdd6ed2 Oct 28 16:08:34 wb08.ocp01.nfvdev.tlabs.ca kernel: Code: ff ff 4c 01 e7 44 89 95 48 ff ff ff 48 89 ce e8 34 d7 ff ff 44 8b 95 48 ff ff ff 85 c0 0f 85 95 03 00 00 48 8b 85 50 ff ff ff <4c> 89 a3 a0 06 00 00 48 89 df c7 43 18 01 00 00 00 48 89 83 90 06 Oct 28 16:08:34 wb08.ocp01.nfvdev.tlabs.ca kernel: RSP: 002b:00007fae88ff8a40 EFLAGS: 00010246 Oct 28 16:08:34 wb08.ocp01.nfvdev.tlabs.ca kernel: RAX: 00007fae477ff000 RBX: 00007fae47fff700 RCX: 00007fae9c8bb99b Oct 28 16:08:34 wb08.ocp01.nfvdev.tlabs.ca kernel: RDX: 0000000000000003 RSI: 0000000000800000 RDI: 00007fae47800000 Oct 28 16:08:34 wb08.ocp01.nfvdev.tlabs.ca kernel: RBP: 00007fae88ff8b10 R08: 00000000ffffffff R09: 0000000000000000 Oct 28 16:08:34 wb08.ocp01.nfvdev.tlabs.ca kernel: R10: 0000000000000000 R11: 0000000000000206 R12: 0000000000001000 Oct 28 16:08:34 wb08.ocp01.nfvdev.tlabs.ca kernel: R13: 00007fae88ff8b90 R14: 00007fae88ff8b90 R15: 0000000000801000 Oct 28 16:08:34 wb08.ocp01.nfvdev.tlabs.ca kernel: Memory limit reached of cgroup /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podcc502bfc_6f99_4dfd_9915_0469c9c2b3f5.slice Oct 28 16:08:34 wb08.ocp01.nfvdev.tlabs.ca kernel: memory: usage 567056kB, limit 563200kB, failcnt 198557 Oct 28 16:08:34 wb08.ocp01.nfvdev.tlabs.ca kernel: memory+swap: usage 567056kB, limit 9007199254740988kB, failcnt 0 Oct 28 16:08:34 wb08.ocp01.nfvdev.tlabs.ca kernel: kmem: usage 127092kB, limit 9007199254740988kB, failcnt 0 For the system, there is 400GB of memory. Only 50GB is in use. So this cgroup may be configured poorly the pod; however, the oom cleanup is stuck in a very tight infinite loop. When looking into the kubelet, it is logging the following for pod cc502bfc-6f99-4dfd-9915-0469c9c2b3f5: Oct 28 16:08:58 wb08.ocp01.nfvdev.tlabs.ca hyperkube[3055]: I1028 16:08:58.109679 3055 volume_manager.go:372] Waiting for volumes to attach and mount for pod "eric-data-visualizer-kb-7fb99cccbb-nq7tf_evnfmtr02(cc502bfc-6f99-4dfd-9915-0469c9c2b3f5)" Oct 28 16:08:58 wb08.ocp01.nfvdev.tlabs.ca hyperkube[3055]: I1028 16:08:58.109723 3055 volume_manager.go:403] All volumes are attached and mounted for pod "eric-data-visualizer-kb-7fb99cccbb-nq7tf_evnfmtr02(cc502bfc-6f99-4dfd-9915-0469c9c2b3f5)" Oct 28 16:08:58 wb08.ocp01.nfvdev.tlabs.ca hyperkube[3055]: I1028 16:08:58.109748 3055 kuberuntime_manager.go:442] No ready sandbox for pod "eric-data-visualizer-kb-7fb99cccbb-nq7tf_evnfmtr02(cc502bfc-6f99-4dfd-9915-0469c9c2b3f5)" can be found. Need to start a new one Oct 28 16:08:58 wb08.ocp01.nfvdev.tlabs.ca hyperkube[3055]: I1028 16:08:58.109762 3055 kuberuntime_manager.go:652] computePodActions got {KillPod:true CreateSandbox:true SandboxID:85c80a1ab6c2a865a6a03355d2c7ad28e464d60be3367399ec38cb723e56ce6e Attempt:1 NextInitContainerToStart:nil ContainersToStart:[0 1] ContainersToKill:map[] EphemeralContainersToStart:[]} for pod "eric-data-visualizer-kb-7fb99cccbb-nq7tf_evnfmtr02(cc502bfc-6f99-4dfd-9915-0469c9c2b3f5)" Oct 28 16:08:58 wb08.ocp01.nfvdev.tlabs.ca hyperkube[3055]: I1028 16:08:58.109876 3055 event.go:281] Event(v1.ObjectReference{Kind:"Pod", Namespace:"evnfmtr02", Name:"eric-data-visualizer-kb-7fb99cccbb-nq7tf", UID:"cc502bfc-6f99-4dfd-9915-0469c9c2b3f5", APIVersion:"v1", ResourceVersion:"121787564", FieldPath:""}): type: 'Normal' reason: 'SandboxChanged' Pod sandbox changed, it will be killed and re-created. Oct 28 16:08:58 wb08.ocp01.nfvdev.tlabs.ca hyperkube[3055]: E1028 16:08:58.110700 3055 remote_runtime.go:105] RunPodSandbox from runtime service failed: rpc error: code = Unknown desc = error reserving pod name k8s_eric-data-visualizer-kb-7fb99cccbb-nq7tf_evnfmtr02_cc502bfc-6f99-4dfd-9915-0469c9c2b3f5_1 for id b1dd0a01a092a1e23ca096784a41872ec7328a3d8df0353521a6aad9decdc899: name is reserved Oct 28 16:08:58 wb08.ocp01.nfvdev.tlabs.ca hyperkube[3055]: E1028 16:08:58.110742 3055 kuberuntime_sandbox.go:68] CreatePodSandbox for pod "eric-data-visualizer-kb-7fb99cccbb-nq7tf_evnfmtr02(cc502bfc-6f99-4dfd-9915-0469c9c2b3f5)" failed: rpc error: code = Unknown desc = error reserving pod name k8s_eric-data-visualizer-kb-7fb99cccbb-nq7tf_evnfmtr02_cc502bfc-6f99-4dfd-9915-0469c9c2b3f5_1 for id b1dd0a01a092a1e23ca096784a41872ec7328a3d8df0353521a6aad9decdc899: name is reserved Oct 28 16:08:58 wb08.ocp01.nfvdev.tlabs.ca hyperkube[3055]: E1028 16:08:58.110756 3055 kuberuntime_manager.go:729] createPodSandbox for pod "eric-data-visualizer-kb-7fb99cccbb-nq7tf_evnfmtr02(cc502bfc-6f99-4dfd-9915-0469c9c2b3f5)" failed: rpc error: code = Unknown desc = error reserving pod name k8s_eric-data-visualizer-kb-7fb99cccbb-nq7tf_evnfmtr02_cc502bfc-6f99-4dfd-9915-0469c9c2b3f5_1 for id b1dd0a01a092a1e23ca096784a41872ec7328a3d8df0353521a6aad9decdc899: name is reserved Oct 28 16:08:58 wb08.ocp01.nfvdev.tlabs.ca hyperkube[3055]: E1028 16:08:58.110792 3055 pod_workers.go:191] Error syncing pod cc502bfc-6f99-4dfd-9915-0469c9c2b3f5 ("eric-data-visualizer-kb-7fb99cccbb-nq7tf_evnfmtr02(cc502bfc-6f99-4dfd-9915-0469c9c2b3f5)"), skipping: failed to "CreatePodSandbox" for "eric-data-visualizer-kb-7fb99cccbb-nq7tf_evnfmtr02(cc502bfc-6f99-4dfd-9915-0469c9c2b3f5)" with CreatePodSandboxError: "CreatePodSandbox for pod \"eric-data-visualizer-kb-7fb99cccbb-nq7tf_evnfmtr02(cc502bfc-6f99-4dfd-9915-0469c9c2b3f5)\" failed: rpc error: code = Unknown desc = error reserving pod name k8s_eric-data-visualizer-kb-7fb99cccbb-nq7tf_evnfmtr02_cc502bfc-6f99-4dfd-9915-0469c9c2b3f5_1 for id b1dd0a01a092a1e23ca096784a41872ec7328a3d8df0353521a6aad9decdc899: name is reserved" Oct 28 16:08:58 wb08.ocp01.nfvdev.tlabs.ca hyperkube[3055]: I1028 16:08:58.110824 3055 event.go:281] Event(v1.ObjectReference{Kind:"Pod", Namespace:"evnfmtr02", Name:"eric-data-visualizer-kb-7fb99cccbb-nq7tf", UID:"cc502bfc-6f99-4dfd-9915-0469c9c2b3f5", APIVersion:"v1", ResourceVersion:"121787564", FieldPath:""}): type: 'Warning' reason: 'FailedCreatePodSandBox' Failed to create pod sandbox: rpc error: code = Unknown desc = error reserving pod name k8s_eric-data-visualizer-kb-7fb99cccbb-nq7tf_evnfmtr02_cc502bfc-6f99-4dfd-9915-0469c9c2b3f5_1 for id b1dd0a01a092a1e23ca096784a41872ec7328a3d8df0353521a6aad9decdc899: name is reserved Oct 28 16:08:58 wb08.ocp01.nfvdev.tlabs.ca hyperkube[3055]: I1028 16:08:58.228413 3055 configmap.go:188] Setting up volume dashboard-cfg for pod cc502bfc-6f99-4dfd-9915-0469c9c2b3f5 at /var/lib/kubelet/pods/cc502bfc-6f99-4dfd-9915-0469c9c2b3f5/volumes/kubernetes.io~configmap/dashboard-cfg Oct 28 16:09:05 wb08.ocp01.nfvdev.tlabs.ca hyperkube[3055]: I1028 16:09:05.338652 3055 worker.go:215] Non-running container probed: eric-data-visualizer-kb-7fb99cccbb-nq7tf_evnfmtr02(cc502bfc-6f99-4dfd-9915-0469c9c2b3f5) - simple-ericsson-branding-server Oct 28 16:09:07 wb08.ocp01.nfvdev.tlabs.ca hyperkube[3055]: I1028 16:09:07.091661 3055 worker.go:215] Non-running container probed: eric-data-visualizer-kb-7fb99cccbb-nq7tf_evnfmtr02(cc502bfc-6f99-4dfd-9915-0469c9c2b3f5) - kibana Oct 28 16:09:10 wb08.ocp01.nfvdev.tlabs.ca hyperkube[3055]: I1028 16:09:10.109277 3055 kubelet_pods.go:1358] Generating status for "eric-data-visualizer-kb-7fb99cccbb-nq7tf_evnfmtr02(cc502bfc-6f99-4dfd-9915-0469c9c2b3f5)" Oct 28 16:09:10 wb08.ocp01.nfvdev.tlabs.ca hyperkube[3055]: I1028 16:09:10.109470 3055 status_manager.go:434] Ignoring same status for pod "eric-data-visualizer-kb-7fb99cccbb-nq7tf_evnfmtr02(cc502bfc-6f99-4dfd-9915-0469c9c2b3f5)", status: {Phase:Running Conditions:[{Type:Initialized Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-10-28 04:39:32 +0000 UTC Reason: Message:} {Type:Ready Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-10-28 09:31:37 +0000 UTC Reason:ContainersNotReady Message:containers with unready status: [simple-ericsson-branding-server kibana]} {Type:ContainersReady Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-10-28 09:31:37 +0000 UTC Reason:ContainersNotReady Message:containers with unready status: [simple-ericsson-branding-server kibana]} {Type:PodScheduled Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-10-28 04:39:32 +0000 UTC Reason: Message:}] Message: Reason: NominatedNodeName: HostIP:172.21.105.36 PodIP:10.129.10.22 PodIPs:[{IP:10.129.10.22}] StartTime:2020-10-28 04:39:32 +0000 UTC InitContainerStatuses:[] ContainerStatuses:[{Name:kibana State:{Waiting:&ContainerStateWaiting{Reason:CreateContainerError,Message:context deadline exceeded,} Running:nil Terminated:nil} LastTerminationState:{Waiting:nil Running:nil Terminated:&ContainerStateTerminated{ExitCode:137,Signal:0,Reason:Error,Message:,StartedAt:2020-10-28 07:32:30 +0000 UTC,FinishedAt:2020-10-28 09:30:53 +0000 UTC,ContainerID:cri-o://6cbc2504c09321791bf8b6dc6bd6e6d586b28761f504d59bcad7ad0b31ac90d4,}} Ready:false RestartCount:1 Image:jfrog.nfvdev.tlabs.ca/proj-eo/kb/eric-data-visualizer-kb:6.4.3-17 ImageID:jfrog.nfvdev.tlabs.ca/proj-eo/kb/eric-data-visualizer-kb@sha256:94c9279a07cfc94a18eb9b0d96b3cf4df70de6f066cc5001570f7c3d4059e04f ContainerID:cri-o://6cbc2504c09321791bf8b6dc6bd6e6d586b28761f504d59bcad7ad0b31ac90d4 Started:0xc0091a633a} {Name:simple-ericsson-branding-server State:{Waiting:&ContainerStateWaiting{Reason:CreateContainerError,Message:context deadline exceeded,} Running:nil Terminated:nil} LastTerminationState:{Waiting:nil Running:nil Terminated:&ContainerStateTerminated{ExitCode:137,Signal:0,Reason:OOMKilled,Message:,StartedAt:2020-10-28 04:39:38 +0000 UTC,FinishedAt:2020-10-28 10:56:44 +0000 UTC,ContainerID:cri-o://9b011c4b11f6f02cef5d04dd800ecbbb231fe51df83312b31af74dc42b134a4d,}} Ready:false RestartCount:0 Image:jfrog.nfvdev.tlabs.ca/proj-eo/kb/adp-gs-branding:1.4.0-35 ImageID:jfrog.nfvdev.tlabs.ca/proj-eo/kb/adp-gs-branding@sha256:9e07a2854a88c69c36bf3c97ec19dab0149153b65e5790e9e87e92632fe45b81 ContainerID:cri-o://9b011c4b11f6f02cef5d04dd800ecbbb231fe51df83312b31af74dc42b134a4d Started:0xc0091a633b}] QOSClass:Burstable EphemeralContainerStatuses:[]} This happens over and over. To simplify the retry data: Oct 28 16:06:45 wb08.ocp01.nfvdev.tlabs.ca hyperkube[3055]: I1028 16:06:45.110241 3055 kuberuntime_manager.go:442] No ready sandbox for pod "eric-data-visualizer-kb-7fb99cccbb-nq7tf_evnfmtr02(cc502bfc-6f99-4dfd-9915-0469c9c2b3f5)" can be found. Need to start a new one Oct 28 16:07:00 wb08.ocp01.nfvdev.tlabs.ca hyperkube[3055]: I1028 16:07:00.109911 3055 kuberuntime_manager.go:442] No ready sandbox for pod "eric-data-visualizer-kb-7fb99cccbb-nq7tf_evnfmtr02(cc502bfc-6f99-4dfd-9915-0469c9c2b3f5)" can be found. Need to start a new one Oct 28 16:07:12 wb08.ocp01.nfvdev.tlabs.ca hyperkube[3055]: I1028 16:07:12.110156 3055 kuberuntime_manager.go:442] No ready sandbox for pod "eric-data-visualizer-kb-7fb99cccbb-nq7tf_evnfmtr02(cc502bfc-6f99-4dfd-9915-0469c9c2b3f5)" can be found. Need to start a new one Oct 28 16:07:27 wb08.ocp01.nfvdev.tlabs.ca hyperkube[3055]: I1028 16:07:27.110035 3055 kuberuntime_manager.go:442] No ready sandbox for pod "eric-data-visualizer-kb-7fb99cccbb-nq7tf_evnfmtr02(cc502bfc-6f99-4dfd-9915-0469c9c2b3f5)" can be found. Need to start a new one Oct 28 16:07:39 wb08.ocp01.nfvdev.tlabs.ca hyperkube[3055]: I1028 16:07:39.109926 3055 kuberuntime_manager.go:442] No ready sandbox for pod "eric-data-visualizer-kb-7fb99cccbb-nq7tf_evnfmtr02(cc502bfc-6f99-4dfd-9915-0469c9c2b3f5)" can be found. Need to start a new one Oct 28 16:07:53 wb08.ocp01.nfvdev.tlabs.ca hyperkube[3055]: I1028 16:07:53.110087 3055 kuberuntime_manager.go:442] No ready sandbox for pod "eric-data-visualizer-kb-7fb99cccbb-nq7tf_evnfmtr02(cc502bfc-6f99-4dfd-9915-0469c9c2b3f5)" can be found. Need to start a new one Oct 28 16:08:08 wb08.ocp01.nfvdev.tlabs.ca hyperkube[3055]: I1028 16:08:08.110070 3055 kuberuntime_manager.go:442] No ready sandbox for pod "eric-data-visualizer-kb-7fb99cccbb-nq7tf_evnfmtr02(cc502bfc-6f99-4dfd-9915-0469c9c2b3f5)" can be found. Need to start a new one Oct 28 16:08:20 wb08.ocp01.nfvdev.tlabs.ca hyperkube[3055]: I1028 16:08:20.109898 3055 kuberuntime_manager.go:442] No ready sandbox for pod "eric-data-visualizer-kb-7fb99cccbb-nq7tf_evnfmtr02(cc502bfc-6f99-4dfd-9915-0469c9c2b3f5)" can be found. Need to start a new one Oct 28 16:08:31 wb08.ocp01.nfvdev.tlabs.ca hyperkube[3055]: I1028 16:08:31.111604 3055 kuberuntime_manager.go:442] No ready sandbox for pod "eric-data-visualizer-kb-7fb99cccbb-nq7tf_evnfmtr02(cc502bfc-6f99-4dfd-9915-0469c9c2b3f5)" can be found. Need to start a new one Oct 28 16:08:45 wb08.ocp01.nfvdev.tlabs.ca hyperkube[3055]: I1028 16:08:45.109918 3055 kuberuntime_manager.go:442] No ready sandbox for pod "eric-data-visualizer-kb-7fb99cccbb-nq7tf_evnfmtr02(cc502bfc-6f99-4dfd-9915-0469c9c2b3f5)" can be found. Need to start a new one Oct 28 16:08:58 wb08.ocp01.nfvdev.tlabs.ca hyperkube[3055]: I1028 16:08:58.109748 3055 kuberuntime_manager.go:442] No ready sandbox for pod "eric-data-visualizer-kb-7fb99cccbb-nq7tf_evnfmtr02(cc502bfc-6f99-4dfd-9915-0469c9c2b3f5)" can be found. Need to start a new one Oct 28 16:09:10 wb08.ocp01.nfvdev.tlabs.ca hyperkube[3055]: I1028 16:09:10.109770 3055 kuberuntime_manager.go:442] No ready sandbox for pod "eric-data-visualizer-kb-7fb99cccbb-nq7tf_evnfmtr02(cc502bfc-6f99-4dfd-9915-0469c9c2b3f5)" can be found. Need to start a new one Version-Release number of selected component (if applicable): OCP 4.4.15 Baremetal How reproducible: The node seems to never really clean up from this condition Steps to Reproduce: Unable to determine reproducing conditions Actual results: After rebooting the node, the system start most, but not all pods and the system performance is degraded. Expected results: Node starts up with all nodes function and system is responsive. Additional info: Many sosreports and must gathers attached to customer ticket.
I believe this ticket is related to: https://bugzilla.redhat.com/show_bug.cgi?id=1857446 I've added the other to my customer's case as well to track its progress.
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days