Bug 1892909 - Kubelet / kernel OOM and no killable processes
Summary: Kubelet / kernel OOM and no killable processes
Keywords:
Status: CLOSED DUPLICATE of bug 1857446
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 4.5
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.7.0
Assignee: Harshal Patil
QA Contact: Sunil Choudhary
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-10-29 19:29 UTC by Neil Girard
Modified: 2023-09-15 00:50 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-01-05 03:02:12 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Neil Girard 2020-10-29 19:29:08 UTC
Description of problem:
Customer is complaining of system response being very slow.  Looking into the journal, the kernel has many the following:

Oct 28 16:08:33 wb08.ocp01.nfvdev.tlabs.ca kernel: Out of memory and no killable processes...
Oct 28 16:08:33 wb08.ocp01.nfvdev.tlabs.ca kernel: runc invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=-1000
Oct 28 16:08:33 wb08.ocp01.nfvdev.tlabs.ca kernel: runc cpuset=/ mems_allowed=0-1
Oct 28 16:08:33 wb08.ocp01.nfvdev.tlabs.ca kernel: CPU: 82 PID: 2043349 Comm: runc Tainted: G        W    L   --------- -  - 4.18.0-193.13.2.el8_2.x86_64 #1
Oct 28 16:08:34 wb08.ocp01.nfvdev.tlabs.ca kernel: Hardware name: HPE ProLiant DL360 Gen10/ProLiant DL360 Gen10, BIOS U32 02/02/2019
Oct 28 16:08:34 wb08.ocp01.nfvdev.tlabs.ca kernel: Call Trace:
Oct 28 16:08:34 wb08.ocp01.nfvdev.tlabs.ca kernel:  dump_stack+0x5c/0x80
Oct 28 16:08:34 wb08.ocp01.nfvdev.tlabs.ca kernel:  dump_header+0x6e/0x27a
Oct 28 16:08:34 wb08.ocp01.nfvdev.tlabs.ca kernel:  out_of_memory.cold.31+0x39/0x87
Oct 28 16:08:34 wb08.ocp01.nfvdev.tlabs.ca kernel:  mem_cgroup_out_of_memory+0x49/0x80
Oct 28 16:08:34 wb08.ocp01.nfvdev.tlabs.ca kernel:  try_charge+0x58c/0x780
Oct 28 16:08:34 wb08.ocp01.nfvdev.tlabs.ca kernel:  ? __alloc_pages_nodemask+0xef/0x280
Oct 28 16:08:34 wb08.ocp01.nfvdev.tlabs.ca kernel:  mem_cgroup_try_charge+0x8b/0x1a0
Oct 28 16:08:34 wb08.ocp01.nfvdev.tlabs.ca kernel:  mem_cgroup_try_charge_delay+0x1c/0x40
Oct 28 16:08:34 wb08.ocp01.nfvdev.tlabs.ca kernel:  do_anonymous_page+0xb5/0x360
Oct 28 16:08:34 wb08.ocp01.nfvdev.tlabs.ca kernel:  __handle_mm_fault+0x662/0x6a0
Oct 28 16:08:34 wb08.ocp01.nfvdev.tlabs.ca kernel:  handle_mm_fault+0xda/0x200
Oct 28 16:08:34 wb08.ocp01.nfvdev.tlabs.ca kernel:  __do_page_fault+0x22d/0x4e0
Oct 28 16:08:34 wb08.ocp01.nfvdev.tlabs.ca kernel:  do_page_fault+0x32/0x110
Oct 28 16:08:34 wb08.ocp01.nfvdev.tlabs.ca kernel:  ? page_fault+0x8/0x30
Oct 28 16:08:34 wb08.ocp01.nfvdev.tlabs.ca kernel:  page_fault+0x1e/0x30
Oct 28 16:08:34 wb08.ocp01.nfvdev.tlabs.ca kernel: RIP: 0033:0x7fae9cdd6ed2
Oct 28 16:08:34 wb08.ocp01.nfvdev.tlabs.ca kernel: Code: ff ff 4c 01 e7 44 89 95 48 ff ff ff 48 89 ce e8 34 d7 ff ff 44 8b 95 48 ff ff ff 85 c0 0f 85 95 03 00 00 48 8b 85 50 ff ff ff <4c> 89 a3 a0 06 00 00 48 89 df c7 43 18 01 00 00 00 48 89 83 90 06
Oct 28 16:08:34 wb08.ocp01.nfvdev.tlabs.ca kernel: RSP: 002b:00007fae88ff8a40 EFLAGS: 00010246
Oct 28 16:08:34 wb08.ocp01.nfvdev.tlabs.ca kernel: RAX: 00007fae477ff000 RBX: 00007fae47fff700 RCX: 00007fae9c8bb99b
Oct 28 16:08:34 wb08.ocp01.nfvdev.tlabs.ca kernel: RDX: 0000000000000003 RSI: 0000000000800000 RDI: 00007fae47800000
Oct 28 16:08:34 wb08.ocp01.nfvdev.tlabs.ca kernel: RBP: 00007fae88ff8b10 R08: 00000000ffffffff R09: 0000000000000000
Oct 28 16:08:34 wb08.ocp01.nfvdev.tlabs.ca kernel: R10: 0000000000000000 R11: 0000000000000206 R12: 0000000000001000
Oct 28 16:08:34 wb08.ocp01.nfvdev.tlabs.ca kernel: R13: 00007fae88ff8b90 R14: 00007fae88ff8b90 R15: 0000000000801000
Oct 28 16:08:34 wb08.ocp01.nfvdev.tlabs.ca kernel: Memory limit reached of cgroup /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podcc502bfc_6f99_4dfd_9915_0469c9c2b3f5.slice
Oct 28 16:08:34 wb08.ocp01.nfvdev.tlabs.ca kernel: memory: usage 567056kB, limit 563200kB, failcnt 198557
Oct 28 16:08:34 wb08.ocp01.nfvdev.tlabs.ca kernel: memory+swap: usage 567056kB, limit 9007199254740988kB, failcnt 0
Oct 28 16:08:34 wb08.ocp01.nfvdev.tlabs.ca kernel: kmem: usage 127092kB, limit 9007199254740988kB, failcnt 0

For the system, there is 400GB of memory.  Only 50GB is in use.  So this cgroup may be configured poorly the pod; however, the oom cleanup is stuck in a very tight infinite loop.

When looking into the kubelet, it is logging the following for pod cc502bfc-6f99-4dfd-9915-0469c9c2b3f5:

Oct 28 16:08:58 wb08.ocp01.nfvdev.tlabs.ca hyperkube[3055]: I1028 16:08:58.109679    3055 volume_manager.go:372] Waiting for volumes to attach and mount for pod "eric-data-visualizer-kb-7fb99cccbb-nq7tf_evnfmtr02(cc502bfc-6f99-4dfd-9915-0469c9c2b3f5)"
Oct 28 16:08:58 wb08.ocp01.nfvdev.tlabs.ca hyperkube[3055]: I1028 16:08:58.109723    3055 volume_manager.go:403] All volumes are attached and mounted for pod "eric-data-visualizer-kb-7fb99cccbb-nq7tf_evnfmtr02(cc502bfc-6f99-4dfd-9915-0469c9c2b3f5)"
Oct 28 16:08:58 wb08.ocp01.nfvdev.tlabs.ca hyperkube[3055]: I1028 16:08:58.109748    3055 kuberuntime_manager.go:442] No ready sandbox for pod "eric-data-visualizer-kb-7fb99cccbb-nq7tf_evnfmtr02(cc502bfc-6f99-4dfd-9915-0469c9c2b3f5)" can be found. Need to start a new one
Oct 28 16:08:58 wb08.ocp01.nfvdev.tlabs.ca hyperkube[3055]: I1028 16:08:58.109762    3055 kuberuntime_manager.go:652] computePodActions got {KillPod:true CreateSandbox:true SandboxID:85c80a1ab6c2a865a6a03355d2c7ad28e464d60be3367399ec38cb723e56ce6e Attempt:1 NextInitContainerToStart:nil ContainersToStart:[0 1] ContainersToKill:map[] EphemeralContainersToStart:[]} for pod "eric-data-visualizer-kb-7fb99cccbb-nq7tf_evnfmtr02(cc502bfc-6f99-4dfd-9915-0469c9c2b3f5)"
Oct 28 16:08:58 wb08.ocp01.nfvdev.tlabs.ca hyperkube[3055]: I1028 16:08:58.109876    3055 event.go:281] Event(v1.ObjectReference{Kind:"Pod", Namespace:"evnfmtr02", Name:"eric-data-visualizer-kb-7fb99cccbb-nq7tf", UID:"cc502bfc-6f99-4dfd-9915-0469c9c2b3f5", APIVersion:"v1", ResourceVersion:"121787564", FieldPath:""}): type: 'Normal' reason: 'SandboxChanged' Pod sandbox changed, it will be killed and re-created.
Oct 28 16:08:58 wb08.ocp01.nfvdev.tlabs.ca hyperkube[3055]: E1028 16:08:58.110700    3055 remote_runtime.go:105] RunPodSandbox from runtime service failed: rpc error: code = Unknown desc = error reserving pod name k8s_eric-data-visualizer-kb-7fb99cccbb-nq7tf_evnfmtr02_cc502bfc-6f99-4dfd-9915-0469c9c2b3f5_1 for id b1dd0a01a092a1e23ca096784a41872ec7328a3d8df0353521a6aad9decdc899: name is reserved
Oct 28 16:08:58 wb08.ocp01.nfvdev.tlabs.ca hyperkube[3055]: E1028 16:08:58.110742    3055 kuberuntime_sandbox.go:68] CreatePodSandbox for pod "eric-data-visualizer-kb-7fb99cccbb-nq7tf_evnfmtr02(cc502bfc-6f99-4dfd-9915-0469c9c2b3f5)" failed: rpc error: code = Unknown desc = error reserving pod name k8s_eric-data-visualizer-kb-7fb99cccbb-nq7tf_evnfmtr02_cc502bfc-6f99-4dfd-9915-0469c9c2b3f5_1 for id b1dd0a01a092a1e23ca096784a41872ec7328a3d8df0353521a6aad9decdc899: name is reserved
Oct 28 16:08:58 wb08.ocp01.nfvdev.tlabs.ca hyperkube[3055]: E1028 16:08:58.110756    3055 kuberuntime_manager.go:729] createPodSandbox for pod "eric-data-visualizer-kb-7fb99cccbb-nq7tf_evnfmtr02(cc502bfc-6f99-4dfd-9915-0469c9c2b3f5)" failed: rpc error: code = Unknown desc = error reserving pod name k8s_eric-data-visualizer-kb-7fb99cccbb-nq7tf_evnfmtr02_cc502bfc-6f99-4dfd-9915-0469c9c2b3f5_1 for id b1dd0a01a092a1e23ca096784a41872ec7328a3d8df0353521a6aad9decdc899: name is reserved
Oct 28 16:08:58 wb08.ocp01.nfvdev.tlabs.ca hyperkube[3055]: E1028 16:08:58.110792    3055 pod_workers.go:191] Error syncing pod cc502bfc-6f99-4dfd-9915-0469c9c2b3f5 ("eric-data-visualizer-kb-7fb99cccbb-nq7tf_evnfmtr02(cc502bfc-6f99-4dfd-9915-0469c9c2b3f5)"), skipping: failed to "CreatePodSandbox" for "eric-data-visualizer-kb-7fb99cccbb-nq7tf_evnfmtr02(cc502bfc-6f99-4dfd-9915-0469c9c2b3f5)" with CreatePodSandboxError: "CreatePodSandbox for pod \"eric-data-visualizer-kb-7fb99cccbb-nq7tf_evnfmtr02(cc502bfc-6f99-4dfd-9915-0469c9c2b3f5)\" failed: rpc error: code = Unknown desc = error reserving pod name k8s_eric-data-visualizer-kb-7fb99cccbb-nq7tf_evnfmtr02_cc502bfc-6f99-4dfd-9915-0469c9c2b3f5_1 for id b1dd0a01a092a1e23ca096784a41872ec7328a3d8df0353521a6aad9decdc899: name is reserved"
Oct 28 16:08:58 wb08.ocp01.nfvdev.tlabs.ca hyperkube[3055]: I1028 16:08:58.110824    3055 event.go:281] Event(v1.ObjectReference{Kind:"Pod", Namespace:"evnfmtr02", Name:"eric-data-visualizer-kb-7fb99cccbb-nq7tf", UID:"cc502bfc-6f99-4dfd-9915-0469c9c2b3f5", APIVersion:"v1", ResourceVersion:"121787564", FieldPath:""}): type: 'Warning' reason: 'FailedCreatePodSandBox' Failed to create pod sandbox: rpc error: code = Unknown desc = error reserving pod name k8s_eric-data-visualizer-kb-7fb99cccbb-nq7tf_evnfmtr02_cc502bfc-6f99-4dfd-9915-0469c9c2b3f5_1 for id b1dd0a01a092a1e23ca096784a41872ec7328a3d8df0353521a6aad9decdc899: name is reserved
Oct 28 16:08:58 wb08.ocp01.nfvdev.tlabs.ca hyperkube[3055]: I1028 16:08:58.228413    3055 configmap.go:188] Setting up volume dashboard-cfg for pod cc502bfc-6f99-4dfd-9915-0469c9c2b3f5 at /var/lib/kubelet/pods/cc502bfc-6f99-4dfd-9915-0469c9c2b3f5/volumes/kubernetes.io~configmap/dashboard-cfg
Oct 28 16:09:05 wb08.ocp01.nfvdev.tlabs.ca hyperkube[3055]: I1028 16:09:05.338652    3055 worker.go:215] Non-running container probed: eric-data-visualizer-kb-7fb99cccbb-nq7tf_evnfmtr02(cc502bfc-6f99-4dfd-9915-0469c9c2b3f5) - simple-ericsson-branding-server
Oct 28 16:09:07 wb08.ocp01.nfvdev.tlabs.ca hyperkube[3055]: I1028 16:09:07.091661    3055 worker.go:215] Non-running container probed: eric-data-visualizer-kb-7fb99cccbb-nq7tf_evnfmtr02(cc502bfc-6f99-4dfd-9915-0469c9c2b3f5) - kibana
Oct 28 16:09:10 wb08.ocp01.nfvdev.tlabs.ca hyperkube[3055]: I1028 16:09:10.109277    3055 kubelet_pods.go:1358] Generating status for "eric-data-visualizer-kb-7fb99cccbb-nq7tf_evnfmtr02(cc502bfc-6f99-4dfd-9915-0469c9c2b3f5)"
Oct 28 16:09:10 wb08.ocp01.nfvdev.tlabs.ca hyperkube[3055]: I1028 16:09:10.109470    3055 status_manager.go:434] Ignoring same status for pod "eric-data-visualizer-kb-7fb99cccbb-nq7tf_evnfmtr02(cc502bfc-6f99-4dfd-9915-0469c9c2b3f5)", status: {Phase:Running Conditions:[{Type:Initialized Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-10-28 04:39:32 +0000 UTC Reason: Message:} {Type:Ready Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-10-28 09:31:37 +0000 UTC Reason:ContainersNotReady Message:containers with unready status: [simple-ericsson-branding-server kibana]} {Type:ContainersReady Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-10-28 09:31:37 +0000 UTC Reason:ContainersNotReady Message:containers with unready status: [simple-ericsson-branding-server kibana]} {Type:PodScheduled Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-10-28 04:39:32 +0000 UTC Reason: Message:}] Message: Reason: NominatedNodeName: HostIP:172.21.105.36 PodIP:10.129.10.22 PodIPs:[{IP:10.129.10.22}] StartTime:2020-10-28 04:39:32 +0000 UTC InitContainerStatuses:[] ContainerStatuses:[{Name:kibana State:{Waiting:&ContainerStateWaiting{Reason:CreateContainerError,Message:context deadline exceeded,} Running:nil Terminated:nil} LastTerminationState:{Waiting:nil Running:nil Terminated:&ContainerStateTerminated{ExitCode:137,Signal:0,Reason:Error,Message:,StartedAt:2020-10-28 07:32:30 +0000 UTC,FinishedAt:2020-10-28 09:30:53 +0000 UTC,ContainerID:cri-o://6cbc2504c09321791bf8b6dc6bd6e6d586b28761f504d59bcad7ad0b31ac90d4,}} Ready:false RestartCount:1 Image:jfrog.nfvdev.tlabs.ca/proj-eo/kb/eric-data-visualizer-kb:6.4.3-17 ImageID:jfrog.nfvdev.tlabs.ca/proj-eo/kb/eric-data-visualizer-kb@sha256:94c9279a07cfc94a18eb9b0d96b3cf4df70de6f066cc5001570f7c3d4059e04f ContainerID:cri-o://6cbc2504c09321791bf8b6dc6bd6e6d586b28761f504d59bcad7ad0b31ac90d4 Started:0xc0091a633a} {Name:simple-ericsson-branding-server State:{Waiting:&ContainerStateWaiting{Reason:CreateContainerError,Message:context deadline exceeded,} Running:nil Terminated:nil} LastTerminationState:{Waiting:nil Running:nil Terminated:&ContainerStateTerminated{ExitCode:137,Signal:0,Reason:OOMKilled,Message:,StartedAt:2020-10-28 04:39:38 +0000 UTC,FinishedAt:2020-10-28 10:56:44 +0000 UTC,ContainerID:cri-o://9b011c4b11f6f02cef5d04dd800ecbbb231fe51df83312b31af74dc42b134a4d,}} Ready:false RestartCount:0 Image:jfrog.nfvdev.tlabs.ca/proj-eo/kb/adp-gs-branding:1.4.0-35 ImageID:jfrog.nfvdev.tlabs.ca/proj-eo/kb/adp-gs-branding@sha256:9e07a2854a88c69c36bf3c97ec19dab0149153b65e5790e9e87e92632fe45b81 ContainerID:cri-o://9b011c4b11f6f02cef5d04dd800ecbbb231fe51df83312b31af74dc42b134a4d Started:0xc0091a633b}] QOSClass:Burstable EphemeralContainerStatuses:[]}

This happens over and over.  To simplify the retry data:

Oct 28 16:06:45 wb08.ocp01.nfvdev.tlabs.ca hyperkube[3055]: I1028 16:06:45.110241    3055 kuberuntime_manager.go:442] No ready sandbox for pod "eric-data-visualizer-kb-7fb99cccbb-nq7tf_evnfmtr02(cc502bfc-6f99-4dfd-9915-0469c9c2b3f5)" can be found. Need to start a new one
Oct 28 16:07:00 wb08.ocp01.nfvdev.tlabs.ca hyperkube[3055]: I1028 16:07:00.109911    3055 kuberuntime_manager.go:442] No ready sandbox for pod "eric-data-visualizer-kb-7fb99cccbb-nq7tf_evnfmtr02(cc502bfc-6f99-4dfd-9915-0469c9c2b3f5)" can be found. Need to start a new one
Oct 28 16:07:12 wb08.ocp01.nfvdev.tlabs.ca hyperkube[3055]: I1028 16:07:12.110156    3055 kuberuntime_manager.go:442] No ready sandbox for pod "eric-data-visualizer-kb-7fb99cccbb-nq7tf_evnfmtr02(cc502bfc-6f99-4dfd-9915-0469c9c2b3f5)" can be found. Need to start a new one
Oct 28 16:07:27 wb08.ocp01.nfvdev.tlabs.ca hyperkube[3055]: I1028 16:07:27.110035    3055 kuberuntime_manager.go:442] No ready sandbox for pod "eric-data-visualizer-kb-7fb99cccbb-nq7tf_evnfmtr02(cc502bfc-6f99-4dfd-9915-0469c9c2b3f5)" can be found. Need to start a new one
Oct 28 16:07:39 wb08.ocp01.nfvdev.tlabs.ca hyperkube[3055]: I1028 16:07:39.109926    3055 kuberuntime_manager.go:442] No ready sandbox for pod "eric-data-visualizer-kb-7fb99cccbb-nq7tf_evnfmtr02(cc502bfc-6f99-4dfd-9915-0469c9c2b3f5)" can be found. Need to start a new one
Oct 28 16:07:53 wb08.ocp01.nfvdev.tlabs.ca hyperkube[3055]: I1028 16:07:53.110087    3055 kuberuntime_manager.go:442] No ready sandbox for pod "eric-data-visualizer-kb-7fb99cccbb-nq7tf_evnfmtr02(cc502bfc-6f99-4dfd-9915-0469c9c2b3f5)" can be found. Need to start a new one
Oct 28 16:08:08 wb08.ocp01.nfvdev.tlabs.ca hyperkube[3055]: I1028 16:08:08.110070    3055 kuberuntime_manager.go:442] No ready sandbox for pod "eric-data-visualizer-kb-7fb99cccbb-nq7tf_evnfmtr02(cc502bfc-6f99-4dfd-9915-0469c9c2b3f5)" can be found. Need to start a new one
Oct 28 16:08:20 wb08.ocp01.nfvdev.tlabs.ca hyperkube[3055]: I1028 16:08:20.109898    3055 kuberuntime_manager.go:442] No ready sandbox for pod "eric-data-visualizer-kb-7fb99cccbb-nq7tf_evnfmtr02(cc502bfc-6f99-4dfd-9915-0469c9c2b3f5)" can be found. Need to start a new one
Oct 28 16:08:31 wb08.ocp01.nfvdev.tlabs.ca hyperkube[3055]: I1028 16:08:31.111604    3055 kuberuntime_manager.go:442] No ready sandbox for pod "eric-data-visualizer-kb-7fb99cccbb-nq7tf_evnfmtr02(cc502bfc-6f99-4dfd-9915-0469c9c2b3f5)" can be found. Need to start a new one
Oct 28 16:08:45 wb08.ocp01.nfvdev.tlabs.ca hyperkube[3055]: I1028 16:08:45.109918    3055 kuberuntime_manager.go:442] No ready sandbox for pod "eric-data-visualizer-kb-7fb99cccbb-nq7tf_evnfmtr02(cc502bfc-6f99-4dfd-9915-0469c9c2b3f5)" can be found. Need to start a new one
Oct 28 16:08:58 wb08.ocp01.nfvdev.tlabs.ca hyperkube[3055]: I1028 16:08:58.109748    3055 kuberuntime_manager.go:442] No ready sandbox for pod "eric-data-visualizer-kb-7fb99cccbb-nq7tf_evnfmtr02(cc502bfc-6f99-4dfd-9915-0469c9c2b3f5)" can be found. Need to start a new one
Oct 28 16:09:10 wb08.ocp01.nfvdev.tlabs.ca hyperkube[3055]: I1028 16:09:10.109770    3055 kuberuntime_manager.go:442] No ready sandbox for pod "eric-data-visualizer-kb-7fb99cccbb-nq7tf_evnfmtr02(cc502bfc-6f99-4dfd-9915-0469c9c2b3f5)" can be found. Need to start a new one

Version-Release number of selected component (if applicable):
OCP 4.4.15 Baremetal


How reproducible:
The node seems to never really clean up from this condition


Steps to Reproduce:
Unable to determine reproducing conditions

Actual results:
After rebooting the node, the system start most, but not all pods and the system performance is degraded.

Expected results:
Node starts up with all nodes function and system is responsive.

Additional info:
Many sosreports and must gathers attached to customer ticket.

Comment 7 Neil Girard 2020-12-03 18:50:10 UTC
I believe this ticket is related to: https://bugzilla.redhat.com/show_bug.cgi?id=1857446

I've added the other to my customer's case as well to track its progress.

Comment 10 Red Hat Bugzilla 2023-09-15 00:50:32 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days


Note You need to log in before you can comment on or make changes to this bug.