Created attachment 1777815 [details] crio log tgz'ed Description of problem: On a baremetal SNO with the realtime kernel installed, pods get stuck in ContainerCreating state forever. [root@e22-h20-b01-fc640 workloads]# oc get pods NAME READY STATUS RESTARTS AGE sno-churning-1 0/1 ContainerCreating 0 82m The reason is this: Normal Scheduled 88m default-scheduler Successfully assigned sno-churning-churn-1/sno-churning-1 to e22-h20-b02-fc640.rdu2.scalelab.redhat.com Warning FailedMount 73m (x2 over 73m) kubelet MountVolume.SetUp failed for volume "default-token-bjkj8" : failed to sync secret cache: timed out waiting for the condition Normal AddedInterface 72m multus Add eth0 [10.128.0.234/23] Warning FailedCreatePodSandBox 69m kubelet Failed to create pod sandbox: rpc error: code = DeadlineExceeded desc = context deadline exceeded Normal AddedInterface 64m multus Add eth0 [10.128.0.234/23] Warning FailedCreatePodSandBox 61m kubelet Failed to create pod sandbox: rpc error: code = DeadlineExceeded desc = context deadline exceeded Warning FailedMount 57m (x3 over 57m) kubelet MountVolume.SetUp failed for volume "default-token-bjkj8" : failed to sync secret cache: timed out waiting for the condition Normal AddedInterface 56m multus Add eth0 [10.128.0.234/23] Warning FailedCreatePodSandBox 53m kubelet Failed to create pod sandbox: rpc error: code = DeadlineExceeded desc = context deadline exceeded Warning FailedMount 48m (x3 over 48m) kubelet MountVolume.SetUp failed for volume "default-token-bjkj8" : failed to sync secret cache: timed out waiting for the condition Normal AddedInterface 48m multus Add eth0 [10.128.0.234/23] Warning FailedCreatePodSandBox 44m kubelet Failed to create pod sandbox: rpc error: code = DeadlineExceeded desc = context deadline exceeded Warning FailedMount 37m (x3 over 37m) kubelet MountVolume.SetUp failed for volume "default-token-bjkj8" : failed to sync secret cache: timed out waiting for the condition Normal AddedInterface 36m multus Add eth0 [10.128.0.234/23] Warning FailedCreatePodSandBox 33m kubelet Failed to create pod sandbox: rpc error: code = DeadlineExceeded desc = context deadline exceeded Warning FailedMount 25m (x2 over 25m) kubelet MountVolume.SetUp failed for volume "default-token-bjkj8" : failed to sync secret cache: timed out waiting for the condition Normal AddedInterface 25m multus Add eth0 [10.128.0.234/23] Warning FailedCreatePodSandBox 21m kubelet Failed to create pod sandbox: rpc error: code = DeadlineExceeded desc = context deadline exceeded Warning FailedMount 12m (x3 over 12m) kubelet MountVolume.SetUp failed for volume "default-token-bjkj8" : failed to sync secret cache: timed out waiting for the condition Normal AddedInterface 11m multus Add eth0 [10.128.0.234/23] Warning FailedCreatePodSandBox 8m9s kubelet Failed to create pod sandbox: rpc error: code = DeadlineExceeded desc = context deadline exceeded Version-Release number of selected component (if applicable): 4.8.0-fc.1 How reproducible: always Steps to Reproduce: 1. Install SNO on BM, deploy the POA operator and the attached performanceprofile 2. Run a churn test that keeps on creating and deleting projects (50 projects, 2 pods and 1 secret per pod) 3. Once all of the initial projects are up, the first one gets deleted and recreated. As soon as that happens, the newly created pods get suck Actual results: Pods stuck in ContainerCreating state Expected results: Pods should reach the Running state eventually. Additional info: I've attached the kubelog, crio log and the performanceprofile I've used, let me know if you need anything else
Created attachment 1777817 [details] kubelet log tgz'ed
Created attachment 1777818 [details] performance profile
oopsies, that's because cri-o segfaulted a 1.21 version of the attached PR should fix it
Hm, I don't see a PR - could you please point me to it? Also, how do I get it on the deployed machine? Thanks!
@pehunt Can you please help me?
Ok, so I can easily reproduce this one: May 05 13:04:19 e22-h20-b02-fc640.rdu2.scalelab.redhat.com crio[4904]: panic: runtime error: invalid memory address or nil pointer dereference May 05 13:04:19 e22-h20-b02-fc640.rdu2.scalelab.redhat.com crio[4904]: [signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x5562b14fb5df] May 05 13:04:19 e22-h20-b02-fc640.rdu2.scalelab.redhat.com crio[4904]: goroutine 25 [running]: May 05 13:04:19 e22-h20-b02-fc640.rdu2.scalelab.redhat.com crio[4904]: panic(0x5562b1d14540, 0x5562b2e8c4c0) May 05 13:04:19 e22-h20-b02-fc640.rdu2.scalelab.redhat.com crio[4904]: /usr/lib/golang/src/runtime/panic.go:1065 +0x565 fp=0xc002351e30 sp=0xc002351d68 pc=0x5562afe61985 May 05 13:04:19 e22-h20-b02-fc640.rdu2.scalelab.redhat.com crio[4904]: runtime.panicmem() For me this is turning into a TestBlocker, so I'd appreciate any pointers towards the aforementioned PR and the way how to get an updated CRIO image onto my SNO cluster please. Thanks Marko
Hm, bz-- it was hiding the PR from me for some reason.
we spoke on slack, I have set Marko up with an RPM to test--and maybe find more holes in this :) pr merged btw
@pehunt Ok, with the fixed crio, I got another segmentation fault with crio. I'm attaching the crio logs to this bz. Timestamp for the latest segfault is May 06 13:19:46 In addition to that, I got pods stuck with [root@e22-h20-b01-fc640 ~]# oc get pods NAME READY STATUS RESTARTS AGE sno-churning-1 0/1 ContainerCreating 0 15h sno-churning-2 0/1 ContainerCreating 0 15h Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedCreatePodContainer 9m17s (x64 over 11h) kubelet unable to ensure pod container exists: failed to create container for [kubepods besteffort pod01954da8-f380-4e25-bfff-b6b5efc061b0] : Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
Created attachment 1780609 [details] latest segfault in crio
Team, Could this be a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1939133? If the system is unresponsive, pod creation would be a problem. /KenY
Okay good news: ``` May 06 13:38:33 e22-h20-b02-fc640.rdu2.scalelab.redhat.com systemd[1]: Starting CRI-O Auto Update Script... May 06 13:38:33 e22-h20-b02-fc640.rdu2.scalelab.redhat.com crio[9207]: time="2021-05-06 13:38:33.938380361Z" level=info msg="Starting CRI-O, version: 1.21.0-88.rhaos4.8.gitfd485de.el8, git: ()" May 06 13:38:33 e22-h20-b02-fc640.rdu2.scalelab.redhat.com crio[9207]: time="2021-05-06 13:38:33.943053793Z" level=info msg="Checking whether cri-o should wipe containers: version file /var/run/crio/version not found: open /var/run/crio/> May 06 13:38:33 e22-h20-b02-fc640.rdu2.scalelab.redhat.com systemd[1]: crio-wipe.service: Succeeded. May 06 13:38:33 e22-h20-b02-fc640.rdu2.scalelab.redhat.com systemd[1]: Started CRI-O Auto Update Script. May 06 13:38:33 e22-h20-b02-fc640.rdu2.scalelab.redhat.com systemd[1]: crio-wipe.service: Consumed 72ms CPU time May 06 13:40:40 e22-h20-b02-fc640.rdu2.scalelab.redhat.com systemd[1]: Starting Open Container Initiative Daemon... May 06 13:40:41 e22-h20-b02-fc640.rdu2.scalelab.redhat.com crio[10654]: time="2021-05-06 13:40:41.000237707Z" level=info msg="Starting CRI-O, version: 1.21.0-88.rhaos4.8.gitfd485de.el8, git: ()" ... May 06 13:40:41 e22-h20-b02-fc640.rdu2.scalelab.redhat.com systemd[1]: Started Open Container Initiative Daemon. May 06 13:40:52 e22-h20-b02-fc640.rdu2.scalelab.redhat.com crio[10654]: panic: error opening "/var/lib/containers/storage/overlay-images/images.lock": no such file or directory May 06 13:40:52 e22-h20-b02-fc640.rdu2.scalelab.redhat.com crio[10654]: goroutine 134 [running]: May 06 13:40:52 e22-h20-b02-fc640.rdu2.scalelab.redhat.com crio[10654]: panic(0x55578c82d4a0, 0xc0006f00b0) ``` the latest segfault is not from the ResourceStore, but rather from corrupted image directory. My suspicion is this is from the previous crash (from the fixed segfault). I think clearing out the image directory (rm -rf /var/lib/containers/storage) after a reboot (before crio and kubelet start) will clean the node up
@keyoung I would say this is duplicated if cri-o segfaulted in that one
I have deleted the contents of /var/lib/containers/storage and restarted the test. Again I have pods stuck at ContainerCreating, this time without crio segfaulting: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 125m default-scheduler Successfully assigned sno-churning-churn-1/sno-churning-1 to e22-h20-b02-fc640.rdu2.scalelab.redhat.com Normal AddedInterface 32m multus Add eth0 [10.128.0.195/23] Warning Failed 42s (x9 over 29m) kubelet Error: context deadline exceeded Normal Pulled 16s (x10 over 31m) kubelet Container image "gcr.io/google_containers/pause-amd64:3.0" already present on machine LAST SEEN TYPE REASON OBJECT MESSAGE 108m Normal AddedInterface pod/sno-churning-1 Add eth0 [10.128.0.195/23] 4m58s Normal Pulled pod/sno-churning-1 Container image "gcr.io/google_containers/pause-amd64:3.0" already present on machine 12m Warning Failed pod/sno-churning-1 Error: context deadline exceeded 106m Normal AddedInterface pod/sno-churning-2 Add eth0 [10.128.0.196/23] 4m59s Normal Pulled pod/sno-churning-2 Container image "gcr.io/google_containers/pause-amd64:3.0" already present on machine 9m34s Warning Failed pod/sno-churning-2 Error: context deadline exceeded 57m Warning Failed pod/sno-churning-2 Error: Kubelet may be retrying requests that are timing out in CRI-O due to system load: context deadline exceeded: error reserving ctr name k8s_sno-churning_sno-churning-2_sno-churning-churn-1_e2dfc6b7-d0e1-461e-b03b-10ce84a65100_0 for id 514133f1b4ef836c02e2a476f049149585c7e9a5912c9f5bff5301e00afbd567: name is reserved System load is high since the number of cores for the mgmt plane is limited to 2 plus its HT siblings. I'm concerned about the "may be retrying" part of the events log. Is this behaviour expected and 'normal'? Thanks!
Yes that is expected. That's cri-o saying "I'm working on creating that container, don't worry)" There may be some issue that is causing a container creation to take 10 minutes, but it shouldn't be because of cri-o
Got you, the only issue here is that it takes hours and hours. I'll let it run to see if they eventually start. Thanks again!
Hi, Marko Karg Can you tell me how to do the step 2 in your reproducer? 2. Run a churn test that keeps on creating and deleting projects (50 projects, 2 pods and 1 secret per pod) Is there a script or yaml file?
Hi MinLi, I use a script, I'll attach it here plus its configuration yaml. Basically it is a payload for kubeburner. Let me know if you have any other questions! Thanks Marko
Created attachment 1783921 [details] workload script
Created attachment 1783922 [details] workload configuration
Created attachment 1784002 [details] secret spec
Created attachment 1784003 [details] pod spec
I can't reproduce this bug, and I file a new bug https://bugzilla.redhat.com/show_bug.cgi?id=1965983. I think this bz is blocked by it.
For the original bug, I can reproduce the steps and though the pods went through ContainerCreating status for a period(like several minutes), they became Running eventually. Also during the period, some pods went through CreateContainerError, but they became Running in the end. # oc get pod -A | grep -v Running | grep -v Completed NAMESPACE NAME READY STATUS RESTARTS AGE default rhel-ubi 0/1 Pending 0 36m openshift-console console-7bc49988f-54bsc 0/1 CreateContainerError 4 25h openshift-ingress router-default-fbb46ddf8-vdhgf 0/1 CreateContainerError 6 26h openshift-marketplace certified-operators-58xl8 0/1 CrashLoopBackOff 6 9h openshift-marketplace certified-operators-hvjh4 0/1 Pending 0 40m openshift-marketplace community-operators-wbnsj 0/1 Pending 0 39m openshift-marketplace redhat-marketplace-ffdcn 0/1 Pending 0 37m openshift-marketplace redhat-operators-nvdlt 0/1 Pending 0 38m openshift-monitoring thanos-querier-6ddfb55cbc-tdwx2 4/5 CrashLoopBackOff 30 25h sno-churning-1 sno-churning-1 1/1 Terminating 0 41m sno-churning-1 sno-churning-2 1/1 Terminating 0 41m sno-churning-44 sno-churning-1 0/1 CreateContainerError 0 41m sno-churning-44 sno-churning-2 0/1 CreateContainerError 0 41m sno-churning-45 sno-churning-1 0/1 CreateContainerError 0 41m sno-churning-45 sno-churning-2 0/1 CreateContainerError 0 41m sno-churning-46 sno-churning-1 0/1 ContainerCreating 0 41m sno-churning-46 sno-churning-2 0/1 CreateContainerError 0 41m sno-churning-47 sno-churning-1 0/1 CreateContainerError 0 41m sno-churning-47 sno-churning-2 0/1 ContainerCreating 0 41m sno-churning-48 sno-churning-1 0/1 ContainerCreating 0 41m sno-churning-48 sno-churning-2 0/1 ContainerCreating 0 41m sno-churning-49 sno-churning-1 0/1 ContainerCreating 0 41m sno-churning-49 sno-churning-2 0/1 ContainerCreating 0 41m sno-churning-50 sno-churning-1 0/1 ContainerCreating 0 41m sno-churning-50 sno-churning-2 0/1 Pending 0 41m sno-churning-churn-1 sno-churning-1 0/1 Pending 0 40m sno-churning-churn-1 sno-churning-2 0/1 Pending 0 40m ]# oc get events -n sno-churning-44 LAST SEEN TYPE REASON OBJECT MESSAGE 50m Normal Scheduled pod/sno-churning-1 Successfully assigned sno-churning-44/sno-churning-1 to sno-0-0 16m Normal AddedInterface pod/sno-churning-1 Add eth0 [10.128.0.231/23] from openshift-sdn 2m20s Normal Pulled pod/sno-churning-1 Container image "gcr.io/google_containers/pause-amd64:3.0" already present on machine 2m36s Warning Failed pod/sno-churning-1 Error: context deadline exceeded 34s Normal Created pod/sno-churning-1 Created container sno-churning 33s Normal Started pod/sno-churning-1 Started container sno-churning 50m Normal Scheduled pod/sno-churning-2 Successfully assigned sno-churning-44/sno-churning-2 to sno-0-0 15m Normal AddedInterface pod/sno-churning-2 Add eth0 [10.128.0.232/23] from openshift-sdn 4m36s Normal Pulled pod/sno-churning-2 Container image "gcr.io/google_containers/pause-amd64:3.0" already present on machine 5m35s Warning Failed pod/sno-churning-2 Error: context deadline exceeded 2m42s Normal Created pod/sno-churning-2 Created container sno-churning 2m41s Normal Started pod/sno-churning-2 Started container sno-churning 50m Normal CreatedSCCRanges namespace/sno-churning-44 created SCC ranges # oc get events -n sno-churning-46 LAST SEEN TYPE REASON OBJECT MESSAGE 50m Normal Scheduled pod/sno-churning-1 Successfully assigned sno-churning-46/sno-churning-1 to sno-0-0 13m Normal AddedInterface pod/sno-churning-1 Add eth0 [10.128.0.237/23] from openshift-sdn 5m2s Normal Pulled pod/sno-churning-1 Container image "gcr.io/google_containers/pause-amd64:3.0" already present on machine 6m1s Warning Failed pod/sno-churning-1 Error: context deadline exceeded 8m44s Warning Failed pod/sno-churning-1 Error: failed to add conmon to systemd sandbox cgroup: Message recipient disconnected from message bus without replying 3m5s Normal Created pod/sno-churning-1 Created container sno-churning 3m4s Normal Started pod/sno-churning-1 Started container sno-churning 50m Normal Scheduled pod/sno-churning-2 Successfully assigned sno-churning-46/sno-churning-2 to sno-0-0 15m Normal AddedInterface pod/sno-churning-2 Add eth0 [10.128.0.235/23] from openshift-sdn 11m Normal Pulled pod/sno-churning-2 Container image "gcr.io/google_containers/pause-amd64:3.0" already present on machine 12m Warning Failed pod/sno-churning-2 Error: context deadline exceeded 10m Normal Created pod/sno-churning-2 Created container sno-churning 10m Normal Started pod/sno-churning-2 Started container sno-churning 50m Normal CreatedSCCRanges namespace/sno-churning-46 created SCC ranges Hi, Peter Hunt Can you help to check these are normal? Thx version: 4.8.0-0.nightly-2021-06-19-005119
Hi Peter, thanks for checking, i filed below new bug to track issue in comments https://bugzilla.redhat.com/show_bug.cgi?id=1955596#c26 https://bugzilla.redhat.com/show_bug.cgi?id=1975097
Hey Min, If the pods eventually become running, then I would say there is no bug
verified according to Comment 30 !
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438