1955596 – Pods stuck in creation phase on realtime kernel SNO

Bug 1955596 - Pods stuck in creation phase on realtime kernel SNO

Summary: Pods stuck in creation phase on realtime kernel SNO

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	4.8
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Peter Hunt
QA Contact:	MinLi
Docs Contact:
URL:
Whiteboard:	TestBlocker
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-04-30 13:42 UTC by Marko Karg
Modified:	2021-07-27 23:05 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-07-27 23:05:17 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
crio log tgz'ed (2.06 MB, application/gzip) 2021-04-30 13:42 UTC, Marko Karg	no flags	Details
kubelet log tgz'ed (7.12 MB, application/gzip) 2021-04-30 13:43 UTC, Marko Karg	no flags	Details
performance profile (929 bytes, text/plain) 2021-04-30 13:45 UTC, Marko Karg	no flags	Details
latest segfault in crio (5.40 MB, application/gzip) 2021-05-07 07:34 UTC, Marko Karg	no flags	Details
workload script (1.62 KB, text/plain) 2021-05-17 06:11 UTC, Marko Karg	no flags	Details
workload configuration (894 bytes, text/plain) 2021-05-17 06:12 UTC, Marko Karg	no flags	Details
secret spec (125 bytes, text/plain) 2021-05-17 09:42 UTC, Marko Karg	no flags	Details
pod spec (321 bytes, text/plain) 2021-05-17 09:43 UTC, Marko Karg	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	cri-o cri-o pull 4834	0	None	open	resource store: prevent segfault on cleanup step	2021-05-03 19:10:17 UTC
Red Hat Product Errata	RHSA-2021:2438	0	None	None	None	2021-07-27 23:05:39 UTC

Description Marko Karg 2021-04-30 13:42:32 UTC

Created attachment 1777815 [details]
crio log tgz'ed

Description of problem:

On a baremetal SNO with the realtime kernel installed, pods get stuck in ContainerCreating state forever. 

[root@e22-h20-b01-fc640 workloads]# oc get pods
NAME             READY   STATUS              RESTARTS   AGE
sno-churning-1   0/1     ContainerCreating   0          82m

The reason is this:

  Normal   Scheduled               88m                default-scheduler  Successfully assigned sno-churning-churn-1/sno-churning-1 to e22-h20-b02-fc640.rdu2.scalelab.redhat.com
  Warning  FailedMount             73m (x2 over 73m)  kubelet            MountVolume.SetUp failed for volume "default-token-bjkj8" : failed to sync secret cache: timed out waiting for the condition
  Normal   AddedInterface          72m                multus             Add eth0 [10.128.0.234/23]
  Warning  FailedCreatePodSandBox  69m                kubelet            Failed to create pod sandbox: rpc error: code = DeadlineExceeded desc = context deadline exceeded
  Normal   AddedInterface          64m                multus             Add eth0 [10.128.0.234/23]
  Warning  FailedCreatePodSandBox  61m                kubelet            Failed to create pod sandbox: rpc error: code = DeadlineExceeded desc = context deadline exceeded
  Warning  FailedMount             57m (x3 over 57m)  kubelet            MountVolume.SetUp failed for volume "default-token-bjkj8" : failed to sync secret cache: timed out waiting for the condition
  Normal   AddedInterface          56m                multus             Add eth0 [10.128.0.234/23]
  Warning  FailedCreatePodSandBox  53m                kubelet            Failed to create pod sandbox: rpc error: code = DeadlineExceeded desc = context deadline exceeded
  Warning  FailedMount             48m (x3 over 48m)  kubelet            MountVolume.SetUp failed for volume "default-token-bjkj8" : failed to sync secret cache: timed out waiting for the condition
  Normal   AddedInterface          48m                multus             Add eth0 [10.128.0.234/23]
  Warning  FailedCreatePodSandBox  44m                kubelet            Failed to create pod sandbox: rpc error: code = DeadlineExceeded desc = context deadline exceeded
  Warning  FailedMount             37m (x3 over 37m)  kubelet            MountVolume.SetUp failed for volume "default-token-bjkj8" : failed to sync secret cache: timed out waiting for the condition
  Normal   AddedInterface          36m                multus             Add eth0 [10.128.0.234/23]
  Warning  FailedCreatePodSandBox  33m                kubelet            Failed to create pod sandbox: rpc error: code = DeadlineExceeded desc = context deadline exceeded
  Warning  FailedMount             25m (x2 over 25m)  kubelet            MountVolume.SetUp failed for volume "default-token-bjkj8" : failed to sync secret cache: timed out waiting for the condition
  Normal   AddedInterface          25m                multus             Add eth0 [10.128.0.234/23]
  Warning  FailedCreatePodSandBox  21m                kubelet            Failed to create pod sandbox: rpc error: code = DeadlineExceeded desc = context deadline exceeded
  Warning  FailedMount             12m (x3 over 12m)  kubelet            MountVolume.SetUp failed for volume "default-token-bjkj8" : failed to sync secret cache: timed out waiting for the condition
  Normal   AddedInterface          11m                multus             Add eth0 [10.128.0.234/23]
  Warning  FailedCreatePodSandBox  8m9s               kubelet            Failed to create pod sandbox: rpc error: code = DeadlineExceeded desc = context deadline exceeded


Version-Release number of selected component (if applicable):
4.8.0-fc.1

How reproducible:
always

Steps to Reproduce:
1. Install SNO on BM, deploy the POA operator and the attached performanceprofile
2. Run a churn test that keeps on creating and deleting projects (50 projects, 2 pods and 1 secret per pod)
3. Once all of the initial projects are up, the first one gets deleted and recreated. As soon as that happens, the newly created pods get suck

Actual results:

Pods stuck in ContainerCreating state

Expected results:

Pods should reach the Running state eventually.


Additional info:

I've attached the kubelog, crio log and the performanceprofile I've used, let me know if you need anything else

Comment 1 Marko Karg 2021-04-30 13:43:42 UTC

Created attachment 1777817 [details]
kubelet log tgz'ed

Comment 2 Marko Karg 2021-04-30 13:45:53 UTC

Created attachment 1777818 [details]
performance profile

Comment 3 Peter Hunt 2021-05-03 19:10:21 UTC

oopsies, that's because cri-o segfaulted

a 1.21 version of the attached PR should fix it

Comment 4 Marko Karg 2021-05-04 06:55:44 UTC

Hm, I don't see a PR - could you please point me to it? Also, how do I get it on the deployed machine?
Thanks!

Comment 5 Marko Karg 2021-05-05 07:11:57 UTC

@pehunt Can you please help me?

Comment 6 Marko Karg 2021-05-05 13:12:11 UTC

Ok, so I can easily reproduce this one:

May 05 13:04:19 e22-h20-b02-fc640.rdu2.scalelab.redhat.com crio[4904]: panic: runtime error: invalid memory address or nil pointer dereference
May 05 13:04:19 e22-h20-b02-fc640.rdu2.scalelab.redhat.com crio[4904]: [signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x5562b14fb5df]
May 05 13:04:19 e22-h20-b02-fc640.rdu2.scalelab.redhat.com crio[4904]: goroutine 25 [running]:
May 05 13:04:19 e22-h20-b02-fc640.rdu2.scalelab.redhat.com crio[4904]: panic(0x5562b1d14540, 0x5562b2e8c4c0)
May 05 13:04:19 e22-h20-b02-fc640.rdu2.scalelab.redhat.com crio[4904]:         /usr/lib/golang/src/runtime/panic.go:1065 +0x565 fp=0xc002351e30 sp=0xc002351d68 pc=0x5562afe61985
May 05 13:04:19 e22-h20-b02-fc640.rdu2.scalelab.redhat.com crio[4904]: runtime.panicmem()

For me this is turning into a TestBlocker, so I'd appreciate any pointers towards the aforementioned PR and the way how to get an updated CRIO image onto my SNO cluster please. 

Thanks

Marko

Comment 7 Marko Karg 2021-05-05 13:22:09 UTC

Hm, bz-- it was hiding the PR from me for some reason.

Comment 8 Peter Hunt 2021-05-05 13:55:25 UTC

we spoke on slack, I have set Marko up with an RPM to test--and maybe find more holes in this :) 
pr merged btw

Comment 10 Marko Karg 2021-05-07 07:32:48 UTC

@pehunt Ok, with the fixed crio, I got another segmentation fault with crio. I'm attaching the crio logs to this bz.
Timestamp for the latest segfault is May 06 13:19:46

In addition to that, I got pods stuck with

[root@e22-h20-b01-fc640 ~]# oc get pods
NAME             READY   STATUS              RESTARTS   AGE
sno-churning-1   0/1     ContainerCreating   0          15h
sno-churning-2   0/1     ContainerCreating   0          15h

Events:
  Type     Reason                    Age                   From     Message
  ----     ------                    ----                  ----     -------
  Warning  FailedCreatePodContainer  9m17s (x64 over 11h)  kubelet  unable to ensure pod container exists: failed to create container for [kubepods besteffort pod01954da8-f380-4e25-bfff-b6b5efc061b0] : Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)

Comment 11 Marko Karg 2021-05-07 07:34:16 UTC

Created attachment 1780609 [details]
latest segfault in crio

Comment 12 Ken Young 2021-05-07 16:24:30 UTC

Team,

Could this be a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1939133?  If the system is unresponsive, pod creation would be a problem.

/KenY

Comment 13 Peter Hunt 2021-05-07 17:02:16 UTC

Okay good news:
```
May 06 13:38:33 e22-h20-b02-fc640.rdu2.scalelab.redhat.com systemd[1]: Starting CRI-O Auto Update Script...
May 06 13:38:33 e22-h20-b02-fc640.rdu2.scalelab.redhat.com crio[9207]: time="2021-05-06 13:38:33.938380361Z" level=info msg="Starting CRI-O, version: 1.21.0-88.rhaos4.8.gitfd485de.el8, git: ()"
May 06 13:38:33 e22-h20-b02-fc640.rdu2.scalelab.redhat.com crio[9207]: time="2021-05-06 13:38:33.943053793Z" level=info msg="Checking whether cri-o should wipe containers: version file /var/run/crio/version not found: open /var/run/crio/>
May 06 13:38:33 e22-h20-b02-fc640.rdu2.scalelab.redhat.com systemd[1]: crio-wipe.service: Succeeded.
May 06 13:38:33 e22-h20-b02-fc640.rdu2.scalelab.redhat.com systemd[1]: Started CRI-O Auto Update Script.
May 06 13:38:33 e22-h20-b02-fc640.rdu2.scalelab.redhat.com systemd[1]: crio-wipe.service: Consumed 72ms CPU time
May 06 13:40:40 e22-h20-b02-fc640.rdu2.scalelab.redhat.com systemd[1]: Starting Open Container Initiative Daemon...
May 06 13:40:41 e22-h20-b02-fc640.rdu2.scalelab.redhat.com crio[10654]: time="2021-05-06 13:40:41.000237707Z" level=info msg="Starting CRI-O, version: 1.21.0-88.rhaos4.8.gitfd485de.el8, git: ()"
...
May 06 13:40:41 e22-h20-b02-fc640.rdu2.scalelab.redhat.com systemd[1]: Started Open Container Initiative Daemon.
May 06 13:40:52 e22-h20-b02-fc640.rdu2.scalelab.redhat.com crio[10654]: panic: error opening "/var/lib/containers/storage/overlay-images/images.lock": no such file or directory
May 06 13:40:52 e22-h20-b02-fc640.rdu2.scalelab.redhat.com crio[10654]: goroutine 134 [running]:
May 06 13:40:52 e22-h20-b02-fc640.rdu2.scalelab.redhat.com crio[10654]: panic(0x55578c82d4a0, 0xc0006f00b0)

```
the latest segfault is not from the ResourceStore, but rather from corrupted image directory. My suspicion is this is from the previous crash (from the fixed segfault).

I think clearing out the image directory (rm -rf /var/lib/containers/storage) after a reboot (before crio and kubelet start) will clean the node up

Comment 14 Peter Hunt 2021-05-07 17:02:46 UTC

@keyoung I would say this is duplicated if cri-o segfaulted in that one

Comment 15 Marko Karg 2021-05-10 12:12:55 UTC

I have deleted the contents of /var/lib/containers/storage and restarted the test. Again I have pods stuck at ContainerCreating, this time without crio segfaulting:

  Type     Reason          Age                 From               Message
  ----     ------          ----                ----               -------
  Normal   Scheduled       125m                default-scheduler  Successfully assigned sno-churning-churn-1/sno-churning-1 to e22-h20-b02-fc640.rdu2.scalelab.redhat.com
  Normal   AddedInterface  32m                 multus             Add eth0 [10.128.0.195/23]
  Warning  Failed          42s (x9 over 29m)   kubelet            Error: context deadline exceeded
  Normal   Pulled          16s (x10 over 31m)  kubelet            Container image "gcr.io/google_containers/pause-amd64:3.0" already present on machine


LAST SEEN   TYPE      REASON           OBJECT               MESSAGE
108m        Normal    AddedInterface   pod/sno-churning-1   Add eth0 [10.128.0.195/23]
4m58s       Normal    Pulled           pod/sno-churning-1   Container image "gcr.io/google_containers/pause-amd64:3.0" already present on machine
12m         Warning   Failed           pod/sno-churning-1   Error: context deadline exceeded
106m        Normal    AddedInterface   pod/sno-churning-2   Add eth0 [10.128.0.196/23]
4m59s       Normal    Pulled           pod/sno-churning-2   Container image "gcr.io/google_containers/pause-amd64:3.0" already present on machine
9m34s       Warning   Failed           pod/sno-churning-2   Error: context deadline exceeded
57m         Warning   Failed           pod/sno-churning-2   Error: Kubelet may be retrying requests that are timing out in CRI-O due to system load: context deadline exceeded: error reserving ctr name k8s_sno-churning_sno-churning-2_sno-churning-churn-1_e2dfc6b7-d0e1-461e-b03b-10ce84a65100_0 for id 514133f1b4ef836c02e2a476f049149585c7e9a5912c9f5bff5301e00afbd567: name is reserved

System load is high since the number of cores for the mgmt plane is limited to 2 plus its HT siblings. I'm concerned about the "may be retrying" part of the events log. Is this behaviour expected and 'normal'?

Thanks!

Comment 16 Peter Hunt 2021-05-10 12:56:14 UTC

Yes that is expected. That's cri-o saying "I'm working on creating that container, don't worry)"
There may be some issue that is causing a container creation to take 10 minutes, but it  shouldn't be because of cri-o

Comment 17 Marko Karg 2021-05-10 13:07:37 UTC

Got you, the only issue here is that it takes hours and hours. I'll let it run to see if they eventually start. Thanks again!

Comment 18 MinLi 2021-05-14 04:14:38 UTC

Hi, Marko Karg
Can you tell me how to do the step 2 in your reproducer?
2. Run a churn test that keeps on creating and deleting projects (50 projects, 2 pods and 1 secret per pod)

Is there a script or yaml file?

Comment 19 Marko Karg 2021-05-17 06:10:34 UTC

Hi MinLi,

I use a script, I'll attach it here plus its configuration yaml. Basically it is a payload for kubeburner.
Let me know if you have any other questions!

Thanks

Marko

Comment 20 Marko Karg 2021-05-17 06:11:40 UTC

Created attachment 1783921 [details]
workload script

Comment 21 Marko Karg 2021-05-17 06:12:06 UTC

Created attachment 1783922 [details]
workload configuration

Comment 22 Marko Karg 2021-05-17 09:42:42 UTC

Created attachment 1784002 [details]
secret spec

Comment 23 Marko Karg 2021-05-17 09:43:11 UTC

Created attachment 1784003 [details]
pod spec

Comment 24 MinLi 2021-05-31 10:13:10 UTC

I can't reproduce this bug, and I file a new bug https://bugzilla.redhat.com/show_bug.cgi?id=1965983.
I think this bz is blocked by it.

Comment 28 MinLi 2021-06-23 03:51:38 UTC

For the original bug, I can reproduce the steps and though the pods went through ContainerCreating status for a period(like several minutes), they became Running eventually. 

Also during the period, some pods went through CreateContainerError, but they became Running in the end. 

# oc get pod -A | grep -v Running  | grep -v Completed 
NAMESPACE                                          NAME                                                              READY   STATUS                 RESTARTS   AGE
default                                            rhel-ubi                                                          0/1     Pending                0          36m
openshift-console                                  console-7bc49988f-54bsc                                           0/1     CreateContainerError   4          25h
openshift-ingress                                  router-default-fbb46ddf8-vdhgf                                    0/1     CreateContainerError   6          26h
openshift-marketplace                              certified-operators-58xl8                                         0/1     CrashLoopBackOff       6          9h
openshift-marketplace                              certified-operators-hvjh4                                         0/1     Pending                0          40m
openshift-marketplace                              community-operators-wbnsj                                         0/1     Pending                0          39m
openshift-marketplace                              redhat-marketplace-ffdcn                                          0/1     Pending                0          37m
openshift-marketplace                              redhat-operators-nvdlt                                            0/1     Pending                0          38m
openshift-monitoring                               thanos-querier-6ddfb55cbc-tdwx2                                   4/5     CrashLoopBackOff       30         25h
sno-churning-1                                     sno-churning-1                                                    1/1     Terminating            0          41m
sno-churning-1                                     sno-churning-2                                                    1/1     Terminating            0          41m
sno-churning-44                                    sno-churning-1                                                    0/1     CreateContainerError   0          41m
sno-churning-44                                    sno-churning-2                                                    0/1     CreateContainerError   0          41m
sno-churning-45                                    sno-churning-1                                                    0/1     CreateContainerError   0          41m
sno-churning-45                                    sno-churning-2                                                    0/1     CreateContainerError   0          41m
sno-churning-46                                    sno-churning-1                                                    0/1     ContainerCreating      0          41m
sno-churning-46                                    sno-churning-2                                                    0/1     CreateContainerError   0          41m
sno-churning-47                                    sno-churning-1                                                    0/1     CreateContainerError   0          41m
sno-churning-47                                    sno-churning-2                                                    0/1     ContainerCreating      0          41m
sno-churning-48                                    sno-churning-1                                                    0/1     ContainerCreating      0          41m
sno-churning-48                                    sno-churning-2                                                    0/1     ContainerCreating      0          41m
sno-churning-49                                    sno-churning-1                                                    0/1     ContainerCreating      0          41m
sno-churning-49                                    sno-churning-2                                                    0/1     ContainerCreating      0          41m
sno-churning-50                                    sno-churning-1                                                    0/1     ContainerCreating      0          41m
sno-churning-50                                    sno-churning-2                                                    0/1     Pending                0          41m
sno-churning-churn-1                               sno-churning-1                                                    0/1     Pending                0          40m
sno-churning-churn-1                               sno-churning-2                                                    0/1     Pending                0          40m

]# oc get events -n sno-churning-44
LAST SEEN   TYPE      REASON             OBJECT                      MESSAGE
50m         Normal    Scheduled          pod/sno-churning-1          Successfully assigned sno-churning-44/sno-churning-1 to sno-0-0
16m         Normal    AddedInterface     pod/sno-churning-1          Add eth0 [10.128.0.231/23] from openshift-sdn
2m20s       Normal    Pulled             pod/sno-churning-1          Container image "gcr.io/google_containers/pause-amd64:3.0" already present on machine
2m36s       Warning   Failed             pod/sno-churning-1          Error: context deadline exceeded
34s         Normal    Created            pod/sno-churning-1          Created container sno-churning
33s         Normal    Started            pod/sno-churning-1          Started container sno-churning
50m         Normal    Scheduled          pod/sno-churning-2          Successfully assigned sno-churning-44/sno-churning-2 to sno-0-0
15m         Normal    AddedInterface     pod/sno-churning-2          Add eth0 [10.128.0.232/23] from openshift-sdn
4m36s       Normal    Pulled             pod/sno-churning-2          Container image "gcr.io/google_containers/pause-amd64:3.0" already present on machine
5m35s       Warning   Failed             pod/sno-churning-2          Error: context deadline exceeded
2m42s       Normal    Created            pod/sno-churning-2          Created container sno-churning
2m41s       Normal    Started            pod/sno-churning-2          Started container sno-churning
50m         Normal    CreatedSCCRanges   namespace/sno-churning-44   created SCC ranges

# oc get events -n sno-churning-46
LAST SEEN   TYPE      REASON             OBJECT                      MESSAGE
50m         Normal    Scheduled          pod/sno-churning-1          Successfully assigned sno-churning-46/sno-churning-1 to sno-0-0
13m         Normal    AddedInterface     pod/sno-churning-1          Add eth0 [10.128.0.237/23] from openshift-sdn
5m2s        Normal    Pulled             pod/sno-churning-1          Container image "gcr.io/google_containers/pause-amd64:3.0" already present on machine
6m1s        Warning   Failed             pod/sno-churning-1          Error: context deadline exceeded
8m44s       Warning   Failed             pod/sno-churning-1          Error: failed to add conmon to systemd sandbox cgroup: Message recipient disconnected from message bus without replying
3m5s        Normal    Created            pod/sno-churning-1          Created container sno-churning
3m4s        Normal    Started            pod/sno-churning-1          Started container sno-churning
50m         Normal    Scheduled          pod/sno-churning-2          Successfully assigned sno-churning-46/sno-churning-2 to sno-0-0
15m         Normal    AddedInterface     pod/sno-churning-2          Add eth0 [10.128.0.235/23] from openshift-sdn
11m         Normal    Pulled             pod/sno-churning-2          Container image "gcr.io/google_containers/pause-amd64:3.0" already present on machine
12m         Warning   Failed             pod/sno-churning-2          Error: context deadline exceeded
10m         Normal    Created            pod/sno-churning-2          Created container sno-churning
10m         Normal    Started            pod/sno-churning-2          Started container sno-churning
50m         Normal    CreatedSCCRanges   namespace/sno-churning-46   created SCC ranges

Hi,  Peter Hunt
Can you help to check these are normal? Thx

version: 4.8.0-0.nightly-2021-06-19-005119

Comment 29 jima 2021-06-23 06:40:53 UTC

Hi Peter, 

thanks for checking, i filed below new bug to track issue in comments https://bugzilla.redhat.com/show_bug.cgi?id=1955596#c26
https://bugzilla.redhat.com/show_bug.cgi?id=1975097

Comment 30 Peter Hunt 2021-06-24 19:38:23 UTC

Hey Min,

If the pods eventually become running, then I would say there is no bug

Comment 31 MinLi 2021-06-25 02:42:39 UTC

verified according to  Comment 30 !

Comment 33 errata-xmlrpc 2021-07-27 23:05:17 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Note You need to log in before you can comment on or make changes to this bug.