Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1975097

Summary: one worker node is in NotReady status
Product: OpenShift Container Platform Reporter: jima
Component: NodeAssignee: Peter Hunt <pehunt>
Node sub component: CRI-O QA Contact: Sunil Choudhary <schoudha>
Status: CLOSED WORKSFORME Docs Contact:
Severity: high    
Priority: unspecified CC: aos-bugs, dblack, rphillips
Version: 4.8   
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-07-26 14:05:39 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
kubelete and crio log on NotReady node none

Description jima 2021-06-23 06:39:10 UTC
Created attachment 1793333 [details]
kubelete and crio log on NotReady node

Description of problem:
Deploy ocp ipi-on-vsphere with 4.8.0-0.nightly-2021-06-21-175537 in embedded vsphere cluster on VMC, find that one worker node is in "NotReady" status.

$ ../oc get nodes -o wide
NAME                              STATUS     ROLES    AGE   VERSION                INTERNAL-IP     EXTERNAL-IP     OS-IMAGE                                                       KERNEL-VERSION                CONTAINER-RUNTIME
jstuevervcsa-mxggk-master-0       Ready      master   32m   v1.21.0-rc.0+120883f   192.168.1.109   192.168.1.109   Red Hat Enterprise Linux CoreOS 48.84.202106211517-0 (Ootpa)   4.18.0-305.3.1.el8_4.x86_64   cri-o://1.21.1-9.rhaos4.8.gitdfcd2b6.el8
jstuevervcsa-mxggk-master-1       Ready      master   32m   v1.21.0-rc.0+120883f   192.168.1.110   192.168.1.110   Red Hat Enterprise Linux CoreOS 48.84.202106211517-0 (Ootpa)   4.18.0-305.3.1.el8_4.x86_64   cri-o://1.21.1-9.rhaos4.8.gitdfcd2b6.el8
jstuevervcsa-mxggk-master-2       Ready      master   31m   v1.21.0-rc.0+120883f   192.168.1.108   192.168.1.108   Red Hat Enterprise Linux CoreOS 48.84.202106211517-0 (Ootpa)   4.18.0-305.3.1.el8_4.x86_64   cri-o://1.21.1-9.rhaos4.8.gitdfcd2b6.el8
jstuevervcsa-mxggk-worker-bn59w   Ready      worker   18m   v1.21.0-rc.0+120883f   192.168.1.112   192.168.1.112   Red Hat Enterprise Linux CoreOS 48.84.202106211517-0 (Ootpa)   4.18.0-305.3.1.el8_4.x86_64   cri-o://1.21.1-9.rhaos4.8.gitdfcd2b6.el8
jstuevervcsa-mxggk-worker-pfgtc   Ready      worker   17m   v1.21.0-rc.0+120883f   192.168.1.113   192.168.1.113   Red Hat Enterprise Linux CoreOS 48.84.202106211517-0 (Ootpa)   4.18.0-305.3.1.el8_4.x86_64   cri-o://1.21.1-9.rhaos4.8.gitdfcd2b6.el8
jstuevervcsa-mxggk-worker-x8jjj   NotReady   worker   17m   v1.21.0-rc.0+120883f   192.168.1.114   192.168.1.114   Red Hat Enterprise Linux CoreOS 48.84.202106211517-0 (Ootpa)   4.18.0-305.3.1.el8_4.x86_64   cri-o://1.21.1-9.rhaos4.8.gitdfcd2b6.el8

$ ../oc get pod -A -o wide| grep -i -vE -- "Running|Completed"
NAMESPACE                                          NAME                                                     READY   STATUS              RESTARTS   AGE     IP              NODE                              NOMINATED NODE   READINESS GATES
openshift-cluster-node-tuning-operator             tuned-mnww6                                              0/1     ContainerCreating   0          21m     192.168.1.114   jstuevervcsa-mxggk-worker-x8jjj   <none>           <none>
openshift-dns                                      node-resolver-tn2hk                                      0/1     ContainerCreating   0          21m     192.168.1.114   jstuevervcsa-mxggk-worker-x8jjj   <none>           <none>
openshift-image-registry                           node-ca-x774x                                            0/1     ContainerCreating   0          21m     192.168.1.114   jstuevervcsa-mxggk-worker-x8jjj   <none>           <none>
openshift-machine-config-operator                  machine-config-daemon-mmcg6                              0/2     ContainerCreating   0          21m     192.168.1.114   jstuevervcsa-mxggk-worker-x8jjj   <none>           <none>
openshift-monitoring                               node-exporter-4t6tv                                      0/2     Init:0/1            0          21m     192.168.1.114   jstuevervcsa-mxggk-worker-x8jjj   <none>           <none>
openshift-multus                                   multus-4kmkw                                             0/1     ContainerCreating   0          21m     192.168.1.114   jstuevervcsa-mxggk-worker-x8jjj   <none>           <none>
openshift-multus                                   multus-additional-cni-plugins-2xtv9                      0/1     Init:0/5            0          21m     192.168.1.114   jstuevervcsa-mxggk-worker-x8jjj   <none>           <none>
openshift-multus                                   network-metrics-daemon-q6bxh                             0/2     ContainerCreating   0          21m     <none>          jstuevervcsa-mxggk-worker-x8jjj   <none>           <none>
openshift-network-diagnostics                      network-check-target-pdg5c                               0/1     ContainerCreating   0          21m     <none>          jstuevervcsa-mxggk-worker-x8jjj   <none>           <none>
openshift-sdn                                      sdn-sjhqv                                                0/2     ContainerCreating   0          21m     192.168.1.114   jstuevervcsa-mxggk-worker-x8jjj   <none>           <none>
openshift-vsphere-infra                            coredns-jstuevervcsa-mxggk-worker-x8jjj                  0/2     Init:0/1            0          21m     192.168.1.114   jstuevervcsa-mxggk-worker-x8jjj   <none>           <none>
openshift-vsphere-infra                            keepalived-jstuevervcsa-mxggk-worker-x8jjj               0/2     Init:0/1            0          21m     192.168.1.114   jstuevervcsa-mxggk-worker-x8jjj   <none>           <none>


Checked on NotReady node "jstuevervcsa-mxggk-worker-x8jjj", all pods scheduled on this node are failed to create pod sandbox with below warning:
 Warning  FailedCreatePodSandBox  52s (x84 over 19m)  kubelet            (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = error creating pod sandbox with name "k8s_tuned-mnww6_openshift-cluster-node-tuning-operator_d1d97028-1924-44d7-8328-827b0cb57f8b_0": error creating read-write layer with ID "734e98cada05c5134b2a59884321ee73d7823db84146eeeec435efa7ff30b1bf": Stat /var/lib/containers/storage/overlay/ff49d8a247b7b5aea6384dec94b154a21bafde2e0b903680304a58be22b8bb31: no such file or directory

In crio.service log, detected panic error:
Jun 22 03:24:12 jstuevervcsa-7wjf9-worker-wjnz6 crio[1667]: time="2021-06-22 03:24:12.129320957Z" level=info msg="RunSandbox: releasing container name: k8s_POD_coredns-jstuevervcsa-7wjf9-worker-wjnz6_openshift-vsphere-infra_1bb8d80b9f0d5d2d7679a334fc031200_0" id=b7e3a9b8-bc8b-4e49-86f6-48817db4147b name=/runtime.v1alpha2.RuntimeService/RunPodSandbox
Jun 22 03:24:12 jstuevervcsa-7wjf9-worker-wjnz6 crio[1667]: time="2021-06-22 03:24:12.129374595Z" level=info msg="RunSandbox: releasing container name: k8s_POD_coredns-jstuevervcsa-7wjf9-worker-wjnz6_openshift-vsphere-infra_1bb8d80b9f0d5d2d7679a334fc031200_0" id=b7e3a9b8-bc8b-4e49-86f6-48817db4147b name=/runtime.v1alpha2.RuntimeService/RunPodSandbox
Jun 22 03:24:12 jstuevervcsa-7wjf9-worker-wjnz6 crio[1667]: time="2021-06-22 03:24:12.494993233Z" level=info msg="RunSandbox: releasing container name: k8s_POD_sdn-mwnpk_openshift-sdn_cdd93a87-0ab2-4879-a8ed-0ad9e75687bf_0" id=2c7180f2-da7c-4762-9d15-a83986440aa3 name=/runtime.v1alpha2.RuntimeService/RunPodSandbox
Jun 22 03:24:12 jstuevervcsa-7wjf9-worker-wjnz6 crio[1667]: time="2021-06-22 03:24:12.496250193Z" level=info msg="RunSandbox: releasing container name: k8s_POD_sdn-mwnpk_openshift-sdn_cdd93a87-0ab2-4879-a8ed-0ad9e75687bf_0" id=2c7180f2-da7c-4762-9d15-a83986440aa3 name=/runtime.v1alpha2.RuntimeService/RunPodSandbox
Jun 22 03:24:12 jstuevervcsa-7wjf9-worker-wjnz6 crio[1667]: time="2021-06-22 03:24:12.524650327Z" level=info msg="RunSandbox: releasing container name: k8s_POD_node-exporter-cgbvs_openshift-monitoring_60c85939-d698-4349-9e14-4125d8ecea3f_0" id=39dc0acf-9ec3-45a4-a89a-603cb956a352 name=/runtime.v1alpha2.RuntimeService/RunPodSandbox
Jun 22 03:24:12 jstuevervcsa-7wjf9-worker-wjnz6 crio[1667]: time="2021-06-22 03:24:12.524724411Z" level=info msg="RunSandbox: releasing container name: k8s_POD_node-exporter-cgbvs_openshift-monitoring_60c85939-d698-4349-9e14-4125d8ecea3f_0" id=39dc0acf-9ec3-45a4-a89a-603cb956a352 name=/runtime.v1alpha2.RuntimeService/RunPodSandbox
Jun 22 03:24:12 jstuevervcsa-7wjf9-worker-wjnz6 crio[1667]: panic: runtime error: invalid memory address or nil pointer dereference
Jun 22 03:24:12 jstuevervcsa-7wjf9-worker-wjnz6 crio[1667]: [signal SIGSEGV: segmentation violation code=0x1 addr=0xa0 pc=0x55577b906655]
Jun 22 03:24:12 jstuevervcsa-7wjf9-worker-wjnz6 crio[1667]: goroutine 118 [running]:
Jun 22 03:24:12 jstuevervcsa-7wjf9-worker-wjnz6 crio[1667]: panic(0x55577cdf6ae0, 0x55577df5d400)
Jun 22 03:24:12 jstuevervcsa-7wjf9-worker-wjnz6 crio[1667]:         /usr/lib/golang/src/runtime/panic.go:1065 +0x565 fp=0xc000c19980 sp=0xc000c198b8 pc=0x55577af88c85
Jun 22 03:24:12 jstuevervcsa-7wjf9-worker-wjnz6 crio[1667]: runtime.panicmem()
Jun 22 03:24:12 jstuevervcsa-7wjf9-worker-wjnz6 crio[1667]:         /usr/lib/golang/src/runtime/panic.go:212 +0x5d fp=0xc000c199a0 sp=0xc000c19980 pc=0x55577af86c7d
Jun 22 03:24:12 jstuevervcsa-7wjf9-worker-wjnz6 crio[1667]: runtime.sigpanic()
Jun 22 03:24:12 jstuevervcsa-7wjf9-worker-wjnz6 crio[1667]:         /usr/lib/golang/src/runtime/signal_unix.go:734 +0x185 fp=0xc000c199d8 sp=0xc000c199a0 pc=0x55577afa0c65
Jun 22 03:24:12 jstuevervcsa-7wjf9-worker-wjnz6 crio[1667]: github.com/cri-o/cri-o/vendor/github.com/containers/storage.(*store).imageTopLayerForMapping.func1(0xc00103e9a0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
Jun 22 03:24:12 jstuevervcsa-7wjf9-worker-wjnz6 crio[1667]:         /builddir/build/BUILD/cri-o-dfcd2b6a61cd5436e6beb54956c6ee4453104968/_output/src/github.com/cri-o/cri-o/vendor/github.com/containers/storage/store.go:1104 +0x35 fp=0xc000c19a18 sp=0xc000c199d8 pc=0x55577b906655
Jun 22 03:24:12 jstuevervcsa-7wjf9-worker-wjnz6 crio[1667]: github.com/cri-o/cri-o/vendor/github.com/containers/storage.(*store).imageTopLayerForMapping(0xc0000de2c0, 0xc0004f6b40, 0x55577d125490, 0xc00010a900, 0xc00010a901, 0x55577d129a08, 0xc001147520, 0x0, 0x0, 0x0, ...)
Jun 22 03:24:12 jstuevervcsa-7wjf9-worker-wjnz6 crio[1667]:         /builddir/build/BUILD/cri-o-dfcd2b6a61cd5436e6beb54956c6ee4453104968/_output/src/github.com/cri-o/cri-o/vendor/github.com/containers/storage/store.go:1151 +0xdbd fp=0xc000c19cf8 sp=0xc000c19a18 pc=0x55577b8eab1d
Jun 22 03:24:12 jstuevervcsa-7wjf9-worker-wjnz6 crio[1667]: github.com/cri-o/cri-o/vendor/github.com/containers/storage.(*store).CreateContainer(0xc0000de2c0, 0xc000855940, 0x40, 0xc0006858a0, 0x2, 0x2, 0xc00050edc7, 0x40, 0x0, 0x0, ...)

I attached kubelet and crio log, if more information needed or access that cluster, please let me know.

Version-Release number of selected component (if applicable):
4.8.0-0.nightly-2021-06-21-175537

How reproducible:
Install twice in embedded vsphere on VMC and 100% reproduced.
Not hit the issue on VMC

Steps to Reproduce:
1. Deploy ipi-on vsphere with payload 4.8.0-0.nightly-2021-06-21-175537
2. 
3.

Actual results:
Installation is not completed due to one worker node is NotReady

Expected results:
Installation is completed.

Additional info:

Comment 1 Peter Hunt 2021-06-23 15:27:19 UTC
what are the contents of /var/lib/containers/storage/overlay? it seems the image storage is corrupted, there are tons of "no such file or directory" errors

better yet, is there any chance I can have access to a node where this happening?

Comment 2 jima 2021-06-24 01:09:50 UTC
The contents of that dir is:
[root@jstuevervcsa-mxggk-worker-x8jjj ~]# ls -ltr /var/lib/containers/storage/overlay
total 0
drwx------. 6 root root   69 Jun 22 07:36 98469092e6042f8c9cc81dcb1a710957fb5ef27817c9b178f7b71c4f242cb2ed
drwx------. 5 root root   69 Jun 22 07:36 b179ee5cfaf72dd478852044184dffa4acbcd76bde24975ad5a886eb96ed78bb
drwx------. 5 root root   69 Jun 22 07:36 fb0c869a2abe55f7f7c2f32be58715156c19c609544b0f04e1dfbf19416bec76
drwx------. 5 root root   69 Jun 22 07:36 f998664a7991707cedaf6f758814dc486545a2d7dca29d20edd646ad24c6e6b9
drwx------. 5 root root   69 Jun 22 07:36 ee9e0db041081439749975a4573bb64aeb0f1245844a0133635dab454efced92
drwx------. 5 root root   69 Jun 22 07:36 a52d80902d6c5b97b93ef7bbaf1b6db9158bf0978eee91221a266fc5d248c973
drwx------. 2 root root  210 Jun 24 00:38 l
brw-------. 1 root root 8, 4 Jun 24 00:38 backingFsBlockDev

I sent you the method of how to access the node via Slack, please check.

Comment 3 Peter Hunt 2021-07-02 18:50:10 UTC
waiting on another reproducer, as the other expired while I was getting vmware creds

Comment 4 Peter Hunt 2021-07-23 20:01:13 UTC
is this still an issue?

Comment 5 jima 2021-07-26 08:01:41 UTC
I tried to install 4.8.2 and other nightly build several times, issue could not be reproduced any more.
We can close it now, reopen it if hitting again in the future.

Comment 6 Peter Hunt 2021-07-26 14:05:39 UTC
excellent to hear!