Bug 1975097
| Summary: | one worker node is in NotReady status | ||||||
|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | jima | ||||
| Component: | Node | Assignee: | Peter Hunt <pehunt> | ||||
| Node sub component: | CRI-O | QA Contact: | Sunil Choudhary <schoudha> | ||||
| Status: | CLOSED WORKSFORME | Docs Contact: | |||||
| Severity: | high | ||||||
| Priority: | unspecified | CC: | aos-bugs, dblack, rphillips | ||||
| Version: | 4.8 | ||||||
| Target Milestone: | --- | ||||||
| Target Release: | --- | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2021-07-26 14:05:39 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
what are the contents of /var/lib/containers/storage/overlay? it seems the image storage is corrupted, there are tons of "no such file or directory" errors better yet, is there any chance I can have access to a node where this happening? The contents of that dir is: [root@jstuevervcsa-mxggk-worker-x8jjj ~]# ls -ltr /var/lib/containers/storage/overlay total 0 drwx------. 6 root root 69 Jun 22 07:36 98469092e6042f8c9cc81dcb1a710957fb5ef27817c9b178f7b71c4f242cb2ed drwx------. 5 root root 69 Jun 22 07:36 b179ee5cfaf72dd478852044184dffa4acbcd76bde24975ad5a886eb96ed78bb drwx------. 5 root root 69 Jun 22 07:36 fb0c869a2abe55f7f7c2f32be58715156c19c609544b0f04e1dfbf19416bec76 drwx------. 5 root root 69 Jun 22 07:36 f998664a7991707cedaf6f758814dc486545a2d7dca29d20edd646ad24c6e6b9 drwx------. 5 root root 69 Jun 22 07:36 ee9e0db041081439749975a4573bb64aeb0f1245844a0133635dab454efced92 drwx------. 5 root root 69 Jun 22 07:36 a52d80902d6c5b97b93ef7bbaf1b6db9158bf0978eee91221a266fc5d248c973 drwx------. 2 root root 210 Jun 24 00:38 l brw-------. 1 root root 8, 4 Jun 24 00:38 backingFsBlockDev I sent you the method of how to access the node via Slack, please check. waiting on another reproducer, as the other expired while I was getting vmware creds is this still an issue? I tried to install 4.8.2 and other nightly build several times, issue could not be reproduced any more. We can close it now, reopen it if hitting again in the future. excellent to hear! |
Created attachment 1793333 [details] kubelete and crio log on NotReady node Description of problem: Deploy ocp ipi-on-vsphere with 4.8.0-0.nightly-2021-06-21-175537 in embedded vsphere cluster on VMC, find that one worker node is in "NotReady" status. $ ../oc get nodes -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME jstuevervcsa-mxggk-master-0 Ready master 32m v1.21.0-rc.0+120883f 192.168.1.109 192.168.1.109 Red Hat Enterprise Linux CoreOS 48.84.202106211517-0 (Ootpa) 4.18.0-305.3.1.el8_4.x86_64 cri-o://1.21.1-9.rhaos4.8.gitdfcd2b6.el8 jstuevervcsa-mxggk-master-1 Ready master 32m v1.21.0-rc.0+120883f 192.168.1.110 192.168.1.110 Red Hat Enterprise Linux CoreOS 48.84.202106211517-0 (Ootpa) 4.18.0-305.3.1.el8_4.x86_64 cri-o://1.21.1-9.rhaos4.8.gitdfcd2b6.el8 jstuevervcsa-mxggk-master-2 Ready master 31m v1.21.0-rc.0+120883f 192.168.1.108 192.168.1.108 Red Hat Enterprise Linux CoreOS 48.84.202106211517-0 (Ootpa) 4.18.0-305.3.1.el8_4.x86_64 cri-o://1.21.1-9.rhaos4.8.gitdfcd2b6.el8 jstuevervcsa-mxggk-worker-bn59w Ready worker 18m v1.21.0-rc.0+120883f 192.168.1.112 192.168.1.112 Red Hat Enterprise Linux CoreOS 48.84.202106211517-0 (Ootpa) 4.18.0-305.3.1.el8_4.x86_64 cri-o://1.21.1-9.rhaos4.8.gitdfcd2b6.el8 jstuevervcsa-mxggk-worker-pfgtc Ready worker 17m v1.21.0-rc.0+120883f 192.168.1.113 192.168.1.113 Red Hat Enterprise Linux CoreOS 48.84.202106211517-0 (Ootpa) 4.18.0-305.3.1.el8_4.x86_64 cri-o://1.21.1-9.rhaos4.8.gitdfcd2b6.el8 jstuevervcsa-mxggk-worker-x8jjj NotReady worker 17m v1.21.0-rc.0+120883f 192.168.1.114 192.168.1.114 Red Hat Enterprise Linux CoreOS 48.84.202106211517-0 (Ootpa) 4.18.0-305.3.1.el8_4.x86_64 cri-o://1.21.1-9.rhaos4.8.gitdfcd2b6.el8 $ ../oc get pod -A -o wide| grep -i -vE -- "Running|Completed" NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES openshift-cluster-node-tuning-operator tuned-mnww6 0/1 ContainerCreating 0 21m 192.168.1.114 jstuevervcsa-mxggk-worker-x8jjj <none> <none> openshift-dns node-resolver-tn2hk 0/1 ContainerCreating 0 21m 192.168.1.114 jstuevervcsa-mxggk-worker-x8jjj <none> <none> openshift-image-registry node-ca-x774x 0/1 ContainerCreating 0 21m 192.168.1.114 jstuevervcsa-mxggk-worker-x8jjj <none> <none> openshift-machine-config-operator machine-config-daemon-mmcg6 0/2 ContainerCreating 0 21m 192.168.1.114 jstuevervcsa-mxggk-worker-x8jjj <none> <none> openshift-monitoring node-exporter-4t6tv 0/2 Init:0/1 0 21m 192.168.1.114 jstuevervcsa-mxggk-worker-x8jjj <none> <none> openshift-multus multus-4kmkw 0/1 ContainerCreating 0 21m 192.168.1.114 jstuevervcsa-mxggk-worker-x8jjj <none> <none> openshift-multus multus-additional-cni-plugins-2xtv9 0/1 Init:0/5 0 21m 192.168.1.114 jstuevervcsa-mxggk-worker-x8jjj <none> <none> openshift-multus network-metrics-daemon-q6bxh 0/2 ContainerCreating 0 21m <none> jstuevervcsa-mxggk-worker-x8jjj <none> <none> openshift-network-diagnostics network-check-target-pdg5c 0/1 ContainerCreating 0 21m <none> jstuevervcsa-mxggk-worker-x8jjj <none> <none> openshift-sdn sdn-sjhqv 0/2 ContainerCreating 0 21m 192.168.1.114 jstuevervcsa-mxggk-worker-x8jjj <none> <none> openshift-vsphere-infra coredns-jstuevervcsa-mxggk-worker-x8jjj 0/2 Init:0/1 0 21m 192.168.1.114 jstuevervcsa-mxggk-worker-x8jjj <none> <none> openshift-vsphere-infra keepalived-jstuevervcsa-mxggk-worker-x8jjj 0/2 Init:0/1 0 21m 192.168.1.114 jstuevervcsa-mxggk-worker-x8jjj <none> <none> Checked on NotReady node "jstuevervcsa-mxggk-worker-x8jjj", all pods scheduled on this node are failed to create pod sandbox with below warning: Warning FailedCreatePodSandBox 52s (x84 over 19m) kubelet (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = error creating pod sandbox with name "k8s_tuned-mnww6_openshift-cluster-node-tuning-operator_d1d97028-1924-44d7-8328-827b0cb57f8b_0": error creating read-write layer with ID "734e98cada05c5134b2a59884321ee73d7823db84146eeeec435efa7ff30b1bf": Stat /var/lib/containers/storage/overlay/ff49d8a247b7b5aea6384dec94b154a21bafde2e0b903680304a58be22b8bb31: no such file or directory In crio.service log, detected panic error: Jun 22 03:24:12 jstuevervcsa-7wjf9-worker-wjnz6 crio[1667]: time="2021-06-22 03:24:12.129320957Z" level=info msg="RunSandbox: releasing container name: k8s_POD_coredns-jstuevervcsa-7wjf9-worker-wjnz6_openshift-vsphere-infra_1bb8d80b9f0d5d2d7679a334fc031200_0" id=b7e3a9b8-bc8b-4e49-86f6-48817db4147b name=/runtime.v1alpha2.RuntimeService/RunPodSandbox Jun 22 03:24:12 jstuevervcsa-7wjf9-worker-wjnz6 crio[1667]: time="2021-06-22 03:24:12.129374595Z" level=info msg="RunSandbox: releasing container name: k8s_POD_coredns-jstuevervcsa-7wjf9-worker-wjnz6_openshift-vsphere-infra_1bb8d80b9f0d5d2d7679a334fc031200_0" id=b7e3a9b8-bc8b-4e49-86f6-48817db4147b name=/runtime.v1alpha2.RuntimeService/RunPodSandbox Jun 22 03:24:12 jstuevervcsa-7wjf9-worker-wjnz6 crio[1667]: time="2021-06-22 03:24:12.494993233Z" level=info msg="RunSandbox: releasing container name: k8s_POD_sdn-mwnpk_openshift-sdn_cdd93a87-0ab2-4879-a8ed-0ad9e75687bf_0" id=2c7180f2-da7c-4762-9d15-a83986440aa3 name=/runtime.v1alpha2.RuntimeService/RunPodSandbox Jun 22 03:24:12 jstuevervcsa-7wjf9-worker-wjnz6 crio[1667]: time="2021-06-22 03:24:12.496250193Z" level=info msg="RunSandbox: releasing container name: k8s_POD_sdn-mwnpk_openshift-sdn_cdd93a87-0ab2-4879-a8ed-0ad9e75687bf_0" id=2c7180f2-da7c-4762-9d15-a83986440aa3 name=/runtime.v1alpha2.RuntimeService/RunPodSandbox Jun 22 03:24:12 jstuevervcsa-7wjf9-worker-wjnz6 crio[1667]: time="2021-06-22 03:24:12.524650327Z" level=info msg="RunSandbox: releasing container name: k8s_POD_node-exporter-cgbvs_openshift-monitoring_60c85939-d698-4349-9e14-4125d8ecea3f_0" id=39dc0acf-9ec3-45a4-a89a-603cb956a352 name=/runtime.v1alpha2.RuntimeService/RunPodSandbox Jun 22 03:24:12 jstuevervcsa-7wjf9-worker-wjnz6 crio[1667]: time="2021-06-22 03:24:12.524724411Z" level=info msg="RunSandbox: releasing container name: k8s_POD_node-exporter-cgbvs_openshift-monitoring_60c85939-d698-4349-9e14-4125d8ecea3f_0" id=39dc0acf-9ec3-45a4-a89a-603cb956a352 name=/runtime.v1alpha2.RuntimeService/RunPodSandbox Jun 22 03:24:12 jstuevervcsa-7wjf9-worker-wjnz6 crio[1667]: panic: runtime error: invalid memory address or nil pointer dereference Jun 22 03:24:12 jstuevervcsa-7wjf9-worker-wjnz6 crio[1667]: [signal SIGSEGV: segmentation violation code=0x1 addr=0xa0 pc=0x55577b906655] Jun 22 03:24:12 jstuevervcsa-7wjf9-worker-wjnz6 crio[1667]: goroutine 118 [running]: Jun 22 03:24:12 jstuevervcsa-7wjf9-worker-wjnz6 crio[1667]: panic(0x55577cdf6ae0, 0x55577df5d400) Jun 22 03:24:12 jstuevervcsa-7wjf9-worker-wjnz6 crio[1667]: /usr/lib/golang/src/runtime/panic.go:1065 +0x565 fp=0xc000c19980 sp=0xc000c198b8 pc=0x55577af88c85 Jun 22 03:24:12 jstuevervcsa-7wjf9-worker-wjnz6 crio[1667]: runtime.panicmem() Jun 22 03:24:12 jstuevervcsa-7wjf9-worker-wjnz6 crio[1667]: /usr/lib/golang/src/runtime/panic.go:212 +0x5d fp=0xc000c199a0 sp=0xc000c19980 pc=0x55577af86c7d Jun 22 03:24:12 jstuevervcsa-7wjf9-worker-wjnz6 crio[1667]: runtime.sigpanic() Jun 22 03:24:12 jstuevervcsa-7wjf9-worker-wjnz6 crio[1667]: /usr/lib/golang/src/runtime/signal_unix.go:734 +0x185 fp=0xc000c199d8 sp=0xc000c199a0 pc=0x55577afa0c65 Jun 22 03:24:12 jstuevervcsa-7wjf9-worker-wjnz6 crio[1667]: github.com/cri-o/cri-o/vendor/github.com/containers/storage.(*store).imageTopLayerForMapping.func1(0xc00103e9a0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...) Jun 22 03:24:12 jstuevervcsa-7wjf9-worker-wjnz6 crio[1667]: /builddir/build/BUILD/cri-o-dfcd2b6a61cd5436e6beb54956c6ee4453104968/_output/src/github.com/cri-o/cri-o/vendor/github.com/containers/storage/store.go:1104 +0x35 fp=0xc000c19a18 sp=0xc000c199d8 pc=0x55577b906655 Jun 22 03:24:12 jstuevervcsa-7wjf9-worker-wjnz6 crio[1667]: github.com/cri-o/cri-o/vendor/github.com/containers/storage.(*store).imageTopLayerForMapping(0xc0000de2c0, 0xc0004f6b40, 0x55577d125490, 0xc00010a900, 0xc00010a901, 0x55577d129a08, 0xc001147520, 0x0, 0x0, 0x0, ...) Jun 22 03:24:12 jstuevervcsa-7wjf9-worker-wjnz6 crio[1667]: /builddir/build/BUILD/cri-o-dfcd2b6a61cd5436e6beb54956c6ee4453104968/_output/src/github.com/cri-o/cri-o/vendor/github.com/containers/storage/store.go:1151 +0xdbd fp=0xc000c19cf8 sp=0xc000c19a18 pc=0x55577b8eab1d Jun 22 03:24:12 jstuevervcsa-7wjf9-worker-wjnz6 crio[1667]: github.com/cri-o/cri-o/vendor/github.com/containers/storage.(*store).CreateContainer(0xc0000de2c0, 0xc000855940, 0x40, 0xc0006858a0, 0x2, 0x2, 0xc00050edc7, 0x40, 0x0, 0x0, ...) I attached kubelet and crio log, if more information needed or access that cluster, please let me know. Version-Release number of selected component (if applicable): 4.8.0-0.nightly-2021-06-21-175537 How reproducible: Install twice in embedded vsphere on VMC and 100% reproduced. Not hit the issue on VMC Steps to Reproduce: 1. Deploy ipi-on vsphere with payload 4.8.0-0.nightly-2021-06-21-175537 2. 3. Actual results: Installation is not completed due to one worker node is NotReady Expected results: Installation is completed. Additional info: