Description of problem: Master-Api pod is going into CreateContainerError state after a hard reboot on the node with the following error and the customer is using CRI-O as the runtime . ~~~~~~ Feb 14 08:50:03 sdccsi-sesosm01 atomic-openshift-node[46337]: E0214 08:50:03.810919 46337 remote_runtime.go:187] CreateContainer in sandbox "11dfc686ca0151f6ede51f0d4a8f9a6373f7039a9a8a972d0d7b45bfd17478ce" from runtime service failed: rpc error: code = Unknown desc = that name is already in use " ~~~~~~ Actual results: Master-Api pod is not starting after a hard reboot of the node. Expected results: Pods to be up and running after a hard reboot of a node running with CRI-O runtime Additional info: The Following are the CRI-O details cri-o-1.11.5-2.rhaos3.11.git1c8a4b1.el7.x86_64 cri-tools-1.11.1-1.rhaos3.11.gitedabfb5.el7_5.x86_64 criu-3.5-4.el7.x86_64 This is along a Known Bug in Upstream link[1] [1] -https://github.com/kubernetes-sigs/cri-o/issues/1742
opened a PR for master here: https://github.com/kubernetes-sigs/cri-o/pull/2083 There are possibly still other ways why it can fail, but this solves the ones I was able to reproduce locally. If the config.json file is not present or not complete (i.e. it failed to be written in the middle of a hard reboot), we would just skip loading the container and not report it to the Kubelet. So we would fail on the next request of creating it. With the patch I've proposed, if we fail to load a container/pod then we remove it also from the storage.
Verified on [root@qe-wjiang311-crio-mrre-1 ~]# rpm -qa|grep -i cri- cri-o-1.11.11-1.rhaos3.11.git474f73d.el7.x86_64 cri-tools-1.11.1-1.rhaos3.11.gitedabfb5.el7_5.x86_64 # journalctl -u atomic-openshift-node --since "20 minutes ago"|grep -i remote_runtime|grep -i createcontainer <-----------EMPTY HERE--------> And the cluster work well after hard reboot. [root@qe-wjiang311-crio-mrre-1 ~]# oc get nodes -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME qe-wjiang311-crio-mrre-1 Ready compute,master 1h v1.11.0+d4cacc0 10.0.76.212 <none> Red Hat Enterprise Linux Server 7.6 (Maipo) 3.10.0-957.10.1.el7.x86_64 cri-o://1.11.11-1.rhaos3.11.git474f73d.el7 [root@qe-wjiang311-crio-mrre-1 ~]# oc get pods --all-namespaces NAMESPACE NAME READY STATUS RESTARTS AGE default docker-registry-1-dk4fv 1/1 Running 1 59m default dockergc-jcspw 1/1 Running 1 59m default registry-console-1-smbbt 1/1 Running 1 59m default router-1-r5p4z 1/1 Running 1 59m install-test mongodb-1-692mp 1/1 Running 1 53m install-test nodejs-mongodb-example-1-build 0/1 Completed 0 53m install-test nodejs-mongodb-example-1-p8whd 1/1 Running 1 52m kube-service-catalog apiserver-rtznh 1/1 Running 2 57m kube-service-catalog controller-manager-sxgs4 1/1 Running 6 57m kube-system master-api-qe-wjiang311-crio-mrre-1 1/1 Running 1 1h kube-system master-controllers-qe-wjiang311-crio-mrre-1 1/1 Running 0 1h kube-system master-etcd-qe-wjiang311-crio-mrre-1 1/1 Running 1 1h openshift-ansible-service-broker asb-1-bgdpk 1/1 Running 3 55m openshift-console console-5c75c46588-2mwhr 1/1 Running 1 58m openshift-node sync-trkqj 1/1 Running 1 1h openshift-sdn ovs-dnlhm 1/1 Running 1 1h openshift-sdn sdn-4r666 1/1 Running 1 1h openshift-template-service-broker apiserver-b8f2l 1/1 Running 3 55m openshift-web-console webconsole-7b8f84dfd4-pn7w9 1/1 Running 1 59m
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0636
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days