Description of problem: Version-Release number of selected component (if applicable): openshift-ansible-3.11.0-0.11.0.git.0.3c66516None.noarch # openshift version openshift v3.11.0-0.11.0 # rpm -q cri-o cri-o-1.11.1-2.rhaos3.11.git1759204.el7.x86_64 How reproducible: Always Steps to Reproduce: 1. install a cluster with cri-o runtime enabled via "openshift_use_crio=true" parameter for openshift-ansible installer. 2. run oc command to make sure everything go well. 3. reboot host, e.g: master host 4. after reboot, found all the pod on master are lost. Actual results: # oc get node The connection to the server qe-jialiu-master-etcd-1:8443 was refused - did you specify the right host or port? # crictl ps -a CONTAINER ID IMAGE CREATED STATE NAME ATTEMPT Expected results: after host reboot, cluster is still working well. Additional info: Run the following command, would help recover the lost pods on node. # rm -rf /var/{lib,run}/containers/storage/overlay-containers/* # systemctl restart crio # crictl ps CONTAINER ID IMAGE CREATED STATE NAME ATTEMPT 638fce66cef38 edb5b5536f11aebecbe3ecf50ba8900f076d229c568253be5a9657b11169d881 28 seconds ago Running apiserver 0 fc3830fb9b788 a87f6cd8f607f8b745d9ce8550ca356b986d4faf71b165e3d7b48da73b476026 28 seconds ago Running c 0 0af1e964e0289 0800fea16b99796f9a9cb988256473d81f3427c4560eedc22d7df7cbaf709a34 28 seconds ago Running webconsole 0 5b66fdb736510 edb5b5536f11aebecbe3ecf50ba8900f076d229c568253be5a9657b11169d881 29 seconds ago Running controller-manager 0 767bc73d977a9 abfda950c7e7bdc5b714e08829e5cbc53658ca2bd7afa40ce5d6efe7ea07b471 30 seconds ago Running console 0 798f18b4c7e64 03a35919cfe09e5e254b9c2623e64f763ef39f744415d153c3173b512eccd0b5 38 seconds ago Running sdn 0 4a6ad93ab0a96 03a35919cfe09e5e254b9c2623e64f763ef39f744415d153c3173b512eccd0b5 38 seconds ago Running sync 0 1fbf5d1b9d075 03a35919cfe09e5e254b9c2623e64f763ef39f744415d153c3173b512eccd0b5 38 seconds ago Running openvswitch 0 ad3a056013361 9b87b1c25840d7707dea1d113701902d2c068ae085fdd71aa14a8f97158ae1b1 About a minute ago Running api 0 c0183d8efeba7 9b87b1c25840d7707dea1d113701902d2c068ae085fdd71aa14a8f97158ae1b1 About a minute ago Running controllers 0 8f8e091d3a503 bb2f1d4dd3a7f57034630d630b3285093b799730aa624a87061ebc1150e62640 About a minute ago Running etcd 0 # oc get po NAME READY STATUS RESTARTS AGE docker-registry-1-87w7f 1/1 Running 0 6h dockergc-mfjzn 1/1 Running 0 6h dockergc-zg2mk 1/1 Running 0 6h registry-console-1-g788f 1/1 Running 0 6h router-1-bjbss 1/1 Running 0 6h
Thanks, that helped a lot. PR here: https://github.com/kubernetes-incubator/cri-o/pull/1744
The fix will is in cri-o 1.11.2
Presently the cri-o version in the latest puddle (v3.11.0-0.20.0_2018-08-21.1) is cri-o-1.11.1-2.rhaos3.11.git1759204.el7.x86_64.rpm.
Checked with # oc version oc v3.11.0-0.28.0 kubernetes v1.11.0+d4cacc0 features: Basic-Auth GSSAPI Kerberos SPNEGO Server https://ip-172-18-7-85.ec2.internal:8443 openshift v3.11.0-0.28.0 kubernetes v1.11.0+d4cacc0 # rpm -qa|grep -i cri-o cri-o-1.11.2-1.rhaos3.11.git3eac3b2.el7.x86_64 And cri-o work well after reboot.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:2652