Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1613938

Summary:	all pods are lost after reboot host when cri-o runtime is enabled.
Product:	OpenShift Container Platform	Reporter:	Johnny Liu <jialiu>
Component:	Containers	Assignee:	Giuseppe Scrivano <gscrivan>
Status:	CLOSED ERRATA	QA Contact:	Johnny Liu <jialiu>
Severity:	high	Docs Contact:
Priority:	high
Version:	3.11.0	CC:	aos-bugs, jialiu, jokerman, lxia, mmccomas, mpatel, wjiang
Target Milestone:	---
Target Release:	3.11.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-10-11 07:24:08 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Johnny Liu 2018-08-08 15:13:55 UTC

Description of problem:


Version-Release number of selected component (if applicable):
openshift-ansible-3.11.0-0.11.0.git.0.3c66516None.noarch
# openshift version
openshift v3.11.0-0.11.0
# rpm -q cri-o
cri-o-1.11.1-2.rhaos3.11.git1759204.el7.x86_64

How reproducible:
Always

Steps to Reproduce:
1. install a cluster with cri-o runtime enabled via "openshift_use_crio=true" parameter for openshift-ansible installer.
2. run oc command to make sure everything go well.
3. reboot host, e.g: master host
4. after reboot, found all the pod on master are lost.

Actual results:
# oc get node
The connection to the server qe-jialiu-master-etcd-1:8443 was refused - did you specify the right host or port?

# crictl ps -a
CONTAINER ID        IMAGE               CREATED             STATE               NAME                ATTEMPT


Expected results:
after host reboot, cluster is still working well.

Additional info:
Run the following command, would help recover the lost pods on node.
# rm -rf /var/{lib,run}/containers/storage/overlay-containers/*
# systemctl restart crio

# crictl ps
CONTAINER ID        IMAGE                                                              CREATED              STATE               NAME                 ATTEMPT
638fce66cef38       edb5b5536f11aebecbe3ecf50ba8900f076d229c568253be5a9657b11169d881   28 seconds ago       Running             apiserver            0
fc3830fb9b788       a87f6cd8f607f8b745d9ce8550ca356b986d4faf71b165e3d7b48da73b476026   28 seconds ago       Running             c                    0
0af1e964e0289       0800fea16b99796f9a9cb988256473d81f3427c4560eedc22d7df7cbaf709a34   28 seconds ago       Running             webconsole           0
5b66fdb736510       edb5b5536f11aebecbe3ecf50ba8900f076d229c568253be5a9657b11169d881   29 seconds ago       Running             controller-manager   0
767bc73d977a9       abfda950c7e7bdc5b714e08829e5cbc53658ca2bd7afa40ce5d6efe7ea07b471   30 seconds ago       Running             console              0
798f18b4c7e64       03a35919cfe09e5e254b9c2623e64f763ef39f744415d153c3173b512eccd0b5   38 seconds ago       Running             sdn                  0
4a6ad93ab0a96       03a35919cfe09e5e254b9c2623e64f763ef39f744415d153c3173b512eccd0b5   38 seconds ago       Running             sync                 0
1fbf5d1b9d075       03a35919cfe09e5e254b9c2623e64f763ef39f744415d153c3173b512eccd0b5   38 seconds ago       Running             openvswitch          0
ad3a056013361       9b87b1c25840d7707dea1d113701902d2c068ae085fdd71aa14a8f97158ae1b1   About a minute ago   Running             api                  0
c0183d8efeba7       9b87b1c25840d7707dea1d113701902d2c068ae085fdd71aa14a8f97158ae1b1   About a minute ago   Running             controllers          0
8f8e091d3a503       bb2f1d4dd3a7f57034630d630b3285093b799730aa624a87061ebc1150e62640   About a minute ago   Running             etcd                 0


# oc get po
NAME                       READY     STATUS    RESTARTS   AGE
docker-registry-1-87w7f    1/1       Running   0          6h
dockergc-mfjzn             1/1       Running   0          6h
dockergc-zg2mk             1/1       Running   0          6h
registry-console-1-g788f   1/1       Running   0          6h
router-1-bjbss             1/1       Running   0          6h

Comment 3 Giuseppe Scrivano 2018-08-16 14:30:21 UTC

Thanks, that helped a lot.

PR here: https://github.com/kubernetes-incubator/cri-o/pull/1744

Comment 4 Mrunal Patel 2018-08-17 23:32:19 UTC

The fix will is in cri-o 1.11.2

Comment 6 Johnny Liu 2018-08-23 06:49:48 UTC

Presently the cri-o version in the latest puddle (v3.11.0-0.20.0_2018-08-21.1) is cri-o-1.11.1-2.rhaos3.11.git1759204.el7.x86_64.rpm.

Comment 8 weiwei jiang 2018-09-06 07:34:11 UTC

Checked with 

# oc version
oc v3.11.0-0.28.0
kubernetes v1.11.0+d4cacc0
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://ip-172-18-7-85.ec2.internal:8443
openshift v3.11.0-0.28.0
kubernetes v1.11.0+d4cacc0

# rpm -qa|grep -i cri-o
cri-o-1.11.2-1.rhaos3.11.git3eac3b2.el7.x86_64

And cri-o work well after reboot.

Comment 10 errata-xmlrpc 2018-10-11 07:24:08 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2652