Bug 1613938 - all pods are lost after reboot host when cri-o runtime is enabled.
Summary: all pods are lost after reboot host when cri-o runtime is enabled.
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Containers
Version: 3.11.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 3.11.0
Assignee: Giuseppe Scrivano
QA Contact: Johnny Liu
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-08-08 15:13 UTC by Johnny Liu
Modified: 2018-10-11 07:24 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-10-11 07:24:08 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2018:2652 0 None None None 2018-10-11 07:24:26 UTC

Description Johnny Liu 2018-08-08 15:13:55 UTC
Description of problem:


Version-Release number of selected component (if applicable):
openshift-ansible-3.11.0-0.11.0.git.0.3c66516None.noarch
# openshift version
openshift v3.11.0-0.11.0
# rpm -q cri-o
cri-o-1.11.1-2.rhaos3.11.git1759204.el7.x86_64

How reproducible:
Always

Steps to Reproduce:
1. install a cluster with cri-o runtime enabled via "openshift_use_crio=true" parameter for openshift-ansible installer.
2. run oc command to make sure everything go well.
3. reboot host, e.g: master host
4. after reboot, found all the pod on master are lost.

Actual results:
# oc get node
The connection to the server qe-jialiu-master-etcd-1:8443 was refused - did you specify the right host or port?

# crictl ps -a
CONTAINER ID        IMAGE               CREATED             STATE               NAME                ATTEMPT


Expected results:
after host reboot, cluster is still working well.

Additional info:
Run the following command, would help recover the lost pods on node.
# rm -rf /var/{lib,run}/containers/storage/overlay-containers/*
# systemctl restart crio

# crictl ps
CONTAINER ID        IMAGE                                                              CREATED              STATE               NAME                 ATTEMPT
638fce66cef38       edb5b5536f11aebecbe3ecf50ba8900f076d229c568253be5a9657b11169d881   28 seconds ago       Running             apiserver            0
fc3830fb9b788       a87f6cd8f607f8b745d9ce8550ca356b986d4faf71b165e3d7b48da73b476026   28 seconds ago       Running             c                    0
0af1e964e0289       0800fea16b99796f9a9cb988256473d81f3427c4560eedc22d7df7cbaf709a34   28 seconds ago       Running             webconsole           0
5b66fdb736510       edb5b5536f11aebecbe3ecf50ba8900f076d229c568253be5a9657b11169d881   29 seconds ago       Running             controller-manager   0
767bc73d977a9       abfda950c7e7bdc5b714e08829e5cbc53658ca2bd7afa40ce5d6efe7ea07b471   30 seconds ago       Running             console              0
798f18b4c7e64       03a35919cfe09e5e254b9c2623e64f763ef39f744415d153c3173b512eccd0b5   38 seconds ago       Running             sdn                  0
4a6ad93ab0a96       03a35919cfe09e5e254b9c2623e64f763ef39f744415d153c3173b512eccd0b5   38 seconds ago       Running             sync                 0
1fbf5d1b9d075       03a35919cfe09e5e254b9c2623e64f763ef39f744415d153c3173b512eccd0b5   38 seconds ago       Running             openvswitch          0
ad3a056013361       9b87b1c25840d7707dea1d113701902d2c068ae085fdd71aa14a8f97158ae1b1   About a minute ago   Running             api                  0
c0183d8efeba7       9b87b1c25840d7707dea1d113701902d2c068ae085fdd71aa14a8f97158ae1b1   About a minute ago   Running             controllers          0
8f8e091d3a503       bb2f1d4dd3a7f57034630d630b3285093b799730aa624a87061ebc1150e62640   About a minute ago   Running             etcd                 0


# oc get po
NAME                       READY     STATUS    RESTARTS   AGE
docker-registry-1-87w7f    1/1       Running   0          6h
dockergc-mfjzn             1/1       Running   0          6h
dockergc-zg2mk             1/1       Running   0          6h
registry-console-1-g788f   1/1       Running   0          6h
router-1-bjbss             1/1       Running   0          6h

Comment 3 Giuseppe Scrivano 2018-08-16 14:30:21 UTC
Thanks, that helped a lot.

PR here: https://github.com/kubernetes-incubator/cri-o/pull/1744

Comment 4 Mrunal Patel 2018-08-17 23:32:19 UTC
The fix will is in cri-o 1.11.2

Comment 6 Johnny Liu 2018-08-23 06:49:48 UTC
Presently the cri-o version in the latest puddle (v3.11.0-0.20.0_2018-08-21.1) is cri-o-1.11.1-2.rhaos3.11.git1759204.el7.x86_64.rpm.

Comment 8 weiwei jiang 2018-09-06 07:34:11 UTC
Checked with 

# oc version
oc v3.11.0-0.28.0
kubernetes v1.11.0+d4cacc0
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://ip-172-18-7-85.ec2.internal:8443
openshift v3.11.0-0.28.0
kubernetes v1.11.0+d4cacc0

# rpm -qa|grep -i cri-o
cri-o-1.11.2-1.rhaos3.11.git3eac3b2.el7.x86_64

And cri-o work well after reboot.

Comment 10 errata-xmlrpc 2018-10-11 07:24:08 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2652


Note You need to log in before you can comment on or make changes to this bug.