Bug 1668649

Summary: [3.10]upgrade failed due to crio client and server mismatch
Product: OpenShift Container Platform Reporter: Weihua Meng <wmeng>
Component: Cluster Version OperatorAssignee: Russell Teague <rteague>
Status: CLOSED ERRATA QA Contact: Weihua Meng <wmeng>
Severity: high Docs Contact:
Priority: high    
Version: 3.10.0CC: aos-bugs, farandac, jokerman, mmccomas, rteague, vlaad, wmeng
Target Milestone: ---   
Target Release: 3.10.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: Due to a breaking change (api endpoint updated) between crio 1.9 and 1.10, crictl 1.10 will not work with older versions of the crio service. Consequence: During upgrades from openshift-ansible 3.9, the cri-tools package is updated/installed to 1.10 prior to image pre-pull tasks. Fix: The pre-pull tasks are not critical to the upgrade process and errors from these tasks are now ignored allowing the upgrade to progress. Images are pulled during the upgrade after the crio service is upgraded. Result: Upgrades from 3.9 to 3.10 complete as expected.
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-03-14 02:15:34 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Weihua Meng 2019-01-23 09:02:08 UTC
Description of problem:
upgrade failed due to crio client and server mismatch 

Version-Release number of the following components:
openshift-ansible-3.10.101-1.git.0.5f32198.el7.noarch


How reproducible:
Always

Steps to Reproduce:
1. install OCP v3.9 with cri-o container runtime.
2. upgrade to v3.10

Actual results:
upgrade failed.

TASK [openshift_node : Check that node image is present] ***********************
task path: /home/slave2/workspace/Run-Ansible-Playbooks-Nextge/private-openshift-ansible/roles/openshift_node/tasks/prepull.yml:2
Using module file /usr/lib/python2.7/site-packages/ansible/modules/commands/command.py
<ec2-3-90-247-103.compute-1.amazonaws.com> ESTABLISH SSH CONNECTION FOR USER: root
<ec2-3-90-247-103.compute-1.amazonaws.com> SSH: EXEC ssh -C -o ControlMaster=auto -o ControlPersist=60s -o 'IdentityFile="/home/slave2/workspace/Run-Ansible-Playbooks-Nextge/private/config/keys/libra.pem"' -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=root -o ConnectTimeout=10 -o ControlPath=/home/slave2/.ansible/cp/%C ec2-3-90-247-103.compute-1.amazonaws.com '/bin/sh -c '"'"'/usr/bin/python && sleep 0'"'"''
<ec2-3-90-247-103.compute-1.amazonaws.com> (1, '\n{"changed": true, "end": "2019-01-23 02:36:13.533830", "stdout": "", "cmd": ["crictl", "images", "-q", "registry.reg-aws.openshift.com:443/openshift3/ose-node:v3.10"], "failed": true, "delta": "0:00:00.016518", "stderr": "W0123 02:36:13.531278   67461 util_unix.go:75] Using \\"/var/run/crio/crio.sock\\" as endpoint is deprecated, please consider using full url format \\"unix:///var/run/crio/crio.sock\\".\\ntime=\\"2019-01-23T02:36:13-05:00\\" level=fatal msg=\\"listing images failed: rpc error: code = Unimplemented desc = unknown service runtime.v1alpha2.ImageService\\" ", "rc": 1, "invocation": {"module_args": {"warn": true, "executable": null, "_uses_shell": false, "_raw_params": "crictl images -q registry.reg-aws.openshift.com:443/openshift3/ose-node:v3.10", "removes": null, "creates": null, "chdir": null, "stdin": null}}, "start": "2019-01-23 02:36:13.517312", "msg": "non-zero return code"}\n', '')
fatal: [ec2-3-90-247-103.compute-1.amazonaws.com]: FAILED! => {
    "changed": true, 
    "cmd": [
        "crictl", 
        "images", 
        "-q", 
        "registry.reg-aws.openshift.com:443/openshift3/ose-node:v3.10"
    ], 
    "delta": "0:00:00.016518", 
    "end": "2019-01-23 02:36:13.533830", 
    "failed": true, 
    "invocation": {
        "module_args": {
            "_raw_params": "crictl images -q registry.reg-aws.openshift.com:443/openshift3/ose-node:v3.10", 
            "_uses_shell": false, 
            "chdir": null, 
            "creates": null, 
            "executable": null, 
            "removes": null, 
            "stdin": null, 
            "warn": true
        }
    }, 
    "msg": "non-zero return code", 
    "rc": 1, 
    "start": "2019-01-23 02:36:13.517312", 
    "stderr": "W0123 02:36:13.531278   67461 util_unix.go:75] Using \"/var/run/crio/crio.sock\" as endpoint is deprecated, please consider using full url format \"unix:///var/run/crio/crio.sock\".\ntime=\"2019-01-23T02:36:13-05:00\" level=fatal msg=\"listing images failed: rpc error: code = Unimplemented desc = unknown service runtime.v1alpha2.ImageService\" ", 
    "stderr_lines": [
        "W0123 02:36:13.531278   67461 util_unix.go:75] Using \"/var/run/crio/crio.sock\" as endpoint is deprecated, please consider using full url format \"unix:///var/run/crio/crio.sock\".", 
        "time=\"2019-01-23T02:36:13-05:00\" level=fatal msg=\"listing images failed: rpc error: code = Unimplemented desc = unknown service runtime.v1alpha2.ImageService\" "
    ], 
    "stdout": "", 
    "stdout_lines": []
}

info when upgrade failed:
[root@ip-172-18-31-212 ~]# crictl --version
crictl version 1.0.0-beta.0
[root@ip-172-18-31-212 ~]# rpm -q cri-o
cri-o-1.9.14-1.git4e220eb.el7.x86_64
[root@ip-172-18-31-212 ~]# rpm -q cri-tools 
cri-tools-1.0.0-5.rhaos3.10.git2e22a75.el7.x86_64
[root@ip-172-18-31-212 ~]# oc get node -owide
NAME                            STATUS    ROLES     AGE       VERSION             EXTERNAL-IP      OS-IMAGE                                      KERNEL-VERSION              CONTAINER-RUNTIME
ip-172-18-11-104.ec2.internal   Ready     master    1h        v1.9.1+a0ce1bc657   34.229.101.173   Red Hat Enterprise Linux Server 7.6 (Maipo)   3.10.0-957.1.3.el7.x86_64   cri-o://1.9.14
ip-172-18-12-197.ec2.internal   Ready     <none>    1h        v1.9.1+a0ce1bc657   54.166.154.56    Red Hat Enterprise Linux Server 7.6 (Maipo)   3.10.0-957.1.3.el7.x86_64   cri-o://1.9.14
ip-172-18-15-193.ec2.internal   Ready     master    1h        v1.9.1+a0ce1bc657   54.224.233.49    Red Hat Enterprise Linux Server 7.6 (Maipo)   3.10.0-957.1.3.el7.x86_64   cri-o://1.9.14
ip-172-18-17-45.ec2.internal    Ready     <none>    1h        v1.9.1+a0ce1bc657   3.90.205.150     Red Hat Enterprise Linux Server 7.6 (Maipo)   3.10.0-957.1.3.el7.x86_64   cri-o://1.9.14
ip-172-18-25-141.ec2.internal   Ready     <none>    1h        v1.9.1+a0ce1bc657   54.160.180.155   Red Hat Enterprise Linux Server 7.6 (Maipo)   3.10.0-957.1.3.el7.x86_64   cri-o://1.9.14
ip-172-18-3-134.ec2.internal    Ready     compute   1h        v1.9.1+a0ce1bc657   52.203.131.75    Red Hat Enterprise Linux Server 7.6 (Maipo)   3.10.0-957.1.3.el7.x86_64   cri-o://1.9.14
ip-172-18-30-239.ec2.internal   Ready     compute   1h        v1.9.1+a0ce1bc657   34.228.55.131    Red Hat Enterprise Linux Server 7.6 (Maipo)   3.10.0-957.1.3.el7.x86_64   cri-o://1.9.14
ip-172-18-31-212.ec2.internal   Ready     master    1h        v1.9.1+a0ce1bc657   3.90.247.103     Red Hat Enterprise Linux Server 7.6 (Maipo)   3.10.0-957.1.3.el7.x86_64   cri-o://1.9.14

Expected results:
upgrade succeeded.

Additional info:
Please attach logs from ansible-playbook with the -vvv flag

Comment 3 Scott Dodson 2019-01-23 20:16:35 UTC
What version of the installer was used to install 3.9? The upgrade playbooks only assert that cri-tools are installed and they should've been installed when installing 3.9 with cri-o but but in yours it's installed during the upgrade in this task

TASK [openshift_control_plane : Ensure cri-tools installed] ********************

Comment 4 Weihua Meng 2019-01-24 01:18:02 UTC
OCP v3.9 was installed by 
openshift-ansible-3.9.65-1.git.0.a14009a.el7.noarch

I did not see this task during OCP v3.9 install.
TASK [openshift_control_plane : Ensure cri-tools installed]

Comment 5 Scott Dodson 2019-01-24 02:10:53 UTC
(In reply to Weihua Meng from comment #4)
> OCP v3.9 was installed by 
> openshift-ansible-3.9.65-1.git.0.a14009a.el7.noarch
> 
> I did not see this task during OCP v3.9 install.
> TASK [openshift_control_plane : Ensure cri-tools installed]

Sorry, I meant that was the task from your upgrade log that installed cri-tools which pulled the latest version because it wasn't previously installed.

Taking another look at the 3.9 codebase cri-tools would've only been installed in 3.9 if it were upgraded from a release prior to 3.9 which seems like a problem unto itself.

We'll have to look into possibly removing the dependency on cri-tools in the 3.9 to 3.10 upgrade codepath or some other way to make sure that we install a 3.9 version.

Workaround would be to install cri-tools while running 3.9 and before enabling the 3.10 repo.

Comment 6 Weihua Meng 2019-01-24 10:23:41 UTC
The workaround works.

The latest released openshift-ansible is openshift-ansible-3.10.89-1.git.0.14ed1cb.el7.noarch
It has same issue, so this is not regression bug.

Comment 7 Russell Teague 2019-01-31 21:08:27 UTC
Testing 3.9 crio cluster upgrades.

Comment 8 Russell Teague 2019-02-07 21:08:57 UTC
Proposed https://github.com/openshift/openshift-ansible/pull/11146

Comment 9 Weihua Meng 2019-02-11 09:58:00 UTC
Fixed.

openshift-ansible-3.10.112-1.git.0.7823ef0.el7.noarch

Comment 10 Scott Dodson 2019-02-25 13:36:21 UTC
*** Bug 1680278 has been marked as a duplicate of this bug. ***

Comment 12 errata-xmlrpc 2019-03-14 02:15:34 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0405