Bug 1668649 - [3.10]upgrade failed due to crio client and server mismatch
Summary: [3.10]upgrade failed due to crio client and server mismatch
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cluster Version Operator
Version: 3.10.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 3.10.z
Assignee: Russell Teague
QA Contact: Weihua Meng
URL:
Whiteboard:
: 1680278 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-01-23 09:02 UTC by Weihua Meng
Modified: 2019-03-14 02:15 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: Due to a breaking change (api endpoint updated) between crio 1.9 and 1.10, crictl 1.10 will not work with older versions of the crio service. Consequence: During upgrades from openshift-ansible 3.9, the cri-tools package is updated/installed to 1.10 prior to image pre-pull tasks. Fix: The pre-pull tasks are not critical to the upgrade process and errors from these tasks are now ignored allowing the upgrade to progress. Images are pulled during the upgrade after the crio service is upgraded. Result: Upgrades from 3.9 to 3.10 complete as expected.
Clone Of:
Environment:
Last Closed: 2019-03-14 02:15:34 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2019:0405 0 None None None 2019-03-14 02:15:41 UTC

Description Weihua Meng 2019-01-23 09:02:08 UTC
Description of problem:
upgrade failed due to crio client and server mismatch 

Version-Release number of the following components:
openshift-ansible-3.10.101-1.git.0.5f32198.el7.noarch


How reproducible:
Always

Steps to Reproduce:
1. install OCP v3.9 with cri-o container runtime.
2. upgrade to v3.10

Actual results:
upgrade failed.

TASK [openshift_node : Check that node image is present] ***********************
task path: /home/slave2/workspace/Run-Ansible-Playbooks-Nextge/private-openshift-ansible/roles/openshift_node/tasks/prepull.yml:2
Using module file /usr/lib/python2.7/site-packages/ansible/modules/commands/command.py
<ec2-3-90-247-103.compute-1.amazonaws.com> ESTABLISH SSH CONNECTION FOR USER: root
<ec2-3-90-247-103.compute-1.amazonaws.com> SSH: EXEC ssh -C -o ControlMaster=auto -o ControlPersist=60s -o 'IdentityFile="/home/slave2/workspace/Run-Ansible-Playbooks-Nextge/private/config/keys/libra.pem"' -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=root -o ConnectTimeout=10 -o ControlPath=/home/slave2/.ansible/cp/%C ec2-3-90-247-103.compute-1.amazonaws.com '/bin/sh -c '"'"'/usr/bin/python && sleep 0'"'"''
<ec2-3-90-247-103.compute-1.amazonaws.com> (1, '\n{"changed": true, "end": "2019-01-23 02:36:13.533830", "stdout": "", "cmd": ["crictl", "images", "-q", "registry.reg-aws.openshift.com:443/openshift3/ose-node:v3.10"], "failed": true, "delta": "0:00:00.016518", "stderr": "W0123 02:36:13.531278   67461 util_unix.go:75] Using \\"/var/run/crio/crio.sock\\" as endpoint is deprecated, please consider using full url format \\"unix:///var/run/crio/crio.sock\\".\\ntime=\\"2019-01-23T02:36:13-05:00\\" level=fatal msg=\\"listing images failed: rpc error: code = Unimplemented desc = unknown service runtime.v1alpha2.ImageService\\" ", "rc": 1, "invocation": {"module_args": {"warn": true, "executable": null, "_uses_shell": false, "_raw_params": "crictl images -q registry.reg-aws.openshift.com:443/openshift3/ose-node:v3.10", "removes": null, "creates": null, "chdir": null, "stdin": null}}, "start": "2019-01-23 02:36:13.517312", "msg": "non-zero return code"}\n', '')
fatal: [ec2-3-90-247-103.compute-1.amazonaws.com]: FAILED! => {
    "changed": true, 
    "cmd": [
        "crictl", 
        "images", 
        "-q", 
        "registry.reg-aws.openshift.com:443/openshift3/ose-node:v3.10"
    ], 
    "delta": "0:00:00.016518", 
    "end": "2019-01-23 02:36:13.533830", 
    "failed": true, 
    "invocation": {
        "module_args": {
            "_raw_params": "crictl images -q registry.reg-aws.openshift.com:443/openshift3/ose-node:v3.10", 
            "_uses_shell": false, 
            "chdir": null, 
            "creates": null, 
            "executable": null, 
            "removes": null, 
            "stdin": null, 
            "warn": true
        }
    }, 
    "msg": "non-zero return code", 
    "rc": 1, 
    "start": "2019-01-23 02:36:13.517312", 
    "stderr": "W0123 02:36:13.531278   67461 util_unix.go:75] Using \"/var/run/crio/crio.sock\" as endpoint is deprecated, please consider using full url format \"unix:///var/run/crio/crio.sock\".\ntime=\"2019-01-23T02:36:13-05:00\" level=fatal msg=\"listing images failed: rpc error: code = Unimplemented desc = unknown service runtime.v1alpha2.ImageService\" ", 
    "stderr_lines": [
        "W0123 02:36:13.531278   67461 util_unix.go:75] Using \"/var/run/crio/crio.sock\" as endpoint is deprecated, please consider using full url format \"unix:///var/run/crio/crio.sock\".", 
        "time=\"2019-01-23T02:36:13-05:00\" level=fatal msg=\"listing images failed: rpc error: code = Unimplemented desc = unknown service runtime.v1alpha2.ImageService\" "
    ], 
    "stdout": "", 
    "stdout_lines": []
}

info when upgrade failed:
[root@ip-172-18-31-212 ~]# crictl --version
crictl version 1.0.0-beta.0
[root@ip-172-18-31-212 ~]# rpm -q cri-o
cri-o-1.9.14-1.git4e220eb.el7.x86_64
[root@ip-172-18-31-212 ~]# rpm -q cri-tools 
cri-tools-1.0.0-5.rhaos3.10.git2e22a75.el7.x86_64
[root@ip-172-18-31-212 ~]# oc get node -owide
NAME                            STATUS    ROLES     AGE       VERSION             EXTERNAL-IP      OS-IMAGE                                      KERNEL-VERSION              CONTAINER-RUNTIME
ip-172-18-11-104.ec2.internal   Ready     master    1h        v1.9.1+a0ce1bc657   34.229.101.173   Red Hat Enterprise Linux Server 7.6 (Maipo)   3.10.0-957.1.3.el7.x86_64   cri-o://1.9.14
ip-172-18-12-197.ec2.internal   Ready     <none>    1h        v1.9.1+a0ce1bc657   54.166.154.56    Red Hat Enterprise Linux Server 7.6 (Maipo)   3.10.0-957.1.3.el7.x86_64   cri-o://1.9.14
ip-172-18-15-193.ec2.internal   Ready     master    1h        v1.9.1+a0ce1bc657   54.224.233.49    Red Hat Enterprise Linux Server 7.6 (Maipo)   3.10.0-957.1.3.el7.x86_64   cri-o://1.9.14
ip-172-18-17-45.ec2.internal    Ready     <none>    1h        v1.9.1+a0ce1bc657   3.90.205.150     Red Hat Enterprise Linux Server 7.6 (Maipo)   3.10.0-957.1.3.el7.x86_64   cri-o://1.9.14
ip-172-18-25-141.ec2.internal   Ready     <none>    1h        v1.9.1+a0ce1bc657   54.160.180.155   Red Hat Enterprise Linux Server 7.6 (Maipo)   3.10.0-957.1.3.el7.x86_64   cri-o://1.9.14
ip-172-18-3-134.ec2.internal    Ready     compute   1h        v1.9.1+a0ce1bc657   52.203.131.75    Red Hat Enterprise Linux Server 7.6 (Maipo)   3.10.0-957.1.3.el7.x86_64   cri-o://1.9.14
ip-172-18-30-239.ec2.internal   Ready     compute   1h        v1.9.1+a0ce1bc657   34.228.55.131    Red Hat Enterprise Linux Server 7.6 (Maipo)   3.10.0-957.1.3.el7.x86_64   cri-o://1.9.14
ip-172-18-31-212.ec2.internal   Ready     master    1h        v1.9.1+a0ce1bc657   3.90.247.103     Red Hat Enterprise Linux Server 7.6 (Maipo)   3.10.0-957.1.3.el7.x86_64   cri-o://1.9.14

Expected results:
upgrade succeeded.

Additional info:
Please attach logs from ansible-playbook with the -vvv flag

Comment 3 Scott Dodson 2019-01-23 20:16:35 UTC
What version of the installer was used to install 3.9? The upgrade playbooks only assert that cri-tools are installed and they should've been installed when installing 3.9 with cri-o but but in yours it's installed during the upgrade in this task

TASK [openshift_control_plane : Ensure cri-tools installed] ********************

Comment 4 Weihua Meng 2019-01-24 01:18:02 UTC
OCP v3.9 was installed by 
openshift-ansible-3.9.65-1.git.0.a14009a.el7.noarch

I did not see this task during OCP v3.9 install.
TASK [openshift_control_plane : Ensure cri-tools installed]

Comment 5 Scott Dodson 2019-01-24 02:10:53 UTC
(In reply to Weihua Meng from comment #4)
> OCP v3.9 was installed by 
> openshift-ansible-3.9.65-1.git.0.a14009a.el7.noarch
> 
> I did not see this task during OCP v3.9 install.
> TASK [openshift_control_plane : Ensure cri-tools installed]

Sorry, I meant that was the task from your upgrade log that installed cri-tools which pulled the latest version because it wasn't previously installed.

Taking another look at the 3.9 codebase cri-tools would've only been installed in 3.9 if it were upgraded from a release prior to 3.9 which seems like a problem unto itself.

We'll have to look into possibly removing the dependency on cri-tools in the 3.9 to 3.10 upgrade codepath or some other way to make sure that we install a 3.9 version.

Workaround would be to install cri-tools while running 3.9 and before enabling the 3.10 repo.

Comment 6 Weihua Meng 2019-01-24 10:23:41 UTC
The workaround works.

The latest released openshift-ansible is openshift-ansible-3.10.89-1.git.0.14ed1cb.el7.noarch
It has same issue, so this is not regression bug.

Comment 7 Russell Teague 2019-01-31 21:08:27 UTC
Testing 3.9 crio cluster upgrades.

Comment 8 Russell Teague 2019-02-07 21:08:57 UTC
Proposed https://github.com/openshift/openshift-ansible/pull/11146

Comment 9 Weihua Meng 2019-02-11 09:58:00 UTC
Fixed.

openshift-ansible-3.10.112-1.git.0.7823ef0.el7.noarch

Comment 10 Scott Dodson 2019-02-25 13:36:21 UTC
*** Bug 1680278 has been marked as a duplicate of this bug. ***

Comment 12 errata-xmlrpc 2019-03-14 02:15:34 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0405


Note You need to log in before you can comment on or make changes to this bug.