Bug 1634004

Summary: Upgrade to OCP 3.9 fails at task "Fail when OpenShift is not installed"
Product: OpenShift Container Platform Reporter: Joel Rosental R. <jrosenta>
Component: Cluster Version OperatorAssignee: Patrick Dillon <padillon>
Status: CLOSED ERRATA QA Contact: Johnny Liu <jialiu>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 3.9.0CC: aos-bugs, jkaur, jokerman, jrosenta, mmccomas, padillon, wmeng
Target Milestone: ---   
Target Release: 3.9.z   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: openshift_facts uses default ose images to determine version in containerized install. When customer uses custom images and internal registry, the default images cannot be pulled. Consequence: openshift.common.version is left empty and upgrade fails on task Fail when OpenShift is not installed. Fix: if there is a custom_image name pass it to openshift_facts to use. Result: openshift.common.version is set in containerized installs with registry mirror and installation succeeds.
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-12-13 19:27:05 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Joel Rosental R. 2018-09-28 12:00:57 UTC
Description of problem:

Customer reports that while attempting to upgrade a containerized installation of OCP version 3.7.57 to 3.9.41 fails at task "Fail when OpenShift is not installed"

TASK [Fail when OpenShift is not installed] **********************************************************************************************************************************************
task path: /usr/share/ansible/openshift-ansible/playbooks/common/openshift-cluster/upgrades/pre/verify_upgrade_targets.yml:2
skipping: [master1.example.com] => {
    "changed": false, 
    "skip_reason": "Conditional result was False", 
    "skipped": true
}
skipping: [master2.example.com] => {
    "changed": false, 
    "skip_reason": "Conditional result was False", 
    "skipped": true
}
skipping: [master3.example.com] => {
    "changed": false, 
    "skip_reason": "Conditional result was False", 
    "skipped": true
}
fatal: [node1.example.com]: FAILED! => {
    "changed": false, 
    "failed": true, 
    "msg": "Verify OpenShift is already installed"
}
fatal: [node2.example.com]: FAILED! => {
    "changed": false, 
    "failed": true, 
    "msg": "Verify OpenShift is already installed"
}

[...]

Failure summary:


  1. Hosts:    node1.example.com, node2.example.com
     Play:     Verify upgrade targets
     Task:     Fail when OpenShift is not installed
     Message:  Verify OpenShift is already installed


Checking on https://github.com/openshift/openshift-ansible/blob/release-3.9/roles/openshift_facts/library/openshift_facts.py#L903-L921 it looks like this is where the fact gets defined from, however /etc/sysconfig/atomic-openshift-node seems to have the IMAGE_VERSION properly set:

# cat /etc/sysconfig/atomic-openshift-node
OPTIONS=--loglevel=0
CONFIG_FILE=/etc/origin/node/node-config.yaml
IMAGE_VERSION=v3.7.57

Customer tried to remove the facts from /etc/ansible/facts.d and then retry the upgrade procedure in case there was a corrupted fact file, but got same outcome.

Version-Release number of the following components:
# rpm -qa | grep atomic
atomic-openshift-clients-3.9.41-1.git.0.67432b0.el7.x86_64
bash-4.2$ ansible --version
ansible 2.4.6.0
  config file = /opt/myuser/home/openshift-leun/.ansible.cfg
  configured module search path = [u'/opt/myuser/home/openshift-leun/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/lib/python2.7/site-packages/ansible
  executable location = /bin/ansible
  python version = 2.7.5 (default, May 31 2018, 09:41:32) [GCC 4.8.5 20150623 (Red Hat 4.8.5-28)]


How reproducible:
I could not reproduce it by myself.

Steps to Reproduce:
1. I could not reproduce it by myself.


Actual results:
Please include the entire output from the last TASK line through the end of output if an error is generated

Expected results:
It should define this fact correctly and upgrade finish smoothly.

Comment 13 Johnny Liu 2018-10-26 16:26:11 UTC
The key root cause is openshift_facts is getting version from "openshift/ose" hard-coded image.

I could reproduce this bug with openshift-ansible-3.9.43-1.git.0.d0bc600.el7.noarch when upgrading a containerized cluster from 3.7.64 to 3.9.41. The pure node upgrade would failed at the step.

TASK [Fail when OpenShift is not installed] ************************************
skipping: [host-8-251-31.host.centralci.eng.rdu2.redhat.com] => {"changed": false, "skip_reason": "Conditional result was False"}
fatal: [host-8-251-237.host.centralci.eng.rdu2.redhat.com]: FAILED! => {"changed": false, "msg": "Verify OpenShift is already installed"}

Comment 14 Johnny Liu 2018-10-26 17:04:07 UTC
Verified this bug with openshift-ansible-3.9.48-1.git.0.09f6c01.el7.noarch, and PASS.

openshift_image_tag=v3.9.41
openshift_release=v3.9
openshift_cockpit_deployer_prefix=host-8-241-45.host.centralci.eng.rdu2.redhat.com:5000/testing/ocp3/
oreg_url=host-8-241-45.host.centralci.eng.rdu2.redhat.com:5000/testing/ocp3/ose-${component}:${version}
openshift_docker_additional_registries=host-8-241-45.host.centralci.eng.rdu2.redhat.com:5000
openshift_docker_insecure_registries=host-8-241-45.host.centralci.eng.rdu2.redhat.com:5000
osm_etcd_image=host-8-241-45.host.centralci.eng.rdu2.redhat.com:5000/rhel7/etcd
osm_image=host-8-241-45.host.centralci.eng.rdu2.redhat.com:5000/testing/ocp3/ose
osn_image=host-8-241-45.host.centralci.eng.rdu2.redhat.com:5000/testing/ocp3/node
osn_ovs_image=host-8-241-45.host.centralci.eng.rdu2.redhat.com:5000/testing/ocp3/openvswitch


PLAY [Verify upgrade targets] **************************************************

TASK [Gathering Facts] *********************************************************
ok: [host-8-251-237.host.centralci.eng.rdu2.redhat.com]
ok: [host-8-251-31.host.centralci.eng.rdu2.redhat.com]

TASK [include_tasks] ***********************************************************
included: /home/slave2/workspace/Run-Ansible-Playbooks-Nextge/private-openshift-ansible/playbooks/common/openshift-cluster/upgrades/pre/verify_upgrade_targets.yml for host-8-251-31.host.centralci.eng.rdu2.redhat.com, host-8-251-237.host.centralci.eng.rdu2.redhat.com

TASK [Fail when OpenShift is not installed] ************************************
skipping: [host-8-251-31.host.centralci.eng.rdu2.redhat.com] => {"changed": false, "skip_reason": "Conditional result was False"}
skipping: [host-8-251-237.host.centralci.eng.rdu2.redhat.com] => {"changed": false, "skip_reason": "Conditional result was False"}

After upgrade, check:
[root@host-172-16-122-61 ~]# oc get node
NAME            STATUS    ROLES     AGE       VERSION
172.16.122.35   Ready     compute   32m       v1.9.1+a0ce1bc657
172.16.122.61   Ready     master    34m       v1.9.1+a0ce1bc657
[root@host-172-16-122-61 ~]# oc version
oc v3.9.41
kubernetes v1.9.1+a0ce1bc657
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://172.16.122.61:8443
openshift v3.9.41
kubernetes v1.9.1+a0ce1bc657

Comment 17 errata-xmlrpc 2018-12-13 19:27:05 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:3748