Bug 1649795

Summary: Upgrade to v3.10 fails with Check for GlusterFS cluster health
Product: OpenShift Container Platform Reporter: Radomir Ludva <rludva>
Component: Cluster Version OperatorAssignee: Michael Gugino <mgugino>
Status: CLOSED ERRATA QA Contact: Rachael <rgeorge>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 3.10.0CC: aos-bugs, jialiu, jokerman, kramdoss, mgugino, mmccomas, pprakash, rgeorge, sdodson, vlaad
Target Milestone: ---Keywords: Reopened
Target Release: 3.10.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-01-10 09:27:10 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Radomir Ludva 2018-11-14 14:30:44 UTC
Created attachment 1505704 [details]
ansible-playbook playbooks/byo/openshift-cluster/upgrades/v3_10/upgrade_control_plane.yml -vvv

Description of problem:
A customer is trying to update his OCP v3.9 to OCP v3.10 and the upgrade fails during health check of GlusterFS.

Version-Release number of the following components:
Ansible-Version: ansible-2.5.3
openshift-ansible: openshift-ansible-3.10.47
Current-OpenShift-Version: 3.9.43

How reproducible:
I still do not know the root cause of the problem. This happens in the customers testing environment.


Actual results:
Please include the entire output from the last TASK line through the end of output if an error is generated

Expected results:
The health check will pass.

Additional info:
Gluster Volumes are up and healthy.

2018-11-13 13:14:00,292 p=42473 u=ansible |  Using module file /usr/share/ansible/openshift-ansible/roles/lib_utils/library/glusterfs_check_containerized.py
2018-11-13 13:14:00,347 p=42473 u=ansible |  Escalation succeeded
2018-11-13 13:14:00,554 p=42473 u=ansible |  FAILED - RETRYING: Check for GlusterFS cluster health (114 retries left).Result was: {
    "attempts": 7,
    "changed": false,
    "module_stderr": "Traceback (most recent call last):\n  File \"/tmp/ansible_wfi3mJ/ansible_module_glusterfs_check_containerized.py\", line 187, in <module>\n    main()\n  File \"/tmp/ansible_wfi3mJ/ansible_module_glusterfs_check_containerized.py\", line 183, in main\n    run_module()\n  File \"/tmp/ansible_wfi3mJ/ansible_module_glusterfs_check_containerized.py\", line 170, in run_module\n    valid_nodes = get_valid_nodes(module, [oc_bin, oc_conf], exclude_node)\n  File \"/tmp/ansible_wfi3mJ/ansible_module_glusterfs_check_containerized.py\", line 77, in get_valid_nodes\n    res = call_or_fail(module, call_args)\n  File \"/tmp/ansible_wfi3mJ/ansible_module_glusterfs_check_containerized.py\", line 68, in call_or_fail\n    res = subprocess.check_output(call_args).decode('utf-8')\n  File \"/usr/lib64/python2.7/subprocess.py\", line 568, in check_output\n    process = Popen(stdout=PIPE, *popenargs, **kwargs)\n  File \"/usr/lib64/python2.7/subprocess.py\", line 711, in __init__\n    errread, errwrite)\n  File \"/usr/lib64/python2.7/subprocess.py\", line 1327, in _execute_child\n    raise child_exception\nOSError: [Errno 2] No such file or directory\n",
    "module_stdout": "",
    "msg": "MODULE FAILURE",
    "rc": 1,
    "retries": 121
}

Description of problem:

Comment 1 Michael Gugino 2018-11-14 16:16:01 UTC
It appears that ansible doesn't include the invocation information if an unhandled exception is raised, which is unfortunate.

Most likely glusterfs_check_containerized is being passed a bad value for 'oc_bin' parameter for some reason.  Seems 'first_master_client_binary' is set to 'oc' which should be in ansible's path.  Most likely, they have some non-standard path set for root / sudo operations and that is the problem.

Comment 2 Michael Gugino 2018-11-14 17:03:38 UTC
This may be an artifact of switching from containerized to RPM based install, we're not setting the correct first_master_client_binary for 3.9 to 3.10 upgrades.

Comment 3 Michael Gugino 2018-11-14 17:06:49 UTC
Potential work-around for now: Install atomic-openshift-clients package on the first master node, then you should be able to pass this step successfully.

Comment 4 Radomir Ludva 2018-11-16 15:00:13 UTC
The installation of atomic-openshift-clients was successful and the problem is solved. Thank you.

Comment 5 Michael Gugino 2018-11-16 15:25:38 UTC
(In reply to Radomir Ludva from comment #4)
> The installation of atomic-openshift-clients was successful and the problem
> is solved. Thank you.

This actually is a bug in openshift-ansible, but thank you for confirming the workaround has the desired affect.

Comment 6 Jose A. Rivera 2018-11-27 17:04:12 UTC
Michael, what would be the preferred solution for this?

Comment 7 Michael Gugino 2018-11-27 17:11:26 UTC
(In reply to Jose A. Rivera from comment #6)
> Michael, what would be the preferred solution for this?

This is only going to affect 3.10, so I think adding something early in the upgrade (after versioning is done) to install atomic-openshift-clients{{ openshift_pkg_version}} package just on the first master should suffice.

This assumes 3.10 client package will work with 3.9 cluster.  @sdodson, thoughts?

Comment 8 Scott Dodson 2018-11-27 17:58:50 UTC
The client should be forwards and backwards compatible at least one version. So no worries there, I'd just make sure it's installed during this play/role.

Comment 9 Michael Gugino 2018-11-27 21:53:22 UTC
PR Created in 3.10.

https://github.com/openshift/openshift-ansible/pull/10783

This only affects 3.9 to 3.10 upgrades.

Comment 16 errata-xmlrpc 2019-01-10 09:27:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0026