Bug 1649795
Summary: | Upgrade to v3.10 fails with Check for GlusterFS cluster health | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Radomir Ludva <rludva> |
Component: | Cluster Version Operator | Assignee: | Michael Gugino <mgugino> |
Status: | CLOSED ERRATA | QA Contact: | Rachael <rgeorge> |
Severity: | medium | Docs Contact: | |
Priority: | unspecified | ||
Version: | 3.10.0 | CC: | aos-bugs, jialiu, jokerman, kramdoss, mgugino, mmccomas, pprakash, rgeorge, sdodson, vlaad |
Target Milestone: | --- | Keywords: | Reopened |
Target Release: | 3.10.z | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2019-01-10 09:27:10 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
It appears that ansible doesn't include the invocation information if an unhandled exception is raised, which is unfortunate. Most likely glusterfs_check_containerized is being passed a bad value for 'oc_bin' parameter for some reason. Seems 'first_master_client_binary' is set to 'oc' which should be in ansible's path. Most likely, they have some non-standard path set for root / sudo operations and that is the problem. This may be an artifact of switching from containerized to RPM based install, we're not setting the correct first_master_client_binary for 3.9 to 3.10 upgrades. Potential work-around for now: Install atomic-openshift-clients package on the first master node, then you should be able to pass this step successfully. The installation of atomic-openshift-clients was successful and the problem is solved. Thank you. (In reply to Radomir Ludva from comment #4) > The installation of atomic-openshift-clients was successful and the problem > is solved. Thank you. This actually is a bug in openshift-ansible, but thank you for confirming the workaround has the desired affect. Michael, what would be the preferred solution for this? (In reply to Jose A. Rivera from comment #6) > Michael, what would be the preferred solution for this? This is only going to affect 3.10, so I think adding something early in the upgrade (after versioning is done) to install atomic-openshift-clients{{ openshift_pkg_version}} package just on the first master should suffice. This assumes 3.10 client package will work with 3.9 cluster. @sdodson, thoughts? The client should be forwards and backwards compatible at least one version. So no worries there, I'd just make sure it's installed during this play/role. PR Created in 3.10. https://github.com/openshift/openshift-ansible/pull/10783 This only affects 3.9 to 3.10 upgrades. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0026 |
Created attachment 1505704 [details] ansible-playbook playbooks/byo/openshift-cluster/upgrades/v3_10/upgrade_control_plane.yml -vvv Description of problem: A customer is trying to update his OCP v3.9 to OCP v3.10 and the upgrade fails during health check of GlusterFS. Version-Release number of the following components: Ansible-Version: ansible-2.5.3 openshift-ansible: openshift-ansible-3.10.47 Current-OpenShift-Version: 3.9.43 How reproducible: I still do not know the root cause of the problem. This happens in the customers testing environment. Actual results: Please include the entire output from the last TASK line through the end of output if an error is generated Expected results: The health check will pass. Additional info: Gluster Volumes are up and healthy. 2018-11-13 13:14:00,292 p=42473 u=ansible | Using module file /usr/share/ansible/openshift-ansible/roles/lib_utils/library/glusterfs_check_containerized.py 2018-11-13 13:14:00,347 p=42473 u=ansible | Escalation succeeded 2018-11-13 13:14:00,554 p=42473 u=ansible | FAILED - RETRYING: Check for GlusterFS cluster health (114 retries left).Result was: { "attempts": 7, "changed": false, "module_stderr": "Traceback (most recent call last):\n File \"/tmp/ansible_wfi3mJ/ansible_module_glusterfs_check_containerized.py\", line 187, in <module>\n main()\n File \"/tmp/ansible_wfi3mJ/ansible_module_glusterfs_check_containerized.py\", line 183, in main\n run_module()\n File \"/tmp/ansible_wfi3mJ/ansible_module_glusterfs_check_containerized.py\", line 170, in run_module\n valid_nodes = get_valid_nodes(module, [oc_bin, oc_conf], exclude_node)\n File \"/tmp/ansible_wfi3mJ/ansible_module_glusterfs_check_containerized.py\", line 77, in get_valid_nodes\n res = call_or_fail(module, call_args)\n File \"/tmp/ansible_wfi3mJ/ansible_module_glusterfs_check_containerized.py\", line 68, in call_or_fail\n res = subprocess.check_output(call_args).decode('utf-8')\n File \"/usr/lib64/python2.7/subprocess.py\", line 568, in check_output\n process = Popen(stdout=PIPE, *popenargs, **kwargs)\n File \"/usr/lib64/python2.7/subprocess.py\", line 711, in __init__\n errread, errwrite)\n File \"/usr/lib64/python2.7/subprocess.py\", line 1327, in _execute_child\n raise child_exception\nOSError: [Errno 2] No such file or directory\n", "module_stdout": "", "msg": "MODULE FAILURE", "rc": 1, "retries": 121 } Description of problem: