Created attachment 1505704 [details] ansible-playbook playbooks/byo/openshift-cluster/upgrades/v3_10/upgrade_control_plane.yml -vvv Description of problem: A customer is trying to update his OCP v3.9 to OCP v3.10 and the upgrade fails during health check of GlusterFS. Version-Release number of the following components: Ansible-Version: ansible-2.5.3 openshift-ansible: openshift-ansible-3.10.47 Current-OpenShift-Version: 3.9.43 How reproducible: I still do not know the root cause of the problem. This happens in the customers testing environment. Actual results: Please include the entire output from the last TASK line through the end of output if an error is generated Expected results: The health check will pass. Additional info: Gluster Volumes are up and healthy. 2018-11-13 13:14:00,292 p=42473 u=ansible | Using module file /usr/share/ansible/openshift-ansible/roles/lib_utils/library/glusterfs_check_containerized.py 2018-11-13 13:14:00,347 p=42473 u=ansible | Escalation succeeded 2018-11-13 13:14:00,554 p=42473 u=ansible | FAILED - RETRYING: Check for GlusterFS cluster health (114 retries left).Result was: { "attempts": 7, "changed": false, "module_stderr": "Traceback (most recent call last):\n File \"/tmp/ansible_wfi3mJ/ansible_module_glusterfs_check_containerized.py\", line 187, in <module>\n main()\n File \"/tmp/ansible_wfi3mJ/ansible_module_glusterfs_check_containerized.py\", line 183, in main\n run_module()\n File \"/tmp/ansible_wfi3mJ/ansible_module_glusterfs_check_containerized.py\", line 170, in run_module\n valid_nodes = get_valid_nodes(module, [oc_bin, oc_conf], exclude_node)\n File \"/tmp/ansible_wfi3mJ/ansible_module_glusterfs_check_containerized.py\", line 77, in get_valid_nodes\n res = call_or_fail(module, call_args)\n File \"/tmp/ansible_wfi3mJ/ansible_module_glusterfs_check_containerized.py\", line 68, in call_or_fail\n res = subprocess.check_output(call_args).decode('utf-8')\n File \"/usr/lib64/python2.7/subprocess.py\", line 568, in check_output\n process = Popen(stdout=PIPE, *popenargs, **kwargs)\n File \"/usr/lib64/python2.7/subprocess.py\", line 711, in __init__\n errread, errwrite)\n File \"/usr/lib64/python2.7/subprocess.py\", line 1327, in _execute_child\n raise child_exception\nOSError: [Errno 2] No such file or directory\n", "module_stdout": "", "msg": "MODULE FAILURE", "rc": 1, "retries": 121 } Description of problem:
It appears that ansible doesn't include the invocation information if an unhandled exception is raised, which is unfortunate. Most likely glusterfs_check_containerized is being passed a bad value for 'oc_bin' parameter for some reason. Seems 'first_master_client_binary' is set to 'oc' which should be in ansible's path. Most likely, they have some non-standard path set for root / sudo operations and that is the problem.
This may be an artifact of switching from containerized to RPM based install, we're not setting the correct first_master_client_binary for 3.9 to 3.10 upgrades.
Potential work-around for now: Install atomic-openshift-clients package on the first master node, then you should be able to pass this step successfully.
The installation of atomic-openshift-clients was successful and the problem is solved. Thank you.
(In reply to Radomir Ludva from comment #4) > The installation of atomic-openshift-clients was successful and the problem > is solved. Thank you. This actually is a bug in openshift-ansible, but thank you for confirming the workaround has the desired affect.
Michael, what would be the preferred solution for this?
(In reply to Jose A. Rivera from comment #6) > Michael, what would be the preferred solution for this? This is only going to affect 3.10, so I think adding something early in the upgrade (after versioning is done) to install atomic-openshift-clients{{ openshift_pkg_version}} package just on the first master should suffice. This assumes 3.10 client package will work with 3.9 cluster. @sdodson, thoughts?
The client should be forwards and backwards compatible at least one version. So no worries there, I'd just make sure it's installed during this play/role.
PR Created in 3.10. https://github.com/openshift/openshift-ansible/pull/10783 This only affects 3.9 to 3.10 upgrades.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0026