Bug 1891816 - [UPI] [OSP] control-plane.yml provisioning playbook fails on OSP 16.1
Summary: [UPI] [OSP] control-plane.yml provisioning playbook fails on OSP 16.1
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.5
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.7.0
Assignee: Matthew Booth
QA Contact: weiwei jiang
URL:
Whiteboard:
Depends On: 1899192
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-10-27 12:48 UTC by Jon Uriarte
Modified: 2021-02-24 15:29 UTC (History)
9 users (show)

Fixed In Version: python-openstacksdk-0.36.4-1.20201113235938.el8ost
Doc Type: Bug Fix
Doc Text:
Cause: A bug openstack-sdk caused a failure when requesting server groups OSP16. Consequence: The UPI playbook control-plane.yaml fails in the "Create the Control Plane servers" task with a stack trace. Fix: Update openstack-sdk on the bastion host executing UPI ansible tasks to at least python-openstacksdk-0.36.4-1.20201113235938.el8ost. Result: The UPI playbook succeeds. N.B. We didn't really fix this in OCP 4.7: we fixed it in OpenStack. The fixed openstack-sdk package is not yet released, so customers hitting this will have to request a hotfix for now.
Clone Of:
Environment:
Last Closed: 2021-02-24 15:28:35 UTC
Target Upstream Version:
Embargoed:
rlobillo: needinfo-
rlobillo: needinfo-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 749381 0 None MERGED Don't set list_type to dict for server groups. 2021-02-19 19:01:14 UTC
OpenStack gerrit 763121 0 None MERGED Don't set list_type to dict for server groups. 2021-02-19 19:01:14 UTC
Red Hat Product Errata RHSA-2020:5633 0 None None None 2021-02-24 15:29:09 UTC

Description Jon Uriarte 2020-10-27 12:48:57 UTC
Description of problem:

The playbook 'control-plane.yml' for provisioning the master nodes in 4.5 and 4.6 UPI fails when the underlying OSP is 16.1.
It works fine in OSP 13.


Version-Release number of selected component (if applicable):
 OCP 4.5.0-0.nightly-2020-10-23-050031
 OSP RHOS-16.1-RHEL-8-20201021.n.0

The playbooks are being executed from a bastion host with:
 ansible 2.9.14
 python3-openstacksdk-0.36.3
 python3-openstackclient-4.0.0


How reproducible: always


Steps to Reproduce:
1. Install OSP 16.1 and create a bastion host (it's not a must for reproducing the issue, it can be run from the undercloud as well)
2. Run the provisioning playbooks for UPI as described in [1]
   ansible-playbook -i "/home/cloud-user/ostest/inventory.yaml" "/home/cloud-user/ostest/control-plane.yaml"

Actual results:

TASK [Create the Control Plane servers] ****************************************
failed: [localhost] (item=[0, 'ostest-vpwdz-master']) => {"ansible_loop_var": "item", "changed": false, "item": [0, "ostest-vpwdz-master"], "module_stderr": "/usr/lib/python3.6/site-packages/openstack/config/cloud_region.py:432: UserWarning: You have a configured API_VERSION with 'latest' in it. In the context of openstacksdk this doesn't make any sense.\n
      \"You have a configured API_VERSION with 'latest' in\"\n
    Traceback (most recent call last):\n
      File \"/home/cloud-user/.ansible/tmp/ansible-tmp-1603265045.369726-22447-253278130374381/AnsiballZ_os_server.py\", line 102, in <module>\n
        _ansiballz_main()\n
      File \"/home/cloud-user/.ansible/tmp/ansible-tmp-1603265045.369726-22447-253278130374381/AnsiballZ_os_server.py\", line 94, in _ansiballz_main\n
        invoke_module(zipped_mod, temp_path, ANSIBALLZ_PARAMS)\n
      File \"/home/cloud-user/.ansible/tmp/ansible-tmp-1603265045.369726-22447-253278130374381/AnsiballZ_os_server.py\", line 40, in invoke_module\n
        runpy.run_module(mod_name='ansible.modules.cloud.openstack.os_server', init_globals=None, run_name='__main__', alter_sys=True)\n
      File \"/usr/lib64/python3.6/runpy.py\", line 205, in run_module\n
        return _run_module_code(code, init_globals, run_name, mod_spec)\n
      File \"/usr/lib64/python3.6/runpy.py\", line 96, in _run_module_code\n
        mod_name, mod_spec, pkg_name, script_name)\n
      File \"/usr/lib64/python3.6/runpy.py\", line 85, in _run_code\n
        exec(code, run_globals)\n
      File \"/tmp/ansible_os_server_payload_r89pe0c8/ansible_os_server_payload.zip/ansible/modules/cloud/openstack/os_server.py\", line 759, in <module>\n
      File \"/tmp/ansible_os_server_payload_r89pe0c8/ansible_os_server_payload.zip/ansible/modules/cloud/openstack/os_server.py\", line 750, in main\n
      File \"/tmp/ansible_os_server_payload_r89pe0c8/ansible_os_server_payload.zip/ansible/modules/cloud/openstack/os_server.py\", line 547, in _create_server\n
      File \"/tmp/ansible_os_server_payload_r89pe0c8/ansible_os_server_payload.zip/ansible/modules/cloud/openstack/os_server.py\", line 417, in _exit_hostvars\n
      File \"/usr/lib/python3.6/site-packages/openstack/cloud/_compute.py\", line 1832, in get_openstack_vars\n
        return meta.get_hostvars_from_server(self, server)\n
      File \"/usr/lib/python3.6/site-packages/openstack/cloud/meta.py\", line 499, in get_hostvars_from_server\n
        expand_server_security_groups(cloud, server)\n
      File \"/usr/lib/python3.6/site-packages/openstack/cloud/meta.py\", line 471, in expand_server_security_groups\n
        groups = cloud.list_server_security_groups(server)\n
      File \"/usr/lib/python3.6/site-packages/openstack/cloud/_compute.py\", line 198, in list_server_security_groups\n
        server = self.compute.get_server(server)\n
      File \"/usr/lib/python3.6/site-packages/openstack/compute/v2/_proxy.py\", line 482, in get_server\n
        return self._get(_server.Server, server)\n
      File \"/usr/lib/python3.6/site-packages/openstack/proxy.py\", line 46, in check\n
        return method(self, expected, actual, *args, **kwargs)\n
      File \"/usr/lib/python3.6/site-packages/openstack/proxy.py\", line 447, in _get\n
        resource_type=resource_type.__name__, value=value))\n
      File \"/usr/lib/python3.6/site-packages/openstack/resource.py\", line 1321, in fetch\n
        self._translate_response(response, **kwargs)\n
      File \"/usr/lib/python3.6/site-packages/openstack/resource.py\", line 1134, in _translate_response\n
        dict.update(self, self.to_dict())\n
      File \"/usr/lib/python3.6/site-packages/openstack/resource.py\", line 969, in to_dict\n
        value = getattr(self, attr, None)\n
      File \"/usr/lib/python3.6/site-packages/openstack/resource.py\", line 580, in __getattribute__\n
        return object.__getattribute__(self, name)\n
      File \"/usr/lib/python3.6/site-packages/openstack/resource.py\", line 166, in __get__\n
        return _convert_type(value, self.type, self.list_type)\n
      File \"/usr/lib/python3.6/site-packages/openstack/resource.py\", line 66, in _convert_type\n
        ret.append(_convert_type(raw, list_type))\n
      File \"/usr/lib/python3.6/site-packages/openstack/resource.py\", line 82, in _convert_type\n
        return data_type(value)\n
    ValueError: dictionary update sequence element #0 has length 1; 2 is required\n
    ", "module_stdout": "", "msg": "MODULE FAILURE\n
    See stdout/stderr for the exact error", "rc": 1}
failed: [localhost] (item=[1, 'ostest-vpwdz-master']) => same error
failed: [localhost] (item=[2, 'ostest-vpwdz-master']) => same error

The tasks fails, but the VMs are deployed successfully.

Expected results: no errors


Additional info:

Commenting out the lines in [2] the task works ok.

Tried with ansible 2.8 and 2.9 but same result.
Tried with python3-openstacksdk-0.36.3 (from the bastion host [3]) and python3-openstacksdk-0.36.4 (from the undercloud [4]) but no differences.

Workaround: add 'ignore_errors: yes' to the 'Create the Control Plane servers' task

[1] https://docs.openshift.com/container-platform/4.5/installing/installing_openstack/installing-openstack-user.html
[2] https://github.com/openshift/installer/blob/release-4.5/upi/openstack/control-plane.yaml#L85-L86
[3] http://pulp.dist.prod.ext.phx2.redhat.com/content/dist/layered/rhel8/$basearch/openstack-tools/16/os/
[4] http://rhos-qe-mirror-tlv.usersys.redhat.com/rcm-guest/puddles/OpenStack/16.1-RHEL-8/RHOS-16.1-RHEL-8-20201021.n.0/compose/OpenStack/$basearch/os

Comment 2 Adolfo Duarte 2020-11-17 23:43:55 UTC
this seems to be a openstacksdk bug: 
Here is a similar result 
https://storyboard.openstack.org/#!/story/2007710 (bug 39843)

It is addressed by this patch to openstacksdk

https://review.opendev.org/#/c/749381/

which has merged to master as of sep 16 2020
and is included in: 

Branches	master
Tags	0.51.0

Comment 3 Adolfo Duarte 2020-11-17 23:46:00 UTC
@Jon Uriarte, could you test with openstacksdk-0.51.0 or later (master). it seems the problem might be fixed there.

Comment 4 Pierre Prinetti 2020-11-19 09:25:32 UTC
This could have been fixed upstream; Can you please verify that the bug still exists now that we recommend[1] `ansible-galaxy` to fetch the dependencies?

[1]: https://github.com/openshift/installer/pull/4379

Comment 6 Martin André 2020-12-10 10:00:20 UTC
Possibly, we'll need to implement a temporary solution like https://github.com/openshift/installer/pull/4375 until the openstacksdk package containing the fix is more widespread.

Comment 7 Adolfo Duarte 2020-12-11 19:09:16 UTC
I think perhaps https://github.com/openshift/installer/pull/4375 is not so temporary.  If we make the version variable, this would probably be something good to have, since this type of defect will probably pop up again if a newer version appears and it brings roblems. Seems pining the playbooks to a particular api version might be a good idea.

Comment 9 Matthew Booth 2021-01-18 13:24:55 UTC
Status:

Emilien posted a backport of the upstream fix here: https://review.opendev.org/c/openstack/openstacksdk/+/763121/ , which has a +2 +W. Unfortunately it failed to merge due to timeout errors in various tests, all of which seem unrelated to the backport. I have resubmitted and hit the same issue again. I will spend some time trying to improve the timeout situation, or this seems unlikely to ever land.

Comment 12 Matthew Booth 2021-01-21 09:53:31 UTC
The openstacksdk backport has now landed.

Comment 14 Pierre Prinetti 2021-01-21 14:47:28 UTC
(In reply to rlobillo from comment #5)

> We also tried to run the control-plane.yaml playbook installing the
> collection through ansible-galaxy as mentioned in the documentation

We have reverted that change; that was my mistake. ansible-galaxy is not supported. If there's any reference to ansible-galaxy left in code or docs, then you can report it as a bug. Thanks!

Comment 16 weiwei jiang 2021-01-25 10:42:20 UTC
Checked with python3-openstacksdk-0.36.4-1.20201113235938.el8ost.noarch, and can not reproduce this issue, moved to verified.


TASK [Create the Control Plane servers] ****************************************
task path: /root/jenkins/workspace/Launch Environment Flexy/private-templates/functionality-testing/aos-4_7/hosts/upi_on_openstack-scripts/04_control-plane.yaml:72
<localhost> ESTABLISH LOCAL CONNECTION FOR USER: root
<localhost> EXEC /bin/sh -c 'echo ~root && sleep 0'
<localhost> EXEC /bin/sh -c '( umask 77 && mkdir -p "` echo /root/.ansible/tmp `"&& mkdir "` echo /root/.ansible/tmp/ansible-tmp-1611570657.568383-4107914-230337991073578 `" && echo ansible-tmp-1611570657.568383-4107914-230337991073578="` echo /root/.ansible/tmp/ansible-tmp-1611570657.568383-4107914-230337991073578 `" ) && sleep 0'

Using module file /usr/lib/python3.6/site-packages/ansible/modules/cloud/openstack/os_server.py
<localhost> PUT /root/.ansible/tmp/ansible-local-410775579l6z6kd/tmpzm2x60rx TO /root/.ansible/tmp/ansible-tmp-1611570657.568383-4107914-230337991073578/AnsiballZ_os_server.py
<localhost> EXEC /bin/sh -c 'chmod u+x /root/.ansible/tmp/ansible-tmp-1611570657.568383-4107914-230337991073578/ /root/.ansible/tmp/ansible-tmp-1611570657.568383-4107914-230337991073578/AnsiballZ_os_server.py && sleep 0'
<localhost> EXEC /bin/sh -c '/usr/libexec/platform-python /root/.ansible/tmp/ansible-tmp-1611570657.568383-4107914-230337991073578/AnsiballZ_os_server.py && sleep 0'

<localhost> EXEC /bin/sh -c 'rm -f -r /root/.ansible/tmp/ansible-tmp-1611570657.568383-4107914-230337991073578/ > /dev/null 2>&1 && sleep 0'
<localhost> EXEC /bin/sh -c 'echo ~root && sleep 0'
changed: [localhost] => (item=[0, 'wj47uos125ag-kgxfx-master']) => {
    "ansible_loop_var": "item",
    "changed": true,
    "id": "d4570999-a84a-4470-85da-e8f8f801482b",
......
        "server_groups": null,
        "status": "ACTIVE",
        "tags": [],
        "task_state": null,
        "tenant_id": "542c6ebd48bf40fa857fc245c7572e30",
        "terminated_at": null,
        "trusted_image_certificates": null,
        "updated": "2021-01-25T10:35:47Z",
        "user_data": null,
        "user_id": "b414646065ab99780ef1bbcba52c07d2033a6f99fd0b10a3b1b12fcb5e5275e1",
        "vm_state": "active",
        "volumes": []
    }
}
META: ran handlers
META: ran handlers

PLAY RECAP *********************************************************************
localhost                  : ok=7    changed=5    unreachable=0    failed=0    skipped=3    rescued=0    ignored=0

Comment 19 errata-xmlrpc 2021-02-24 15:28:35 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633


Note You need to log in before you can comment on or make changes to this bug.