Description of problem: In RHOS-16.1-RHEL-8-20200428.n.0 you can't do a `openstack overcloud node provision -o baremetal_environment.yaml baremetal_deployment.yaml` as there's missing variable in /usr/share/openstack-tripleo-common/workbooks/baremetal_deploy.yaml Version-Release number of selected component (if applicable): RHOS 16.1 How reproducible: Everytime Steps to Reproduce: 1. openstack overcloud node provision -o baremetal_environment.yaml baremetal_deployment.yaml Actual results: The provision fails Expected results: The provision succeeds Additional info: The problem is specifically in /usr/share/openstack-tripleo-common/workbooks/baremetal_deploy.yaml In the deploy_roles workflow, in the deploy_instances task, '<% $.connection_timeout %>' is referenced but does not exist in the deploy_roles input param list. Fortunately, a patch that fixes this has been committed upstream to stable/train, and just needs to be backported downstream to RHOS 16.1, specifically https://opendev.org/openstack/tripleo-common/commit/3d3afa62dc392236dd3191956ed2bf2f05f3b0e1 I've verified this fix by doing the following: 1. Add in "- connection_timeout: 600" to the input: list in the deploy_roles workflow 2. openstack workbook delete tripleo.baremetal_deploy.v1 3. openstack workbook create /usr/share/openstack-tripleo-common/workbooks/baremetal_deploy.yaml 4. Run 'openstack overcloud node provision -o baremetal_environment.yaml baremetal_deployment.yaml' 5. See the provision succeed
I would have expected that this fix would be available in the compose being tested as the tripleo-common package built for 16.1 on 4/13 has the fix - https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=1164925.
I've just done a fresh deploy to verify, I'm still seeing the issue. Here's the environment I used today: RHEL: Red Hat Enterprise Linux release 8.2 (Ootpa) RHOS: Red Hat OpenStack Platform release 16.1 (Train) RHOS Puddle: 16.1-trunk -p RHOS-16.1-RHEL-8-20200428.n.0 Yum Repos: 16.1-trunk ceph-4 ceph-osd-4 rhel-8.2 rpm -qa | grep tripleo python3-tripleoclient-12.3.2-0.20200424033448.b951192.el8ost.noarch openstack-tripleo-puppet-elements-11.2.2-0.20200311084936.a6fef08.el8ost.noarch python3-tripleo-common-11.3.3-0.20200423204446.86569f2.el8ost.noarch ansible-role-tripleo-modify-image-1.1.1-0.20200311081746.bb6f78d.el8ost.noarch ansible-tripleo-ipa-0.1.2-0.20200427103432.f23f480.el8ost.noarch openstack-tripleo-image-elements-10.6.2-0.20200313223428.8c91b46.el8ost.noarch openstack-tripleo-common-11.3.3-0.20200423204446.86569f2.el8ost.noarch ansible-tripleo-ipsec-9.2.1-0.20200311073016.0c8693c.el8ost.noarch openstack-tripleo-common-containers-11.3.3-0.20200423204446.86569f2.el8ost.noarch python3-tripleoclient-heat-installer-12.3.2-0.20200424033448.b951192.el8ost.noarch openstack-tripleo-heat-templates-11.3.2-0.20200428015016.d5442cd.el8ost.noarch puppet-tripleo-11.4.1-0.20200420213421.cae687c.el8ost.noarch openstack-tripleo-validations-11.3.2-0.20200415073428.7b94843.el8ost.noarch And this is the patch I needed to see this work: diff -u baremetal_deploy.yaml.orig baremetal_deploy.yaml --- baremetal_deploy.yaml.orig 2020-05-04 04:08:07.682156095 +0000 +++ baremetal_deploy.yaml 2020-05-04 04:28:49.289445878 +0000 @@ -203,6 +203,7 @@ - ctlplane_network: ctlplane - ssh_keys: [] - ssh_user_name: heat-admin + - connection_timeout: 600 - timeout: 3600 - concurrency: 20 - queue_name: tripleo Followed by: openstack workbook delete tripleo.baremetal_deploy.v1 openstack workbook create /usr/share/openstack-tripleo-common/workbooks/baremetal_deploy.yaml
@Bob: The build you pint out isn't tagged as -pending so we're getting: https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=1176328 @Michael: Can you install the RPMs from the build Bob points out before installing the undercloud to verify that fixes the problem. If it does I assume we can mark it as modified and include the nevra and it will get tagged correctly?
Hi Tony - I'm confused as https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=1176328 should have the fix too, see the note from April 3: [Train only] Add ssh timeout for baremetal_deploy
Actually I think the referenced change[1] caused this problem, since it: - Adds a connection_timeout input to the deploy_instances but that workflow doesn't do anything with that input value - Calls deploy_instances with connection_timeout from the deploy_roles workflow, but deploy_roles is missing a connection_timeout input Also I don't see a need for deploy_roles or deploy_instances to deal with ansible connection timeouts because it doesn't call ansible and doesn't attempt to connect to any remote nodes. I'm going to propose a revert to this change. [1] https://opendev.org/openstack/tripleo-common/commit/3d3afa62dc392236dd3191956ed2bf2f05f3b0e1
Thanks Steve - I misunderstood the fix needed as I misread the original diff. I was suggesting to add in connection_timeout to the input params for deploy_roles, but your solution of removing the connection_timeout if it isn't required is even better. Thanks for seeing through my mistake. I've reviewed the upstream patch (https://review.opendev.org/#/c/725426), and I've verified it in my local environment. It works as expected. Looking forward to this landing in rhos16.1.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:3148