Bug 1829707

Summary: Missing connection_timeout in deploy_roles
Product: Red Hat OpenStack Reporter: midavies
Component: openstack-tripleo-commonAssignee: Steve Baker <sbaker>
Status: CLOSED ERRATA QA Contact: David Rosenfeld <drosenfe>
Severity: medium Docs Contact:
Priority: medium    
Version: 16.1 (Train)CC: bfournie, mburns, midavies, sbaker, slinaber, tonyb
Target Milestone: betaKeywords: Triaged
Target Release: 16.1 (Train on RHEL 8.2)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-tripleo-common-11.3.3-0.20200525163439.20973e4.el8ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-07-29 07:52:21 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description midavies 2020-04-30 07:21:27 UTC
Description of problem:

In RHOS-16.1-RHEL-8-20200428.n.0 you can't do a `openstack overcloud node provision -o baremetal_environment.yaml baremetal_deployment.yaml` as there's missing variable in /usr/share/openstack-tripleo-common/workbooks/baremetal_deploy.yaml

Version-Release number of selected component (if applicable):

RHOS 16.1 

How reproducible:

Everytime

Steps to Reproduce:
1. openstack overcloud node provision -o baremetal_environment.yaml baremetal_deployment.yaml


Actual results:

The provision fails

Expected results:

The provision succeeds


Additional info:

The problem is specifically in /usr/share/openstack-tripleo-common/workbooks/baremetal_deploy.yaml

In the deploy_roles workflow, in the deploy_instances task, '<% $.connection_timeout %>' is referenced but does not exist in the deploy_roles input param list.

Fortunately, a patch that fixes this has been committed upstream to stable/train, and just needs to be backported downstream to RHOS 16.1, specifically https://opendev.org/openstack/tripleo-common/commit/3d3afa62dc392236dd3191956ed2bf2f05f3b0e1

I've verified this fix by doing the following:

1. Add in "- connection_timeout: 600" to the input: list in the deploy_roles workflow
2. openstack workbook delete tripleo.baremetal_deploy.v1
3. openstack workbook create /usr/share/openstack-tripleo-common/workbooks/baremetal_deploy.yaml
4. Run 'openstack overcloud node provision -o baremetal_environment.yaml baremetal_deployment.yaml'
5. See the provision succeed

Comment 1 Bob Fournier 2020-05-04 01:04:16 UTC
I would have expected that this fix would be available in the compose being tested as the tripleo-common package built for 16.1 on 4/13 has the fix - https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=1164925.

Comment 2 midavies 2020-05-04 04:57:54 UTC
I've just done a fresh deploy to verify, I'm still seeing the issue.

Here's the environment I used today:
        RHEL: Red Hat Enterprise Linux release 8.2 (Ootpa)
        RHOS: Red Hat OpenStack Platform release 16.1 (Train)
 RHOS Puddle: 16.1-trunk  -p RHOS-16.1-RHEL-8-20200428.n.0
   Yum Repos: 16.1-trunk  ceph-4  ceph-osd-4  rhel-8.2

rpm -qa | grep tripleo
python3-tripleoclient-12.3.2-0.20200424033448.b951192.el8ost.noarch
openstack-tripleo-puppet-elements-11.2.2-0.20200311084936.a6fef08.el8ost.noarch
python3-tripleo-common-11.3.3-0.20200423204446.86569f2.el8ost.noarch
ansible-role-tripleo-modify-image-1.1.1-0.20200311081746.bb6f78d.el8ost.noarch
ansible-tripleo-ipa-0.1.2-0.20200427103432.f23f480.el8ost.noarch
openstack-tripleo-image-elements-10.6.2-0.20200313223428.8c91b46.el8ost.noarch
openstack-tripleo-common-11.3.3-0.20200423204446.86569f2.el8ost.noarch
ansible-tripleo-ipsec-9.2.1-0.20200311073016.0c8693c.el8ost.noarch
openstack-tripleo-common-containers-11.3.3-0.20200423204446.86569f2.el8ost.noarch
python3-tripleoclient-heat-installer-12.3.2-0.20200424033448.b951192.el8ost.noarch
openstack-tripleo-heat-templates-11.3.2-0.20200428015016.d5442cd.el8ost.noarch
puppet-tripleo-11.4.1-0.20200420213421.cae687c.el8ost.noarch
openstack-tripleo-validations-11.3.2-0.20200415073428.7b94843.el8ost.noarch

And this is the patch I needed to see this work:
diff -u baremetal_deploy.yaml.orig baremetal_deploy.yaml
--- baremetal_deploy.yaml.orig	2020-05-04 04:08:07.682156095 +0000
+++ baremetal_deploy.yaml	2020-05-04 04:28:49.289445878 +0000
@@ -203,6 +203,7 @@
       - ctlplane_network: ctlplane
       - ssh_keys: []
       - ssh_user_name: heat-admin
+      - connection_timeout: 600
       - timeout: 3600
       - concurrency: 20
       - queue_name: tripleo

Followed by:
openstack workbook delete tripleo.baremetal_deploy.v1
openstack workbook create /usr/share/openstack-tripleo-common/workbooks/baremetal_deploy.yaml

Comment 3 Tony Breeds 2020-05-04 06:41:56 UTC
@Bob: The build you pint out isn't tagged as -pending so we're getting: https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=1176328
@Michael: Can you install the RPMs from the build Bob points out before installing the undercloud to verify that fixes the problem.

If it does I assume we can mark it as modified and include the nevra and it will get tagged correctly?

Comment 4 Bob Fournier 2020-05-04 11:33:26 UTC
Hi Tony - I'm confused as https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=1176328 should have the fix too, see the note from April 3:
[Train only] Add ssh timeout for baremetal_deploy

Comment 5 Steve Baker 2020-05-04 22:38:11 UTC
Actually I think the referenced change[1] caused this problem, since it:

- Adds a connection_timeout input to the deploy_instances but that workflow doesn't do anything with that input value
- Calls deploy_instances with connection_timeout from the deploy_roles workflow, but deploy_roles is missing a connection_timeout input

Also I don't see a need for deploy_roles or deploy_instances to deal with ansible connection timeouts because it doesn't call ansible and doesn't attempt to connect to any remote nodes.

I'm going to propose a revert to this change.

[1] https://opendev.org/openstack/tripleo-common/commit/3d3afa62dc392236dd3191956ed2bf2f05f3b0e1

Comment 6 midavies 2020-05-05 02:52:10 UTC
Thanks Steve - I misunderstood the fix needed as I misread the original diff.  I was suggesting to add in connection_timeout to the input params for deploy_roles, but your solution of removing the connection_timeout if it isn't required is even better.  Thanks for seeing through my mistake.

I've reviewed the upstream patch (https://review.opendev.org/#/c/725426), and I've verified it in my local environment.  It works as expected.  Looking forward to this landing in rhos16.1.

Comment 11 errata-xmlrpc 2020-07-29 07:52:21 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:3148