Bug 1805429

Summary: [osp16 hackfest] Enabling ssh admin (tripleo-admin) for hosts - Timed out waiting for port 22 (mix of VM and BM)
Product: Red Hat OpenStack Reporter: Chris Janiszewski <cjanisze>
Component: python-tripleoclientAssignee: Alex Schultz <aschultz>
Status: CLOSED ERRATA QA Contact:
Severity: medium Docs Contact:
Priority: medium    
Version: 16.0 (Train)CC: aschultz, augol, emacchi, hbrock, jschluet, jslagle, mburns
Target Milestone: betaKeywords: Triaged
Target Release: 16.1 (Train on RHEL 8.2)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: python-tripleoclient-12.3.2-0.20200229004913.6d57d68.el8ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-07-29 07:50:34 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Chris Janiszewski 2020-02-20 18:50:15 UTC
Description of problem:
Overcloud deployment fails with time-out when using mix of VM and BM nodes for overcloud

Deploying overcloud configuration
Enabling ssh admin (tripleo-admin) for hosts:
172.31.0.43 172.31.0.33 172.31.0.35 172.31.0.47
Using ssh user heat-admin for initial connection.
Using ssh key at /home/stack/.ssh/id_rsa for initial connection.
Inserting TripleO short term key for 172.31.0.43
Warning: Permanently added '172.31.0.43' (ECDSA) to the list of known hosts.
Removing short term keys locally
Timed out waiting for port 22 from 172.31.0.33


In the output above 1 node succeeds and it's followed by the error/timeout on the remaining 3 nodes. The 1 node that succeeded is a VM and the 3 that failed are Baremetal nodes.
The Baremetal nodes are Supermicro that have relatively quick bootup time.
In about 1 minute after the time-out occurs these supermicro nodes are accessible via ssh

The second consecutive deploy (without any changes completes successfully).

The timeout period should be increased for a Baremetal nodes.
I am not sure if the problem occurs if only BM nodes are being used, I only tested a mix of VM and BM


Version-Release number of selected component (if applicable):
OSP16

How reproducible:
Every time

Steps to Reproduce:
1. Deploy OSP16 on a mix of VMs and BM nodes
2. The first deployment times out
3. The second consecutive deployment (update) succeeds even with no changes to the configuration

Actual results:
Time-out

Expected results:
The time for enabling ssh admin should be increased

Additional info:
sosreport - http://chrisj.cloud/sosreport-undercloud-osp16-2020-02-20-jclawlu.tar.xz

Comment 1 Alex Schultz 2020-02-20 22:18:27 UTC
The following option can be used to tune this timeout:

  --overcloud-ssh-port-timeout OVERCLOUD_SSH_PORT_TIMEOUT
                        Timeout for to wait for the ssh port to become active.

Comment 2 Chris Janiszewski 2020-02-21 15:00:52 UTC
Thanks for the update. I am glad we have that option. It's probably safe to assume that if I hit this on the relatively quick to boot supermicro boards, our customers will also hit it on more traditional and slower OEM servers. I would say it's not uncommon for the traditional servers that could take ~15 minutes or more to boot. I would highly recommend changing the default to something higher.

Comment 3 Alex Schultz 2020-02-21 15:34:34 UTC
I want to say we already did raise it to 10 mins, but i'll have to check.

Comment 4 Alex Schultz 2020-02-21 15:37:24 UTC
https://review.opendev.org/#/c/620754/ so there are two values but we did raise one to 10 mins.  Usually we don't get to ssh enable process until several minutes after the systems should already be up/deployed so I'm not sure what specifically happened in this scenario. In our testing it's usually like 10+ minutes before the ssh enable process runs after the nodes should already be up.

Comment 5 Chris Janiszewski 2020-02-21 16:43:39 UTC
The one thing that might be different is my environment has a mix of VM and BM. The VM restarts in a matter of seconds where the BM nodes are typically about 5 minutes to boot. Does the timer gets reset after the first node comes up or anything along those lines ?

Comment 6 Alex Schultz 2020-02-21 17:45:50 UTC
No it's once it reaches the point where it needs to try and do the ssh key bits. The overall timeouts are global to the entire environment.

Comment 7 Chris Janiszewski 2020-02-25 20:36:12 UTC
adding   --overcloud-ssh-port-timeout 600 \ to my deployment script has fixed this problem for me .. again I have a relatively fast posting hardware .. please consider changing the defaults .. in any case I'd like to leave this BZ as the artifact for others who hit the issue.

Comment 14 errata-xmlrpc 2020-07-29 07:50:34 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:3148