Description of problem: Overcloud deployment fails with time-out when using mix of VM and BM nodes for overcloud Deploying overcloud configuration Enabling ssh admin (tripleo-admin) for hosts: 172.31.0.43 172.31.0.33 172.31.0.35 172.31.0.47 Using ssh user heat-admin for initial connection. Using ssh key at /home/stack/.ssh/id_rsa for initial connection. Inserting TripleO short term key for 172.31.0.43 Warning: Permanently added '172.31.0.43' (ECDSA) to the list of known hosts. Removing short term keys locally Timed out waiting for port 22 from 172.31.0.33 In the output above 1 node succeeds and it's followed by the error/timeout on the remaining 3 nodes. The 1 node that succeeded is a VM and the 3 that failed are Baremetal nodes. The Baremetal nodes are Supermicro that have relatively quick bootup time. In about 1 minute after the time-out occurs these supermicro nodes are accessible via ssh The second consecutive deploy (without any changes completes successfully). The timeout period should be increased for a Baremetal nodes. I am not sure if the problem occurs if only BM nodes are being used, I only tested a mix of VM and BM Version-Release number of selected component (if applicable): OSP16 How reproducible: Every time Steps to Reproduce: 1. Deploy OSP16 on a mix of VMs and BM nodes 2. The first deployment times out 3. The second consecutive deployment (update) succeeds even with no changes to the configuration Actual results: Time-out Expected results: The time for enabling ssh admin should be increased Additional info: sosreport - http://chrisj.cloud/sosreport-undercloud-osp16-2020-02-20-jclawlu.tar.xz
The following option can be used to tune this timeout: --overcloud-ssh-port-timeout OVERCLOUD_SSH_PORT_TIMEOUT Timeout for to wait for the ssh port to become active.
Thanks for the update. I am glad we have that option. It's probably safe to assume that if I hit this on the relatively quick to boot supermicro boards, our customers will also hit it on more traditional and slower OEM servers. I would say it's not uncommon for the traditional servers that could take ~15 minutes or more to boot. I would highly recommend changing the default to something higher.
I want to say we already did raise it to 10 mins, but i'll have to check.
https://review.opendev.org/#/c/620754/ so there are two values but we did raise one to 10 mins. Usually we don't get to ssh enable process until several minutes after the systems should already be up/deployed so I'm not sure what specifically happened in this scenario. In our testing it's usually like 10+ minutes before the ssh enable process runs after the nodes should already be up.
The one thing that might be different is my environment has a mix of VM and BM. The VM restarts in a matter of seconds where the BM nodes are typically about 5 minutes to boot. Does the timer gets reset after the first node comes up or anything along those lines ?
No it's once it reaches the point where it needs to try and do the ssh key bits. The overall timeouts are global to the entire environment.
adding --overcloud-ssh-port-timeout 600 \ to my deployment script has fixed this problem for me .. again I have a relatively fast posting hardware .. please consider changing the defaults .. in any case I'd like to leave this BZ as the artifact for others who hit the issue.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:3148