Bug 1730661

Summary: Splitstack deployment fails due to no ssh connection to overcloud hosts
Product: Red Hat OpenStack Reporter: Sasha Smolyak <ssmolyak>
Component: openstack-tripleo-heat-templatesAssignee: Cédric Jeanneret <cjeanner>
Status: CLOSED DUPLICATE QA Contact: Sasha Smolyak <ssmolyak>
Severity: high Docs Contact:
Priority: unspecified    
Version: 15.0 (Stein)CC: cjeanner, emacchi, mburns
Target Milestone: ---Keywords: Triaged
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-08-01 06:22:30 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
sosreport
none
overcloud install log none

Description Sasha Smolyak 2019-07-17 09:52:17 UTC
Created attachment 1591370 [details]
sosreport

Description of problem:
Overcloud deployment fails on splitstack: no ssh connection to overcloud. Sos report attached
[stack@undercloud-0 ~]$ cat overcloud_deployment_12.log
2019-07-17 08:04:29.269 105641 WARNING tripleoclient.plugin [  admin] Waiting for messages on queue 'tripleo' with no timeout.
2019-07-17 08:43:26.226 105641 ERROR openstack [  admin] Overcloud configuration failed.

"TASK [register machine id] *****************************************************", "fatal: [ceph-1]: UNREACHABLE! => {\"changed\": false, \"msg\": \"Data could not be sent to remote host \\\"192.168.24.137\\\". Make sure this host can be reached over ssh: ssh: connect to host 192.168.24.137 port 22: Connection timed out\\r\\n\", \"unreachable\": true}", "fatal: [controller-2]: UNREACHABLE! => {\"changed\": false, \"msg\": \"Data could not be sent to remote host \\\"192.168.24.139\\\". Make sure this host can be reached over ssh: ssh: connect to host 192.168.24.139 port 22: Connection timed out\\r\\n\", \"unreachable\": true}", "fatal: [compute-1]: UNREACHABLE! => {\"changed\": false, \"msg\": \"Data could not be sent to remote host \\\"192.168.24.118\\\". Make sure this host can be reached over ssh: ssh: connect to host 192.168.24.118 port 22: Connection timed out\\r\\n\", \"unreachable\": true}", "fatal: [compute-0]: UNREACHABLE! => {\"changed\": false, \"msg\": \"Data could not be sent to remote host \\\"192.168.24.144\\\". Make sure this host can be reached over ssh: ssh: connect to host 192.168.24.144 port 22: Connection timed out\\r\\n\", \"unreachable\": true}", "fatal: [ceph-0]: UNREACHABLE! => {\"changed\": false, \"msg\": \"Data could not be sent to remote host \\\"192.168.24.148\\\". Make sure this host can be reached over ssh: ssh: connect to host 192.168.24.148 port 22: Connection timed out\\r\\n\", \"unreachable\": true}", "fatal: [controller-1]: UNREACHABLE! => {\"changed\": false, \"msg\": \"Data could not be sent to remote host \\\"192.168.24.129\\\". Make sure this host can be reached over ssh: ssh: connect to host 192.168.24.129 port 22: Connection timed out\\r\\n\", \"unreachable\": true}", "fatal: [controller-0]: UNREACHABLE! => {\"changed\": false, \"msg\": \"Data could not be sent to remote host \\\"192.168.24.108\\\". Make sure this host can be reached over ssh: ssh: connect to host 192.168.24.108 port 22: Connection timed out\\r\\n\", \"unreachable\": true}", "fatal: [ceph-2]: UNREACHABLE! => {\"changed\": false, \"msg\": \"Data could not be sent to remote host \\\"192.168.24.149\\\". Make sure this host can be reached over ssh: ssh: connect to host 192.168.24.149 port 22: Connection timed out\\r\\n\", \"unreachable\": true}", ""


Version-Release number of selected component (if applicable):
openstack-tripleo-heat-templates-10.6.1-0.20190713150434.2871ce0.el8ost.noarch
puppet-tripleo-10.4.2-0.20190701160408.ecbec17.el8ost.noarch
RHOS_TRUNK-15.0-RHEL-8-20190714.n.0

How reproducible:
100%

Steps to Reproduce:
1. Try to deploy 3 controllers, 2 computes, 3 cephs with splitstack
2.
3.

Actual results:
Deployment fails

Expected results:
Deployment passes

Additional info:

Comment 1 Sasha Smolyak 2019-07-17 09:53:35 UTC
Created attachment 1591371 [details]
overcloud install log

Comment 2 Cédric Jeanneret 2019-07-17 10:15:30 UTC
Sooo. apparently, ssh access isn't opened as expected.

We can see a resource named "003 accept ssh from all ipv4" (same for ipv6), but that one actually *removes* the ssh access (ensure => absent).

What's missing in the overcloud_install_log is apparently resources named "003 accept ssh from ctlplane subnet ..." which allows ssh access only from the ctlplane.

So we have either a link to https://bugs.launchpad.net/tripleo/+bug/1836696 or it's another thing. I tend to think it's another one, but we would need to get a view on the iptables content on the unreachable hosts. This can be done as follow:

- deploy the undercloud
- introspect nodes
- run the deploy command adding the "--stack-only" parameter
- run tripleo-config-download --output-dir oc-config-download
- run tripleo-ansible-inventory --ansible_ssh_user heat-admin --static-yaml-inventory oc-inventory.yaml
- fetch servers IP with "openstack server list", and connect to one of them (compute, controller, whatever)
- run ansible-playbook \
  -i oc-inventory.yaml \
  --private-key /home/stack/.ssh/id_rsa \
  --become \
  oc-config-download/deploy_steps_playbook.yaml

That last step will run as plain ansible the deploy. Doing so will allow you to have time to connect to the node, and will also allow to get the generated templates/configurations for the overcloud nodes.

Once you have the failure, as you're connected to the overcloud node, you will be able to provide the iptables content as follow:
sudo iptables -vnL INPUT

You can enable SSH access with this simple command: sudo iptables -I INPUT -j ACCEPT (note: this will enable ALL connections, not only SSH).
That will allow to provide an env we can use for debugging and researches.

Care to provide that? Thanks!

Comment 7 Sasha Smolyak 2019-08-01 06:22:30 UTC

*** This bug has been marked as a duplicate of bug 1734172 ***