1497328 – create_admin_via_nova returns before the ssh key is installed on all nodes (was: ceph-ansible starts before hosts are ready)

Bug 1497328 - create_admin_via_nova returns before the ssh key is installed on all nodes (was: ceph-ansible starts before hosts are ready)

Summary: create_admin_via_nova returns before the ssh key is installed on all nodes (w...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-tripleo-common
Sub Component:
Version:	12.0 (Pike)
Hardware:	All
OS:	All
Priority:	high
Severity:	high
Target Milestone:	beta
Target Release:	12.0 (Pike)
Assignee:	Giulio Fidente
QA Contact:	Alexander Chuzhoy
Docs Contact:
URL:
Whiteboard:	PerfScale
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-09-29 19:38 UTC by Joe Talerico
Modified:	2018-02-05 19:15 UTC (History)
CC List:	15 users (show)
Fixed In Version:	openstack-tripleo-common-7.6.3-0.20171010234828.el7ost
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-12-13 22:11:53 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Launchpad	1720793	None	None	None	2017-10-02 13:27:25 UTC
OpenStack gerrit	510970	None	None	None	2017-10-16 09:24:03 UTC
Red Hat Product Errata	RHEA-2017:3462	normal	SHIPPED_LIVE	Red Hat OpenStack Platform 12.0 Enhancement Advisory	2018-02-16 01:43:25 UTC

Description Joe Talerico 2017-09-29 19:38:39 UTC

Description of problem:
When Mistral kicks off Ceph-Ansible, I am seeing issues like :

2017-09-29 15:38:10,768 p=19459 u=mistral |  TASK [ceph-defaults : is ceph running already?] ********************************
2017-09-29 15:38:10,780 p=19459 u=mistral |  [DEPRECATION WARNING]: always_run is deprecated. Use check_mode = no instead..
2017-09-29 15:38:11,180 p=19459 u=mistral |  fatal: [192.168.24.56]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: Could not create directory '/home/mistral/.ssh'.\r\nssh_exchange_identification: Connection closed by remote host\r\n", "unreachable": true}
2017-09-29 15:38:11,181 p=19459 u=mistral |  fatal: [192.168.24.71]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: Could not create directory '/home/mistral/.ssh'.\r\nssh_exchange_identification: Connection closed by remote host\r\n", "unreachable": true}
2017-09-29 15:38:11,188 p=19459 u=mistral |  [DEPRECATION WARNING]: always_run is deprecated. Use check_mode = no instead..

Which causes the deployment to fail due to the host being unreachable. 

However, I am able to login to the hosts that mentions unreachable=1. 

For the full Ansible log (includes multiple deployments) :
http://perf1.perf.lab.eng.bos.redhat.com/jtaleric/OpenStack/logs/092917-ceph-ansible-mistral.log

This only has become a problem since growing the overcloud deployment to 3 controllers, 3 ceph nodes, and 26 compute nodes (deployed at once).

Version-Release number of selected component (if applicable):
puppet-mistral-11.3.1-0.20170825184651.cf2e493.el7ost.noarch
python-mistral-5.1.1-0.20170909041831.a8e648c.el7ost.noarch
openstack-mistral-engine-5.1.1-0.20170909041831.a8e648c.el7ost.noarch
python-mistral-lib-0.2.0-0.20170821165722.bb1b87b.el7ost.noarch
openstack-mistral-common-5.1.1-0.20170909041831.a8e648c.el7ost.noarch
openstack-mistral-api-5.1.1-0.20170909041831.a8e648c.el7ost.noarch
openstack-mistral-executor-5.1.1-0.20170909041831.a8e648c.el7ost.noarch
python-mistralclient-3.1.3-0.20170913011357.c33d39c.el7ost.noarch


How reproducible:
Seems to be reproducing 100% (last two deploys have failed due to this).

Steps to Reproduce:
1. Deploy with 32 nodes, and some ceph

Actual results:
Failed deployment

Expected results:
Successful Deployment

Additional info:

Comment 1 John Fulton 2017-09-29 20:59:52 UTC

Jirka and I talked to Joe about this. He found the bug tripleo-admin user was configured on the node that ansible said the SSH keys were the right ones. 

We had asked for this information because, given the following in the Mistral workbook:

https://github.com/openstack/tripleo-common/blob/master/workbooks/ceph-ansible.yaml#L24-L26

only after the after the triplo-admin user was configured, would the ceph-ansible playbook run and that should prevent this bug. We had seen the same reported error in a split stack scenario in upstream CI and Jirka resovled it by adding the following: 

https://github.com/openstack/tripleo-common/commit/77dbe9295b282c54aab65c6b9815a575ce29a49c#diff-03b4bc9664d59568adabe645ea018e03

Assuming that os-collect-config was running without issue on the node and that the keys and account was set up correctly, is it possible, that with a large deployment that not all of the nodes are stood up yet and therefore Ansible could not connect to all of them at that point in the deploy? 

If so, could we have Mistral verify that something like: 

 ansible all -m ping 

returns 100% success before starting the ceph-ansible playook? Perhaps the mistral task could do a wait until?

Comment 2 Joe Talerico 2017-09-30 11:43:34 UTC

This is the third time this has bite me. 

2017-09-29 21:06:17,413 p=29967 u=mistral |  RUNNING HANDLER [ceph-defaults : restart ceph mdss] ****************************
2017-09-29 21:06:17,440 p=29967 u=mistral |  RUNNING HANDLER [ceph-defaults : restart ceph rgws] ****************************
2017-09-29 21:06:17,469 p=29967 u=mistral |  PLAY RECAP *********************************************************************
2017-09-29 21:06:17,469 p=29967 u=mistral |  192.168.24.52              : ok=49   changed=8    unreachable=0    failed=0
2017-09-29 21:06:17,469 p=29967 u=mistral |  192.168.24.53              : ok=3    changed=0    unreachable=1    failed=0
2017-09-29 21:06:17,469 p=29967 u=mistral |  192.168.24.54              : ok=24   changed=6    unreachable=0    failed=0
2017-09-29 21:06:17,470 p=29967 u=mistral |  192.168.24.55              : ok=24   changed=6    unreachable=0    failed=0
2017-09-29 21:06:17,470 p=29967 u=mistral |  192.168.24.56              : ok=3    changed=0    unreachable=1    failed=0
2017-09-29 21:06:17,470 p=29967 u=mistral |  192.168.24.57              : ok=24   changed=6    unreachable=0    failed=0
2017-09-29 21:06:17,470 p=29967 u=mistral |  192.168.24.58              : ok=39   changed=7    unreachable=0    failed=0
2017-09-29 21:06:17,470 p=29967 u=mistral |  192.168.24.59              : ok=24   changed=6    unreachable=0    failed=0
2017-09-29 21:06:17,470 p=29967 u=mistral |  192.168.24.61              : ok=39   changed=7    unreachable=0    failed=0
2017-09-29 21:06:17,470 p=29967 u=mistral |  192.168.24.62              : ok=39   changed=4    unreachable=0    failed=0
2017-09-29 21:06:17,470 p=29967 u=mistral |  192.168.24.63              : ok=24   changed=6    unreachable=0    failed=0
2017-09-29 21:06:17,470 p=29967 u=mistral |  192.168.24.64              : ok=41   changed=7    unreachable=0    failed=0
2017-09-29 21:06:17,470 p=29967 u=mistral |  192.168.24.65              : ok=26   changed=6    unreachable=0    failed=0
2017-09-29 21:06:17,471 p=29967 u=mistral |  192.168.24.66              : ok=24   changed=6    unreachable=0    failed=0
2017-09-29 21:06:17,471 p=29967 u=mistral |  192.168.24.67              : ok=24   changed=6    unreachable=0    failed=0
2017-09-29 21:06:17,471 p=29967 u=mistral |  192.168.24.68              : ok=39   changed=5    unreachable=0    failed=0
2017-09-29 21:06:17,471 p=29967 u=mistral |  192.168.24.69              : ok=24   changed=6    unreachable=0    failed=0
2017-09-29 21:06:17,471 p=29967 u=mistral |  192.168.24.70              : ok=24   changed=6    unreachable=0    failed=0
2017-09-29 21:06:17,471 p=29967 u=mistral |  192.168.24.72              : ok=24   changed=6    unreachable=0    failed=0
2017-09-29 21:06:17,471 p=29967 u=mistral |  192.168.24.73              : ok=24   changed=6    unreachable=0    failed=0
2017-09-29 21:06:17,471 p=29967 u=mistral |  192.168.24.74              : ok=24   changed=6    unreachable=0    failed=0
2017-09-29 21:06:17,471 p=29967 u=mistral |  192.168.24.75              : ok=3    changed=0    unreachable=1    failed=0
2017-09-29 21:06:17,471 p=29967 u=mistral |  192.168.24.76              : ok=24   changed=6    unreachable=0    failed=0
2017-09-29 21:06:17,471 p=29967 u=mistral |  192.168.24.77              : ok=3    changed=0    unreachable=1    failed=0
2017-09-29 21:06:17,472 p=29967 u=mistral |  192.168.24.78              : ok=24   changed=6    unreachable=0    failed=0
2017-09-29 21:06:17,472 p=29967 u=mistral |  192.168.24.80              : ok=24   changed=6    unreachable=0    failed=0
2017-09-29 21:06:17,472 p=29967 u=mistral |  192.168.24.83              : ok=24   changed=6    unreachable=0    failed=0
2017-09-29 21:06:17,472 p=29967 u=mistral |  192.168.24.84              : ok=24   changed=6    unreachable=0    failed=0
2017-09-29 21:06:17,472 p=29967 u=mistral |  192.168.24.87              : ok=3    changed=0    unreachable=1    failed=0
2017-09-29 21:06:17,472 p=29967 u=mistral |  192.168.24.89              : ok=24   changed=6    unreachable=0    failed=0
2017-09-29 21:06:17,472 p=29967 u=mistral |  192.168.24.90              : ok=24   changed=6    unreachable=0    failed=0
2017-09-29 21:06:17,472 p=29967 u=mistral |  192.168.24.92              : ok=24   changed=6    unreachable=0    failed=0

However, if I lower the compute count (to 16):

2017-09-30 02:32:28Z [overcloud.AllNodesDeploySteps.ObjectStoragePostConfig]: CREATE_COMPLETE  state changed
2017-09-30 02:32:28Z [overcloud.AllNodesDeploySteps.ComputePostConfig]: CREATE_COMPLETE  state changed
2017-09-30 02:32:28Z [overcloud.AllNodesDeploySteps]: CREATE_COMPLETE  Stack CREATE completed successfully
2017-09-30 02:32:28Z [overcloud.AllNodesDeploySteps]: CREATE_COMPLETE  state changed
2017-09-30 02:32:29Z [overcloud]: CREATE_COMPLETE  Stack CREATE completed successfully

 Stack overcloud CREATE_COMPLETE 

Internal Server Error (HTTP 500)

real    113m46.788s
user    0m8.993s
sys     0m0.540s


So, with 16 computes it succeeds, but jumping to 26, I have failure. 

Another important note 16 computes did initially fail (same errors as this bug). Starting over, the deployment succeeded.

Comment 3 John Fulton 2017-09-30 17:59:19 UTC

Joe,

As a quick workaround do you want to try modifying: 

 /usr/share/ceph-ansible/ansible.cfg

to retry more often on an SSH connection failure? 

 https://stackoverflow.com/questions/40340761/is-it-possible-to-have-ansible-retry-on-connection-failure

  John

Comment 4 Joe Talerico 2017-10-01 00:08:48 UTC

(In reply to John Fulton from comment #3)
> Joe,
> 
> As a quick workaround do you want to try modifying: 
> 
>  /usr/share/ceph-ansible/ansible.cfg
> 
> to retry more often on an SSH connection failure? 
> 
>  https://stackoverflow.com/questions/40340761/is-it-possible-to-have-ansible-
> retry-on-connection-failure
> 
>   John

Hey John - As mentioned on IRC, I don't think that will help this issue. I think if I wanted to modify ceph-ansible, I would simply add a retry/delay with the initial task that seems to always fail as a workaround.

Comment 5 Joe Talerico 2017-10-01 15:54:42 UTC

(In reply to John Fulton from comment #3)
> Joe,
> 
> As a quick workaround do you want to try modifying: 
> 
>  /usr/share/ceph-ansible/ansible.cfg
> 
> to retry more often on an SSH connection failure? 
> 
>  https://stackoverflow.com/questions/40340761/is-it-possible-to-have-ansible-
> retry-on-connection-failure
> 
>   John

I'll eat my own words here John! I set retry = 5, and the ceph-ansible playbook completed with 26 Compute nodes. This seems like a reasonable workaround until we get a Mistral task to check ssh connectivity before progressing to ceph-ansible.

Comment 6 Jiri Stransky 2017-10-02 09:10:38 UTC

The public key authorization on nodes is inserted via os-collect-config software deployment (there's no other access to the node for Mistral at that point), and i think os-collect-config can have some delay when picking up the metadata. IOW the public key insertion is asynchronous.

So indeed the best solution might be a follow up task after this one

https://github.com/openstack/tripleo-common/blob/e21f8e094f503b3a82a40d54d5459dd70ba4cbfa/workbooks/access.yaml#L72

which will try using the authorized key (with retries), to give the os-collect-config agents some time to pick up and apply the software deployment. (Same could be done in the ceph-ansible workflow, but better solve it globally if we can.)

Comment 7 Joe Talerico 2017-10-02 14:34:07 UTC

(In reply to Jiri Stransky from comment #6)
> The public key authorization on nodes is inserted via os-collect-config
> software deployment (there's no other access to the node for Mistral at that
> point), and i think os-collect-config can have some delay when picking up
> the metadata. IOW the public key insertion is asynchronous.
> 
> So indeed the best solution might be a follow up task after this one
> 
> https://github.com/openstack/tripleo-common/blob/
> e21f8e094f503b3a82a40d54d5459dd70ba4cbfa/workbooks/access.yaml#L72
> 
> which will try using the authorized key (with retries), to give the
> os-collect-config agents some time to pick up and apply the software
> deployment. (Same could be done in the ceph-ansible workflow, but better
> solve it globally if we can.)

I think the ceph-ansible workflow should double check things prior to deploying. Even if the key is dropped in, something outside of the deployment tool's control could impact the deployment (especially when we are looking at many nodes). 

In my opinion, the more checks we can put throughout the workflow the better.

Comment 8 Giulio Fidente 2017-10-16 10:41:44 UTC

Landed into master branch, updated reference to the stable/pike port.

Comment 11 Alexander Chuzhoy 2017-10-26 21:41:19 UTC

Environment:
openstack-tripleo-common-7.6.3-0.20171010234828.el7ost.noarch

Was able to deploy successfully with 6 ceph nodes.
Is this sufficient to verify this issue?

Comment 12 Giulio Fidente 2017-10-27 12:10:09 UTC

(In reply to Alexander Chuzhoy from comment #11)
> Environment:
> openstack-tripleo-common-7.6.3-0.20171010234828.el7ost.noarch
> 
> Was able to deploy successfully with 6 ceph nodes.
> Is this sufficient to verify this issue?

Probably not, I think people has seen this happening with 24 nodes, never with less than 16 nodes.

Comment 13 Joe Talerico 2017-10-27 12:13:50 UTC

This is less about the ceph nodes, and more about the compute nodes. As mentioned in Comment 2 -- 16 compute nodes did not see this issue.

Comment 14 Alexander Chuzhoy 2017-10-30 20:56:12 UTC

Verified.
Environment:
openstack-tripleo-common-7.6.3-0.20171022171808.el7ost.noarch

Successfully deployed and populated a setup with 24 compute nodes:

(undercloud) [stack@undercloud-0 ~]$ nova list
+--------------------------------------+-------------------------+--------+------------+-------------+------------------------+
| ID                                   | Name                    | Status | Task State | Power State | Networks               |
+--------------------------------------+-------------------------+--------+------------+-------------+------------------------+
| 4b445f0a-cb09-4944-87d1-1958c1d40114 | overcloud-cephstorage-0 | ACTIVE | -          | Running     | ctlplane=192.168.24.22 |
| 79b7e9a2-2fab-470a-8997-1c9d60584a99 | overcloud-cephstorage-1 | ACTIVE | -          | Running     | ctlplane=192.168.24.34 |
| 9769ab90-7caf-4ccb-9df4-caef85c8b847 | overcloud-cephstorage-2 | ACTIVE | -          | Running     | ctlplane=192.168.24.39 |
| ed67d6e7-8970-411e-ac5a-5e8f612537e5 | overcloud-compute-0     | ACTIVE | -          | Running     | ctlplane=192.168.24.30 |
| cee5a230-59a8-470f-b418-f64e3e8b73f3 | overcloud-compute-1     | ACTIVE | -          | Running     | ctlplane=192.168.24.9  |
| 389c4280-0585-4455-ba50-0373968142b9 | overcloud-compute-10    | ACTIVE | -          | Running     | ctlplane=192.168.24.26 |
| 8bffccd3-9c72-4929-aea4-cad6d20aa926 | overcloud-compute-11    | ACTIVE | -          | Running     | ctlplane=192.168.24.15 |
| 32c2a921-7263-4527-9ec8-1bcea5441cce | overcloud-compute-12    | ACTIVE | -          | Running     | ctlplane=192.168.24.16 |
| 60af7544-ffb7-469e-a81f-9b93145a62ae | overcloud-compute-13    | ACTIVE | -          | Running     | ctlplane=192.168.24.32 |
| d5eae938-ad4c-4be0-a9ea-0d95e217d87d | overcloud-compute-14    | ACTIVE | -          | Running     | ctlplane=192.168.24.36 |
| 0ebfe7d3-6226-464a-80cd-4c5dfa9f8740 | overcloud-compute-15    | ACTIVE | -          | Running     | ctlplane=192.168.24.25 |
| 3b9431c3-03b8-44a1-bf6c-cf16182602e0 | overcloud-compute-16    | ACTIVE | -          | Running     | ctlplane=192.168.24.29 |
| 3a4d47ba-82b4-4120-a1f9-e8aa954d1338 | overcloud-compute-17    | ACTIVE | -          | Running     | ctlplane=192.168.24.17 |
| d26473ad-6310-4470-8e12-10be6348bed6 | overcloud-compute-18    | ACTIVE | -          | Running     | ctlplane=192.168.24.14 |
| 9ceac8dd-e156-46fd-8ed3-ea8e97bdfdb9 | overcloud-compute-19    | ACTIVE | -          | Running     | ctlplane=192.168.24.8  |
| 3d1f2584-05a4-4085-aedc-0c01b7fc9767 | overcloud-compute-2     | ACTIVE | -          | Running     | ctlplane=192.168.24.7  |
| 71c3784b-853d-43be-8d51-560e398676a9 | overcloud-compute-20    | ACTIVE | -          | Running     | ctlplane=192.168.24.6  |
| 8a37dbfa-6d14-4288-a281-978dfda62c29 | overcloud-compute-21    | ACTIVE | -          | Running     | ctlplane=192.168.24.27 |
| 1b8e7132-2c27-43ee-8c70-6622558cf3d7 | overcloud-compute-22    | ACTIVE | -          | Running     | ctlplane=192.168.24.21 |
| ea1ef19f-c99d-4251-a28d-4f6c01a4843b | overcloud-compute-23    | ACTIVE | -          | Running     | ctlplane=192.168.24.38 |
| 9aecd654-72e7-4bfc-85a2-fbbe73a2f2f1 | overcloud-compute-3     | ACTIVE | -          | Running     | ctlplane=192.168.24.12 |
| f13d3c42-4c0a-4442-80c8-b25773f588dc | overcloud-compute-4     | ACTIVE | -          | Running     | ctlplane=192.168.24.19 |
| 88e6c282-01ea-429a-af53-6dfa0caa8971 | overcloud-compute-5     | ACTIVE | -          | Running     | ctlplane=192.168.24.24 |
| 4e9736af-cdb8-4ea9-afdd-6cef9e53e0cf | overcloud-compute-6     | ACTIVE | -          | Running     | ctlplane=192.168.24.28 |
| 14dfbdd3-fe8e-47e8-8173-7b828a13fc8f | overcloud-compute-7     | ACTIVE | -          | Running     | ctlplane=192.168.24.20 |
| d1654b43-7a9e-4038-9f78-250ced44fae9 | overcloud-compute-8     | ACTIVE | -          | Running     | ctlplane=192.168.24.43 |
| 3a937ab2-fe07-46f9-81c1-80bd9cf216fe | overcloud-compute-9     | ACTIVE | -          | Running     | ctlplane=192.168.24.10 |
| 92a299ad-7b9e-4d8b-8abc-5395e481bf10 | overcloud-controller-0  | ACTIVE | -          | Running     | ctlplane=192.168.24.18 |
| 90c834c9-8a48-4869-bc7a-c1d5408d5a08 | overcloud-controller-1  | ACTIVE | -          | Running     | ctlplane=192.168.24.11 |
| 25d8236a-cc26-41ee-a678-bb06e7af3b10 | overcloud-controller-2  | ACTIVE | -          | Running     | ctlplane=192.168.24.23 |
+--------------------------------------+-------------------------+--------+------------+-------------+------------------------+
(undercloud) [stack@undercloud-0 ~]$ heat stack-list
WARNING (shell) "heat stack-list" is deprecated, please use "openstack stack list" instead
+--------------------------------------+------------+-----------------+----------------------+--------------+----------------------------------+
| id                                   | stack_name | stack_status    | creation_time        | updated_time | project                          |
+--------------------------------------+------------+-----------------+----------------------+--------------+----------------------------------+
| 6ed57bb2-a782-45d4-b131-52f8615bf2ef | overcloud  | CREATE_COMPLETE | 2017-10-30T19:30:17Z | None         | 7c53c41d51d74361ac57676ad34a93af |
+--------------------------------------+------------+-----------------+----------------------+--------------+----------------------------------+
(undercloud) [stack@undercloud-0 ~]$ . overcloudrc
(overcloud) [stack@undercloud-0 ~]$ nova list --all
ping -c1 +--------------------------------------+--------------+----------------------------------+--------+------------+-------------+--------------------------------------+
| ID                                   | Name         | Tenant ID                        | Status | Task State | Power State | Networks                             |
+--------------------------------------+--------------+----------------------------------+--------+------------+-------------+--------------------------------------+
| b263728b-3bb2-477d-aac6-548eba6a9202 | after_deploy | 014ebf1660824a57b7b9db69d31c5b27 | ACTIVE | -          | Running     | tenantvxlan=192.168.32.7, 10.0.0.190 |
+--------------------------------------+--------------+----------------------------------+--------+------------+-------------+--------------------------------------+
(overcloud) [stack@undercloud-0 ~]$ ping -c1 10.0.0.190
PING 10.0.0.190 (10.0.0.190) 56(84) bytes of data.
64 bytes from 10.0.0.190: icmp_seq=1 ttl=63 time=2.44 ms

--- 10.0.0.190 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 2.445/2.445/2.445/0.000 ms

Comment 17 errata-xmlrpc 2017-12-13 22:11:53 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:3462

Note You need to log in before you can comment on or make changes to this bug.