Hide Forgot
Description of problem: When Mistral kicks off Ceph-Ansible, I am seeing issues like : 2017-09-29 15:38:10,768 p=19459 u=mistral | TASK [ceph-defaults : is ceph running already?] ******************************** 2017-09-29 15:38:10,780 p=19459 u=mistral | [DEPRECATION WARNING]: always_run is deprecated. Use check_mode = no instead.. 2017-09-29 15:38:11,180 p=19459 u=mistral | fatal: [192.168.24.56]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: Could not create directory '/home/mistral/.ssh'.\r\nssh_exchange_identification: Connection closed by remote host\r\n", "unreachable": true} 2017-09-29 15:38:11,181 p=19459 u=mistral | fatal: [192.168.24.71]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: Could not create directory '/home/mistral/.ssh'.\r\nssh_exchange_identification: Connection closed by remote host\r\n", "unreachable": true} 2017-09-29 15:38:11,188 p=19459 u=mistral | [DEPRECATION WARNING]: always_run is deprecated. Use check_mode = no instead.. Which causes the deployment to fail due to the host being unreachable. However, I am able to login to the hosts that mentions unreachable=1. For the full Ansible log (includes multiple deployments) : http://perf1.perf.lab.eng.bos.redhat.com/jtaleric/OpenStack/logs/092917-ceph-ansible-mistral.log This only has become a problem since growing the overcloud deployment to 3 controllers, 3 ceph nodes, and 26 compute nodes (deployed at once). Version-Release number of selected component (if applicable): puppet-mistral-11.3.1-0.20170825184651.cf2e493.el7ost.noarch python-mistral-5.1.1-0.20170909041831.a8e648c.el7ost.noarch openstack-mistral-engine-5.1.1-0.20170909041831.a8e648c.el7ost.noarch python-mistral-lib-0.2.0-0.20170821165722.bb1b87b.el7ost.noarch openstack-mistral-common-5.1.1-0.20170909041831.a8e648c.el7ost.noarch openstack-mistral-api-5.1.1-0.20170909041831.a8e648c.el7ost.noarch openstack-mistral-executor-5.1.1-0.20170909041831.a8e648c.el7ost.noarch python-mistralclient-3.1.3-0.20170913011357.c33d39c.el7ost.noarch How reproducible: Seems to be reproducing 100% (last two deploys have failed due to this). Steps to Reproduce: 1. Deploy with 32 nodes, and some ceph Actual results: Failed deployment Expected results: Successful Deployment Additional info:
Jirka and I talked to Joe about this. He found the bug tripleo-admin user was configured on the node that ansible said the SSH keys were the right ones. We had asked for this information because, given the following in the Mistral workbook: https://github.com/openstack/tripleo-common/blob/master/workbooks/ceph-ansible.yaml#L24-L26 only after the after the triplo-admin user was configured, would the ceph-ansible playbook run and that should prevent this bug. We had seen the same reported error in a split stack scenario in upstream CI and Jirka resovled it by adding the following: https://github.com/openstack/tripleo-common/commit/77dbe9295b282c54aab65c6b9815a575ce29a49c#diff-03b4bc9664d59568adabe645ea018e03 Assuming that os-collect-config was running without issue on the node and that the keys and account was set up correctly, is it possible, that with a large deployment that not all of the nodes are stood up yet and therefore Ansible could not connect to all of them at that point in the deploy? If so, could we have Mistral verify that something like: ansible all -m ping returns 100% success before starting the ceph-ansible playook? Perhaps the mistral task could do a wait until?
This is the third time this has bite me. 2017-09-29 21:06:17,413 p=29967 u=mistral | RUNNING HANDLER [ceph-defaults : restart ceph mdss] **************************** 2017-09-29 21:06:17,440 p=29967 u=mistral | RUNNING HANDLER [ceph-defaults : restart ceph rgws] **************************** 2017-09-29 21:06:17,469 p=29967 u=mistral | PLAY RECAP ********************************************************************* 2017-09-29 21:06:17,469 p=29967 u=mistral | 192.168.24.52 : ok=49 changed=8 unreachable=0 failed=0 2017-09-29 21:06:17,469 p=29967 u=mistral | 192.168.24.53 : ok=3 changed=0 unreachable=1 failed=0 2017-09-29 21:06:17,469 p=29967 u=mistral | 192.168.24.54 : ok=24 changed=6 unreachable=0 failed=0 2017-09-29 21:06:17,470 p=29967 u=mistral | 192.168.24.55 : ok=24 changed=6 unreachable=0 failed=0 2017-09-29 21:06:17,470 p=29967 u=mistral | 192.168.24.56 : ok=3 changed=0 unreachable=1 failed=0 2017-09-29 21:06:17,470 p=29967 u=mistral | 192.168.24.57 : ok=24 changed=6 unreachable=0 failed=0 2017-09-29 21:06:17,470 p=29967 u=mistral | 192.168.24.58 : ok=39 changed=7 unreachable=0 failed=0 2017-09-29 21:06:17,470 p=29967 u=mistral | 192.168.24.59 : ok=24 changed=6 unreachable=0 failed=0 2017-09-29 21:06:17,470 p=29967 u=mistral | 192.168.24.61 : ok=39 changed=7 unreachable=0 failed=0 2017-09-29 21:06:17,470 p=29967 u=mistral | 192.168.24.62 : ok=39 changed=4 unreachable=0 failed=0 2017-09-29 21:06:17,470 p=29967 u=mistral | 192.168.24.63 : ok=24 changed=6 unreachable=0 failed=0 2017-09-29 21:06:17,470 p=29967 u=mistral | 192.168.24.64 : ok=41 changed=7 unreachable=0 failed=0 2017-09-29 21:06:17,470 p=29967 u=mistral | 192.168.24.65 : ok=26 changed=6 unreachable=0 failed=0 2017-09-29 21:06:17,471 p=29967 u=mistral | 192.168.24.66 : ok=24 changed=6 unreachable=0 failed=0 2017-09-29 21:06:17,471 p=29967 u=mistral | 192.168.24.67 : ok=24 changed=6 unreachable=0 failed=0 2017-09-29 21:06:17,471 p=29967 u=mistral | 192.168.24.68 : ok=39 changed=5 unreachable=0 failed=0 2017-09-29 21:06:17,471 p=29967 u=mistral | 192.168.24.69 : ok=24 changed=6 unreachable=0 failed=0 2017-09-29 21:06:17,471 p=29967 u=mistral | 192.168.24.70 : ok=24 changed=6 unreachable=0 failed=0 2017-09-29 21:06:17,471 p=29967 u=mistral | 192.168.24.72 : ok=24 changed=6 unreachable=0 failed=0 2017-09-29 21:06:17,471 p=29967 u=mistral | 192.168.24.73 : ok=24 changed=6 unreachable=0 failed=0 2017-09-29 21:06:17,471 p=29967 u=mistral | 192.168.24.74 : ok=24 changed=6 unreachable=0 failed=0 2017-09-29 21:06:17,471 p=29967 u=mistral | 192.168.24.75 : ok=3 changed=0 unreachable=1 failed=0 2017-09-29 21:06:17,471 p=29967 u=mistral | 192.168.24.76 : ok=24 changed=6 unreachable=0 failed=0 2017-09-29 21:06:17,471 p=29967 u=mistral | 192.168.24.77 : ok=3 changed=0 unreachable=1 failed=0 2017-09-29 21:06:17,472 p=29967 u=mistral | 192.168.24.78 : ok=24 changed=6 unreachable=0 failed=0 2017-09-29 21:06:17,472 p=29967 u=mistral | 192.168.24.80 : ok=24 changed=6 unreachable=0 failed=0 2017-09-29 21:06:17,472 p=29967 u=mistral | 192.168.24.83 : ok=24 changed=6 unreachable=0 failed=0 2017-09-29 21:06:17,472 p=29967 u=mistral | 192.168.24.84 : ok=24 changed=6 unreachable=0 failed=0 2017-09-29 21:06:17,472 p=29967 u=mistral | 192.168.24.87 : ok=3 changed=0 unreachable=1 failed=0 2017-09-29 21:06:17,472 p=29967 u=mistral | 192.168.24.89 : ok=24 changed=6 unreachable=0 failed=0 2017-09-29 21:06:17,472 p=29967 u=mistral | 192.168.24.90 : ok=24 changed=6 unreachable=0 failed=0 2017-09-29 21:06:17,472 p=29967 u=mistral | 192.168.24.92 : ok=24 changed=6 unreachable=0 failed=0 However, if I lower the compute count (to 16): 2017-09-30 02:32:28Z [overcloud.AllNodesDeploySteps.ObjectStoragePostConfig]: CREATE_COMPLETE state changed 2017-09-30 02:32:28Z [overcloud.AllNodesDeploySteps.ComputePostConfig]: CREATE_COMPLETE state changed 2017-09-30 02:32:28Z [overcloud.AllNodesDeploySteps]: CREATE_COMPLETE Stack CREATE completed successfully 2017-09-30 02:32:28Z [overcloud.AllNodesDeploySteps]: CREATE_COMPLETE state changed 2017-09-30 02:32:29Z [overcloud]: CREATE_COMPLETE Stack CREATE completed successfully Stack overcloud CREATE_COMPLETE Internal Server Error (HTTP 500) real 113m46.788s user 0m8.993s sys 0m0.540s So, with 16 computes it succeeds, but jumping to 26, I have failure. Another important note 16 computes did initially fail (same errors as this bug). Starting over, the deployment succeeded.
Joe, As a quick workaround do you want to try modifying: /usr/share/ceph-ansible/ansible.cfg to retry more often on an SSH connection failure? https://stackoverflow.com/questions/40340761/is-it-possible-to-have-ansible-retry-on-connection-failure John
(In reply to John Fulton from comment #3) > Joe, > > As a quick workaround do you want to try modifying: > > /usr/share/ceph-ansible/ansible.cfg > > to retry more often on an SSH connection failure? > > https://stackoverflow.com/questions/40340761/is-it-possible-to-have-ansible- > retry-on-connection-failure > > John Hey John - As mentioned on IRC, I don't think that will help this issue. I think if I wanted to modify ceph-ansible, I would simply add a retry/delay with the initial task that seems to always fail as a workaround.
(In reply to John Fulton from comment #3) > Joe, > > As a quick workaround do you want to try modifying: > > /usr/share/ceph-ansible/ansible.cfg > > to retry more often on an SSH connection failure? > > https://stackoverflow.com/questions/40340761/is-it-possible-to-have-ansible- > retry-on-connection-failure > > John I'll eat my own words here John! I set retry = 5, and the ceph-ansible playbook completed with 26 Compute nodes. This seems like a reasonable workaround until we get a Mistral task to check ssh connectivity before progressing to ceph-ansible.
The public key authorization on nodes is inserted via os-collect-config software deployment (there's no other access to the node for Mistral at that point), and i think os-collect-config can have some delay when picking up the metadata. IOW the public key insertion is asynchronous. So indeed the best solution might be a follow up task after this one https://github.com/openstack/tripleo-common/blob/e21f8e094f503b3a82a40d54d5459dd70ba4cbfa/workbooks/access.yaml#L72 which will try using the authorized key (with retries), to give the os-collect-config agents some time to pick up and apply the software deployment. (Same could be done in the ceph-ansible workflow, but better solve it globally if we can.)
(In reply to Jiri Stransky from comment #6) > The public key authorization on nodes is inserted via os-collect-config > software deployment (there's no other access to the node for Mistral at that > point), and i think os-collect-config can have some delay when picking up > the metadata. IOW the public key insertion is asynchronous. > > So indeed the best solution might be a follow up task after this one > > https://github.com/openstack/tripleo-common/blob/ > e21f8e094f503b3a82a40d54d5459dd70ba4cbfa/workbooks/access.yaml#L72 > > which will try using the authorized key (with retries), to give the > os-collect-config agents some time to pick up and apply the software > deployment. (Same could be done in the ceph-ansible workflow, but better > solve it globally if we can.) I think the ceph-ansible workflow should double check things prior to deploying. Even if the key is dropped in, something outside of the deployment tool's control could impact the deployment (especially when we are looking at many nodes). In my opinion, the more checks we can put throughout the workflow the better.
Landed into master branch, updated reference to the stable/pike port.
Environment: openstack-tripleo-common-7.6.3-0.20171010234828.el7ost.noarch Was able to deploy successfully with 6 ceph nodes. Is this sufficient to verify this issue?
(In reply to Alexander Chuzhoy from comment #11) > Environment: > openstack-tripleo-common-7.6.3-0.20171010234828.el7ost.noarch > > Was able to deploy successfully with 6 ceph nodes. > Is this sufficient to verify this issue? Probably not, I think people has seen this happening with 24 nodes, never with less than 16 nodes.
This is less about the ceph nodes, and more about the compute nodes. As mentioned in Comment 2 -- 16 compute nodes did not see this issue.
Verified. Environment: openstack-tripleo-common-7.6.3-0.20171022171808.el7ost.noarch Successfully deployed and populated a setup with 24 compute nodes: (undercloud) [stack@undercloud-0 ~]$ nova list +--------------------------------------+-------------------------+--------+------------+-------------+------------------------+ | ID | Name | Status | Task State | Power State | Networks | +--------------------------------------+-------------------------+--------+------------+-------------+------------------------+ | 4b445f0a-cb09-4944-87d1-1958c1d40114 | overcloud-cephstorage-0 | ACTIVE | - | Running | ctlplane=192.168.24.22 | | 79b7e9a2-2fab-470a-8997-1c9d60584a99 | overcloud-cephstorage-1 | ACTIVE | - | Running | ctlplane=192.168.24.34 | | 9769ab90-7caf-4ccb-9df4-caef85c8b847 | overcloud-cephstorage-2 | ACTIVE | - | Running | ctlplane=192.168.24.39 | | ed67d6e7-8970-411e-ac5a-5e8f612537e5 | overcloud-compute-0 | ACTIVE | - | Running | ctlplane=192.168.24.30 | | cee5a230-59a8-470f-b418-f64e3e8b73f3 | overcloud-compute-1 | ACTIVE | - | Running | ctlplane=192.168.24.9 | | 389c4280-0585-4455-ba50-0373968142b9 | overcloud-compute-10 | ACTIVE | - | Running | ctlplane=192.168.24.26 | | 8bffccd3-9c72-4929-aea4-cad6d20aa926 | overcloud-compute-11 | ACTIVE | - | Running | ctlplane=192.168.24.15 | | 32c2a921-7263-4527-9ec8-1bcea5441cce | overcloud-compute-12 | ACTIVE | - | Running | ctlplane=192.168.24.16 | | 60af7544-ffb7-469e-a81f-9b93145a62ae | overcloud-compute-13 | ACTIVE | - | Running | ctlplane=192.168.24.32 | | d5eae938-ad4c-4be0-a9ea-0d95e217d87d | overcloud-compute-14 | ACTIVE | - | Running | ctlplane=192.168.24.36 | | 0ebfe7d3-6226-464a-80cd-4c5dfa9f8740 | overcloud-compute-15 | ACTIVE | - | Running | ctlplane=192.168.24.25 | | 3b9431c3-03b8-44a1-bf6c-cf16182602e0 | overcloud-compute-16 | ACTIVE | - | Running | ctlplane=192.168.24.29 | | 3a4d47ba-82b4-4120-a1f9-e8aa954d1338 | overcloud-compute-17 | ACTIVE | - | Running | ctlplane=192.168.24.17 | | d26473ad-6310-4470-8e12-10be6348bed6 | overcloud-compute-18 | ACTIVE | - | Running | ctlplane=192.168.24.14 | | 9ceac8dd-e156-46fd-8ed3-ea8e97bdfdb9 | overcloud-compute-19 | ACTIVE | - | Running | ctlplane=192.168.24.8 | | 3d1f2584-05a4-4085-aedc-0c01b7fc9767 | overcloud-compute-2 | ACTIVE | - | Running | ctlplane=192.168.24.7 | | 71c3784b-853d-43be-8d51-560e398676a9 | overcloud-compute-20 | ACTIVE | - | Running | ctlplane=192.168.24.6 | | 8a37dbfa-6d14-4288-a281-978dfda62c29 | overcloud-compute-21 | ACTIVE | - | Running | ctlplane=192.168.24.27 | | 1b8e7132-2c27-43ee-8c70-6622558cf3d7 | overcloud-compute-22 | ACTIVE | - | Running | ctlplane=192.168.24.21 | | ea1ef19f-c99d-4251-a28d-4f6c01a4843b | overcloud-compute-23 | ACTIVE | - | Running | ctlplane=192.168.24.38 | | 9aecd654-72e7-4bfc-85a2-fbbe73a2f2f1 | overcloud-compute-3 | ACTIVE | - | Running | ctlplane=192.168.24.12 | | f13d3c42-4c0a-4442-80c8-b25773f588dc | overcloud-compute-4 | ACTIVE | - | Running | ctlplane=192.168.24.19 | | 88e6c282-01ea-429a-af53-6dfa0caa8971 | overcloud-compute-5 | ACTIVE | - | Running | ctlplane=192.168.24.24 | | 4e9736af-cdb8-4ea9-afdd-6cef9e53e0cf | overcloud-compute-6 | ACTIVE | - | Running | ctlplane=192.168.24.28 | | 14dfbdd3-fe8e-47e8-8173-7b828a13fc8f | overcloud-compute-7 | ACTIVE | - | Running | ctlplane=192.168.24.20 | | d1654b43-7a9e-4038-9f78-250ced44fae9 | overcloud-compute-8 | ACTIVE | - | Running | ctlplane=192.168.24.43 | | 3a937ab2-fe07-46f9-81c1-80bd9cf216fe | overcloud-compute-9 | ACTIVE | - | Running | ctlplane=192.168.24.10 | | 92a299ad-7b9e-4d8b-8abc-5395e481bf10 | overcloud-controller-0 | ACTIVE | - | Running | ctlplane=192.168.24.18 | | 90c834c9-8a48-4869-bc7a-c1d5408d5a08 | overcloud-controller-1 | ACTIVE | - | Running | ctlplane=192.168.24.11 | | 25d8236a-cc26-41ee-a678-bb06e7af3b10 | overcloud-controller-2 | ACTIVE | - | Running | ctlplane=192.168.24.23 | +--------------------------------------+-------------------------+--------+------------+-------------+------------------------+ (undercloud) [stack@undercloud-0 ~]$ heat stack-list WARNING (shell) "heat stack-list" is deprecated, please use "openstack stack list" instead +--------------------------------------+------------+-----------------+----------------------+--------------+----------------------------------+ | id | stack_name | stack_status | creation_time | updated_time | project | +--------------------------------------+------------+-----------------+----------------------+--------------+----------------------------------+ | 6ed57bb2-a782-45d4-b131-52f8615bf2ef | overcloud | CREATE_COMPLETE | 2017-10-30T19:30:17Z | None | 7c53c41d51d74361ac57676ad34a93af | +--------------------------------------+------------+-----------------+----------------------+--------------+----------------------------------+ (undercloud) [stack@undercloud-0 ~]$ . overcloudrc (overcloud) [stack@undercloud-0 ~]$ nova list --all ping -c1 +--------------------------------------+--------------+----------------------------------+--------+------------+-------------+--------------------------------------+ | ID | Name | Tenant ID | Status | Task State | Power State | Networks | +--------------------------------------+--------------+----------------------------------+--------+------------+-------------+--------------------------------------+ | b263728b-3bb2-477d-aac6-548eba6a9202 | after_deploy | 014ebf1660824a57b7b9db69d31c5b27 | ACTIVE | - | Running | tenantvxlan=192.168.32.7, 10.0.0.190 | +--------------------------------------+--------------+----------------------------------+--------+------------+-------------+--------------------------------------+ (overcloud) [stack@undercloud-0 ~]$ ping -c1 10.0.0.190 PING 10.0.0.190 (10.0.0.190) 56(84) bytes of data. 64 bytes from 10.0.0.190: icmp_seq=1 ttl=63 time=2.44 ms --- 10.0.0.190 ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 2.445/2.445/2.445/0.000 ms
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2017:3462