Description of problem: I am trying to deploy OSP-13 and RHCS3 in a HCI environment. At the very last stage of deployment, openstack stack deploy failed. After troubleshooting it looks like, stack creation failed because of ceph-ansibiel error 2018-08-20 08:29:10,611 p=27101 u=mistral | TASK [ceph-client : get client cephx keys] ************************************* 2018-08-20 08:29:10,611 p=27101 u=mistral | Monday 20 August 2018 08:29:10 -0400 (0:00:00.083) 0:13:53.981 ********* 2018-08-20 08:29:10,729 p=27101 u=mistral | fatal: [192.168.120.20]: FAILED! => {"msg": "'dict object' has no attribute 'slurp_client_keys'"} 2018-08-20 08:29:10,730 p=27101 u=mistral | fatal: [192.168.120.13]: FAILED! => {"msg": "'dict object' has no attribute 'slurp_client_keys'"} 2018-08-20 08:29:10,732 p=27101 u=mistral | fatal: [192.168.120.19]: FAILED! => {"msg": "'dict object' has no attribute 'slurp_client_keys'"} More Logs below openstack stack deploy logs --------------------------- 2018-08-20 12:06:21Z [overcloud.AllNodesDeploySteps.ComputeHCIDeployment_Step1.4]: SIGNAL_IN_PROGRESS Signal: deployment f376b56b-4a7b-4542-8254-6ec8fbb4b6fa succeeded 2018-08-20 12:06:21Z [overcloud.AllNodesDeploySteps.ComputeHCIDeployment_Step1.2]: SIGNAL_IN_PROGRESS Signal: deployment 6df7dd1d-ba64-40dc-b4a8-62778b3543b4 succeeded 2018-08-20 12:06:22Z [overcloud.AllNodesDeploySteps.ComputeHCIDeployment_Step1.4]: CREATE_COMPLETE state changed 2018-08-20 12:06:22Z [overcloud.AllNodesDeploySteps.ComputeHCIDeployment_Step1.2]: CREATE_COMPLETE state changed 2018-08-20 12:06:25Z [overcloud.AllNodesDeploySteps.ComputeHCIDeployment_Step1.1]: SIGNAL_IN_PROGRESS Signal: deployment b22c508e-3429-4186-bf5e-749d99e4259d succeeded 2018-08-20 12:06:25Z [overcloud.AllNodesDeploySteps.ComputeHCIDeployment_Step1.1]: CREATE_COMPLETE state changed 2018-08-20 12:06:30Z [overcloud.AllNodesDeploySteps.ComputeHCIDeployment_Step1.0]: SIGNAL_IN_PROGRESS Signal: deployment 04197e90-4fbf-4cfc-b1f0-9d89e9c6d290 succeeded 2018-08-20 12:06:31Z [overcloud.AllNodesDeploySteps.ComputeHCIDeployment_Step1.0]: CREATE_COMPLETE state changed 2018-08-20 12:06:41Z [overcloud.AllNodesDeploySteps.ComputeHCIDeployment_Step1.3]: SIGNAL_IN_PROGRESS Signal: deployment f11e36f1-781a-4ce1-bf42-b7e7faf32b99 succeeded 2018-08-20 12:06:42Z [overcloud.AllNodesDeploySteps.ComputeHCIDeployment_Step1.3]: CREATE_COMPLETE state changed 2018-08-20 12:06:42Z [overcloud.AllNodesDeploySteps.ComputeHCIDeployment_Step1]: CREATE_COMPLETE Stack CREATE completed successfully 2018-08-20 12:06:43Z [overcloud.AllNodesDeploySteps.ComputeHCIDeployment_Step1]: CREATE_COMPLETE state changed 2018-08-20 12:14:05Z [overcloud.AllNodesDeploySteps.ControllerDeployment_Step1.0]: SIGNAL_IN_PROGRESS Signal: deployment c3c0bdd9-d5a9-41eb-b5a2-291ae2a2cab2 succeeded 2018-08-20 12:14:06Z [overcloud.AllNodesDeploySteps.ControllerDeployment_Step1.0]: CREATE_COMPLETE state changed 2018-08-20 12:14:06Z [overcloud.AllNodesDeploySteps.ControllerDeployment_Step1]: CREATE_COMPLETE Stack CREATE completed successfully 2018-08-20 12:14:06Z [overclHeat Stack create failed. Heat Stack create failed. oud.AllNodesDeploySteps.ControllerDeployment_Step1]: CREATE_COMPLETE state changed 2018-08-20 12:14:07Z [overcloud.AllNodesDeploySteps.WorkflowTasks_Step2]: CREATE_IN_PROGRESS state changed 2018-08-20 12:14:08Z [overcloud.AllNodesDeploySteps.WorkflowTasks_Step2]: CREATE_COMPLETE state changed 2018-08-20 12:14:09Z [overcloud.AllNodesDeploySteps.WorkflowTasks_Step2_Execution]: CREATE_IN_PROGRESS state changed 2018-08-20 12:29:17Z [overcloud.AllNodesDeploySteps.WorkflowTasks_Step2_Execution]: CREATE_FAILED resources.WorkflowTasks_Step2_Execution: ERROR 2018-08-20 12:29:17Z [overcloud.AllNodesDeploySteps]: CREATE_FAILED Resource CREATE failed: resources.WorkflowTasks_Step2_Execution: ERROR 2018-08-20 12:29:17Z [overcloud.AllNodesDeploySteps]: CREATE_FAILED resources.AllNodesDeploySteps: Resource CREATE failed: resources.WorkflowTasks_Step2_Execution: ERROR 2018-08-20 12:29:18Z [overcloud]: CREATE_FAILED Resource CREATE failed: resources.AllNodesDeploySteps: Resource CREATE failed: resources.WorkflowTasks_Step2_Execution: ERROR Stack overcloud CREATE_FAILED overcloud.AllNodesDeploySteps.WorkflowTasks_Step2_Execution: resource_type: OS::TripleO::WorkflowSteps physical_resource_id: 5a971867-190a-4ab4-8eb7-2e0c50d836b9 status: CREATE_FAILED status_reason: | resources.WorkflowTasks_Step2_Execution: ERROR [root@refarch-r220-02 tmp]# /var/log/mistral/ceph-install-workflow.log ------------------------------------------- 2018-08-20 08:29:10,365 p=27101 u=mistral | TASK [ceph-client : list existing pool(s)] ************************************* 2018-08-20 08:29:10,365 p=27101 u=mistral | Monday 20 August 2018 08:29:10 -0400 (0:00:00.169) 0:13:53.735 ********* 2018-08-20 08:29:10,444 p=27101 u=mistral | TASK [ceph-client : create ceph pool(s)] *************************************** 2018-08-20 08:29:10,445 p=27101 u=mistral | Monday 20 August 2018 08:29:10 -0400 (0:00:00.079) 0:13:53.814 ********* 2018-08-20 08:29:10,527 p=27101 u=mistral | TASK [ceph-client : kill a dummy container that created pool(s)/key(s)] ******** 2018-08-20 08:29:10,528 p=27101 u=mistral | Monday 20 August 2018 08:29:10 -0400 (0:00:00.082) 0:13:53.897 ********* 2018-08-20 08:29:10,574 p=27101 u=mistral | skipping: [192.168.120.13] => {"changed": false, "skip_reason": "Conditional result was False"} 2018-08-20 08:29:10,587 p=27101 u=mistral | skipping: [192.168.120.20] => {"changed": false, "skip_reason": "Conditional result was False"} 2018-08-20 08:29:10,598 p=27101 u=mistral | skipping: [192.168.120.19] => {"changed": false, "skip_reason": "Conditional result was False"} 2018-08-20 08:29:10,611 p=27101 u=mistral | TASK [ceph-client : get client cephx keys] ************************************* 2018-08-20 08:29:10,611 p=27101 u=mistral | Monday 20 August 2018 08:29:10 -0400 (0:00:00.083) 0:13:53.981 ********* 2018-08-20 08:29:10,729 p=27101 u=mistral | fatal: [192.168.120.20]: FAILED! => {"msg": "'dict object' has no attribute 'slurp_client_keys'"} 2018-08-20 08:29:10,730 p=27101 u=mistral | fatal: [192.168.120.13]: FAILED! => {"msg": "'dict object' has no attribute 'slurp_client_keys'"} 2018-08-20 08:29:10,732 p=27101 u=mistral | fatal: [192.168.120.19]: FAILED! => {"msg": "'dict object' has no attribute 'slurp_client_keys'"} 2018-08-20 08:29:10,734 p=27101 u=mistral | PLAY RECAP ********************************************************************* 2018-08-20 08:29:10,734 p=27101 u=mistral | 192.168.120.10 : ok=121 changed=20 unreachable=0 failed=0 2018-08-20 08:29:10,734 p=27101 u=mistral | 192.168.120.13 : ok=104 changed=13 unreachable=0 failed=1 2018-08-20 08:29:10,734 p=27101 u=mistral | 192.168.120.19 : ok=104 changed=13 unreachable=0 failed=1 2018-08-20 08:29:10,734 p=27101 u=mistral | 192.168.120.20 : ok=104 changed=13 unreachable=0 failed=1 2018-08-20 08:29:10,734 p=27101 u=mistral | 192.168.120.7 : ok=114 changed=13 unreachable=0 failed=1 2018-08-20 08:29:10,734 p=27101 u=mistral | 192.168.120.8 : ok=67 changed=12 unreachable=0 failed=1 2018-08-20 08:29:10,735 p=27101 u=mistral | INSTALLER STATUS *************************************************************** 2018-08-20 08:29:10,763 p=27101 u=mistral | Install Ceph Monitor : Complete (0:01:20) 2018-08-20 08:29:10,764 p=27101 u=mistral | Install Ceph Manager : Complete (0:00:29) 2018-08-20 08:29:10,764 p=27101 u=mistral | Install Ceph OSD : Complete (0:11:02) 2018-08-20 08:29:10,764 p=27101 u=mistral | Install Ceph Client : In Progress (0:00:41) 2018-08-20 08:29:10,764 p=27101 u=mistral | This phase can be restarted by running: roles/ceph-client/tasks/main.yml 2018-08-20 08:29:10,764 p=27101 u=mistral | Monday 20 August 2018 08:29:10 -0400 (0:00:00.153) 0:13:54.134 ********* 2018-08-20 08:29:10,764 p=27101 u=mistral | =============================================================================== Version-Release number of selected component (if applicable): [root@refarch-r220-02 tmp]# rpm -qa | egrep -i "openstack|ceph" openstack-nova-placement-api-17.0.3-0.20180420001141.el7ost.noarch puppet-openstacklib-12.4.0-0.20180329042555.4b30e6f.el7ost.noarch ceph-ansible-3.1.0-0.1.rc9.el7cp.noarch openstack-glance-16.0.1-2.el7ost.noarch openstack-tripleo-common-containers-8.6.1-23.el7ost.noarch openstack-heat-api-10.0.1-0.20180411125640.el7ost.noarch python2-openstackclient-3.14.1-1.el7ost.noarch openstack-tempest-18.0.0-2.el7ost.noarch openstack-mistral-api-6.0.2-1.el7ost.noarch openstack-zaqar-6.0.1-1.el7ost.noarch openstack-swift-container-2.17.1-0.20180314165245.caeeb54.el7ost.noarch openstack-neutron-12.0.2-0.20180421011364.0ec54fd.el7ost.noarch openstack-ironic-api-10.1.2-4.el7ost.noarch openstack-tripleo-ui-8.3.1-3.el7ost.noarch python-openstackclient-lang-3.14.1-1.el7ost.noarch openstack-tripleo-puppet-elements-8.0.0-2.el7ost.noarch openstack-nova-scheduler-17.0.3-0.20180420001141.el7ost.noarch openstack-swift-object-2.17.1-0.20180314165245.caeeb54.el7ost.noarch openstack-neutron-common-12.0.2-0.20180421011364.0ec54fd.el7ost.noarch openstack-neutron-ml2-12.0.2-0.20180421011364.0ec54fd.el7ost.noarch openstack-heat-engine-10.0.1-0.20180411125640.el7ost.noarch openstack-ironic-common-10.1.2-4.el7ost.noarch openstack-mistral-executor-6.0.2-1.el7ost.noarch puppet-ceph-2.5.0-1.el7ost.noarch openstack-tripleo-heat-templates-8.0.2-43.el7ost.noarch openstack-selinux-0.8.14-12.el7ost.noarch python2-openstacksdk-0.11.3-1.el7ost.noarch openstack-nova-api-17.0.3-0.20180420001141.el7ost.noarch openstack-nova-compute-17.0.3-0.20180420001141.el7ost.noarch openstack-keystone-13.0.1-0.20180420194847.7bd6454.el7ost.noarch openstack-nova-common-17.0.3-0.20180420001141.el7ost.noarch openstack-neutron-openvswitch-12.0.2-0.20180421011364.0ec54fd.el7ost.noarch openstack-heat-api-cfn-10.0.1-0.20180411125640.el7ost.noarch openstack-ironic-staging-drivers-0.9.0-4.el7ost.noarch openstack-ironic-inspector-7.2.1-0.20180409163360.el7ost.noarch openstack-mistral-engine-6.0.2-1.el7ost.noarch puppet-openstack_extras-12.4.1-0.20180413042250.2634296.el7ost.noarch openstack-swift-account-2.17.1-0.20180314165245.caeeb54.el7ost.noarch openstack-tripleo-common-8.6.1-23.el7ost.noarch openstack-swift-proxy-2.17.1-0.20180314165245.caeeb54.el7ost.noarch openstack-ironic-conductor-10.1.2-4.el7ost.noarch openstack-tripleo-validations-8.4.1-5.el7ost.noarch openstack-nova-conductor-17.0.3-0.20180420001141.el7ost.noarch openstack-tripleo-image-elements-8.0.1-1.el7ost.noarch openstack-heat-common-10.0.1-0.20180411125640.el7ost.noarch openstack-mistral-common-6.0.2-1.el7ost.noarch [root@refarch-r220-02 tmp]# How reproducible: 100% of the time Steps to Reproduce: 1. 2. 3. Actual results: Stack creation failed at very last stage Expected results: HCI Stack creation should be successfull Additional info:
I reproduced this using ceph-ansible-3.1.0-0.1.rc21.el7cp.noarch and rhceph/rhceph-3-rhel7:3-12 (fa3b551f0952). Full ceph-ansible output: http://sprunge.us/oawDbF So why does it fail on this task: https://github.com/ceph/ceph-ansible/blob/v3.1.0rc21/roles/ceph-client/tasks/create_users_keys.yml#L62-L72 I observed during the deployment that all 60 OSDs were created but that none of them were brought up or in [1]. On investigating I saw the OSDs were flapping. Re-running ceph-ansible results in a failure because of missing containers [2] before it can fail on the "slurp client cephx key(s)" task. Maybe the root cause is that the containers never got in? [1] [root@controller-0 ~]# ceph -s cluster: id: 4ad54812-a703-11e8-916e-2047478ccfaa health: HEALTH_OK services: mon: 1 daemons, quorum controller-0 mgr: controller-0(active) osd: 60 osds: 0 up, 0 in data: pools: 0 pools, 0 pgs objects: 0 objects, 0 bytes usage: 0 kB used, 0 kB / 0 kB avail pgs: [root@controller-0 ~]# [2] TASK [ceph-docker-common : inspect ceph osd container] *************************************************** Thursday 23 August 2018 15:46:42 -0400 (0:00:00.160) 0:01:53.832 ******* fatal: [192.168.120.17]: FAILED! => {"changed": false, "cmd": ["docker", "inspect", "c7fcbec7e947", "97f2c 38eeabc", "e656eb8666b7"], "delta": "0:00:00.036323", "end": "2018-08-23 19:46:43.454781", "msg": "non-zer o return code", "rc": 1, "start": "2018-08-23 19:46:43.418458", "stderr": "Error: No such object: c7fcbec7 e947", "stderr_lines": ["Error: No such object: c7fcbec7e947"], "stdout": "[]", "stdout_lines": ["[]"]} ... fatal: [192.168.120.6]: FAILED! => {"changed": false, "cmd": ["docker", "inspect", "eab7f7f02601", "76089d 529f46", "d19b177dd469"], "delta": "0:00:00.036426", "end": "2018-08-23 19:46:43.693686", "msg": "non-zero return code", "rc": 1, "start": "2018-08-23 19:46:43.657260", "stderr": "Error: No such object: eab7f7f02 601", "stderr_lines": ["Error: No such object: eab7f7f02601"], "stdout": "[]", "stdout_lines": ["[]"]} PLAY RECAP *********************************************************************************************** 192.168.120.11 : ok=112 changed=12 unreachable=0 failed=0 192.168.120.15 : ok=33 changed=0 unreachable=0 failed=1 192.168.120.16 : ok=33 changed=0 unreachable=0 failed=1 192.168.120.17 : ok=36 changed=0 unreachable=0 failed=1 192.168.120.6 : ok=33 changed=0 unreachable=0 failed=1 192.168.120.9 : ok=33 changed=0 unreachable=0 failed=1
The error: 2018-08-24 11:05:36.950684 7fcca1e67d80 -1 unable to find any IP address in networks '172.17.4.0/24' interfaces '' Wrong config on the ceph.conf, the network does not exist on the box [root@osd-compute-4 ~]# ip a | grep 172.17.4 [root@osd-compute-4 ~]# ip a | grep 172.17.3 inet 172.17.3.225/24 brd 172.17.3.255 scope global vlan170 Only .3 is present. Closing this, feel free to re-open if you have any more concerns.
root cause: network misconfiguration; OSD not connected to StorageMgmt network. The network-environment.yaml file was updated with something like the following to fix it: right: OS::TripleO::ComputeHCI::Ports::StorageMgmtPort: network/ports/storage_mgmt_from_pool.yaml wrong: OS::TripleO::ComputeHCI::Ports::StorageMgmtPort: network/ports/noop.yaml Full diff below (undercloud) [stack@refarch-r220-02 templates]$ diff -u network-environment.yaml~ network-environment.yaml --- network-environment.yaml~ 2018-08-20 06:22:41.350436109 -0400 +++ network-environment.yaml 2018-08-24 08:39:46.249534353 -0400 @@ -5,13 +5,13 @@ OS::TripleO::Controller::Ports::ExternalPort: /usr/share/openstack-tripleo-heat-templates/network/ports/external_from_pool.yaml OS::TripleO::Controller::Ports::InternalApiPort: /usr/share/openstack-tripleo-heat-templates/network/ports/internal_api_from_pool.yaml OS::TripleO::Controller::Ports::StoragePort: /usr/share/openstack-tripleo-heat-templates/network/ports/storage_from_pool.yaml - OS::TripleO::Controller::Ports::StorageMgmtPort: /usr/share/openstack-tripleo-heat-templates/network/ports/storage_mgmt_from_pool.yaml + OS::TripleO::Controller::Ports::StorageMgmtPort: /usr/share/openstack-tripleo-heat-templates/network/ports/noop.yaml OS::TripleO::Controller::Ports::TenantPort: /usr/share/openstack-tripleo-heat-templates/network/ports/tenant_from_pool.yaml OS::TripleO::ComputeHCI::Ports::ExternalPort: /usr/share/openstack-tripleo-heat-templates/network/ports/noop.yaml OS::TripleO::ComputeHCI::Ports::InternalApiPort: /usr/share/openstack-tripleo-heat-templates/network/ports/internal_api_from_pool.yaml OS::TripleO::ComputeHCI::Ports::StoragePort: /usr/share/openstack-tripleo-heat-templates/network/ports/storage_from_pool.yaml - OS::TripleO::ComputeHCI::Ports::StorageMgmtPort: /usr/share/openstack-tripleo-heat-templates/network/ports/noop.yaml + OS::TripleO::ComputeHCI::Ports::StorageMgmtPort: /usr/share/openstack-tripleo-heat-templates/network/ports/storage_mgmt_from_pool.yaml OS::TripleO::ComputeHCI::Ports::TenantPort: /usr/share/openstack-tripleo-heat-templates/network/ports/tenant_from_pool.yaml parameter_defaults: (undercloud) [stack@refarch-r220-02 templates]$
John / Seb : Thanks for your help so far on this BZ, but unfortunately overcloud deployment never completed successfully. I am still getting errors related ceph_install tasks (however no errors in ceph-mistral logs + ceph -s is fine too). I have created a new BZ : 1623417 for that.
should we clo