1553196 – osp12 deployment with external ceph fails

Bug 1553196 - osp12 deployment with external ceph fails

Summary: osp12 deployment with external ceph fails

Keywords:
Status:	CLOSED DUPLICATE of bug 1552327
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	ceph
Sub Component:
Version:	12.0 (Pike)
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	---
Assignee:	John Fulton
QA Contact:	Yogev Rabl
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-03-08 13:35 UTC by pkomarov
Modified:	2022-03-13 15:00 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-08-13 23:09:02 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	OSP-13575	0	None	None	None	2022-03-13 15:00:04 UTC

Description pkomarov 2018-03-08 13:35:21 UTC

Description of problem:

OSP12 deployment with external ceph  fails

On a baremetal environment with 3 controllers and 2 computes.

(But succeeds without the external ceph)

Version-Release number of selected component (if applicable):

$ rhos-release -L
Installed repositories (rhel-7.4):
  12
  ceph-2
  ceph-osd-2
  rhel-7.4

Steps to Reproduce:
1.Deploy OSP12 on baremetal with external ceph 
2.
3.


Actual results:

Deployment fails with the following message : 

From overcloud_install.log : 

 Stack overcloud CREATE_FAILED

overcloud.AllNodesDeploySteps.WorkflowTasks_Step2_Execution:
  resource_type: OS::Mistral::ExternalResource
  physical_resource_id: 548f9fc0-9ee3-4853-b155-80f25f6a93df
  status: CREATE_FAILED
  status_reason: |
    resources.WorkflowTasks_Step2_Execution: ERROR


Additional info:

From mistral/ceph-install-workflow.log

2018-03-07 19:11:23,896 p=13129 u=mistral |  TASK [ceph-defaults : set_fact docker_exec_cmd] ********************************
2018-03-07 19:11:23,932 p=13129 u=mistral |  fatal: [192.168.24.11]: FAILED! => {"msg": "list object has no element 0"}
2018-03-07 19:11:23,963 p=13129 u=mistral |  fatal: [192.168.24.8]: FAILED! => {"msg": "list object has no element 0"}
2018-03-07 19:11:23,992 p=13129 u=mistral |  fatal: [192.168.24.12]: FAILED! => {"msg": "list object has no element 0"}
2018-03-07 19:11:24,025 p=13129 u=mistral |  fatal: [192.168.24.15]: FAILED! => {"msg": "list object has no element 0"}
2018-03-07 19:11:24,039 p=13129 u=mistral |  fatal: [192.168.24.7]: FAILED! => {"msg": "list object has no element 0"}
2018-03-07 19:11:24,041 p=13129 u=mistral |  PLAY RECAP *********************************************************************



$ mistral execution-list |grep -v SUCCESS
+--------------------------------------+--------------------------------------+------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+---------+------------------------------+---------------------+---------------------+
| ID                                   | Workflow ID                          | Workflow name                                                          | Description                                                                                                                                                                                                                       | Task Execution ID                    | State   | State info                   | Created at          | Updated at          |
+--------------------------------------+--------------------------------------+------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+---------+------------------------------+---------------------+---------------------+
| 0a25644c-3384-41bc-b49e-eebe0a3a8b9d | 3258fa8f-fa27-4f07-bcbc-01c68ffb28d8 | tripleo.baremetal.v1.cellv2_discovery                                  | sub-workflow execution                                                                                                                                                                                                            | 4b8fd697-d07b-4be5-af65-bec9267a775a | ERROR   | None                         | 2018-03-07 15:11:39 | 2018-03-07 15:11:49 |
| 548f9fc0-9ee3-4853-b155-80f25f6a93df | 9a67b540-1761-493d-93a5-79a6fe4fcb2a | tripleo.overcloud.workflow_tasks.step2                                 | Heat managed                                                                                                                                                                                                                      | <none>                               | ERROR   | Failure caused by error i... | 2018-03-07 17:09:25 | 2018-03-07 17:11:27 |
| 7e715b7f-4bba-4e77-8b72-260b7517b5ee | 799d3307-76a5-4982-9568-d9cba8fde8cb | tripleo.storage.v1.ceph-install                                        | sub-workflow execution                                                                                                                                                                                                            | 27d1e34d-03c0-4354-a5ac-8558c66f331e | ERROR   | Failure caused by error i... | 2018-03-07 17:09:26 | 2018-03-07 17:11:25 |

SOSreports from the undercloud,controller,compute and deployment files used are all there in the link : 

https://drive.google.com/drive/folders/14SGfFF9NDVGxEB7CIuYX2BWwkmwEdnm3?usp=sharing

Comment 1 John Fulton 2018-03-14 14:57:41 UTC

Your overcloud controller node seems to have more than ceph problems, e.g. all of the containers are down except memcahe [0]. 

As per ceph-install-workflow.log [1], the deployment failed on the following (ceph-ansible-3.0.26-1.el7cp confirmed) :

 https://github.com/ceph/ceph-ansible/blob/v3.0.26/roles/ceph-defaults/tasks/facts.yml#L14-L19

It may be that the following ansible variable didn't return:

 hostvars[groups[mon_group_name][0]]['ansible_hostname']

[fultonj@skagra sosreport-pkomarov-20180308090054]$ cat hostname 
controller-0
[fultonj@skagra sosreport-pkomarov-20180308090054]$ 


Please re-run deployment but add -e debug.yaml to your 'openstack overcloud deploy ... -e debug.yaml' where debug.yaml contains the following: 

parameter_defaults:
  CephAnsiblePlaybookVerbosity: 3

then, after the deployment runs, update this bugzilla with: 

- /var/log/mistral/ceph-install-workflow.log from your undercloud
- A tarball containing /tmp/ansible-mistral-action* from your undercloud  
- the exact 'openstack overcloud deploy ...' command you ran
- the output of `ansible -m setup localhost` when run on your overcloud controller

Thanks,
  John

[0] All containers died on overcloud controller node except memcache: 

[fultonj@skagra docker]$ cat docker_ps_-a 
CONTAINER ID        IMAGE                                                        COMMAND                  CREATED             STATUS                    PORTS               NAMES
77a8fd42873d        192.168.24.1:8787/rhosp12/openstack-mariadb:2018-02-27.4     "/bin/bash -c '/usr/b"   15 hours ago        Exited (0) 15 hours ago                       mysql_image_tag
35eb34d3a61f        192.168.24.1:8787/rhosp12/openstack-memcached:2018-02-27.4   "/bin/bash -c 'source"   15 hours ago        Up 15 hours                                   memcached
2d7ddbf5f9bd        192.168.24.1:8787/rhosp12/openstack-haproxy:2018-02-27.4     "/bin/bash -c '/usr/b"   15 hours ago        Exited (0) 15 hours ago                       haproxy_image_tag
05936166dd67        192.168.24.1:8787/rhosp12/openstack-mariadb:2018-02-27.4     "bash -ecx 'if [ -e /"   15 hours ago        Exited (0) 15 hours ago                       mysql_bootstrap
81a520488572        192.168.24.1:8787/rhosp12/openstack-redis:2018-02-27.4       "/bin/bash -c '/usr/b"   15 hours ago        Exited (0) 15 hours ago                       redis_image_tag
1783e86df69e        192.168.24.1:8787/rhosp12/openstack-rabbitmq:2018-02-27.4    "/bin/bash -c '/usr/b"   15 hours ago        Exited (0) 15 hours ago                       rabbitmq_image_tag
1bfc375b1c1b        192.168.24.1:8787/rhosp12/openstack-rabbitmq:2018-02-27.4    "kolla_start"            15 hours ago        Exited (0) 15 hours ago                       rabbitmq_bootstrap
d0aa0a689d67        192.168.24.1:8787/rhosp12/openstack-memcached:2018-02-27.4   "/bin/bash -c 'source"   15 hours ago        Exited (0) 15 hours ago                       memcached_init_logs
d77cec9718ef        192.168.24.1:8787/rhosp12/openstack-mariadb:2018-02-27.4     "chown -R mysql: /var"   15 hours ago        Exited (0) 15 hours ago                       mysql_data_ownership
[fultonj@skagra docker]$ 


[1] 

[fultonj@skagra mistral]$ tail -30 ceph-install-workflow.log 
2018-03-07 19:11:23,334 p=13129 u=mistral |  TASK [ceph-defaults : remove ceph nfs ganesha socket if exists and not used by a process] ***
2018-03-07 19:11:23,360 p=13129 u=mistral |  skipping: [192.168.24.11]
2018-03-07 19:11:23,405 p=13129 u=mistral |  skipping: [192.168.24.8]
2018-03-07 19:11:23,428 p=13129 u=mistral |  skipping: [192.168.24.12]
2018-03-07 19:11:23,429 p=13129 u=mistral |  skipping: [192.168.24.15]
2018-03-07 19:11:23,446 p=13129 u=mistral |  skipping: [192.168.24.7]
2018-03-07 19:11:23,478 p=13129 u=mistral |  TASK [ceph-defaults : set_fact monitor_name ansible_hostname] ******************
2018-03-07 19:11:23,654 p=13129 u=mistral |  ok: [192.168.24.11]
2018-03-07 19:11:23,683 p=13129 u=mistral |  ok: [192.168.24.8]
2018-03-07 19:11:23,705 p=13129 u=mistral |  ok: [192.168.24.12]
2018-03-07 19:11:23,736 p=13129 u=mistral |  ok: [192.168.24.15]
2018-03-07 19:11:23,751 p=13129 u=mistral |  ok: [192.168.24.7]
2018-03-07 19:11:23,767 p=13129 u=mistral |  TASK [ceph-defaults : set_fact monitor_name ansible_fqdn] **********************
2018-03-07 19:11:23,793 p=13129 u=mistral |  skipping: [192.168.24.11]
2018-03-07 19:11:23,815 p=13129 u=mistral |  skipping: [192.168.24.8]
2018-03-07 19:11:23,836 p=13129 u=mistral |  skipping: [192.168.24.12]
2018-03-07 19:11:23,859 p=13129 u=mistral |  skipping: [192.168.24.15]
2018-03-07 19:11:23,872 p=13129 u=mistral |  skipping: [192.168.24.7]
2018-03-07 19:11:23,896 p=13129 u=mistral |  TASK [ceph-defaults : set_fact docker_exec_cmd] ********************************
2018-03-07 19:11:23,932 p=13129 u=mistral |  fatal: [192.168.24.11]: FAILED! => {"msg": "list object has no element 0"}
2018-03-07 19:11:23,963 p=13129 u=mistral |  fatal: [192.168.24.8]: FAILED! => {"msg": "list object has no element 0"}
2018-03-07 19:11:23,992 p=13129 u=mistral |  fatal: [192.168.24.12]: FAILED! => {"msg": "list object has no element 0"}
2018-03-07 19:11:24,025 p=13129 u=mistral |  fatal: [192.168.24.15]: FAILED! => {"msg": "list object has no element 0"}
2018-03-07 19:11:24,039 p=13129 u=mistral |  fatal: [192.168.24.7]: FAILED! => {"msg": "list object has no element 0"}
2018-03-07 19:11:24,041 p=13129 u=mistral |  PLAY RECAP *********************************************************************
2018-03-07 19:11:24,041 p=13129 u=mistral |  192.168.24.11              : ok=2    changed=0    unreachable=0    failed=1   
2018-03-07 19:11:24,041 p=13129 u=mistral |  192.168.24.12              : ok=2    changed=0    unreachable=0    failed=1   
2018-03-07 19:11:24,041 p=13129 u=mistral |  192.168.24.15              : ok=2    changed=0    unreachable=0    failed=1   
2018-03-07 19:11:24,042 p=13129 u=mistral |  192.168.24.7               : ok=2    changed=0    unreachable=0    failed=1   
2018-03-07 19:11:24,042 p=13129 u=mistral |  192.168.24.8               : ok=2    changed=0    unreachable=0    failed=1   
[fultonj@skagra mistral]$

Comment 2 John Fulton 2018-03-21 14:08:10 UTC

I have not received the needinfo requested two weeks ago. It looks like a local environment issue but I asked for that info to be sure. Closing for now. Re-open if you have requested data or can reproduce and provide requested data.

Comment 7 Giulio Fidente 2018-08-13 23:09:02 UTC


*** This bug has been marked as a duplicate of bug 1552327 ***

Comment 8 Giulio Fidente 2018-08-13 23:14:53 UTC

Until a version of ceph-ansible > 3.0.29 becomes available, the workaround is to deploy using environments/puppet-ceph-external.yaml

Note You need to log in before you can comment on or make changes to this bug.