Created attachment 1485633 [details] /home/stack/overcloud_install.log Description of problem: Overcloud deploy is still stuck 100 minutes after deployment started, then it is killed by "--timeout 100". Last entries in /home/stack/overcloud_install.log are: ... ok: [ceph-0] => { "failed_when_result": false, "outputs.stdout_lines | default([]) | union(outputs.stderr_lines | default([]))": [ "stdout: ce811316fb3f82468d810fa1199e2d430376ee3e6307b844a4bd270b45b26442", "", "stderr: " ] } TASK [Check if /var/lib/docker-puppet/docker-puppet-tasks4.json exists] ******** Friday 21 September 2018 08:43:44 -0400 (0:00:00.217) 0:27:06.634 ****** ok: [ceph-0] => {"changed": false, "stat": {"exists": false}} ok: [compute-0] => {"changed": false, "stat": {"exists": false}} ok: [controller-0] => {"changed": false, "stat": {"atime": 1537532467.8883862, "attr_flags": "", "attributes": [], "block_size": 4096, "blocks": 8, "charset": "us-ascii", "checksum": "fc092b06b8d6fdc6d18320ab604ab9a6ebc1e1ae", "ctime": 1537532467.8933864, "dev": 64514, "device_type": 0, "executable": false, "exists": true, "gid": 0, "gr_name": "root", "inode": 41943523, "isblk": false, "ischr": false, "isdir": false, "isfifo": false, "isgid": false, "islnk": false, "isreg": true, "issock": false, "isuid": false, "mimetype": "text/plain", "mode": "0600", "mtime": 1537532467.7203863, "nlink": 1, "path": "/var/lib/docker-puppet/docker-puppet-tasks4.json", "pw_name": "root", "readable": true, "rgrp": false, "roth": false, "rusr": true, "size": 321, "uid": 0, "version": "18446744073615036390", "wgrp": false, "woth": false, "writeable": true, "wusr": true, "xgrp": false, "xoth": false, "xusr": false}} TASK [Run docker-puppet tasks (bootstrap tasks) for step 4] ******************** Friday 21 September 2018 08:43:44 -0400 (0:00:00.456) Connection is already closed. 0:27:07.090 ****** skipping: [compute-0] => {"censored": "the output has been hidden due to the fact that 'no_log: true' was specified for this result", "changed": false} skipping: [ceph-0] => {"censored": "the output has been hidden due to the fact that 'no_log: true' was specified for this result", "changed": false} ok: [controller-0] => {"censored": "the output has been hidden due to the fact that 'no_log: true' was specified for this result", "changed": false} Last entries in /var/lib/mistral/overcloud/ansible.log are: ... 2018-09-21 08:44:33,941 p=891 u=mistral | TASK [Run docker-puppet tasks (generate config) during step 5] ***************** 2018-09-21 08:44:33,941 p=891 u=mistral | Friday 21 September 2018 08:44:33 -0400 (0:00:00.162) 0:27:56.428 ****** 2018-09-21 08:44:33,977 p=891 u=mistral | skipping: [controller-0] => {"censored": "the output has been hidden due to the fact that 'no_log: true' was specified for this result", "changed": false} 2018-09-21 08:44:34,012 p=891 u=mistral | skipping: [compute-0] => {"censored": "the output has been hidden due to the fact that 'no_log: true' was specified for this result", "changed": false} 2018-09-21 08:44:34,028 p=891 u=mistral | skipping: [ceph-0] => {"censored": "the output has been hidden due to the fact that 'no_log: true' was specified for this result", "changed": false} 2018-09-21 08:44:34,058 p=891 u=mistral | TASK [Debug output for task which failed: Run docker-puppet tasks (generate config) during step 5] *** 2018-09-21 08:44:34,058 p=891 u=mistral | Friday 21 September 2018 08:44:34 -0400 (0:00:00.116) 0:27:56.544 ****** 2018-09-21 08:44:34,094 p=891 u=mistral | skipping: [controller-0] => {} 2018-09-21 08:44:34,129 p=891 u=mistral | skipping: [compute-0] => {} 2018-09-21 08:44:34,145 p=891 u=mistral | skipping: [ceph-0] => {} 2018-09-21 08:44:34,175 p=891 u=mistral | TASK [Start containers for step 5] ********************************************* 2018-09-21 08:44:34,176 p=891 u=mistral | Friday 21 September 2018 08:44:34 -0400 (0:00:00.117) 0:27:56.662 ****** 2018-09-21 08:44:34,756 p=891 u=mistral | ok: [compute-0] => {"censored": "the output has been hidden due to the fact that 'no_log: true' was specified for this result", "changed": false} 2018-09-21 08:44:34,786 p=891 u=mistral | ok: [ceph-0] => {"censored": "the output has been hidden due to the fact that 'no_log: true' was specified for this result", "changed": false} Version-Release number of selected component (if applicable): OSP14, puddle 2018-09-20.1, topology 1uc:1contr:1comp:1ceph How reproducible: 100% Steps to Reproduce: 1. Deploy mentioned topology using InfraRed 2. OC deployment times out Additional info: heat-cfntools.noarch 1.3.0-2.el7ost openstack-heat-agents.noarch 1.7.1-0.20180907213355.476aae2.el7ost openstack-heat-api.noarch 1:12.0.0-0.20180604085325.7d878a8.el7ost openstack-heat-common.noarch 1:12.0.0-0.20180604085325.7d878a8.el7ost openstack-heat-engine.noarch 1:12.0.0-0.20180604085325.7d878a8.el7ost openstack-heat-monolith.noarch 1:12.0.0-0.20180604085325.7d878a8.el7ost openstack-tripleo-heat-templates.noarch 9.0.0-0.20180906145841.66804ff.0rc1.0rc1.el7ost puppet-heat.noarch 13.3.1-0.20180831195745.28088f9.el7ost python-tripleoclient-heat-installer.noarch 10.5.1-0.20180906012840.5b00837.el7ost
Created attachment 1485634 [details] /var/lib/mistral/overcloud/ansible.log
Created attachment 1485635 [details] /home/stack/overcloud_deploy.sh
Created attachment 1485648 [details] gnocchi_db_sync stdout.log
Except the gnocchi issue on controller Alex mentioned, I can also see also several tracebacks in in /var/log/containers/mistral , tar.gz attached. E.g.: 2018-09-21 07:37:16.021 10 ERROR mistral.actions.openstack.action_generator.base [-] Failed to create action: qinling.jobs_list: AttributeError: 'NoneType' object has no attribute 'Client' 2018-09-21 07:37:16.021 10 ERROR mistral.actions.openstack.action_generator.base Traceback (most recent call last): 2018-09-21 07:37:16.021 10 ERROR mistral.actions.openstack.action_generator.base File "/usr/lib/python2.7/site-packages/mistral/actions/openstack/action_generator/base.py", line 143, in create_actions 2018-09-21 07:37:16.021 10 ERROR mistral.actions.openstack.action_generator.base client_method = class_.get_fake_client_method() 2018-09-21 07:37:16.021 10 ERROR mistral.actions.openstack.action_generator.base File "/usr/lib/python2.7/site-packages/mistral/actions/openstack/base.py", line 75, in get_fake_client_method 2018-09-21 07:37:16.021 10 ERROR mistral.actions.openstack.action_generator.base return cls._get_client_method(cls._get_fake_client()) 2018-09-21 07:37:16.021 10 ERROR mistral.actions.openstack.action_generator.base File "/usr/lib/python2.7/site-packages/mistral/actions/openstack/actions.py", line 999, in _get_fake_client 2018-09-21 07:37:16.021 10 ERROR mistral.actions.openstack.action_generator.base return cls._get_client_class()( 2018-09-21 07:37:16.021 10 ERROR mistral.actions.openstack.action_generator.base File "/usr/lib/python2.7/site-packages/mistral/actions/openstack/actions.py", line 987, in _get_client_class 2018-09-21 07:37:16.021 10 ERROR mistral.actions.openstack.action_generator.base return qinlingclient.Client 2018-09-21 07:37:16.021 10 ERROR mistral.actions.openstack.action_generator.base AttributeError: 'NoneType' object has no attribute 'Client'
Created attachment 1485661 [details] var/log/containers/mistral dir
gnocchi_db_sync seems to be timing out when connecting to ceph [1] Ceph cluster seems in good shape [2] and I was able to run 'rbd --user openstack --pool metrics ls' from within the ceph container, as root. As 'gnocchi' user it failed because of permission errors on the ceph-client asok file; I believe this has been introduced with the changes for BZ #1610220 in ceph-ansible 1. http://cougar11.scl.lab.tlv.redhat.com/phase1-14_director-rhel-7.5-virthost-1cont_1comp_1ceph-ipv4-vxlan-ceph-containers/159/controller-0.tar.gz?controller-0/var/log/extra/docker/containers/gnocchi_db_sync/stdout.log 2. http://cougar11.scl.lab.tlv.redhat.com/phase1-14_director-rhel-7.5-virthost-1cont_1comp_1ceph-ipv4-vxlan-ceph-containers/159/controller-0.tar.gz?controller-0/var/log/extra/ceph
FWIW, the ceph-ansible change looks valid but did not fix the issue in comment #0; I am making a clone of this bug to keep track of the ceph-ansible change but also moving back this bug to OpenStack to continue with the investigation from Gnocchi side.
I second that, I got the same results on live deployment, using both approaches to possible fix: a) patching ceph-ansible in mistral-executor container on UC directly using https://github.com/ceph/ceph-ansible/commit/7d2a13f8c7324c0faf5bc40ff75a179bf1dec364#diff-535f2327a9e3f01dd3ee37ba5fec518b (and redeployment) b) installing latest centos build of package ceph-ansible in mistral-executor http://cbs.centos.org/kojifiles/packages/ceph-ansible/3.1.6/1.el7/noarch/ceph-ansible-3.1.6-1.el7.noarch.rpm (and redeployment) - neither fixed this issue, OC deployment still times out after 100 minutes.
Created attachment 1487374 [details] connection_test.py We are continuing to troubleshoot with Mehdi. There seems to be an issue connecting to the cluster from within the gnocchi_db_sync container; a simple connection test using cradox works when executed outside the container but fails when launched within the container.
I have done some testing with a working container (the cinde-api one). TCP connectivity was OK, but strace show in the non-working container that Ceph monitor sends a message that make the client shutdowns cleanly the connection and retry later. Then I have done some comparaison of both containers, and I have found that librados2 are different: gnocchi: librados2-0.94.5-2.el7.x86_64 cinder: librados2-12.2.4-42.el7cp.x86_64 Then I check the minimum version that required by the server side: # ceph osd crush dump |grep required "minimum_required_version": "jewel", Jewel == 10.2.0, so Gnocchi is not able to connect to Ceph because the librados version of Gnocchi is 0.94. Now need to figure out why the container have been generated with a so old librados2 version.
In today's puddle, both have librados2-12.2.5-42.el7cp.x86_64
*** This bug has been marked as a duplicate of bug 1633288 ***