Bug 1631789 - admin_socket is always set in ceph.conf, including when deploying jewel/kraken/luminous
Summary: admin_socket is always set in ceph.conf, including when deploying jewel/krake...
Keywords:
Status: CLOSED DUPLICATE of bug 1633288
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: ceph
Version: 14.0 (Rocky)
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: beta
: 14.0 (Rocky)
Assignee: Ric Wheeler
QA Contact: Warren
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-09-21 14:55 UTC by Filip Hubík
Modified: 2018-10-08 11:13 UTC (History)
18 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1633288 (view as bug list)
Environment:
Last Closed: 2018-10-03 10:53:54 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
/home/stack/overcloud_install.log (6.42 MB, text/plain)
2018-09-21 14:55 UTC, Filip Hubík
no flags Details
/var/lib/mistral/overcloud/ansible.log (6.77 MB, text/plain)
2018-09-21 14:58 UTC, Filip Hubík
no flags Details
/home/stack/overcloud_deploy.sh (883 bytes, application/x-shellscript)
2018-09-21 14:59 UTC, Filip Hubík
no flags Details
gnocchi_db_sync stdout.log (61.94 KB, text/plain)
2018-09-21 15:05 UTC, Alex Schultz
no flags Details
var/log/containers/mistral dir (5.05 MB, application/x-gzip)
2018-09-21 15:14 UTC, Filip Hubík
no flags Details
connection_test.py (344 bytes, text/plain)
2018-09-26 16:46 UTC, Giulio Fidente
no flags Details

Description Filip Hubík 2018-09-21 14:55:27 UTC
Created attachment 1485633 [details]
/home/stack/overcloud_install.log

Description of problem:

Overcloud deploy is still stuck 100 minutes after deployment started, then it is killed by "--timeout 100".

Last entries in /home/stack/overcloud_install.log are:
...
ok: [ceph-0] => {
    "failed_when_result": false, 
    "outputs.stdout_lines | default([]) | union(outputs.stderr_lines | default([]))": [
        "stdout: ce811316fb3f82468d810fa1199e2d430376ee3e6307b844a4bd270b45b26442", 
        "", 
        "stderr: "
    ]
}

TASK [Check if /var/lib/docker-puppet/docker-puppet-tasks4.json exists] ********
Friday 21 September 2018  08:43:44 -0400 (0:00:00.217)       0:27:06.634 ****** 
ok: [ceph-0] => {"changed": false, "stat": {"exists": false}}
ok: [compute-0] => {"changed": false, "stat": {"exists": false}}
ok: [controller-0] => {"changed": false, "stat": {"atime": 1537532467.8883862, "attr_flags": "", "attributes": [], "block_size": 4096, "blocks": 8, "charset": "us-ascii", "checksum": "fc092b06b8d6fdc6d18320ab604ab9a6ebc1e1ae", "ctime": 1537532467.8933864, "dev": 64514, "device_type": 0, "executable": false, "exists": true, "gid": 0, "gr_name": "root", "inode": 41943523, "isblk": false, "ischr": false, "isdir": false, "isfifo": false, "isgid": false, "islnk": false, "isreg": true, "issock": false, "isuid": false, "mimetype": "text/plain", "mode": "0600", "mtime": 1537532467.7203863, "nlink": 1, "path": "/var/lib/docker-puppet/docker-puppet-tasks4.json", "pw_name": "root", "readable": true, "rgrp": false, "roth": false, "rusr": true, "size": 321, "uid": 0, "version": "18446744073615036390", "wgrp": false, "woth": false, "writeable": true, "wusr": true, "xgrp": false, "xoth": false, "xusr": false}}

TASK [Run docker-puppet tasks (bootstrap tasks) for step 4] ********************
Friday 21 September 2018  08:43:44 -0400 (0:00:00.456)       Connection is already closed.
0:27:07.090 ****** 
skipping: [compute-0] => {"censored": "the output has been hidden due to the fact that 'no_log: true' was specified for this result", "changed": false}
skipping: [ceph-0] => {"censored": "the output has been hidden due to the fact that 'no_log: true' was specified for this result", "changed": false}
ok: [controller-0] => {"censored": "the output has been hidden due to the fact that 'no_log: true' was specified for this result", "changed": false}



Last entries in /var/lib/mistral/overcloud/ansible.log are:
...
2018-09-21 08:44:33,941 p=891 u=mistral |  TASK [Run docker-puppet tasks (generate config) during step 5] *****************
2018-09-21 08:44:33,941 p=891 u=mistral |  Friday 21 September 2018  08:44:33 -0400 (0:00:00.162)       0:27:56.428 ****** 
2018-09-21 08:44:33,977 p=891 u=mistral |  skipping: [controller-0] => {"censored": "the output has been hidden due to the fact that 'no_log: true' was specified for this result", "changed": false}
2018-09-21 08:44:34,012 p=891 u=mistral |  skipping: [compute-0] => {"censored": "the output has been hidden due to the fact that 'no_log: true' was specified for this result", "changed": false}
2018-09-21 08:44:34,028 p=891 u=mistral |  skipping: [ceph-0] => {"censored": "the output has been hidden due to the fact that 'no_log: true' was specified for this result", "changed": false}
2018-09-21 08:44:34,058 p=891 u=mistral |  TASK [Debug output for task which failed: Run docker-puppet tasks (generate config) during step 5] ***
2018-09-21 08:44:34,058 p=891 u=mistral |  Friday 21 September 2018  08:44:34 -0400 (0:00:00.116)       0:27:56.544 ****** 
2018-09-21 08:44:34,094 p=891 u=mistral |  skipping: [controller-0] => {}
2018-09-21 08:44:34,129 p=891 u=mistral |  skipping: [compute-0] => {}
2018-09-21 08:44:34,145 p=891 u=mistral |  skipping: [ceph-0] => {}
2018-09-21 08:44:34,175 p=891 u=mistral |  TASK [Start containers for step 5] *********************************************
2018-09-21 08:44:34,176 p=891 u=mistral |  Friday 21 September 2018  08:44:34 -0400 (0:00:00.117)       0:27:56.662 ****** 
2018-09-21 08:44:34,756 p=891 u=mistral |  ok: [compute-0] => {"censored": "the output has been hidden due to the fact that 'no_log: true' was specified for this result", "changed": false}
2018-09-21 08:44:34,786 p=891 u=mistral |  ok: [ceph-0] => {"censored": "the output has been hidden due to the fact that 'no_log: true' was specified for this result", "changed": false}



Version-Release number of selected component (if applicable):
OSP14, puddle 2018-09-20.1, topology 1uc:1contr:1comp:1ceph

How reproducible:
100%

Steps to Reproduce:
1. Deploy mentioned topology using InfraRed
2. OC deployment times out

Additional info:
heat-cfntools.noarch            1.3.0-2.el7ost
openstack-heat-agents.noarch    1.7.1-0.20180907213355.476aae2.el7ost
openstack-heat-api.noarch       1:12.0.0-0.20180604085325.7d878a8.el7ost
openstack-heat-common.noarch    1:12.0.0-0.20180604085325.7d878a8.el7ost
openstack-heat-engine.noarch    1:12.0.0-0.20180604085325.7d878a8.el7ost
openstack-heat-monolith.noarch  1:12.0.0-0.20180604085325.7d878a8.el7ost
openstack-tripleo-heat-templates.noarch 9.0.0-0.20180906145841.66804ff.0rc1.0rc1.el7ost
puppet-heat.noarch              13.3.1-0.20180831195745.28088f9.el7ost
python-tripleoclient-heat-installer.noarch 10.5.1-0.20180906012840.5b00837.el7ost

Comment 1 Filip Hubík 2018-09-21 14:58:36 UTC
Created attachment 1485634 [details]
/var/lib/mistral/overcloud/ansible.log

Comment 2 Filip Hubík 2018-09-21 14:59:13 UTC
Created attachment 1485635 [details]
/home/stack/overcloud_deploy.sh

Comment 4 Alex Schultz 2018-09-21 15:05:56 UTC
Created attachment 1485648 [details]
gnocchi_db_sync stdout.log

Comment 5 Filip Hubík 2018-09-21 15:13:46 UTC
Except the gnocchi issue on controller Alex mentioned, I can also see also several tracebacks in in /var/log/containers/mistral , tar.gz attached. E.g.:

2018-09-21 07:37:16.021 10 ERROR mistral.actions.openstack.action_generator.base [-] Failed to create action: qinling.jobs_list: AttributeError: 'NoneType' object has no attribute 'Client'
2018-09-21 07:37:16.021 10 ERROR mistral.actions.openstack.action_generator.base Traceback (most recent call last):
2018-09-21 07:37:16.021 10 ERROR mistral.actions.openstack.action_generator.base   File "/usr/lib/python2.7/site-packages/mistral/actions/openstack/action_generator/base.py", line 143, in create_actions
2018-09-21 07:37:16.021 10 ERROR mistral.actions.openstack.action_generator.base     client_method = class_.get_fake_client_method()
2018-09-21 07:37:16.021 10 ERROR mistral.actions.openstack.action_generator.base   File "/usr/lib/python2.7/site-packages/mistral/actions/openstack/base.py", line 75, in get_fake_client_method
2018-09-21 07:37:16.021 10 ERROR mistral.actions.openstack.action_generator.base     return cls._get_client_method(cls._get_fake_client())
2018-09-21 07:37:16.021 10 ERROR mistral.actions.openstack.action_generator.base   File "/usr/lib/python2.7/site-packages/mistral/actions/openstack/actions.py", line 999, in _get_fake_client
2018-09-21 07:37:16.021 10 ERROR mistral.actions.openstack.action_generator.base     return cls._get_client_class()(
2018-09-21 07:37:16.021 10 ERROR mistral.actions.openstack.action_generator.base   File "/usr/lib/python2.7/site-packages/mistral/actions/openstack/actions.py", line 987, in _get_client_class
2018-09-21 07:37:16.021 10 ERROR mistral.actions.openstack.action_generator.base     return qinlingclient.Client
2018-09-21 07:37:16.021 10 ERROR mistral.actions.openstack.action_generator.base AttributeError: 'NoneType' object has no attribute 'Client'

Comment 6 Filip Hubík 2018-09-21 15:14:37 UTC
Created attachment 1485661 [details]
var/log/containers/mistral dir

Comment 7 Giulio Fidente 2018-09-21 16:00:53 UTC
gnocchi_db_sync seems to be timing out when connecting to ceph [1]

Ceph cluster seems in good shape [2] and I was able to run 'rbd --user openstack --pool metrics ls' from within the ceph container, as root.

As 'gnocchi' user it failed because of permission errors on the ceph-client asok file; I believe this has been introduced with the changes for BZ #1610220 in ceph-ansible

1. http://cougar11.scl.lab.tlv.redhat.com/phase1-14_director-rhel-7.5-virthost-1cont_1comp_1ceph-ipv4-vxlan-ceph-containers/159/controller-0.tar.gz?controller-0/var/log/extra/docker/containers/gnocchi_db_sync/stdout.log

2. http://cougar11.scl.lab.tlv.redhat.com/phase1-14_director-rhel-7.5-virthost-1cont_1comp_1ceph-ipv4-vxlan-ceph-containers/159/controller-0.tar.gz?controller-0/var/log/extra/ceph

Comment 15 Giulio Fidente 2018-09-26 15:27:19 UTC
FWIW, the ceph-ansible change looks valid but did not fix the issue in comment #0; I am making a clone of this bug to keep track of the ceph-ansible change but also moving back this bug to OpenStack to continue with the investigation from Gnocchi side.

Comment 19 Filip Hubík 2018-09-26 16:00:35 UTC
I second that, I got the same results on live deployment, using both approaches to possible fix:

a) patching ceph-ansible in mistral-executor container on UC directly using https://github.com/ceph/ceph-ansible/commit/7d2a13f8c7324c0faf5bc40ff75a179bf1dec364#diff-535f2327a9e3f01dd3ee37ba5fec518b (and redeployment)

b) installing latest centos build of package ceph-ansible in mistral-executor http://cbs.centos.org/kojifiles/packages/ceph-ansible/3.1.6/1.el7/noarch/ceph-ansible-3.1.6-1.el7.noarch.rpm (and redeployment)

- neither fixed this issue, OC deployment still times out after 100 minutes.

Comment 20 Giulio Fidente 2018-09-26 16:46:07 UTC
Created attachment 1487374 [details]
connection_test.py

We are continuing to troubleshoot with Mehdi.

There seems to be an issue connecting to the cluster from within the gnocchi_db_sync container; a simple connection test using cradox works when executed outside the container but fails when launched within the container.

Comment 21 Mehdi ABAAKOUK 2018-09-27 10:09:48 UTC
I have done some testing with a working container (the cinde-api one).

TCP connectivity was OK, but strace show in the non-working container that Ceph monitor sends a message that make the client shutdowns cleanly the connection and retry later.

Then I have done some comparaison of both containers, and I have found that librados2 are different:

 gnocchi: librados2-0.94.5-2.el7.x86_64
 cinder: librados2-12.2.4-42.el7cp.x86_64

Then I check the minimum version that required by the server side:

# ceph osd crush  dump |grep required
        "minimum_required_version": "jewel",


Jewel == 10.2.0, so Gnocchi is not able to connect to Ceph because the librados version of Gnocchi is 0.94.

Now need to figure out why the container have been generated with a so old librados2 version.

Comment 22 Thierry Vignaud 2018-09-27 12:05:48 UTC
In today's puddle, both have librados2-12.2.5-42.el7cp.x86_64

Comment 23 Giulio Fidente 2018-10-03 10:53:54 UTC

*** This bug has been marked as a duplicate of bug 1633288 ***


Note You need to log in before you can comment on or make changes to this bug.