Bug 1547671 - Ceph containerized OSDs fail to start with a deployment of dedicated monitor nodes
Summary: Ceph containerized OSDs fail to start with a deployment of dedicated monitor ...
Keywords:
Status: CLOSED DUPLICATE of bug 1541152
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo
Version: 13.0 (Queens)
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: James Slagle
QA Contact: Arik Chernetsky
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-02-21 16:48 UTC by Yogev Rabl
Modified: 2018-02-22 10:10 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-02-22 10:10:20 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
ceph-install-workflow log (1.10 MB, text/plain)
2018-02-21 16:48 UTC, Yogev Rabl
no flags Details

Description Yogev Rabl 2018-02-21 16:48:03 UTC
Created attachment 1398911 [details]
ceph-install-workflow log

Description of problem:
Ceph Ansible failed to report failures in the deployment and running of the OSDs in the Ceph cluster. 
The summary of the deployment of the cluster in /var/log/mistral/ceph-install-workflow.log is: 

2018-02-21 11:14:37,429 p=7359 u=mistral |  192.168.24.10              : ok=109  changed=17   unreachable=0    failed=0
2018-02-21 11:14:37,429 p=7359 u=mistral |  192.168.24.14              : ok=58   changed=6    unreachable=0    failed=0
2018-02-21 11:14:37,429 p=7359 u=mistral |  192.168.24.15              : ok=37   changed=3    unreachable=0    failed=0
2018-02-21 11:14:37,429 p=7359 u=mistral |  192.168.24.17              : ok=56   changed=6    unreachable=0    failed=0
2018-02-21 11:14:37,429 p=7359 u=mistral |  192.168.24.6               : ok=56   changed=6    unreachable=0    failed=0

With the nodes, 192.168.24.6/14/17 are Ceph storage nodes.
The status of the OSDs in those nodes are:

$ lsblk
NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
vda    252:0    0   20G  0 disk
├─vda1 252:1    0    1M  0 part
└─vda2 252:2    0   20G  0 part /
vdb    252:16   0   40G  0 disk
├─vdb1 252:17   0 39.5G  0 part
└─vdb2 252:18   0  512M  0 part
vdc    252:32   0   40G  0 disk
├─vdc1 252:33   0 39.5G  0 part
└─vdc2 252:34   0  512M  0 part

(The disks were partitioned)

From journalctl: 

Feb 21 16:33:41 ceph-0 systemd[1]: ceph-osd failed.
Feb 21 16:33:41 ceph-0 ceph-osd-run.sh[156961]: command: Running command: /usr/bin/ceph-detect-init --default sysvinit
Feb 21 16:33:41 ceph-0 dockerd-current[14904]: command: Running command: /usr/bin/ceph-detect-init --default sysvinit
Feb 21 16:33:41 ceph-0 dockerd-current[14904]: activate: Marking with init system none
Feb 21 16:33:41 ceph-0 ceph-osd-run.sh[156961]: activate: Marking with init system none
Feb 21 16:33:41 ceph-0 ceph-osd-run.sh[156961]: command: Running command: /usr/sbin/restorecon -R /var/lib/ceph/tmp/mnt.NdxHnQ/none
Feb 21 16:33:41 ceph-0 dockerd-current[14904]: command: Running command: /usr/sbin/restorecon -R /var/lib/ceph/tmp/mnt.NdxHnQ/none
Feb 21 16:33:41 ceph-0 dockerd-current[14904]: command: Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/tmp/mnt.NdxHnQ/none
Feb 21 16:33:41 ceph-0 ceph-osd-run.sh[156961]: command: Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/tmp/mnt.NdxHnQ/none
Feb 21 16:33:41 ceph-0 ceph-osd-run.sh[156961]: activate: ceph osd.1 data dir is ready at /var/lib/ceph/tmp/mnt.NdxHnQ
Feb 21 16:33:41 ceph-0 ceph-osd-run.sh[156961]: move_mount: Moving mount to final location...
Feb 21 16:33:41 ceph-0 ceph-osd-run.sh[156961]: command_check_call: Running command: /bin/mount -o noatime,largeio,inode64,swalloc -- /dev/vdb1 /var/lib/ceph/osd/ceph-1
Feb 21 16:33:41 ceph-0 dockerd-current[14904]: activate: ceph osd.1 data dir is ready at /var/lib/ceph/tmp/mnt.NdxHnQ
Feb 21 16:33:41 ceph-0 dockerd-current[14904]: move_mount: Moving mount to final location...
Feb 21 16:33:41 ceph-0 dockerd-current[14904]: command_check_call: Running command: /bin/mount -o noatime,largeio,inode64,swalloc -- /dev/vdb1 /var/lib/ceph/osd/ceph-1
Feb 21 16:33:41 ceph-0 ceph-osd-run.sh[156961]: command_check_call: Running command: /bin/umount -l -- /var/lib/ceph/tmp/mnt.NdxHnQ
Feb 21 16:33:41 ceph-0 dockerd-current[14904]: command_check_call: Running command: /bin/umount -l -- /var/lib/ceph/tmp/mnt.NdxHnQ
Feb 21 16:33:41 ceph-0 ceph-osd-run.sh[156961]: 2018-02-21 16:33:41  /entrypoint.sh: SUCCESS
Feb 21 16:33:41 ceph-0 dockerd-current[14904]: 2018-02-21 16:33:41  /entrypoint.sh: SUCCESS
Feb 21 16:33:42 ceph-0 ceph-osd-run.sh[156961]: starting osd.1 at - osd_data /var/lib/ceph/osd/ceph-1 /var/lib/ceph/osd/ceph-1/journal
Feb 21 16:33:42 ceph-0 dockerd-current[14904]: starting osd.1 at - osd_data /var/lib/ceph/osd/ceph-1 /var/lib/ceph/osd/ceph-1/journal
Feb 21 16:33:43 ceph-0 dockerd-current[14904]: 2018-02-21 16:33:43.263267 7f6516cd3d00 -1 osd.1 28 log_to_monitors {default=true}
Feb 21 16:33:43 ceph-0 ceph-osd-run.sh[156961]: 2018-02-21 16:33:43.263267 7f6516cd3d00 -1 osd.1 28 log_to_monitors {default=true}
Feb 21 16:33:43 ceph-0 ceph-osd-run.sh[156961]: 2018-02-21 16:33:43.272350 7f6516cd3d00 -1 osd.1 28 init authentication failed: (1) Operation not permitted
Feb 21 16:33:43 ceph-0 dockerd-current[14904]: 2018-02-21 16:33:43.272350 7f6516cd3d00 -1 osd.1 28 init authentication failed: (1) Operation not permitted
Feb 21 16:33:43 ceph-0 kernel: XFS (vdb1): Unmounting Filesystem
Feb 21 16:33:43 ceph-0 oci-systemd-hook[158052]: systemdhook <debug>: 13fa4832bdd2: Skipping as container command is /entrypoint.sh, not init or systemd
Feb 21 16:33:43 ceph-0 oci-umount[158055]: umounthook <debug>: 13fa4832bdd2: only runs in prestart stage, ignoring
Feb 21 16:33:43 ceph-0 dockerd-current[14904]: time="2018-02-21T11:33:43.435399957-05:00" level=debug msg="containerd: process exited" id=13fa4832bdd2aa52d4eca7e52bf5ba487d10157ed77c4063f519d
Feb 21 16:33:43 ceph-0 dockerd-current[14904]: time="2018-02-21T11:33:43.443381118-05:00" level=error msg="containerd: deleting container" error="exit status 1: \"container 13fa4832bdd2aa52d4
Feb 21 16:33:43 ceph-0 dockerd-current[14904]: time="2018-02-21T11:33:43.444025747-05:00" level=debug msg="libcontainerd: received containerd event: &types.Event{Type:\"exit\", Id:\"13fa4832b
Feb 21 16:33:43 ceph-0 dockerd-current[14904]: time="2018-02-21T11:33:43.444272070-05:00" level=debug msg="attach: stdout: end"
Feb 21 16:33:43 ceph-0 dockerd-current[14904]: time="2018-02-21T11:33:43.444288848-05:00" level=debug msg="attach: stderr: end"
Feb 21 16:33:43 ceph-0 dockerd-current[14904]: time="2018-02-21T11:33:43.444361565-05:00" level=debug msg="AuthZ response using plugin rhel-push-plugin"
Feb 21 16:33:43 ceph-0 dockerd-current[14904]: time="2018-02-21T11:33:43.479011114-05:00" level=debug msg="Calling GET /_ping"
Feb 21 16:33:43 ceph-0 dockerd-current[14904]: time="2018-02-21T11:33:43.479252948-05:00" level=debug msg="{Action=_ping, Username=heat-admin, LoginUID=1000, PID=158062}"
Feb 21 16:33:43 ceph-0 dockerd-current[14904]: time="2018-02-21T11:33:43.479472039-05:00" level=debug msg="AuthZ request using plugin rhel-push-plugin"
Feb 21 16:33:43 ceph-0 dockerd-current[14904]: time="2018-02-21T11:33:43.480531419-05:00" level=debug msg="AuthZ response using plugin rhel-push-plugin"
Feb 21 16:33:43 ceph-0 dockerd-current[14904]: time="2018-02-21T11:33:43.481473631-05:00" level=debug msg="Calling GET /v1.26/containers/json"
Feb 21 16:33:43 ceph-0 dockerd-current[14904]: time="2018-02-21T11:33:43.481648277-05:00" level=debug msg="{Action=json, Username=heat-admin, LoginUID=1000, PID=158062}"
Feb 21 16:33:43 ceph-0 dockerd-current[14904]: time="2018-02-21T11:33:43.481677925-05:00" level=debug msg="AuthZ request using plugin rhel-push-plugin"
Feb 21 16:33:43 ceph-0 dockerd-current[14904]: time="2018-02-21T11:33:43.484338982-05:00" level=warning msg="13fa4832bdd2aa52d4eca7e52bf5ba487d10157ed77c4063f519d2c3fc3a966c cleanup: failed t
Feb 21 16:33:43 ceph-0 dockerd-current[14904]: time="2018-02-21T11:33:43.538930367-05:00" level=debug msg="AuthZ response using plugin rhel-push-plugin"
Feb 21 16:33:43 ceph-0 dockerd-current[14904]: time="2018-02-21T11:33:43.544054834-05:00" level=debug msg="Removing volume reference: driver local, name 245738349e633015d779c1d85fa78b5998e6f3
Feb 21 16:33:43 ceph-0 systemd[1]: ceph-osd: main process exited, code=exited, status=1/FAILURE

Version-Release number of selected component (if applicable):
ceph-ansible-3.0.25-1.el7cp.noarch
python2-mistral-lib-0.3.3-0.20180109062152.8986ce9.el7ost.noarch
python2-mistralclient-3.1.4-0.20171117092239.291501a.el7ost.noarch
openstack-mistral-engine-6.0.0-0.20180122153726.ae7950e.el7ost.noarch
python-mistral-6.0.0-0.20180122153726.ae7950e.el7ost.noarch
openstack-mistral-common-6.0.0-0.20180122153726.ae7950e.el7ost.noarch
puppet-mistral-12.2.0-0.20180119074354.379b7ce.el7ost.noarch
openstack-mistral-api-6.0.0-0.20180122153726.ae7950e.el7ost.noarch
openstack-mistral-executor-6.0.0-0.20180122153726.ae7950e.el7ost.noarch
puppet-tripleo-8.2.0-0.20180122224520.el7ost.noarch
openstack-tripleo-image-elements-8.0.0-0.20180117094122.02d0985.el7ost.noarch
openstack-tripleo-ui-8.1.1-0.20180122135122.aef02d8.el7ost.noarch
openstack-tripleo-puppet-elements-8.0.0-0.20180117092204.120eca8.el7ost.noarch
openstack-tripleo-heat-templates-8.0.0-0.20180122224017.el7ost.noarch
openstack-tripleo-validations-8.1.1-0.20180119231917.2ff3c79.el7ost.noarch
openstack-tripleo-common-8.3.1-0.20180123050219.el7ost.noarch
ansible-tripleo-ipsec-0.0.1-0.20180119094817.5e80d4f.el7ost.noarch
python-tripleoclient-9.0.1-0.20180119233147.el7ost.noarch
openstack-tripleo-common-containers-8.3.1-0.20180123050219.el7ost.noarch

How reproducible:
unknown

Steps to Reproduce:
1. Deploy an overcloud with 1 dedicated node with Ceph monitor and mgr on it, 3 controller nodes, 1 compute node and 3 ceph storage nodes, each with 2 OSDs in it

Actual results:
The cluster is not functional the deployment doesn't report the failure

Expected results:
1) the OSDs are running and the Ceph cluster is functional
2) in case of an error, the deployment should fail with the Ceph cluster error

Additional info:


Note You need to log in before you can comment on or make changes to this bug.