Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

This project is now read‑only. Starting Monday, February 2, please use https://ibm-ceph.atlassian.net/ for all bug tracking management.

Bug 1490283

Summary:	failure to create gpt disk label
Product:	[Red Hat Storage] Red Hat Ceph Storage	Reporter:	Yogev Rabl <yrabl>
Component:	Ceph-Ansible	Assignee:	John Fulton <johfulto>
Status:	CLOSED DUPLICATE	QA Contact:	Yogev Rabl <yrabl>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	2.4	CC:	adeza, aschoen, ceph-eng-bugs, gfidente, gmeno, johfulto, jschluet, kdreyer, mburns, mcornea, nthomas, rhel-osp-director-maint, sankarshan, sasha, seb, shan, yprokule, yrabl
Target Milestone:	rc	Keywords:	Triaged
Target Release:	3.0
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-10-02 16:41:58 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Yogev Rabl 2017-09-11 08:54:31 UTC

Description of problem:
An overcloud deployment ran and ended without an error when Ceph's OSDs are not running in the Ceph-storage nodes.

The error messages in the Ceph-storage node are:

Sep 11 08:53:10 ceph-0 dockerd-current[15732]: command: Running command: /usr/sbin/blkid -o udev -p /dev/vdb1
Sep 11 08:53:10 ceph-0 ceph-osd-run.sh[58017]: command: Running command: /sbin/blkid -p -s TYPE -o value -- /dev/vdb1
Sep 11 08:53:10 ceph-0 dockerd-current[15732]: command: Running command: /sbin/blkid -p -s TYPE -o value -- /dev/vdb1
Sep 11 08:53:10 ceph-0 ceph-osd-run.sh[58017]: command: Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup osd_mount_options_xfs
Sep 11 08:53:10 ceph-0 dockerd-current[15732]: command: Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup osd_mount_options_xfs
Sep 11 08:53:10 ceph-0 ceph-osd-run.sh[58017]: mount: Mounting /dev/vdb1 on /var/lib/ceph/tmp/mnt.ZlKvTe with options noatime,largeio,inode64,swalloc
Sep 11 08:53:10 ceph-0 dockerd-current[15732]: mount: Mounting /dev/vdb1 on /var/lib/ceph/tmp/mnt.ZlKvTe with options noatime,largeio,inode64,swalloc
Sep 11 08:53:10 ceph-0 dockerd-current[15732]: command_check_call: Running command: /usr/bin/mount -t xfs -o noatime,largeio,inode64,swalloc -- /dev/vdb1 /var/lib/ceph/tmp/mnt.ZlKvTe
Sep 11 08:53:10 ceph-0 ceph-osd-run.sh[58017]: command_check_call: Running command: /usr/bin/mount -t xfs -o noatime,largeio,inode64,swalloc -- /dev/vdb1 /var/lib/ceph/tmp/mnt.ZlKvTe
Sep 11 08:53:10 ceph-0 kernel: XFS (vdb1): Mounting V5 Filesystem
Sep 11 08:53:10 ceph-0 kernel: XFS (vdb1): Ending clean mount
Sep 11 08:53:10 ceph-0 ceph-osd-run.sh[58017]: command: Running command: /usr/sbin/restorecon /var/lib/ceph/tmp/mnt.ZlKvTe
Sep 11 08:53:10 ceph-0 dockerd-current[15732]: command: Running command: /usr/sbin/restorecon /var/lib/ceph/tmp/mnt.ZlKvTe
Sep 11 08:53:10 ceph-0 dockerd-current[15732]: activate: Cluster uuid is 2a57f5e2-94d2-11e7-897c-52540015ce25
Sep 11 08:53:10 ceph-0 ceph-osd-run.sh[58017]: activate: Cluster uuid is 2a57f5e2-94d2-11e7-897c-52540015ce25
Sep 11 08:53:10 ceph-0 dockerd-current[15732]: command: Running command: /usr/bin/ceph-osd --cluster=ceph --show-config-value=fsid
Sep 11 08:53:10 ceph-0 ceph-osd-run.sh[58017]: command: Running command: /usr/bin/ceph-osd --cluster=ceph --show-config-value=fsid
Sep 11 08:53:10 ceph-0 ceph-osd-run.sh[58017]: activate: Cluster name is ceph
Sep 11 08:53:10 ceph-0 dockerd-current[15732]: activate: Cluster name is ceph
Sep 11 08:53:10 ceph-0 dockerd-current[15732]: activate: OSD uuid is 9aba92c8-102f-4c62-8dfa-9bc393b35f50
Sep 11 08:53:10 ceph-0 ceph-osd-run.sh[58017]: activate: OSD uuid is 9aba92c8-102f-4c62-8dfa-9bc393b35f50
Sep 11 08:53:10 ceph-0 ceph-osd-run.sh[58017]: activate: OSD id is 0
Sep 11 08:53:10 ceph-0 dockerd-current[15732]: activate: OSD id is 0
Sep 11 08:53:10 ceph-0 dockerd-current[15732]: command: Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup init
Sep 11 08:53:10 ceph-0 ceph-osd-run.sh[58017]: command: Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup init
Sep 11 08:53:10 ceph-0 ceph-osd-run.sh[58017]: command: Running command: /usr/bin/ceph-detect-init --default sysvinit
Sep 11 08:53:10 ceph-0 dockerd-current[15732]: command: Running command: /usr/bin/ceph-detect-init --default sysvinit
Sep 11 08:53:10 ceph-0 ceph-osd-run.sh[58017]: activate: Marking with init system systemd
Sep 11 08:53:10 ceph-0 dockerd-current[15732]: activate: Marking with init system systemd
Sep 11 08:53:10 ceph-0 dockerd-current[15732]: command: Running command: /usr/sbin/restorecon -R /var/lib/ceph/tmp/mnt.ZlKvTe/systemd
Sep 11 08:53:10 ceph-0 ceph-osd-run.sh[58017]: command: Running command: /usr/sbin/restorecon -R /var/lib/ceph/tmp/mnt.ZlKvTe/systemd
Sep 11 08:53:10 ceph-0 ceph-osd-run.sh[58017]: command: Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/tmp/mnt.ZlKvTe/systemd
Sep 11 08:53:10 ceph-0 dockerd-current[15732]: command: Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/tmp/mnt.ZlKvTe/systemd
Sep 11 08:53:10 ceph-0 ceph-osd-run.sh[58017]: activate: ceph osd.0 data dir is ready at /var/lib/ceph/tmp/mnt.ZlKvTe
Sep 11 08:53:10 ceph-0 ceph-osd-run.sh[58017]: move_mount: Moving mount to final location...
Sep 11 08:53:10 ceph-0 ceph-osd-run.sh[58017]: command_check_call: Running command: /bin/mount -o noatime,largeio,inode64,swalloc -- /dev/vdb1 /var/lib/ceph/osd/ceph-0
Sep 11 08:53:10 ceph-0 dockerd-current[15732]: activate: ceph osd.0 data dir is ready at /var/lib/ceph/tmp/mnt.ZlKvTe
Sep 11 08:53:10 ceph-0 dockerd-current[15732]: move_mount: Moving mount to final location...
Sep 11 08:53:10 ceph-0 dockerd-current[15732]: command_check_call: Running command: /bin/mount -o noatime,largeio,inode64,swalloc -- /dev/vdb1 /var/lib/ceph/osd/ceph-0
Sep 11 08:53:10 ceph-0 ceph-osd-run.sh[58017]: command_check_call: Running command: /bin/umount -l -- /var/lib/ceph/tmp/mnt.ZlKvTe
Sep 11 08:53:10 ceph-0 dockerd-current[15732]: command_check_call: Running command: /bin/umount -l -- /var/lib/ceph/tmp/mnt.ZlKvTe
Sep 11 08:53:10 ceph-0 ceph-osd-run.sh[58017]: 2017-09-11 08:53:10.671539 7f0563b82700  0 librados: osd.0 authentication error (1) Operation not permitted
Sep 11 08:53:10 ceph-0 dockerd-current[15732]: 2017-09-11 08:53:10.671539 7f0563b82700  0 librados: osd.0 authentication error (1) Operation not permitted
Sep 11 08:53:10 ceph-0 ceph-osd-run.sh[58017]: Error connecting to cluster: PermissionError
Sep 11 08:53:10 ceph-0 dockerd-current[15732]: Error connecting to cluster: PermissionError



Version-Release number of selected component (if applicable):
ceph-ansible-3.0.0-0.1.rc4.el7cp.noarch
openstack-tripleo-image-elements-7.0.0-0.20170830150703.526772d.el7ost.noarch
puppet-tripleo-7.3.1-0.20170831100515.0457aa1.el7ost.noarch
openstack-tripleo-heat-templates-7.0.0-0.20170901051303.0rc1.el7ost.noarch
python-tripleoclient-7.2.1-0.20170831202445.0bd00bb.el7ost.noarch
openstack-tripleo-common-containers-7.5.1-0.20170831015949.2517e1e.el7ost.1.noarch
openstack-tripleo-puppet-elements-7.0.0-0.20170831100659.2094778.el7ost.noarch
openstack-tripleo-common-7.5.1-0.20170831015949.2517e1e.el7ost.1.noarch
openstack-tripleo-validations-7.3.1-0.20170831052729.67faa39.el7ost.noarch


How reproducible:
unknown

Steps to Reproduce:
1. Run a deployment of an overcloud with containerized Ceph

Actual results:
The Ceph OSDs are not running in the cluster

Expected results:
The ceph OSDs are running inside containers

Additional info:

Comment 2 Giulio Fidente 2017-09-11 13:52:18 UTC

Yogev, what version of ceph-ansible are you using on the undercloud?

Are you sure the disks are cleared from previous deployments and there are no pre-existing ceph partitions on them? You can zap the disks manually with sgdisk -Z or with ironic.

If they are clean, can you see if adding 'osd_objectore: filestore' and 'osd_scenario: collocated' to CephAnsibleDisksConfig helps?

Comment 5 Yogev Rabl 2017-09-11 17:29:46 UTC

(In reply to Yogev Rabl from comment #4)
> (In reply to Giulio Fidente from comment #2)
> > Yogev, what version of ceph-ansible are you using on the undercloud?
> > 
> > Are you sure the disks are cleared from previous deployments and there are
> > no pre-existing ceph partitions on them? You can zap the disks manually with
> > sgdisk -Z or with ironic.
> > 
> > If they are clean, can you see if adding 'osd_objectore: filestore' and
> > 'osd_scenario: collocated' to CephAnsibleDisksConfig helps?
> 
> Giulio, I think that the disks were not clean. I removed the blocker flag from this bug. 

It is still a bug but with a much lower priority

Comment 6 John Fulton 2017-09-11 17:46:34 UTC

Yogev,

Would you please share your heat templates? Can you confirm that you intended to deploy with the disk /dev/vdb and with no separate journal disk, i.e. with a colocated journal? 

FWIW: I reproduced an error like this in a quickstart env using: 

- site-docker-tripleo.yml.sample (from today)
- docker/services/ceph-ansible/ceph-osd.yaml [1] has params from comment #2 (our docs need an update for this)

However, I was using a separate journal disk and not collocating so the default in docker/services/ceph-ansible/ceph-osd.yaml was wrong. I then updated one of my env files to override the default with the following: 

+  CephAnsibleExtraConfig:
+    osd_objectstore: filestore
+    osd_scenario: non-collocated

I then re-ran my openstack overcloud deploy command to update. The playbook re-ran and I got a HEALTH_OK ceph cluster. I agree that the deploy should have failed in the first place however if the OSDs were not running. 

  John

[1] 
(undercloud) [stack@undercloud templates]$ grep -B 4 -A 4 osd_ docker/services/ceph-ansible/ceph-osd.yaml
      devices:
      - /dev/vdb
      journal_size: 512
      journal_collocation: true
      osd_scenario: collocated

resources:
  CephBase:
    type: ./ceph-base.yaml
--
        - tripleo.ceph_osd.firewall_rules:
            '111 ceph_osd':
              dport:
              - '6800-7300'
        - ceph_osd_ansible_vars:
            map_merge:
            - {get_attr: [CephBase, role_data, config_settings, ceph_common_ansible_vars]}
            - osd_objectstore: filestore
            - {get_param: CephAnsibleDisksConfig}
(undercloud) [stack@undercloud templates]$ 


[2] 
[root@overcloud-cephstorage-0 ~]# docker ps -a
CONTAINER ID        IMAGE                                                   COMMAND                  CREATED             STATUS                    PORTS               NAMES
9a0f4d156e8e        tripleoupstream/centos-binary-cron:latest               "kolla_start"            42 hours ago        Up 42 hours                                   logrotate_crond
fd2811f3020d        docker.io/ceph/daemon:tag-build-master-jewel-centos-7   "/usr/bin/ceph --vers"   43 hours ago        Exited (0) 43 hours ago                       stoic_archimedes
[root@overcloud-cephstorage-0 ~]#

Comment 7 John Fulton 2017-09-11 20:34:02 UTC

I am adjusting upstream docs to cover the params for different journal scenarios:

 https://review.openstack.org/#/c/502557

Also, you need the following: 

 https://review.openstack.org/#/c/501983/

I am re-testing both scenarios (shared journal and separate journal).

Comment 8 John Fulton 2017-09-11 21:19:46 UTC

Update on re-testing:

1. shared journal disk scenario using tht from doc: new deploy succeeds and ceph HEALTH_OK
2. separate journal disk scenario using tht from doc: new deploy failed on the following ceph-ansible task: 

 https://github.com/ceph/ceph-ansible/blob/master/roles/ceph-osd/tasks/check_devices.yml#L50-L60


2017-09-11 22:00:34,924 p=24485 u=mistral |  TASK [ceph-osd : create gpt disk label of the journal device(s)] ***************
2017-09-11 22:00:35,215 p=24485 u=mistral |  failed: [192.168.24.11] (item=[{'_ansible_parsed': True, 'stderr_lines': [], '_ansible_item_result': True, u'end': u'2017-09-11 22:00:33.913658', '_ansible_no_log': False, u'stdout': u'', u'cmd': u'parted --script /dev/vdb print > /dev/null 2>&1', u'rc': 1, 'item': [{'_ansible_parsed': True, 'stderr_lines': [], '_ansible_item_result': True, u'end': u'2017-09-11 22:00:33.333649', '_ansible_no_log': False, u'stdout': u'', u'cmd': u"readlink -f /dev/vdb | egrep '/dev/([hsv]d[a-z]{1,2}|cciss/c[0-9]d[0-9]p|nvme[0-9]n[0-9]p)[0-9]{1,2}|fio[a-z]{1,2}[0-9]{1,2}$'", u'rc': 1, 'item': u'/dev/vdb', u'delta': u'0:00:00.004784', u'stderr': u'', u'change None, u'_uses_shell': True, u'_raw_params': u"readlink -f /dev/vdb | egrep '/dev/([hsv]d[a-z]{1,2}|cciss/c[0-9]d[0-9]p|nvme[0-9]n[0-9]p)[0-9]{1,2}|fio[a-z]{1,2}[0-9]{1,2}$'", u'removes': None, u'creates': None, u'chdir': None}}, 'stdout_lines': [], 'failed_when_result': False, u'start': u'2017-09-11 22:00:33.328865', 'failed': False}, u'/dev/vdb'], u'delta': u'0:00:00.008560', u'stderr': u'', u'changed': False, u'invocation': {u'module_args': {u'warn': True, u'executable': None, u'_uses_shell': True, u'_raw_params': u'parted --script /dev/vdb print > /dev/null 2>&1', u'removes': None, u'creates': None, u'chdir': None}}, 'stdout_lines': [], 'failed_when_result': False, u'start': u'2017-09-11 22:00:33.905098', 'failed': False}, None]) => {"changed": false, "cmd": ["parted", "--script", "mklabel", "gpt"], "delta": "0:00:00.003180", "end": "2017-09-11 22:00:35.951058", "failed": true, "item": [{"_ansible_item_result": true, "_ansible_no_log": false, "_ansible_parsed": true, "changed": false, "cmd": "parted --script /dev/vdb print > /dev/null 2>&1", "delta": "0:00:00.008560", "end": "2017-09-11 22:00:33.913658", "failed": false, "failed_when_result": false, "invocation": {"module_args": {"_raw_params": "parted --script /dev/vdb print > /dev/null 2>&1", "_uses_shell": true, "chdir": null, "creates": null, "executable": null, "removes": null, "warn": true}}, "item": [{"_ansible_item_result": true, "_ansible_no_log": false, "_ansible_parsed": true, "changed": false, "cmd": "readlink -f /dev/vdb | egrep '/dev/([hsv]d[a-z]{1,2}|cciss/c[0-9]d[0-9]p|nvme[0-9]n[0-9]p)[0-9]{1,2}|fio[a-z]{1,2}[0-9]{1,2}$'", "delta": "0:00:00.004784", "end": "2017-09-11 22:00:33.333649", "failed": false, "failed_when_result": false, "invocation": {"module_args": {"_raw_params": "readlink -f /dev/vdb | egrep '/dev/([hsv]d[a-z]{1,2}|cciss/c[0-9]d[0-9]p|nvme[0-9]n[0-9]p)[0-9]{1,2}|fio[a-z]{1,2}[0-9]{1,2}$'", "_uses_shell": true, "chdir": null, "creates": null, "executable": null, "removes": null, "warn": true}}, "item": "/dev/vdb", "rc": 1, "start": "2017-09-11 22:00:33.328865", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}, "/dev/vdb"], "rc": 1, "start": "2017-09-11 22:00:33.905098", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}, null], "rc": 1, "start": "2017-09-11 22:00:35.947878", "stderr": "Error: Could not stat device mklabel - No such file or directory.", "stderr_lines": ["Error: Could not stat device mklabel - No such file or directory."], "stdout": "", "stdout_lines": []}

...

2017-09-11 22:00:35,524 p=24485 u=mistral |  failed: [192.168.24.18] (item=[{'_ansible_parsed': True, 'stderr_lines': [], '_ansible_item_result': True, u'end': u'2017-09-11 22:00:34.313839', '_ansible_no_log': False, u'stdout': u'', u'cmd': u'parted --script /dev/vdc print > /dev/null 2>&1', u'rc': 1, 'item': [{'_ansible_parsed': True, 'stderr_lines': [], '_ansible_item_result': True, u'end': u'2017-09-11 22:00:33.696647', '_ansible_no_log': False, u'stdout': u'', u'cmd': u"readlink -f /dev/vdc | egrep '/dev/([hsv]d[a-z]{1,2}|cciss/c[0-9]d[0-9]p|nvme[0-9]n[0-9]p)[0-9]{1,2}|fio[a-z]{1,2}[0-9]{1,2}$'", u'rc': 1, 'item': u'/dev/vdc', u'delta': u'0:00:00.005318', u'stderr': u'', u'changed': False, u'invocation': {u'module_args': {u'warn': True, u'executable': None, u'_uses_shell': True, u'_raw_params': u"readlink -f /dev/vdc | egrep '/dev/([hsv]d[a-z]{1,2}|cciss/c[0-9]d[0-9]p|nvme[0-9]n[0-9]p)[0-9]{1,2}|fio[a-z]{1,2}[0-9]{1,2}$'", u'removes': None, u'creates': None, u'chdir': None}}, 'stdout_lines': [], 'failed_when_result': False, u'start': u'2017-09-11 22:00:33.691329', 'failed': False}, u'/dev/vdc'], u'delta': u'0:00:00.008033', u'stderr': u'', u'changed': False, u'invocation': {u'module_args': {u'warn': True, u'executable': None, u'_uses_shell': True, u'_raw_params': u'parted --script /dev/vdc print > /dev/null 2>&1', u'removes': None, u'creates': None, u'chdir': None}}, 'stdout_lines': [], 'failed_when_result': False, u'start': u'2017-09-11 22:00:34.305806', 'failed': False}, None]) => {"changed": false, "cmd": ["parted", "--script", "mklabel", "gpt"], "delta": "0:00:00.004037", "end": "2017-09-11 22:00:36.341092", "failed": true, "item": [{"_ansible_item_result": true, "_ansible_no_log": false, "_ansible_parsed": true, "changed": false, "cmd": "parted --script /dev/vdc print > /dev/null 2>&1", "delta": "0:00:00.008033", "end": "2017-09-11 22:00:34.313839", "failed": false, "failed_when_result": false, "invocation": {"module_args": {"_raw_params": "parted --script /dev/vdc print > /dev/null 2>&1", "_uses_shell": true, "chdir": null, "creates": null, "executable": null, "removes": null, "warn": true}}, "item": [{"_ansible_item_result": true, "_ansible_no_log": false, "_ansible_parsed": true, "changed": false, "cmd": "readlink -f /dev/vdc | egrep '/dev/([hsv]d[a-z]{1,2}|cciss/c[0-9]d[0-9]p|nvme[0-9]n[0-9]p)[0-9]{1,2}|fio[a-z]{1,2}[0-9]{1,2}$'", "delta": "0:00:00.005318", "end": "2017-09-11 22:00:33.696647", "failed": false, "failed_when_result": false, "invocation": {"module_args": {"_raw_params": "readlink -f /dev/vdc | egrep '/dev/([hsv]d[a-z]{1,2}|cciss/c[0-9]d[0-9]p|nvme[0-9]n[0-9]p)[0-9]{1,2}|fio[a-z]{1,2}[0-9]{1,2}$'", "_uses_shell": true, "chdir": null, "creates": null, "executable": null, "removes": null, "warn": true}}, "item": "/dev/vdc", "rc": 1, "start": "2017-09-11 22:00:33.691329", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}, "/dev/vdc"], "rc": 1, "start": "2017-09-11 22:00:34.305806", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}, null], "rc": 1, "start": "2017-09-11 22:00:36.337055", "stderr": "Error: Could not stat device mklabel - No such file or directory.", "stderr_lines": ["Error: Could not stat device mklabel - No such file or directory."], "stdout": "", "stdout_lines": []}
2017-09-11 22:00:35,525 p=24485 u=mistral |  RUNNING HANDLER [ceph-defaults : copy mon restart script] **********************
2017-09-11 22:00:35,525 p=24485 u=mistral |  RUNNING HANDLER [ceph-defaults : restart ceph mon daemon(s)] *******************
2017-09-11 22:00:35,526 p=24485 u=mistral |  RUNNING HANDLER [ceph-defaults : copy osd restart script] **********************
2017-09-11 22:00:35,526 p=24485 u=mistral |  RUNNING HANDLER [ceph-defaults : restart containerized ceph osds daemon(s)] ****
2017-09-11 22:00:35,526 p=24485 u=mistral |  RUNNING HANDLER [ceph-defaults : restart non-containerized ceph osds daemon(s)] ***
2017-09-11 22:00:35,526 p=24485 u=mistral |  RUNNING HANDLER [ceph-defaults : copy mds restart script] **********************
2017-09-11 22:00:35,526 p=24485 u=mistral |  RUNNING HANDLER [ceph-defaults : debug socket mds] *****************************
2017-09-11 22:00:35,527 p=24485 u=mistral |  RUNNING HANDLER [ceph-defaults : restart ceph mds daemon(s)] *******************
2017-09-11 22:00:35,527 p=24485 u=mistral |  RUNNING HANDLER [ceph-defaults : copy rgw restart script] **********************
2017-09-11 22:00:35,527 p=24485 u=mistral |  RUNNING HANDLER [ceph-defaults : restart ceph rgw daemon(s)] *******************
2017-09-11 22:00:35,528 p=24485 u=mistral |  PLAY RECAP *********************************************************************
2017-09-11 22:00:35,528 p=24485 u=mistral |  192.168.24.11              : ok=45   changed=4    unreachable=0    failed=1   
2017-09-11 22:00:35,528 p=24485 u=mistral |  192.168.24.17              : ok=43   changed=3    unreachable=0    failed=1   
2017-09-11 22:00:35,528 p=24485 u=mistral |  192.168.24.18              : ok=43   changed=3    unreachable=0    failed=1   
2017-09-11 22:00:35,528 p=24485 u=mistral |  192.168.24.20              : ok=56   changed=8    unreachable=0    failed=0   
2017-09-11 22:00:35,528 p=24485 u=mistral |  192.168.24.9               : ok=1    changed=0    unreachable=0    failed=0

Comment 9 John Fulton 2017-09-13 02:22:42 UTC

The dedicated journal scenario has been fixed by the following: 

https://github.com/ceph/ceph-ansible/pull/1882

Comment 10 Alexander Chuzhoy 2017-09-19 21:00:25 UTC

Environment:
openstack-tripleo-heat-templates-7.0.0-0.20170913050524.0rc2.el7ost.noarch

ceph-common-10.2.7-32.el7cp.x86_64
ceph-mon-10.2.7-32.el7cp.x86_64
libcephfs1-10.2.7-32.el7cp.x86_64
python-cephfs-10.2.7-32.el7cp.x86_64
ceph-base-10.2.7-32.el7cp.x86_64
ceph-radosgw-10.2.7-32.el7cp.x86_64
puppet-ceph-2.4.1-0.20170911230204.ebea4b7.el7ost.noarch
ceph-mds-10.2.7-32.el7cp.x86_64
ceph-selinux-10.2.7-32.el7cp.x86_64


A deployment failed:

Looking for errors in mistral, found these:


ceph_disk.main.Error: Error: partition 2 for /dev/vdb does not appear to exist\", \"stderr_lines\": [\"command: Running command: /usr/bin/ceph-osd --cluster=ceph --show-config-value=fsid\", \"command: Running command: /usr/bin/ceph-osd --check-allows-journal -i 0 --log-file $run_dir/$cluster-osd-check.log --cluster ceph --setuser ceph --setgroup ceph\", \"command: Running command: /usr/bin/ceph-osd --check-wants-journal -i 0 --log-file $run_dir/$cluster-osd-check.log --cluster ceph --setuser ceph --setgroup ceph\", \"command: Running command: /usr/bin/ceph-osd --check-needs-journal -i 0 --log-file $run_dir/$cluster-osd-check.log --cluster ceph --setuser ceph --setgroup ceph\", \"get_dm_uuid: get_dm_uuid /dev/vdb uuid path is /sys/dev/block/252:16/dm/uuid\", \"set_type: Will colocate journal with data on /dev/vdb\", \"command: Running command: /usr/bin/ceph-osd --cluster=ceph --show-config-value=osd_journal_size\", \"get_dm_uuid: get_dm_uuid /dev/vdb uuid path is /sys/dev/block/252:16/dm/uuid\", \"get_dm_uuid: get_dm_uuid /dev/vdb uuid path is /sys/dev/block/252:16/dm/uuid\", \"get_dm_uuid: get_dm_uuid /dev/vdb uuid path is /sys/dev/block/252:16/dm/uuid\", \"command: Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup osd_mkfs_type\", \"command: Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup osd_fs_type\", \"command: Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup osd_mkfs_options_xfs\", \"command: Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup osd_fs_mkfs_options_xfs\", \"command: Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup osd_mount_options_xfs\", \"command: Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup osd_fs_mount_options_xfs\", \"get_dm_uuid: get_dm_uuid /dev/vdb uuid path is /sys/dev/block/252:16/dm/uuid\", \"get_dm_uuid: get_dm_uuid /dev/vdb uuid path is /sys/dev/block/252:16/dm/uuid\", \"ptype_tobe_for_name: name = journal\", \"get_dm_uuid: get_dm_uuid /dev/vdb uuid path is /sys/dev/block/252:16/dm/uuid\", \"create_partition: Creating journal partition num 2 size 5120 on /dev/vdb\", \"command_check_call: Running command: /usr/sbin/sgdisk --new=2:0:+5120M --change-name=2:ceph journal --partition-guid=2:6644fd1a-3cf4-4768-b01e-20bab7ece783 --typecode=2:45b0969e-9b03-4f30-b4c6-b4b80ceff106 --mbrtogpt -- /dev/vdb\", \"update_partition: Calling partprobe on created device /dev/vdb\", \"command_check_call: Running command: /usr/bin/udevadm settle --timeout=600\", \"command: Running command: /usr/bin/flock -s /dev/vdb /usr/sbin/partprobe /dev/vdb\", \"command_check_call: Running command: /usr/bin/udevadm settle --timeout=600\", \"get_dm_uuid: get_dm_uuid /dev/vdb uuid path is /sys/dev/block/252:16/dm/uuid\", \"Traceback (most recent call last):\", \"  File \\\\\"/usr/sbin/ceph-disk\\\\\", line 9, in <module>\", \"    load_entry_point(\\'ceph-disk==1.0.0\\', \\'console_scripts\\', \\'ceph-disk\\')()\", \"  File \\\\\"/usr/lib/python2.7/site-packages/ceph_disk/main.py\\\\\", line 5326, in run\", \"    main(sys.argv[1:])\", \"  File \\\\\"/usr/lib/python2.7/site-packages/ceph_disk/main.py\\\\\", line 5277, in main\", \"    args.func(args)\", \"  File \\\\\"/usr/lib/python2.7/site-packages/ceph_disk/main.py\\\\\", line 1879, in main\", \"    Prepare.factory(args).prepare()\", \"  File \\\\\"/usr/lib/python2.7/site-packages/ceph_disk/main.py\\\\\", line 1868, in prepare\", \"    self.prepare_locked()\", \"  File \\\\\"/usr/lib/python2.7/site-packages/ceph_disk/main.py\\\\\", line 1899, in prepare_locked\", \"    self.data.prepare(self.journal)\", \"  File \\\\\"/usr/lib/python2.7/site-packages/ceph_disk/main.py\\\\\", line 2566, in prepare\", \"    self.prepare_device(*to_prepare_list)\", \"  File \\\\\"/usr/lib/python2.7/site-packages/ceph_disk/main.py\\\\\", line 2728, in prepare_device\", \"    to_prepare.prepare()\", \"  File \\\\\"/usr/lib/python2.7/site-packages/ceph_disk/main.py\\\\\", line 2070, in prepare\", \"    self.prepare_device()\", \"  File \\\\\"/usr/lib/python2.7/site-packages/ceph_disk/main.py\\\\\", line 2164, in prepare_device\", \"    partition = device.get_partition(num)\", \"  File \\\\\"/usr/lib/python2.7/site-packages/ceph_disk/main.py\\\\\", line 1644, in get_partition\", \"    dev = get_partition_dev(self.path, num)\", \"  File \\\\\"/usr/lib/python2.7/site-packages/ceph_disk/main.py\\\\\", line 670, in get_partition_dev\", \"    (pnum, dev))\", \"ceph_disk.main.Error: Error: partition 2 for /dev/vdb does not appear to exist\"], \"stdout\": \"2017-09-19 20:15:02  /entrypoint.sh: static: does not generate config\\\

\'Error: /dev/vdb: unrecognised disk label\\', 
\'Error ENOENT: failed to find client.manila in keyring\\', 
\'Error ENOENT: failed to find client.openstack in keyring\\'], 
\'Error ENOENT: failed to find client.radosgw in keyring\\', 
\'Error response from daemon: No such container: ceph-osd-overcloud-cephstorage-1-devvdb\\'], 
\'Error response from daemon: No such container: ceph-osd-overcloud-cephstorage-2-devvdb\\'],

Comment 12 John Fulton 2017-09-20 13:57:25 UTC

(In reply to Alexander Chuzhoy from comment #10)
...
> 670, in get_partition_dev\", \"    (pnum, dev))\", \"ceph_disk.main.Error:
> Error: partition 2 for /dev/vdb does not appear to exist\"], \"stdout\":
> \"2017-09-19 20:15:02  /entrypoint.sh: static: does not generate config\\\

This might be 

 https://bugzilla.redhat.com/show_bug.cgi?id=1491780
 http://tracker.ceph.com/issues/19428

"At the first iteration, the sdb2 is missing while at the second one (1 sec after) the sdb2 showed up."

> \'Error: /dev/vdb: unrecognised disk label\\', 
> \'Error ENOENT: failed to find client.manila in keyring\\', 
> \'Error ENOENT: failed to find client.openstack in keyring\\'], 
> \'Error ENOENT: failed to find client.radosgw in keyring\\', 
> \'Error response from daemon: No such container:
> ceph-osd-overcloud-cephstorage-1-devvdb\\'], 
> \'Error response from daemon: No such container:
> ceph-osd-overcloud-cephstorage-2-devvdb\\'],

However the "/dev/vdb: unrecognised disk label" seems to be the other issue under this bug. 

I did further testing using ceph-ansible directly from the master branch on the same system (sealusa3). I didn't hit the unrecognised disk label this time, but I hit the race condition again...

Ansible command: http://ix.io/A2
Ansible run: http://ix.io/A2R

failed on osd node .12 on task: 

2017-09-19 23:09:40,888 p=25857 u=mistral |  TASK [ceph-osd : prepare ceph containerized osd disk collocated] ***************

On: Error: partition 2 for /dev/vdb does not appear to exist

Comment 13 John Fulton 2017-09-20 13:59:26 UTC

(In reply to John Fulton from comment #12)
> I did further testing using ceph-ansible directly from the master branch on
> the same system (sealusa3). I didn't hit the unrecognised disk label this
> time, but I hit the race condition again...
> 
> Ansible command: http://ix.io/A2
> Ansible run: http://ix.io/A2R

Correction, that command is at: Ansible command: http://ix.io/A2Q

Comment 14 Sébastien Han 2017-10-02 14:53:17 UTC

I see two different issues in this PR:

Error: Could not stat device mklabel - No such file or directory."

Which makes me believe the device passed doesn't exist.

Then I see: ceph_disk.main.Error: Error: partition 2 for /dev/vdb does not appear to exist

Which appears to be the race condition.

So either it's a NOTABUG or DUP.

Please update the status, thanks.

Comment 15 John Fulton 2017-10-02 16:41:58 UTC

The root cause of this seems to be a ceph-disk race condition as described in the following bug so I am closing this bug as a duplicate of it. 

 https://bugzilla.redhat.com/show_bug.cgi?id=1494543

*** This bug has been marked as a duplicate of bug 1494543 ***

Comment 16 John Fulton 2017-11-08 16:23:29 UTC

*** Bug 1507823 has been marked as a duplicate of this bug. ***