Bug 1422191 - OSPD doesn't notify when it fails to create OSDs due to lack of disks in Ceph storage node
Summary: OSPD doesn't notify when it fails to create OSDs due to lack of disks in Ceph...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: puppet-ceph
Version: 11.0 (Ocata)
Hardware: x86_64
OS: Linux
medium
high
Target Milestone: Upstream M3
: 11.0 (Ocata)
Assignee: John Fulton
QA Contact: Yogev Rabl
Derek
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-02-14 16:55 UTC by Yogev Rabl
Modified: 2017-05-17 19:59 UTC (History)
9 users (show)

Fixed In Version: puppet-ceph-2.3.0-2.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-05-17 19:59:41 UTC


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2017:1245 normal SHIPPED_LIVE Red Hat OpenStack Platform 11.0 Bug Fix and Enhancement Advisory 2017-05-17 23:01:50 UTC
Ceph Project Bug Tracker 18976 None None None 2017-02-17 18:11:58 UTC
OpenStack gerrit 435618 None None None 2017-02-17 22:36:14 UTC
Launchpad 1665697 None None None 2017-02-17 16:16:30 UTC

Description Yogev Rabl 2017-02-14 16:55:46 UTC
Description of problem:
OSPD didn't raise any error or warning when a updating an Overcloud, increasing the number of OSDs from 3 in each node to 11. Each Ceph storage node had only 9 available disk to run OSDs on.
The update ended successfully, though not all of the OSDs that were set in the environment file were active. 

The environment file was set with 11 OSDs per node: 
  ExtraConfig:
    ceph::profile::params::osds:
     '/dev/vdb':
       journal:
     '/dev/vdc':
       journal:
     '/dev/vdd':
       journal:
     '/dev/vde':
       journal:
     '/dev/vdf':
       journal:
     '/dev/vdg':
       journal:
     '/dev/vdh':
       journal:
     '/dev/vdi':
       journal:
     '/dev/vdj':
       journal:
     '/dev/vdk':
       journal:
     '/dev/vdl':
       journal:
When there were only 9 disks available for the OSDs /dev/vdb-/dev/vdj

Version-Release number of selected component (if applicable):

openstack-tripleo-validations-5.3.1-0.20170125194508.6b928f1.el7ost.noarch
openstack-tripleo-common-5.7.1-0.20170126235054.c75d3c6.el7ost.noarch
puppet-tripleo-6.1.0-0.20170127040716.d427c2a.el7ost.noarch
openstack-tripleo-puppet-elements-6.0.0-0.20170126053436.688584c.el7ost.noarch
openstack-tripleo-0.0.8-0.2.4de13b3git.el7ost.noarch
openstack-tripleo-heat-templates-6.0.0-0.20170127041112.ce54697.el7ost.1.noarch
openstack-tripleo-ui-2.0.1-0.20170126144317.f3bd97e.el7ost.noarch
python-tripleoclient-6.0.1-0.20170127055753.8ea289c.el7ost.noarch
openstack-tripleo-image-elements-6.0.0-0.20170126135810.00b9869.el7ost.noarch


How reproducible:
100%

Steps to Reproduce:
1. Deploy an Overcloud with 3 OSDs on each Ceph storage node
2. Update the Overcloud with a new storage environment file that sets more OSDs that disks in the Ceph storage nodes.


Actual results:
The update of the Overcloud finished successfully.

Expected results:
The update fails with an error that not all of the OSDs were initialized.

Additional info:

Comment 1 John Fulton 2017-02-17 14:23:37 UTC
We can add a test in puppet-ceph's osd.pp to make it fail if any of the OSDs on the list fail to be activated. Here's an example from another tool: 

https://github.com/ceph/ceph-ansible/blob/master/roles/ceph-osd/tasks/activate_osds.yml#L61-L66

Users should specify a list of the disks they want which is accurate: 

They can use something like the following: 

 http://tripleo.org/advanced_deployment/node_specific_hieradata.html

or even: 

 https://github.com/RHsyseng/hci/tree/master/other-scenarios/mixed-nodes

If they have heterogeneous hardware. 

So the next step is to look at how this scenario is slipping by the following conditionals: 

https://github.com/openstack/puppet-ceph/blob/master/manifests/osd.pp#L201-L206

Comment 3 John Fulton 2017-02-17 16:26:40 UTC
What you get in this scenario is a working directory-based OSD, not block-based directory as the user intended (and they did intend it if they passed /dev/foo along with a list of other block devices). 

[root@osd ~]# ls -laF /dev/sdq
total 28
drwxr-xr-x.  3 ceph ceph  220 Feb 17 10:10 ./
drwxr-xr-x. 22 root root 3180 Feb 17 10:10 ../
-rw-r--r--.  1 root root  189 Feb 17 10:10 activate.monmap
-rw-r--r--.  1 ceph ceph   37 Feb 17 10:10 ceph_fsid
drwxr-xr-x.  3 ceph ceph   80 Feb 17 10:10 current/
-rw-r--r--.  1 ceph ceph   37 Feb 17 10:10 fsid
-rw-r--r--.  1 ceph ceph    0 Feb 17 10:10 journal
-rw-r--r--.  1 ceph ceph   21 Feb 17 10:10 magic
-rw-r--r--.  1 ceph ceph    4 Feb 17 10:10 store_version
-rw-r--r--.  1 ceph ceph   53 Feb 17 10:10 superblock
-rw-r--r--.  1 ceph ceph    2 Feb 17 10:10 whoami

Comment 4 John Fulton 2017-03-01 16:36:34 UTC
There was an update requested on this: 

- I have a proposed fix https://review.openstack.org/#/c/435618
- I just need to update the unit test so it can pass CI and merge
- I will get this done before the end of march so I can focus on some higher priority items.

Comment 5 John Fulton 2017-03-18 16:08:46 UTC
Update: Proposed upstream fix [1] passed CI and received positive review so far. 

[1]  https://review.openstack.org/#/c/435618/

Comment 6 John Fulton 2017-03-20 17:25:49 UTC
https://review.openstack.org/#/c/435618 has merged upstream.

Comment 10 Yogev Rabl 2017-04-18 15:14:53 UTC
verified on puppet-ceph-2.3.0-4.el7ost.noarch

Comment 11 errata-xmlrpc 2017-05-17 19:59:41 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:1245


Note You need to log in before you can comment on or make changes to this bug.