Bug 1422191

Summary: OSPD doesn't notify when it fails to create OSDs due to lack of disks in Ceph storage node
Product: Red Hat OpenStack Reporter: Yogev Rabl <yrabl>
Component: puppet-cephAssignee: John Fulton <johfulto>
Status: CLOSED ERRATA QA Contact: Yogev Rabl <yrabl>
Severity: high Docs Contact: Derek <dcadzow>
Priority: medium    
Version: 11.0 (Ocata)CC: gfidente, jjoyce, johfulto, jomurphy, jschluet, mburns, rhel-osp-director-maint, slinaber, tvignaud
Target Milestone: Upstream M3   
Target Release: 11.0 (Ocata)   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: puppet-ceph-2.3.0-2.el7ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-05-17 19:59:41 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Yogev Rabl 2017-02-14 16:55:46 UTC
Description of problem:
OSPD didn't raise any error or warning when a updating an Overcloud, increasing the number of OSDs from 3 in each node to 11. Each Ceph storage node had only 9 available disk to run OSDs on.
The update ended successfully, though not all of the OSDs that were set in the environment file were active. 

The environment file was set with 11 OSDs per node: 
  ExtraConfig:
    ceph::profile::params::osds:
     '/dev/vdb':
       journal:
     '/dev/vdc':
       journal:
     '/dev/vdd':
       journal:
     '/dev/vde':
       journal:
     '/dev/vdf':
       journal:
     '/dev/vdg':
       journal:
     '/dev/vdh':
       journal:
     '/dev/vdi':
       journal:
     '/dev/vdj':
       journal:
     '/dev/vdk':
       journal:
     '/dev/vdl':
       journal:
When there were only 9 disks available for the OSDs /dev/vdb-/dev/vdj

Version-Release number of selected component (if applicable):

openstack-tripleo-validations-5.3.1-0.20170125194508.6b928f1.el7ost.noarch
openstack-tripleo-common-5.7.1-0.20170126235054.c75d3c6.el7ost.noarch
puppet-tripleo-6.1.0-0.20170127040716.d427c2a.el7ost.noarch
openstack-tripleo-puppet-elements-6.0.0-0.20170126053436.688584c.el7ost.noarch
openstack-tripleo-0.0.8-0.2.4de13b3git.el7ost.noarch
openstack-tripleo-heat-templates-6.0.0-0.20170127041112.ce54697.el7ost.1.noarch
openstack-tripleo-ui-2.0.1-0.20170126144317.f3bd97e.el7ost.noarch
python-tripleoclient-6.0.1-0.20170127055753.8ea289c.el7ost.noarch
openstack-tripleo-image-elements-6.0.0-0.20170126135810.00b9869.el7ost.noarch


How reproducible:
100%

Steps to Reproduce:
1. Deploy an Overcloud with 3 OSDs on each Ceph storage node
2. Update the Overcloud with a new storage environment file that sets more OSDs that disks in the Ceph storage nodes.


Actual results:
The update of the Overcloud finished successfully.

Expected results:
The update fails with an error that not all of the OSDs were initialized.

Additional info:

Comment 1 John Fulton 2017-02-17 14:23:37 UTC
We can add a test in puppet-ceph's osd.pp to make it fail if any of the OSDs on the list fail to be activated. Here's an example from another tool: 

https://github.com/ceph/ceph-ansible/blob/master/roles/ceph-osd/tasks/activate_osds.yml#L61-L66

Users should specify a list of the disks they want which is accurate: 

They can use something like the following: 

 http://tripleo.org/advanced_deployment/node_specific_hieradata.html

or even: 

 https://github.com/RHsyseng/hci/tree/master/other-scenarios/mixed-nodes

If they have heterogeneous hardware. 

So the next step is to look at how this scenario is slipping by the following conditionals: 

https://github.com/openstack/puppet-ceph/blob/master/manifests/osd.pp#L201-L206

Comment 3 John Fulton 2017-02-17 16:26:40 UTC
What you get in this scenario is a working directory-based OSD, not block-based directory as the user intended (and they did intend it if they passed /dev/foo along with a list of other block devices). 

[root@osd ~]# ls -laF /dev/sdq
total 28
drwxr-xr-x.  3 ceph ceph  220 Feb 17 10:10 ./
drwxr-xr-x. 22 root root 3180 Feb 17 10:10 ../
-rw-r--r--.  1 root root  189 Feb 17 10:10 activate.monmap
-rw-r--r--.  1 ceph ceph   37 Feb 17 10:10 ceph_fsid
drwxr-xr-x.  3 ceph ceph   80 Feb 17 10:10 current/
-rw-r--r--.  1 ceph ceph   37 Feb 17 10:10 fsid
-rw-r--r--.  1 ceph ceph    0 Feb 17 10:10 journal
-rw-r--r--.  1 ceph ceph   21 Feb 17 10:10 magic
-rw-r--r--.  1 ceph ceph    4 Feb 17 10:10 store_version
-rw-r--r--.  1 ceph ceph   53 Feb 17 10:10 superblock
-rw-r--r--.  1 ceph ceph    2 Feb 17 10:10 whoami

Comment 4 John Fulton 2017-03-01 16:36:34 UTC
There was an update requested on this: 

- I have a proposed fix https://review.openstack.org/#/c/435618
- I just need to update the unit test so it can pass CI and merge
- I will get this done before the end of march so I can focus on some higher priority items.

Comment 5 John Fulton 2017-03-18 16:08:46 UTC
Update: Proposed upstream fix [1] passed CI and received positive review so far. 

[1]  https://review.openstack.org/#/c/435618/

Comment 6 John Fulton 2017-03-20 17:25:49 UTC
https://review.openstack.org/#/c/435618 has merged upstream.

Comment 10 Yogev Rabl 2017-04-18 15:14:53 UTC
verified on puppet-ceph-2.3.0-4.el7ost.noarch

Comment 11 errata-xmlrpc 2017-05-17 19:59:41 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:1245