Description of problem: OSPD didn't raise any error or warning when a updating an Overcloud, increasing the number of OSDs from 3 in each node to 11. Each Ceph storage node had only 9 available disk to run OSDs on. The update ended successfully, though not all of the OSDs that were set in the environment file were active. The environment file was set with 11 OSDs per node: ExtraConfig: ceph::profile::params::osds: '/dev/vdb': journal: '/dev/vdc': journal: '/dev/vdd': journal: '/dev/vde': journal: '/dev/vdf': journal: '/dev/vdg': journal: '/dev/vdh': journal: '/dev/vdi': journal: '/dev/vdj': journal: '/dev/vdk': journal: '/dev/vdl': journal: When there were only 9 disks available for the OSDs /dev/vdb-/dev/vdj Version-Release number of selected component (if applicable): openstack-tripleo-validations-5.3.1-0.20170125194508.6b928f1.el7ost.noarch openstack-tripleo-common-5.7.1-0.20170126235054.c75d3c6.el7ost.noarch puppet-tripleo-6.1.0-0.20170127040716.d427c2a.el7ost.noarch openstack-tripleo-puppet-elements-6.0.0-0.20170126053436.688584c.el7ost.noarch openstack-tripleo-0.0.8-0.2.4de13b3git.el7ost.noarch openstack-tripleo-heat-templates-6.0.0-0.20170127041112.ce54697.el7ost.1.noarch openstack-tripleo-ui-2.0.1-0.20170126144317.f3bd97e.el7ost.noarch python-tripleoclient-6.0.1-0.20170127055753.8ea289c.el7ost.noarch openstack-tripleo-image-elements-6.0.0-0.20170126135810.00b9869.el7ost.noarch How reproducible: 100% Steps to Reproduce: 1. Deploy an Overcloud with 3 OSDs on each Ceph storage node 2. Update the Overcloud with a new storage environment file that sets more OSDs that disks in the Ceph storage nodes. Actual results: The update of the Overcloud finished successfully. Expected results: The update fails with an error that not all of the OSDs were initialized. Additional info:
We can add a test in puppet-ceph's osd.pp to make it fail if any of the OSDs on the list fail to be activated. Here's an example from another tool: https://github.com/ceph/ceph-ansible/blob/master/roles/ceph-osd/tasks/activate_osds.yml#L61-L66 Users should specify a list of the disks they want which is accurate: They can use something like the following: http://tripleo.org/advanced_deployment/node_specific_hieradata.html or even: https://github.com/RHsyseng/hci/tree/master/other-scenarios/mixed-nodes If they have heterogeneous hardware. So the next step is to look at how this scenario is slipping by the following conditionals: https://github.com/openstack/puppet-ceph/blob/master/manifests/osd.pp#L201-L206
What you get in this scenario is a working directory-based OSD, not block-based directory as the user intended (and they did intend it if they passed /dev/foo along with a list of other block devices). [root@osd ~]# ls -laF /dev/sdq total 28 drwxr-xr-x. 3 ceph ceph 220 Feb 17 10:10 ./ drwxr-xr-x. 22 root root 3180 Feb 17 10:10 ../ -rw-r--r--. 1 root root 189 Feb 17 10:10 activate.monmap -rw-r--r--. 1 ceph ceph 37 Feb 17 10:10 ceph_fsid drwxr-xr-x. 3 ceph ceph 80 Feb 17 10:10 current/ -rw-r--r--. 1 ceph ceph 37 Feb 17 10:10 fsid -rw-r--r--. 1 ceph ceph 0 Feb 17 10:10 journal -rw-r--r--. 1 ceph ceph 21 Feb 17 10:10 magic -rw-r--r--. 1 ceph ceph 4 Feb 17 10:10 store_version -rw-r--r--. 1 ceph ceph 53 Feb 17 10:10 superblock -rw-r--r--. 1 ceph ceph 2 Feb 17 10:10 whoami
There was an update requested on this: - I have a proposed fix https://review.openstack.org/#/c/435618 - I just need to update the unit test so it can pass CI and merge - I will get this done before the end of march so I can focus on some higher priority items.
Update: Proposed upstream fix [1] passed CI and received positive review so far. [1] https://review.openstack.org/#/c/435618/
https://review.openstack.org/#/c/435618 has merged upstream.
verified on puppet-ceph-2.3.0-4.el7ost.noarch
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2017:1245