I. Description of problem:
When deploying a large number of OSDs (e.g. 12 per node and 4 nodes) not all OSDs are activated but all are prepared
II. Version-Release number of selected component (if applicable):
overcloud image built with recent git checkout from review.openstack.org/openstack/puppet-ceph
[stack@hci-director puppet-ceph]$ git log | head
Merge: aa06a86 f49341b
Author: Jenkins <email@example.com>
Date: Tue Aug 23 14:36:41 2016 +0000
Merge "Manage all OSDs before managing pools."
Merge: b1af406 76b48dc
Author: Jenkins <firstname.lastname@example.org>
III. How reproducible:
I've reproduced this 6 times in a row on bare metal servers.
IV. Steps to Reproduce:
1. Establish working overcloud as described in tripleo.sh doc (or similar):
2. Modify overcloud nodes to have multiple block devices (e.g. 15)
3. Create Heat template environment assigning said block devices to OSDs (e.g. twelve for OSDs and three for OSD journals).
4. Deploy overcloud and include this Heat environment template
V. Actual results:
Not all OSDs listed in Heat under ceph::profile::params::osds: are active at the end of the deployment.
VI. Expected results:
All OSDs listed in Heat under ceph::profile::params::osds: are active at the end of the deployment
VII. Additional info:
1. When reproducing this it's possible to run into a separate ceph-disk race condition with partprobe described in https://github.com/ceph/ceph/commit/3d6d36a12bd4823352dc58e2135d03f261d18dbe, though fixing this problem does not eliminate this bug (I will follow up on this separate issue in a separate BZ for a separate project)
2. I conjecture that puppet-ceph may need a retry or it might need to wait before calling `ceph-disk activate` to ensure `ceph-disk prepare` is complete; possibly wait for udevadm to settle
Created attachment 1195418 [details]
TripleO preboot script to workaround ceph-disk partprobe race condition
Workaround for https://github.com/ceph/ceph/commit/3d6d36a12bd4823352dc58e2135d03f261d18dbe for those using http://buildlogs.centos.org/centos/7/storage/x86_64/ceph-jewel/
This is a backport of an existing fix
Created attachment 1195420 [details]
patch containing backport of existing fix from https://github.com/ceph/ceph/commit/3d6d36a12bd4823352dc58e2135d03f261d18dbe
NOT a patch for _this_ bug but for a separate bug. This patch is provided to help in reproducing this bug so that a separate issue is not conflated with this one.
I made this by updating /usr/lib/python2.7/site-packages/ceph_disk/main.py as provided by the following RPMs:
[root@overcloud-novacompute-2 ~]# rpm -qa | grep ceph
To have the changes described in the following:
(In reply to John Fulton from comment #0)
> VII. Additional info:
> 1. When reproducing this it's possible to run into a separate ceph-disk race
> condition with partprobe described in
> though fixing this problem does not eliminate this bug (I will follow up on
> this separate issue in a separate BZ for a separate project)
The first two attachments and comments 1 and 2 above are exclusively about the issue quoted above.
A patch for _this_ bug is still needed.
Created attachment 1197477 [details]
patch to osd.pp to fix reported problem when combined with other patch
This patch to osd.pp from puppet-ceph solves the problem in my env provided that I combine it with the patch containing the backport to ceph-disk I posted earlier.
I am going to hold off on sending this to review as I'd like to see if I can also workaround the backport not being there; i.e. make osd.pp look for `ceph-disk prepare` failure and manage it.
Created attachment 1198340 [details]
update to osd.pp.patch to not use udev for any Jewel until fix version is known
There are two bugs. The following patch to ceph-disk will solve ONE of them.
The other bug needs to be filed (I will do that next). Until that other bug is fixed, have osd.pp tell the install to not use udev. Thus, the osd.pp patch disables udev for 10.2.0 =< $version < X. X may not be 10.2.3. When the version with both fixes is known, osd.pp should be updated with a value for X.
todo: verify second bug with non-opensack Ceph install.
This has been hard to reproduce. Recent testing in the scale lab has shown similar symptoms when deploying a node with 36 OSDs. I will re-review logs from this testing after Ocata M3.
The problem we seem to have is that the execs in osd.pp go in background, causing puppet to move on to the next resource when it should not.
I am attaching a reproducer which executes 10 times the same resource dumping the execution timestamp into a set of files in /tmp; each resource sleeps 2 seconds and the files are named date_background when the sleep goes in background and date_nobackground when the sleep does not ... the execution time for the _nobackground resources skips 2 seconds as expected, the execution time for the _background resources is just within the same second for all 10 resources.
Created attachment 1239574 [details]
parallelism.pp reproduces the issue demonstrating the background issue
I'm attempting a change in puppet-ceph which could mitigate this https://review.openstack.org/#/c/434330/
Would be nice to test it at scale, using DeployArtifacts, to see if / how much it helps.
This bugzilla has been removed from the release and needs to be reviewed and Triaged for another Target Release.
Giulio, do you have the change in a puddle somewhere that could be used to deploy in the scale lab? Or would we have to patch the undercloud to try this? This patch looks like it might apply cleanly because it is so small. Does it apply cleanly to OSP10 by any chance?
Tim Wilkinson is working on a deployment right now to get HCI working on the supermicro 6048R servers in the scale lab, on a smaller scale than we did before. If this goes well, it would be at a much greater scale than what's in the original post, up to 36 drives/server x 9 servers = 324 OSDs. The scale lab can deploy up to 30 x 36 drives > 1000 OSDs in theory this way (but someone has to put their request in the queue). For now, Tim's current cloud09 deployment would be enough to test this patch right?
Yes, I want to try this patch with Tim. I'll jump on his undercloud to help set it up when the deploy is ready. I propose the following:
1. reprodcue the bug (this should happen)
2. ping me and I'll SSH into your undercloud to set up use deploy articfacts (or even first-boot if necessary) to have the patch applied before puppet-ceph runs
3. we'll observe the results of the patch and update the bug
There was an update requested on this:
- We one change to test which was merged for Ocata https://review.openstack.org/#/c/434330/1/manifests/osd.pp
- A similar test could be conducted with backgrounding as per #8 but only if the first point doesn't help
- We can put time into reproducing with the scale lab team when they're ready
- We can put time into reproducing in a virtual env too
I am focussing on higher priorities this week but can return to this the week of 3/13 unless asked to do so earlier.
Testing in scale lab yesterday with OSP10 but also with deploy artifacts to use a newer version of puppet-ceph which included the following:
We were able to deploy without any issue 3 times in a row using 8 Ceph storage servers with 34 OSDs each.
Im getting the same error, deploying the latest OSP10 repos with a hyperconverged setup compute+ceph with only one osd per compute :
Thanks, though this doesn't look like the same error. puppet-ceph exec'd the following shell commands:
[1;31mError: /bin/true # comment to satisfy puppet syntax requirements
if ! test -b /dev/nvme0n1 ; then
mkdir -p /dev/nvme0n1
if getent passwd ceph >/dev/null 2>&1; then
chown -h ceph:ceph /dev/nvme0n1
ceph-disk prepare --cluster-uuid d203beee-2208-11e7-9a51-525400fe01b8 /dev/nvme0n1
returned 1 instead of one of 
Does the device /dev/nvme0n1 exist on your system? If not then it shouldn't be in the Heat templates. The above looks a little more like this issue:
FYI: Neither this fix (no fixed-in-flag yet) not the above have landed in OSP10.
Do you want to open a new bug this and provide your Heat templates and the output of lsblk on your overcloud nodes?
verified in the scale lab
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.