Bug 1371218 - [puppet-ceph] When deploying a large number of OSDs not all OSDs are activated but all are prepared
Summary: [puppet-ceph] When deploying a large number of OSDs not all OSDs are activate...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: puppet-ceph
Version: 10.0 (Newton)
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: rc
: 11.0 (Ocata)
Assignee: Giulio Fidente
QA Contact: Yogev Rabl
URL:
Whiteboard:
Depends On:
Blocks: 1413723
TreeView+ depends on / blocked
 
Reported: 2016-08-29 15:24 UTC by John Fulton
Modified: 2017-05-17 19:32 UTC (History)
14 users (show)

Fixed In Version: puppet-ceph-2.3.0-2.el7ost.noarch
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-05-17 19:32:37 UTC


Attachments (Terms of Use)
TripleO preboot script to workaround ceph-disk partprobe race condition (1.04 KB, text/plain)
2016-08-29 15:34 UTC, John Fulton
no flags Details
patch containing backport of existing fix from https://github.com/ceph/ceph/commit/3d6d36a12bd4823352dc58e2135d03f261d18dbe (804 bytes, patch)
2016-08-29 15:41 UTC, John Fulton
no flags Details | Diff
patch to osd.pp to fix reported problem when combined with other patch (1.99 KB, patch)
2016-09-03 17:32 UTC, John Fulton
no flags Details | Diff
update to osd.pp.patch to not use udev for any Jewel until fix version is known (1.95 KB, patch)
2016-09-06 15:59 UTC, John Fulton
no flags Details | Diff
parallelism.pp reproduces the issue demonstrating the background issue (643 bytes, text/plain)
2017-01-11 18:40 UTC, Giulio Fidente
no flags Details


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2017:1245 normal SHIPPED_LIVE Red Hat OpenStack Platform 11.0 Bug Fix and Enhancement Advisory 2017-05-17 23:01:50 UTC
OpenStack gerrit 434330 None None None 2017-03-31 12:55:35 UTC

Description John Fulton 2016-08-29 15:24:44 UTC
I. Description of problem:

When deploying a large number of OSDs (e.g. 12 per node and 4 nodes) not all OSDs are activated but all are prepared

II. Version-Release number of selected component (if applicable):

overcloud image built with recent git checkout from review.openstack.org/openstack/puppet-ceph

[stack@hci-director puppet-ceph]$ git log | head 
commit 4e36628a02f0bce66aba3f1b97843f0cc44664ca
Merge: aa06a86 f49341b
Author: Jenkins <jenkins@review.openstack.org>
Date:   Tue Aug 23 14:36:41 2016 +0000

    Merge "Manage all OSDs before managing pools."

commit aa06a86361275eb2f4e3ce3d0e0cf6afe80b3f12
Merge: b1af406 76b48dc
Author: Jenkins <jenkins@review.openstack.org>
[stack@hci-director puppet-ceph]$ 

III. How reproducible:

I've reproduced this 6 times in a row on bare metal servers. 

IV. Steps to Reproduce:

1. Establish working overcloud as described in tripleo.sh doc (or similar): 
http://docs.openstack.org/developer/tripleo-docs/advanced_deployment/tripleo.sh.html
2. Modify overcloud nodes to have multiple block devices (e.g. 15)
3. Create Heat template environment assigning said block devices to OSDs (e.g. twelve for OSDs and three for OSD journals).  
4. Deploy overcloud and include this Heat environment template

V. Actual results:

Not all OSDs listed in Heat under ceph::profile::params::osds: are active at the end of the deployment. 

VI. Expected results:

All OSDs listed in Heat under ceph::profile::params::osds: are active at the end of the deployment

VII. Additional info:

1. When reproducing this it's possible to run into a separate ceph-disk race condition with partprobe described in https://github.com/ceph/ceph/commit/3d6d36a12bd4823352dc58e2135d03f261d18dbe, though fixing this problem does not eliminate this bug (I will follow up on this separate issue in a separate BZ for a separate project) 

2. I conjecture that puppet-ceph may need a retry or it might need to wait before calling `ceph-disk activate` to ensure `ceph-disk prepare` is complete; possibly wait for udevadm to settle

Footnote:

[1]
~~~
parameter_defaults:
  CephStorageExtraConfig:
    ceph::profile::params::osds:
      '/dev/sda':
        journal: '/dev/sdm'
      '/dev/sdb':
        journal: '/dev/sdm'  
      '/dev/sdc':
        journal: '/dev/sdm'
      '/dev/sdd':
        journal: '/dev/sdm'
      '/dev/sde':
        journal: '/dev/sdn'
      '/dev/sdf':
        journal: '/dev/sdn'
      '/dev/sdg':
        journal: '/dev/sdn'
      '/dev/sdh':
        journal: '/dev/sdn'
      '/dev/sdi':
        journal: '/dev/sdo'
      '/dev/sdj':
        journal: '/dev/sdo'
      '/dev/sdk':
        journal: '/dev/sdo'
      '/dev/sdl':
        journal: '/dev/sdo'
~~~

Comment 1 John Fulton 2016-08-29 15:34:48 UTC
Created attachment 1195418 [details]
TripleO preboot script to workaround ceph-disk partprobe race condition

Workaround for https://github.com/ceph/ceph/commit/3d6d36a12bd4823352dc58e2135d03f261d18dbe for those using http://buildlogs.centos.org/centos/7/storage/x86_64/ceph-jewel/

This is a backport of an existing fix

Comment 2 John Fulton 2016-08-29 15:41:25 UTC
Created attachment 1195420 [details]
patch containing backport of existing fix from https://github.com/ceph/ceph/commit/3d6d36a12bd4823352dc58e2135d03f261d18dbe

NOT a patch for _this_ bug but for a separate bug. This patch is provided to help in reproducing this bug so that a separate issue is not conflated with this one. 

I made this by updating /usr/lib/python2.7/site-packages/ceph_disk/main.py as provided by the following RPMs: 
 
[root@overcloud-novacompute-2 ~]# rpm -qa | grep ceph
ceph-base-10.2.2-0.el7.x86_64
ceph-mds-10.2.2-0.el7.x86_64
ceph-common-10.2.2-0.el7.x86_64
ceph-mon-10.2.2-0.el7.x86_64
ceph-10.2.2-0.el7.x86_64
python-cephfs-10.2.2-0.el7.x86_64
libcephfs1-10.2.2-0.el7.x86_64
ceph-selinux-10.2.2-0.el7.x86_64
ceph-osd-10.2.2-0.el7.x86_64
[root@overcloud-novacompute-2 ~]# 

To have the changes described in the following: 

https://github.com/ceph/ceph/commit/3d6d36a12bd4823352dc58e2135d03f261d18dbe

Comment 3 John Fulton 2016-08-29 15:43:49 UTC
(In reply to John Fulton from comment #0)
> VII. Additional info:
> 
> 1. When reproducing this it's possible to run into a separate ceph-disk race
> condition with partprobe described in
> https://github.com/ceph/ceph/commit/3d6d36a12bd4823352dc58e2135d03f261d18dbe,
> though fixing this problem does not eliminate this bug (I will follow up on
> this separate issue in a separate BZ for a separate project) 

The first two attachments and comments 1 and 2 above are exclusively about the issue quoted above. 

A patch for _this_ bug is still needed.

Comment 5 John Fulton 2016-09-03 17:32:26 UTC
Created attachment 1197477 [details]
patch to osd.pp to fix reported problem when combined with other patch

This patch to osd.pp from puppet-ceph solves the problem in my env provided that I combine it with the patch containing the backport to ceph-disk I posted earlier. 

I am going to hold off on sending this to review as I'd like to see if I can also workaround the backport not being there; i.e. make osd.pp look for `ceph-disk prepare` failure and manage it.

Comment 6 John Fulton 2016-09-06 15:59:18 UTC
Created attachment 1198340 [details]
update to osd.pp.patch to not use udev for any Jewel until fix version is known

There are two bugs. The following patch to ceph-disk will solve ONE of them. 

 https://github.com/ceph/ceph/commit/3d6d36a12bd4823352dc58e2135d03f261d18dbe

The other bug needs to be filed (I will do that next). Until that other bug is fixed, have osd.pp tell the install to not use udev. Thus, the osd.pp patch disables udev for 10.2.0 =< $version < X. X may not be 10.2.3. When the version with both fixes is known, osd.pp should be updated with a value for X. 

todo: verify second bug with non-opensack Ceph install.

Comment 7 John Fulton 2017-01-11 14:01:32 UTC
This has been hard to reproduce. Recent testing in the scale lab has shown similar symptoms when deploying a node with 36 OSDs. I will re-review logs from this testing after Ocata M3.

Comment 8 Giulio Fidente 2017-01-11 18:39:05 UTC
The problem we seem to have is that the execs in osd.pp go in background, causing puppet to move on to the next resource when it should not.

I am attaching a reproducer which executes 10 times the same resource dumping the execution timestamp into a set of files in /tmp; each resource sleeps 2 seconds and the files are named date_background when the sleep goes in background and date_nobackground when the sleep does not ... the execution time for the _nobackground resources skips 2 seconds as expected, the execution time for the _background resources is just within the same second for all 10 resources.

Comment 9 Giulio Fidente 2017-01-11 18:40:34 UTC
Created attachment 1239574 [details]
parallelism.pp reproduces the issue demonstrating the background issue

Comment 15 Giulio Fidente 2017-02-15 16:02:32 UTC
I'm attempting a change in puppet-ceph which could mitigate this https://review.openstack.org/#/c/434330/

Would be nice to test it at scale, using DeployArtifacts, to see if / how much it helps.

Comment 16 Red Hat Bugzilla Rules Engine 2017-02-15 16:02:37 UTC
This bugzilla has been removed from the release and needs to be reviewed and Triaged for another Target Release.

Comment 17 Ben England 2017-02-15 16:20:16 UTC
Giulio, do you have the change in a puddle somewhere that could be used to deploy in the scale lab?  Or would we have to patch the undercloud to try this?  This patch looks like it might apply cleanly because it is so small.  Does it apply cleanly to OSP10 by any chance?

Tim Wilkinson is working on a deployment right now to get HCI working on the supermicro 6048R servers in the scale lab, on a smaller scale than we did before.  If this goes well, it would be at a much greater scale than what's in the original post, up to 36 drives/server x 9 servers = 324 OSDs.  The scale lab can deploy up to 30 x 36 drives > 1000 OSDs in theory this way (but someone has to put their request in the queue).  For now, Tim's current cloud09 deployment would be enough to test this patch right?

Comment 18 John Fulton 2017-02-15 16:24:38 UTC
Ben,

Yes, I want to try this patch with Tim. I'll jump on his undercloud to help set it up when the deploy is ready. I propose the following: 

1. reprodcue the bug (this should happen)
2. ping me and I'll SSH into your undercloud to set up use deploy articfacts (or even first-boot if necessary) to have the patch applied before puppet-ceph runs
3. we'll observe the results of the patch and update the bug

  John

Comment 19 John Fulton 2017-03-01 16:45:48 UTC
There was an update requested on this: 

- We one change to test which was merged for Ocata https://review.openstack.org/#/c/434330/1/manifests/osd.pp
- A similar test could be conducted with backgrounding as per #8 but only if the first point doesn't help
- We can put time into reproducing with the scale lab team when they're ready
- We can put time into reproducing in a virtual env too

I am focussing on higher priorities this week but can return to this the week of 3/13 unless asked to do so earlier.

Comment 22 John Fulton 2017-03-31 13:15:56 UTC
Testing in scale lab yesterday with OSP10 but also with deploy artifacts to use a newer version of puppet-ceph which included the following: 

 https://review.openstack.org/#/c/434330/

We were able to deploy without any issue 3 times in a row using 8 Ceph storage servers with 34 OSDs each.

Comment 23 Pablo Sanchez 2017-04-25 15:41:08 UTC
Im getting the same error, deploying the latest OSP10 repos with a hyperconverged setup compute+ceph with only one osd per compute :

http://chunk.io/f/86ce7a96161443dfb97d541edc0a62f5

Comment 24 John Fulton 2017-04-25 16:01:13 UTC
Pablo,

Thanks, though this doesn't look like the same error. puppet-ceph exec'd the following shell commands: 

   [1;31mError: /bin/true # comment to satisfy puppet syntax requirements
    set -ex
    if ! test -b /dev/nvme0n1 ; then
        mkdir -p /dev/nvme0n1
        if getent passwd ceph >/dev/null 2>&1; then
            chown -h ceph:ceph /dev/nvme0n1
        fi
    fi
    ceph-disk prepare  --cluster-uuid d203beee-2208-11e7-9a51-525400fe01b8 /dev/nvme0n1 
    udevadm settle
     returned 1 instead of one of [0]

Does the device /dev/nvme0n1 exist on your system? If not then it shouldn't be in the Heat templates. The above looks a little more like this issue: 

 https://bugzilla.redhat.com/show_bug.cgi?id=1422191

FYI: Neither this fix (no fixed-in-flag yet) not the above have landed in OSP10.

Do you want to open a new bug this and provide your Heat templates and the output of lsblk on your overcloud nodes?

Comment 27 Yogev Rabl 2017-05-12 14:56:54 UTC
verified in the scale lab

Comment 28 errata-xmlrpc 2017-05-17 19:32:37 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:1245


Note You need to log in before you can comment on or make changes to this bug.