Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 1491780 - ceph: Racing between partition creation & device node creation
ceph: Racing between partition creation & device node creation
Status: CLOSED ERRATA
Product: Red Hat Ceph Storage
Classification: Red Hat
Component: Ceph-Disk (Show other bugs)
2.4
Unspecified Unspecified
urgent Severity urgent
: rc
: 2.4
Assigned To: Kefu Chai
Warren
Bara Ancincova
: Triaged
Depends On:
Blocks: 1479701 1496985
  Show dependency treegraph
 
Reported: 2017-09-14 12:08 EDT by Alexander Chuzhoy
Modified: 2018-06-26 19:46 EDT (History)
14 users (show)

See Also:
Fixed In Version: RHEL: ceph-10.2.7-34.el7cp Ubuntu: ceph_10.2.7-35redhat1
Doc Type: Bug Fix
Doc Text:
.`ceph-disk` retries up to ten times to find files that represents newly created OSD partitions When deploying a new OSD with the `ceph-ansible` playbook, the file under the `/sys/` directory that represents a newly created OSD partition failed to show up right after the `partprobe` utility returned it. Consequently, the `ceph-disk` utility failed to activate the OSD, and `ceph-ansible` could not deploy the OSD successfully. With this update, if `ceph-disk` cannot find the file, it retries up to ten times to find it before it terminates. As a result, `ceph-disk` can activate the newly prepared OSD as expected.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2017-10-17 14:12:51 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Ceph Project Bug Tracker 19428 None None None 2017-09-14 12:09 EDT
Red Hat Product Errata RHBA-2017:2903 normal SHIPPED_LIVE Red Hat Ceph Storage 2.4 enhancement and bug fix update 2017-10-17 18:12:30 EDT

  None (edit)
Description Alexander Chuzhoy 2017-09-14 12:08:27 EDT
ceph: Racing between partition creation & device node creation

Environment:
python-cephfs-10.2.7-32.el7cp.x86_64
libcephfs1-10.2.7-32.el7cp.x86_64
ceph-common-10.2.7-32.el7cp.x86_64
ceph-mon-10.2.7-32.el7cp.x86_64
ceph-radosgw-10.2.7-32.el7cp.x86_64
puppet-ceph-2.4.1-0.20170831071705.df3ed30.el7ost.noarch
ceph-selinux-10.2.7-32.el7cp.x86_64
ceph-mds-10.2.7-32.el7cp.x86_64
ceph-base-10.2.7-32.el7cp.x86_64

instack-undercloud-7.3.1-0.20170830213703.el7ost.noarch
puppet-ceph-2.4.1-0.20170831071705.df3ed30.el7ost.noarch
openstack-tripleo-heat-templates-7.0.0-0.20170901051303.0rc1.el7ost.noarch
ceph-ansible-3.0.0-0.1.rc6.el7cp.noarch
openstack-puppet-modules-11.0.0-0.20170828113154.el7ost.noarch


Not all partitions are created on OSD nodes during OC deployment with ceph due to a race.


Steps to reproduce:
1. Apply the missing patch https://review.openstack.org/#/c/501983/3/docker/services/ceph-ansible/ceph-osd.yaml

2. Deploy OC

Result:
The deployment fails during ceph-ansible phase. 

W/A:
re-run the deployment command over the failed one.
Comment 1 John Fulton 2017-09-14 12:18:59 EDT
Background on this is below. Question we should verify: 

 Does the ceph-disk that is being used in the puddle have the fix in tracker.ceph.com/issues/19428? 


---------- Forwarded message ----------
From: John Fulton <johfulto@redhat.com>
Date: Wed, Sep 13, 2017 at 6:09 PM
Subject: ceph-disk race condition
To: Sasha Chuzhoy <sasha@redhat.com>
Cc: Giulio Fidente <gfidente@redhat.com>, Sebastien Han <shan@redhat.com>

Hi Sasha,

Following up our conversation today.

You used ceph-ansible-3.0.0-0.1.rc6.el7cp.noarch and manually applied
the following patch to your THT, which you should do as that
ceph-ansible version requires it.

https://review.openstack.org/#/c/501983/3/docker/services/ceph-ansible/ceph-osd.yaml

You then deployed 3 ceph-storage in your overcloud. 2 succeeded and 1
failed. The one that failed had the following error:

 http://sprunge.us/fMGj

The above error is a ceph-disk error race condition which was fixed in
a newer version of ceph:

 http://tracker.ceph.com/issues/19428
Comment 2 John Fulton 2017-09-14 12:21:40 EDT
Sasha,

Which version of ceph-disk are you using as provided by the ceph-osd package on your Ceph Storage node. Here is an example of it on my upstream system:

[root@overcloud-cephstorage-0 ~]# yum whatprovides */ceph-disk
Loaded plugins: fastestmirror, priorities
Loading mirror speeds from cached hostfile
1020 packages excluded due to repository priority protections
1:ceph-osd-10.2.7-0.el7.x86_64 : Ceph Object Storage Daemon
Repo        : quickstart-centos-ceph-jewel
Matched from:
Filename    : /usr/sbin/ceph-disk



1:ceph-osd-10.2.7-0.el7.x86_64 : Ceph Object Storage Daemon
Repo        : @quickstart-centos-ceph-jewel
Matched from:
Filename    : /usr/sbin/ceph-disk



[root@overcloud-cephstorage-0 ~]# 

Thanks
  John
Comment 3 Alexander Chuzhoy 2017-09-14 13:29:59 EDT
So there's no ceph-disk on the ceph storage node. It's in the respective container (ceph-osd-overcloud-cephstorage-0-devvdb):

[root@overcloud-cephstorage-0 /]# rpm -qf `which ceph-disk`
ceph-osd-10.2.7-28.el7cp.x86_64
Comment 6 Kefu Chai 2017-09-18 01:44:25 EDT
>  Does the ceph-disk that is being used in the puddle have the fix in tracker.ceph.com/issues/19428? 


no, the upstream backport PR[0] targeting jewel was included by 10.2.8[1] and up. i also cross checked using git cli by checking the two commits in the backport PR:

$ git tag --contains a20d2b89ee13e311cf1038c54ecadae79b68abd5
v10.2.8
v10.2.9

$ git tag --contains 2d5d0aec60ec9689d44a53233268e9b9dd25df95
v10.2.8
v10.2.9

since rhcs 3.0 will be based on luminous, we can consider this issue addressed.

--
[0] https://github.com/ceph/ceph/pull/14329
[1] http://tracker.ceph.com/issues/19493, see target version.
Comment 11 John Fulton 2017-09-20 13:06:46 EDT
On a downstream system (sealusa3) with this issue ceph -v returns the following 

[root@20eceada4d09 /]# ceph -v
ceph version 10.2.7-28.el7cp (216cda64fd9a9b43c4b0c2f8c402d36753ee35f7)
[root@20eceada4d09 /]#
Comment 17 Kefu Chai 2017-10-12 09:55:46 EDT
without this fix, user might not be able to deploy OSD successfully. so i updated the Doc Text field.
Comment 21 Warren 2017-10-16 16:53:16 EDT
Automated tests have not produced a problem.  Marking as verified.
Comment 23 errata-xmlrpc 2017-10-17 14:12:51 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2903

Note You need to log in before you can comment on or make changes to this bug.