ceph: Racing between partition creation & device node creation Environment: python-cephfs-10.2.7-32.el7cp.x86_64 libcephfs1-10.2.7-32.el7cp.x86_64 ceph-common-10.2.7-32.el7cp.x86_64 ceph-mon-10.2.7-32.el7cp.x86_64 ceph-radosgw-10.2.7-32.el7cp.x86_64 puppet-ceph-2.4.1-0.20170831071705.df3ed30.el7ost.noarch ceph-selinux-10.2.7-32.el7cp.x86_64 ceph-mds-10.2.7-32.el7cp.x86_64 ceph-base-10.2.7-32.el7cp.x86_64 instack-undercloud-7.3.1-0.20170830213703.el7ost.noarch puppet-ceph-2.4.1-0.20170831071705.df3ed30.el7ost.noarch openstack-tripleo-heat-templates-7.0.0-0.20170901051303.0rc1.el7ost.noarch ceph-ansible-3.0.0-0.1.rc6.el7cp.noarch openstack-puppet-modules-11.0.0-0.20170828113154.el7ost.noarch Not all partitions are created on OSD nodes during OC deployment with ceph due to a race. Steps to reproduce: 1. Apply the missing patch https://review.openstack.org/#/c/501983/3/docker/services/ceph-ansible/ceph-osd.yaml 2. Deploy OC Result: The deployment fails during ceph-ansible phase. W/A: re-run the deployment command over the failed one.
Background on this is below. Question we should verify: Does the ceph-disk that is being used in the puddle have the fix in tracker.ceph.com/issues/19428? ---------- Forwarded message ---------- From: John Fulton <johfulto> Date: Wed, Sep 13, 2017 at 6:09 PM Subject: ceph-disk race condition To: Sasha Chuzhoy <sasha> Cc: Giulio Fidente <gfidente>, Sebastien Han <shan> Hi Sasha, Following up our conversation today. You used ceph-ansible-3.0.0-0.1.rc6.el7cp.noarch and manually applied the following patch to your THT, which you should do as that ceph-ansible version requires it. https://review.openstack.org/#/c/501983/3/docker/services/ceph-ansible/ceph-osd.yaml You then deployed 3 ceph-storage in your overcloud. 2 succeeded and 1 failed. The one that failed had the following error: http://sprunge.us/fMGj The above error is a ceph-disk error race condition which was fixed in a newer version of ceph: http://tracker.ceph.com/issues/19428
Sasha, Which version of ceph-disk are you using as provided by the ceph-osd package on your Ceph Storage node. Here is an example of it on my upstream system: [root@overcloud-cephstorage-0 ~]# yum whatprovides */ceph-disk Loaded plugins: fastestmirror, priorities Loading mirror speeds from cached hostfile 1020 packages excluded due to repository priority protections 1:ceph-osd-10.2.7-0.el7.x86_64 : Ceph Object Storage Daemon Repo : quickstart-centos-ceph-jewel Matched from: Filename : /usr/sbin/ceph-disk 1:ceph-osd-10.2.7-0.el7.x86_64 : Ceph Object Storage Daemon Repo : @quickstart-centos-ceph-jewel Matched from: Filename : /usr/sbin/ceph-disk [root@overcloud-cephstorage-0 ~]# Thanks John
So there's no ceph-disk on the ceph storage node. It's in the respective container (ceph-osd-overcloud-cephstorage-0-devvdb): [root@overcloud-cephstorage-0 /]# rpm -qf `which ceph-disk` ceph-osd-10.2.7-28.el7cp.x86_64
> Does the ceph-disk that is being used in the puddle have the fix in tracker.ceph.com/issues/19428? no, the upstream backport PR[0] targeting jewel was included by 10.2.8[1] and up. i also cross checked using git cli by checking the two commits in the backport PR: $ git tag --contains a20d2b89ee13e311cf1038c54ecadae79b68abd5 v10.2.8 v10.2.9 $ git tag --contains 2d5d0aec60ec9689d44a53233268e9b9dd25df95 v10.2.8 v10.2.9 since rhcs 3.0 will be based on luminous, we can consider this issue addressed. -- [0] https://github.com/ceph/ceph/pull/14329 [1] http://tracker.ceph.com/issues/19493, see target version.
On a downstream system (sealusa3) with this issue ceph -v returns the following [root@20eceada4d09 /]# ceph -v ceph version 10.2.7-28.el7cp (216cda64fd9a9b43c4b0c2f8c402d36753ee35f7) [root@20eceada4d09 /]#
without this fix, user might not be able to deploy OSD successfully. so i updated the Doc Text field.
Automated tests have not produced a problem. Marking as verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:2903