Bug 1749097

Summary: ceph-ansible filestore fails to start containerized OSD when using block device like /dev/loop3
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: John Fulton <johfulto>
Component: Ceph-AnsibleAssignee: Dimitri Savineau <dsavinea>
Status: CLOSED ERRATA QA Contact: Vasishta <vashastr>
Severity: low Docs Contact:
Priority: low    
Version: 3.2CC: aschoen, bniver, ceph-eng-bugs, ceph-qe-bugs, gabrioux, gfidente, gmeno, nthomas, pasik, sasha, tchandra, tserlin, ykaul
Target Milestone: z2   
Target Release: 3.3   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: RHEL: ceph-ansible-3.2.30-1.el7cp Ubuntu: ceph-ansible_3.2.30-2redhat1 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-12-19 17:59:09 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1578730    

Description John Fulton 2019-09-04 21:15:06 UTC
When using filestore and the following devices list with ceph-ansible 3.24 and the rhceph-3-rhel7:3-32 container (78e0950c3de6) a ceph deployment fails while waiting for all of the OSDs to start. 

devices:
 - /dev/loop3
 - /dev/loop4
 - /dev/loop5

The OSDs are not starting because the ceph-osd-run.sh script is trimming only one character from the device name which results in an invalid device and the following error when attempting to manually start the OSD:

[root@overcloud-ceph-leaf1-0 ~]# /usr/share/ceph-osd-run.sh 1
2019-09-04 20:11:24  /entrypoint.sh: static: does not generate config
2019-09-04 20:11:24  /entrypoint.sh: ERROR: you either provided a non-existing device or no device at all.
2019-09-04 20:11:24  /entrypoint.sh: You must provide a device to build your OSD ie: /dev/sdb
[root@overcloud-ceph-leaf1-0 ~]# 

If I modify the deployed script to trim not one character but two [2], then the OSD starts fine.

When running the commands directly you can see what's happening for to the given block devices [3]. We get a non-exist DATA_PART of /dev/loop3p :

[root@overcloud-ceph-leaf1-0 ~]# DATA_PART=$(docker run --rm --ulimit nofile=1024:1024 --privileged=true -v /dev/:/dev/ -v /etc/ceph:/etc/ceph:z --entrypoint ceph-disk 10.37.168.131:8787/rhceph/rhceph-3-rhel7:3-32 list | grep ", osd\.1," | awk '{ print $1 }')
[root@overcloud-ceph-leaf1-0 ~]# OSD_DEVICE=${DATA_PART:0:-1}
[root@overcloud-ceph-leaf1-0 ~]# echo $OSD_DEVICE
/dev/loop3p
[root@overcloud-ceph-leaf1-0 ~]# 

I assume that the block devices you tested with, e.g. /dev/vdb, didn't have this issue but the loop back devices did. 

[root@overcloud-ceph-leaf1-0 ~]# echo $OSD_DEVICE
/dev/loop3p
[root@overcloud-ceph-leaf1-0 ~]# OSD_DEVICE=${DATA_PART:0:-2}
[root@overcloud-ceph-leaf1-0 ~]# echo $OSD_DEVICE
/dev/loop3
[root@overcloud-ceph-leaf1-0 ~]# 

If you want to do additional checking in the shell script it would solve this bug for devices which match this syntax type. 

We're working around it simply by switching the deployment from filestore to bluestore. Just wanted to report the bug for completeness.


[1] https://github.com/ceph/ceph-ansible/blob/v3.2.24/roles/ceph-osd/templates/ceph-osd-run.sh.j2#L23

[2] 
[fultonj@skagra tmp]$ diff -u old new
--- old	2019-09-04 16:54:23.085337059 -0400
+++ new	2019-09-04 16:54:31.829142391 -0400
@@ -15,7 +15,7 @@
   if [[ "${DATA_PART}" =~ ^/dev/(cciss|nvme) ]]; then
     OSD_DEVICE=${DATA_PART:0:-2}
   else
-    OSD_DEVICE=${DATA_PART:0:-1}
+    OSD_DEVICE=${DATA_PART:0:-2}
   fi
 }
 
[fultonj@skagra tmp]$ 

[3] 
[root@overcloud-ceph-leaf1-0 ~]# lsblk
NAME      MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda         8:0    0 931.5G  0 disk 
├─sda1      8:1    0     1M  0 part 
└─sda2      8:2    0 931.5G  0 part /
loop3       7:3    0    20G  0 loop 
├─loop3p1 259:1    0    15G  0 loop 
└─loop3p2 259:0    0     5G  0 loop 
loop4       7:4    0    20G  0 loop 
├─loop4p1 259:3    0    15G  0 loop 
└─loop4p2 259:2    0     5G  0 loop 
loop5       7:5    0    20G  0 loop 
├─loop5p1 259:5    0    15G  0 loop 
└─loop5p2 259:4    0     5G  0 loop 
loop6       7:6    0    20G  0 loop 
├─loop6p1 259:7    0    15G  0 loop 
└─loop6p2 259:6    0     5G  0 loop 
[root@overcloud-ceph-leaf1-0 ~]#

Comment 6 errata-xmlrpc 2019-12-19 17:59:09 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2019:4353