Description of problem: When a Ceph OSD node on RHcs 2.0 boots, UDEV is not mounting the OSD block devices. If partprobe is run after after the boot, it triggers properly and starts and mounts all the OSD daemons. I would expect UDEV to properly trigger the rules and mound the OSD block devices at boot time. Debug was enabled for UDEV and shows UDEV is doing its job, we then transition to ceph-disk having to complete the mount and it fails sporadically. This is likely a timing issue as it does not occur with each reboot. Ceph process and is ceph:ceph Version-Release number of selected component (if applicable): RHCS 2.0 10.2.2-41.el7cp.x86_64 RHEL 7.2 3.10.0-327.36.3.el7.x86_64 Logs from issue with debug enabled on UDV are located: https://api.access.redhat.com/rs/cases/01705490/attachments/1d24f222-b432-4fb4-9275-69eec89b4eab
I would need to be able to reproduce the problem to figure out what's going on. Would you be so kind as to detail the steps I should follow to do that ?
> This is likely timing related but the customers seems to be able to trigger this behaviour on demand. This is good :-) What about asking him to provide a sosreport ? It would be most useful in combination to the date/time of the reboot that failed to bring all OSDs up and running.
I can't access the api.access link using the login that works for access.redhat.com. I guess not enough permissions ? In any case it would be good to attach the report to this bz for posterity. If that's not possible for some reason I'll need to investigate why I can't access this URL. Thanks for your patience.
@Tim I have the file, thanks for the guidance.
Just to confirm this is indeed a race at boot time, do you confirm that manually activating the devices with ceph-disk activate works ? It would also help me understand what's going on if you tell me a little more about how the OSD were configured. How were the device prepared with ceph-disk (i.e. what options were used) ? I'm guessing devices /dev/sdb to /dev/sdy have been prepared with a simple ceph-disk prepare /dev/sdb which explain why they all have two partitions (one for the data the other for the journal). How far am I from what's really there ?
In /var/log/messages I looked at the last two reboots. The last one does not have much, as if it had ben interrupted. The previous to last has the following sequence similar for each disk: Nov 3 14:59:23 ttbcosd0015 kernel: [ 5.008708] sde: sde1 sde2 ... Nov 3 14:59:23 ttbcosd0015 logger: RHDEBUG:Running-ceph-disk ... Nov 3 14:59:33 ttbcosd0015 kernel: [ 16.693962] XFS (sde1): Mounting V4 Filesystem Nov 3 14:59:33 ttbcosd0015 kernel: [ 16.693962] XFS (sde1): Mounting V4 Filesystem Nov 3 14:59:33 ttbcosd0015 kernel: [ 16.709829] XFS (sde1): Ending clean mount Nov 3 14:59:33 ttbcosd0015 kernel: [ 16.709829] XFS (sde1): Ending clean mount and in var/log/ceph/ceph-osd.363.log 2016-11-03 14:59:28.897645 7f97ccd2a800 0 ceph version 10.2.2-41.el7cp (1ac1c364ca12fa985072174e75339bfb1f50e9ee), process ceph-osd, pid 3874 2016-11-03 14:59:28.897855 7f97ccd2a800 -1 [0;31m ** ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-363: (2) No such file or directory[0m ... 2016-11-03 14:59:37.914801 7fe4ea312800 0 ceph version 10.2.2-41.el7cp (1ac1c364ca12fa985072174e75339bfb1f50e9ee), process ceph-osd, pid 25115 2016-11-03 14:59:37.916534 7fe4ea312800 0 pidfile_write: ignore empty --pid-file 2016-11-03 14:59:37.944380 7fe4ea312800 0 filestore(/var/lib/ceph/osd/ceph-363) backend xfs (magic 0x58465342) 2016-11-03 14:59:37.945254 7fe4ea312800 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-363) detect_features: FIEMAP ioctl is disabled via 'filestore fiemap' config option 2016-11-03 14:59:37.945263 7fe4ea312800 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-363) detect_features: SEEK_DATA/SEEK_HOLE is disabled via 'filestore seek data hole' config option ... 2016-11-03 15:02:14.749603 7fe4ea312800 10 -- :/25115 wait: waiting for pipes to close 2016-11-03 15:02:14.749605 7fe4ea312800 10 -- :/25115 wait: done. 2016-11-03 15:02:14.749606 7fe4ea312800 1 -- :/25115 shutdown complete. at 15:02 the machine was rebooted but it looks like the OSD booted ok at 14:59. After it rebooted at 15:07 nothing good happens. All osd logs exhibit something like: 2016-11-03 15:07:14.858399 7f17dc5e9800 0 set uid:gid to 167:167 (ceph:ceph) 2016-11-03 15:07:14.858425 7f17dc5e9800 0 ceph version 10.2.2-41.el7cp (1ac1c364ca12fa985072174e75339bfb1f50e9ee), process ceph-osd, pid 3789 2016-11-03 15:07:14.858571 7f17dc5e9800 -1 [0;31m ** ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-363: (2) No such file or directory[0m and var/log/message only has Nov 3 15:07:09 ttbcosd0015 kernel: [ 4.962805] sde: sde1 sde2 Nov 3 15:07:09 ttbcosd0015 kernel: [ 4.964710] sd 0:2:3:0: [sde] Attached SCSI disk but no ceph-disk and no series of lines like Nov 3 15:01:04 ttbcosd0015 root: os-prober: debug: running /usr/libexec/os-probes/mounted/90linux-distro on mounted /dev/sde1 which seems to indicate that something's not happening as it should and not just regarding ceph osds. It is as if even /var/lib/ceph is not available.