Bug 1392455 - The ceph-disk process is not reliably mounting OSD's on boot
Summary: The ceph-disk process is not reliably mounting OSD's on boot
Keywords:
Status: CLOSED DUPLICATE of bug 1391197
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: Ceph-Disk
Version: 2.0
Hardware: All
OS: Linux
unspecified
medium
Target Milestone: rc
: 2.2
Assignee: Loic Dachary
QA Contact: ceph-qe-bugs
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-11-07 14:32 UTC by Mike Hackett
Modified: 2017-07-30 14:58 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-11-21 17:06:52 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Ceph Project Bug Tracker 17889 0 None None None 2016-11-14 07:21:20 UTC

Description Mike Hackett 2016-11-07 14:32:05 UTC
Description of problem:
When a Ceph OSD node on RHcs 2.0 boots, UDEV is not mounting the OSD block devices.  If partprobe is run after after the boot, it triggers properly and starts and mounts all the OSD daemons.  I would expect UDEV to properly trigger the rules and mound the OSD block devices at boot time.

Debug was enabled for UDEV and shows UDEV is doing its job, we then transition to ceph-disk having to complete the mount and it fails sporadically.
This is likely a timing issue as it does not occur with each reboot.

Ceph process and is ceph:ceph


Version-Release number of selected component (if applicable):
RHCS 2.0 
10.2.2-41.el7cp.x86_64 

RHEL 7.2
3.10.0-327.36.3.el7.x86_64

Logs from issue with debug enabled on UDV are located:

https://api.access.redhat.com/rs/cases/01705490/attachments/1d24f222-b432-4fb4-9275-69eec89b4eab

Comment 5 Loic Dachary 2016-11-07 17:27:23 UTC
I would need to be able to reproduce the problem to figure out what's going on. Would you be so kind as to detail the steps I should follow to do that ?

Comment 7 Loic Dachary 2016-11-07 17:38:06 UTC
> This is likely timing related but the customers seems to be able to trigger this behaviour on demand.

This is good :-) What about asking him to provide a sosreport ? It would be most useful in combination to the date/time of the reboot that failed to bring all OSDs up and running.

Comment 10 Loic Dachary 2016-11-07 18:01:23 UTC
I can't access the api.access link using the login that works for access.redhat.com. I guess not enough permissions ? In any case it would be good to attach the report to this bz for posterity. If that's not possible for some reason I'll need to investigate why I can't access this URL. Thanks for your patience.

Comment 18 Loic Dachary 2016-11-07 22:01:58 UTC
@Tim I have the file, thanks for the guidance.

Comment 19 Loic Dachary 2016-11-07 22:25:58 UTC
Just to confirm this is indeed a race at boot time, do you confirm that manually activating the devices with ceph-disk activate works ? It would also help me understand what's going on if you tell me a little more about how the OSD were configured. How were the device prepared with ceph-disk (i.e. what options were used) ? I'm guessing devices /dev/sdb to /dev/sdy have been prepared with a simple ceph-disk prepare /dev/sdb which explain why they all have two partitions (one for the data the other for the journal). How far am I from what's really there ?

Comment 20 Loic Dachary 2016-11-07 23:17:17 UTC
In /var/log/messages I looked at the last two reboots. The last one does not have much, as if it had ben interrupted. The previous to last has the following sequence similar for each disk:

Nov  3 14:59:23 ttbcosd0015 kernel: [    5.008708]  sde: sde1 sde2
...
Nov  3 14:59:23 ttbcosd0015 logger: RHDEBUG:Running-ceph-disk
...
Nov  3 14:59:33 ttbcosd0015 kernel: [   16.693962] XFS (sde1): Mounting V4 Filesystem
Nov  3 14:59:33 ttbcosd0015 kernel: [   16.693962] XFS (sde1): Mounting V4 Filesystem
Nov  3 14:59:33 ttbcosd0015 kernel: [   16.709829] XFS (sde1): Ending clean mount
Nov  3 14:59:33 ttbcosd0015 kernel: [   16.709829] XFS (sde1): Ending clean mount

and in var/log/ceph/ceph-osd.363.log

2016-11-03 14:59:28.897645 7f97ccd2a800  0 ceph version 10.2.2-41.el7cp (1ac1c364ca12fa985072174e75339bfb1f50e9ee), process ceph-osd, pid 3874
2016-11-03 14:59:28.897855 7f97ccd2a800 -1 [0;31m ** ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-363: (2) No such file or directory[0m
...
2016-11-03 14:59:37.914801 7fe4ea312800  0 ceph version 10.2.2-41.el7cp (1ac1c364ca12fa985072174e75339bfb1f50e9ee), process ceph-osd, pid 25115
2016-11-03 14:59:37.916534 7fe4ea312800  0 pidfile_write: ignore empty --pid-file
2016-11-03 14:59:37.944380 7fe4ea312800  0 filestore(/var/lib/ceph/osd/ceph-363) backend xfs (magic 0x58465342)
2016-11-03 14:59:37.945254 7fe4ea312800  0 genericfilestorebackend(/var/lib/ceph/osd/ceph-363) detect_features: FIEMAP ioctl is disabled via 'filestore fiemap' config option
2016-11-03 14:59:37.945263 7fe4ea312800  0 genericfilestorebackend(/var/lib/ceph/osd/ceph-363) detect_features: SEEK_DATA/SEEK_HOLE is disabled via 'filestore seek data hole' config option
...
2016-11-03 15:02:14.749603 7fe4ea312800 10 -- :/25115 wait: waiting for pipes  to close
2016-11-03 15:02:14.749605 7fe4ea312800 10 -- :/25115 wait: done.
2016-11-03 15:02:14.749606 7fe4ea312800  1 -- :/25115 shutdown complete.

at 15:02 the machine was rebooted but it looks like the OSD booted ok at 14:59. After it rebooted at 15:07 nothing good happens. All osd logs exhibit something like:

2016-11-03 15:07:14.858399 7f17dc5e9800  0 set uid:gid to 167:167 (ceph:ceph)
2016-11-03 15:07:14.858425 7f17dc5e9800  0 ceph version 10.2.2-41.el7cp (1ac1c364ca12fa985072174e75339bfb1f50e9ee), process ceph-osd, pid 3789
2016-11-03 15:07:14.858571 7f17dc5e9800 -1 [0;31m ** ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-363: (2) No such file or directory[0m


and var/log/message only has

Nov  3 15:07:09 ttbcosd0015 kernel: [    4.962805]  sde: sde1 sde2
Nov  3 15:07:09 ttbcosd0015 kernel: [    4.964710] sd 0:2:3:0: [sde] Attached SCSI disk

but no ceph-disk and no series of lines like

Nov  3 15:01:04 ttbcosd0015 root: os-prober: debug: running /usr/libexec/os-probes/mounted/90linux-distro on mounted /dev/sde1

which seems to indicate that something's not happening as it should and not just regarding ceph osds. It is as if even /var/lib/ceph is not available.


Note You need to log in before you can comment on or make changes to this bug.