1486099 – [RFE] Boot time Ceph OSD's activation using partprobe in rc.local or any other pre-processing script

Bug 1486099 - [RFE] Boot time Ceph OSD's activation using partprobe in rc.local or any other pre-processing script

Summary: [RFE] Boot time Ceph OSD's activation using partprobe in rc.local or any othe...

Keywords:
Status:	CLOSED DUPLICATE of bug 1458007
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	Ceph-Disk
Sub Component:
Version:	2.3
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	2.5
Assignee:	Loic Dachary
QA Contact:	ceph-qe-bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-08-29 02:59 UTC by Vikhyat Umrao
Modified:	2021-03-11 15:40 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-10-13 14:54:31 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1458007	0	urgent	CLOSED	timeout during ceph-disk trigger due to /var/lock/ceph-disk flock contention	2021-09-09 12:24:46 UTC

Internal Links: 1458007

Description Vikhyat Umrao 2017-08-29 02:59:35 UTC

Description of problem:
[RFE] Boot time Ceph OSD's activation adding partprobe in place of udev rules[1]. 

This will help avoid race conditions between Ceph udev rules and ceph-disk.

We have tested adding partprobe in rc.local as following and it has helped multiple customers.

---
Below are the steps we have been using to implement the workaround. 

1. Edit /etc/rc.local and append the command:
partprobe

2. Set /etc/rc.d/rc.local as executable:
# chmod 755 /etc/rc.d/rc.local

3. Enable the rc.local service:
# systemctl enable rc-local.service

4. And then, reboot your system.
---

If we do not want to choose rc.local we can use some other preprocessing scripts for running partprobe.

[1] /usr/lib/udev/rules.d/95-ceph-osd.rules


Version-Release number of selected component (if applicable):
Red Hat Ceph Storage 2.3

Comment 3 Loic Dachary 2017-08-29 08:57:21 UTC

Please provide an explanation why this helps. Running partprobe cannot replace udev events because it relies on them. When partprobe runs against a device, it will trigger udev events that will start the OSD. At boot time it will race against the udev events generated by the kernel when the machine comes up and the disks appear.

It is my understanding that it helped in some cases but I lack information to understand why. If the machines in question boot without the kernel firing any udev event, that will help indeed. Is that the case ?

Comment 4 Vikhyat Umrao 2017-08-29 12:01:26 UTC

(In reply to Loic Dachary from comment #3)
> Please provide an explanation why this helps. Running partprobe cannot
> replace udev events because it relies on them. When partprobe runs against a
> device, it will trigger udev events that will start the OSD. 

Thanks for the explanation and correcting me. I misunderstood it because I thought partprobe bypasses ceph udev rule and directly trigger activation.

I have changed the bug title to fix my misunderstanding. 

> At boot time it will race against the udev events generated by the kernel 
> when the machine comes up and the disks appear.

I think here you meant on boot time ceph-disk activate triggers the udev envent for OSD activate and it race against the udev events generated by the kernel.

or

we already use partprobe in ceph-disk activate to activate the devices in boot time? and udev events generated by partprobe for OSD disk activation are racing against the udev events generated by kernel.

> 
> It is my understanding that it helped in some cases but I lack information
> to understand why. If the machines in question boot without the kernel
> firing any udev event, that will help indeed. Is that the case ?


We have suggested this to 4 customers and it has helped them and I can see only one thing why it is helping is adding it in rc.local because when rc.local gets executed all the resources are available in the system and all the kernel udev events are already completed so no race against kernel udev events.

Comment 5 Loic Dachary 2017-08-29 15:26:59 UTC

> I think here you meant on boot time ceph-disk activate triggers the udev envent > for OSD activate and it race against the udev events generated by the kernel.
> 
> or
> 
> we already use partprobe in ceph-disk activate to activate the devices in boot time? and udev events generated by partprobe for OSD disk activation are racing against the udev events generated by kernel.

partprobe is not used at all during the boot phase. What it does is essentially telling the kernel to forget everything it knows about a device and then inspect the device to fire udev events as if the device was just plugged in. This is quite useful when the device is modified and that's in these situations only that ceph-disk calls partprobe.

At boot time the kernel is supposed to send udev events for every disks it has and these udev events will then call ceph-disk activate to run the OSD daemon.

Since you had good experience using partprobe at boot time and saw no race of any kind, it either means you got lucky (but since you did that on 4 customers it does not sound like luck). Or it means that for some reasons the udev events are *not* fired when the kernel boots (that sounds more likely).

Can you figure that out by generating a full sos report right before rc.local tries to run partprobe ?

Note: I don't think the context in which rc.local provides any guarantees that each and every udev event was fired and acted upon when it completes. And even though, the udev event running ceph-disk activate actually delegates the bulk of the action to a systemd service working asynchronously with a global lock.

Comment 6 Stephen Smith 2017-09-06 12:54:17 UTC

The workaround here doesn't appear to have resolved the issue for us unfortunately. Is there any indication as to what the upstream issue is?

Comment 7 Loic Dachary 2017-09-20 12:37:17 UTC

@Stephen did you try the upstream fix for https://bugzilla.redhat.com/show_bug.cgi?id=1472409 ? It will be in the next release but you can test it by manually changing the timeout in from 300 to 10000 in the ceph-disk@.service file (see https://github.com/ceph/ceph/pull/17133/files)

Comment 8 Stephen Smith 2017-09-20 13:33:08 UTC

@loic - we found that it ends up bleeding into the default "down / out" interval. Is there a recommended amount by which we should also increase the "mon osd down out interval", or do you perhaps have any insight into why it takes so long? I assume it directly relates to either the size or number of disks and perhaps is a factor of both?

Comment 9 Loic Dachary 2017-09-21 13:32:28 UTC

> we found that it ends up bleeding into the default "down / out" interval.

This is an interesting idea. Could you explain in detail what you mean by that ?

> any insight into why it takes so long?

I don't know for what reason the initialization of a single OSD would take more than a few seconds on a given platform. It would be interesting to get detailed information about what happens on a machine where it is the case. I.e. how much time is spent doing what.

Comment 10 Stephen Smith 2017-09-21 15:02:34 UTC

(In reply to Loic Dachary from comment #9)
> > we found that it ends up bleeding into the default "down / out" interval.
> 
> This is an interesting idea. Could you explain in detail what you mean by
> that ?

We have the following in our ceph.conf:

mon_osd_auto_mark_new_in = false
mon_osd_auto_mark_auto_out_in = false
mon_osd_down_out_interval = 600

The default for mon_osd_down_out_interval is 300 so we've modified this from 5 minutes to 10 minutes. We ask Ceph to not auto mark in, disks that were auto marked out to control rebalancing to some extent with mon_osd_auto_mark_auto_out_in = false 

When OSDs take longer than 10 minutes to come up (If we set CEPH_DISK_TIMEOUT = 10000, ~3 hours) they will be auto marked out and won't come back in necessarily. Even if we didn't have mon_osd_auto_mark_auto_out_in = false, you run the risk of:

1) Some number of OSDs being auto marked out if they do not come up in 10 minutes
2) Rebalancing / recovery begins for those marked auto out
3) Now OSDs come up and get marked auto in
4) Rebalancing / recovery begins again

Even if there isn't a large amount of rebalancing, it's possible OSDs can flap under the load of figuring out what should be in / out and what has / has not been rebalanced.

In a cluster with 10TB OSDs for instance, this can be fairly painful. The strange thing is this happens with or without data in the cluster and under zero or high load it seems. We see this more so with larger numbers of OSDs in a single host. We also observe one OSD seems to come up every 20 seconds, with 60 OSDs this can be upwards of 20 minutes.

> > any insight into why it takes so long?
> 
> I don't know for what reason the initialization of a single OSD would take
> more than a few seconds on a given platform. It would be interesting to get
> detailed information about what happens on a machine where it is the case.
> I.e. how much time is spent doing what.

I'm happy to provide any details I can.

Comment 11 Vikhyat Umrao 2017-10-13 14:54:31 UTC


*** This bug has been marked as a duplicate of bug 1458007 ***

Note You need to log in before you can comment on or make changes to this bug.