Bug 1458007

Summary:	timeout during ceph-disk trigger due to /var/lock/ceph-disk flock contention
Product:	[Red Hat Storage] Red Hat Ceph Storage	Reporter:	Justin Bautista <jbautist>
Component:	Ceph-Disk	Assignee:	Kefu Chai <kchai>
Status:	CLOSED ERRATA	QA Contact:	Manohar Murthy <mmurthy>
Severity:	urgent	Docs Contact:	Aron Gunn <agunn>
Priority:	urgent
Version:	2.2	CC:	abustosp, agunn, ealcaniz, edonnell, gmeno, hklein, hnallurv, jcall, jniu, jquinn, kchai, kdreyer, k-matsuzaki, knoha, linuxkidd, mhackett, mv3, saime, shmohan, sinan.polat, stephen_e_smith, taichi.kageyama, tpetr, uboppana, vumrao, ykawada
Target Milestone:	rc
Target Release:	2.5
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	RHEL: ceph-10.2.10-1.el7cp Ubuntu: ceph_10.2.10-2redhat1	Doc Type:	Bug Fix
Doc Text:	.Some OSDs fail to come up after reboot Previously, on a machine with more than five OSDs, some OSDs failed to come up after a reboot because the `systemd` unit for the `ceph-disk` utility timed out after 120 seconds. With this release, the `ceph-disk` code no longer fails, if the OSD `udev` rule triggers prior to the mounting of the `/var/` directory.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-02-21 19:43:32 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1437916, 1536401

Description Justin Bautista 2017-06-01 18:17:31 UTC

Description of problem: 

Customer is reporting that whenever OSD hosts are rebooted the drives aren't all mounted & OSD's don't all start.

Per Loic, this should already be fixed upstream via: 
  http://tracker.ceph.com/issues/18060.

Version-Release number of selected component (if applicable):

$ grep -i ceph installed-rpms 
ceph-base-10.2.5-37.el7cp.x86_64                            Fri Mar 31 23:50:55 2017
ceph-common-10.2.5-37.el7cp.x86_64                          Fri Mar 31 23:50:50 2017
ceph-osd-10.2.5-37.el7cp.x86_64                             Fri Mar 31 23:50:56 2017
ceph-selinux-10.2.5-37.el7cp.x86_64                         Fri Mar 31 23:50:50 2017
libcephfs1-10.2.5-37.el7cp.x86_64                           Fri Mar 31 23:50:47 2017
python-cephfs-10.2.5-37.el7cp.x86_64                        Fri Mar 31 23:50:47 2017

Comment 2 Justin Bautista 2017-06-01 18:19:01 UTC

Fixed upstream in pull request:

  https://github.com/ceph/ceph/pull/12200

Comment 4 Ian Colle 2017-06-05 17:33:03 UTC

Already in latest 2.3 compose

Comment 6 Loic Dachary 2017-06-06 14:26:09 UTC

Hi,

I think a good way to reliably reproduce this problem would be to setup a virtual machine and attach 50 disks to it. It should experience the same kind of delays during reboot, don't you think ?

Cheers

Comment 8 Loic Dachary 2017-06-08 19:17:27 UTC

This is great ! Investitgating.

Comment 11 Loic Dachary 2017-06-08 20:23:20 UTC

Here is what happens:

- 60 systemctl units for ceph-disk run at the same time
- each has a different lock so they are not blocked by each other (that is what http://tracker.ceph.com/issues/18060 is for)
- all units run a ceph-disk activate which has a global lock so they effectively all wait for each other, there is lock contention anyway (meaning my suggestion for an upgrade is actually not a good suggestion, it does not improve things)
- it takes a lot longer than 2 minutes (the default timeout) to activate all 60 OSDs, therefore some of them timeout and never launch

The workaround is simple and can be documented as suggested in the Documentation field (Doc Text). I applied it to the OSD machine you provided and all OSD come up after a reboot.

http://tracker.ceph.com/issues/18740 is a better workaround since it can workaround the problem without modifying an installed file that will be overriden at the next installation. But it is not a good solution either. And it has not been backported either.

I think the right fix is to just set a very large timeout (2 hours) and let the OSD come up one after the other. I will propose a fix upstream in that direction.

Comment 12 shylesh 2017-06-09 08:35:55 UTC

@Loic,

Please move the bug once you have the fix. As of now I will move this to ASSIGNED.

Thanks,
Shylesh

Comment 13 Loic Dachary 2017-06-09 13:36:42 UTC

Thanks !

Comment 15 Loic Dachary 2017-06-12 15:13:43 UTC

Looks good to me !

Comment 17 Stephen Smith 2017-09-06 12:36:52 UTC

We're currently bumping into this and have also backported https://github.com/ceph/ceph/pull/13197 with no success (Also have increased to 7200 with no success). Is there any insight into what the root cause of the issue is or perhaps a better workaround for it? It seems clear that it happens with larger numbers of OSDs in a given host.

Comment 18 Stephen Smith 2017-09-06 12:49:30 UTC

(In reply to Stephen Smith from comment #17)
> We're currently bumping into this and have also backported
> https://github.com/ceph/ceph/pull/13197 with no success (Also have increased
> to 7200 with no success). Is there any insight into what the root cause of
> the issue is or perhaps a better workaround for it? It seems clear that it
> happens with larger numbers of OSDs in a given host.

Just saw https://bugzilla.redhat.com/show_bug.cgi?id=1486099

Testing to see if this resolves the issue for us.

Comment 19 Stephen Smith 2017-09-06 12:55:04 UTC

Applying the workaround from https://bugzilla.redhat.com/show_bug.cgi?id=1486099 doesn't appear to have worked for us.

Comment 20 Stephen Smith 2017-09-06 13:09:01 UTC

Some additional information for our cluster: We have 60 8TB OSDs per host and this seems to only happen after the first 1 or 2 reboots of each host on a completely empty test cluster. After the first couple of reboots the OSDs appear to come up at the proper rate.

Comment 24 Loic Dachary 2017-09-25 15:39:45 UTC

For information I'm spending time on this issue, trying to find a new angle to diagnose the problem. We don't have more information but it's worth revisiting the "cold case" ;-)

Comment 30 Matthew Vernon 2017-10-02 10:59:47 UTC

There was some discussion of this issue on ceph-users recently (we encountered this problem, but didn't know about this bugzilla entry). On a storage node with 60 6TB disks, total startup time was around 18 minutes if all the fss were dirty (clean xfs startup is much quicker).

The problem (as you note in comment 11) is that there's still a single global lock for disk activation. Is this really necessary, or could you have a per-device lock there as well? That would let startup be parallelizsed

Other problems:

The rationale for having a timeout at all was that this would result in useful error messages if an osd startup wedges entirely. Actually, though, timeout makes it much harder to diagnose startup issues - ceph-disk trigger calls ceph-disk activate with the "command" function, which uses subprocess.communicate(); so stdout and stderr from ceph-disk activate end up in buffers in the ceph-disk trigger process. So when timeout fires, they are entirely lost.

Relatedly, the systemd service file doesn't really tell you that timeout firing is what killed the job - you have to intuit this fact from the 124 exit code. It would be better if the service file handled the timeout firing in a manner that produced explicit logging (probably some sort of shell runery in the ExecStart?)

Regards,

Matthew
[we have RH support for our ceph install, if that helps you resource fixing on it :-) ]

Comment 31 Loic Dachary 2017-10-02 13:53:55 UTC

> Is this really necessary, or could you have a per-device lock there as well? That would let startup be parallelizsed

It would be a non trivial change to do with ceph-disk. Fortunately ceph-volume is going to deprecate ceph-disk and resolve this problem.

> The rationale for having a timeout at all was that this would result in useful error messages if an osd startup wedges entirely.

It was mainly to avoid blocking forever, which happened in the past.

> Actually, though, timeout makes it much harder to diagnose startup issues..

I agree it is difficult to diagnose because logs are scattered in various places, depending on where the originate from ( udev/systemd/ceph-disk/ceph-osd ). The upcoming ceph-volume will greatly simplify this.

Comment 53 Vikhyat Umrao 2017-10-13 14:49:26 UTC

*** Bug 1490716 has been marked as a duplicate of this bug. ***

Comment 54 Vikhyat Umrao 2017-10-13 14:51:02 UTC

*** Bug 1439210 has been marked as a duplicate of this bug. ***

Comment 55 Vikhyat Umrao 2017-10-13 14:54:31 UTC

*** Bug 1486099 has been marked as a duplicate of this bug. ***

Comment 82 Loic Dachary 2017-10-23 15:38:20 UTC

I'm glad this was the right fix and we did *not* need a workaround !

Comment 83 Vikhyat Umrao 2017-10-23 15:43:37 UTC

(In reply to Loic Dachary from comment #82)
> I'm glad this was the right fix and we did *not* need a workaround !

Thanks Loic. Yes, it is the right fix. 

Adding fix in public comment for visibility - https://github.com/ceph/ceph/pull/15504/files from upstream jewel backport tracker - http://tracker.ceph.com/issues/20151. 

This is getting backported on top of 2.4 async - 10.2.7-48.el7cp.

Comment 92 errata-xmlrpc 2018-02-21 19:43:32 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0340