Description of problem:
Customer is reporting that whenever OSD hosts are rebooted the drives aren't all mounted & OSD's don't all start.
Per Loic, this should already be fixed upstream via:
Version-Release number of selected component (if applicable):
$ grep -i ceph installed-rpms
ceph-base-10.2.5-37.el7cp.x86_64 Fri Mar 31 23:50:55 2017
ceph-common-10.2.5-37.el7cp.x86_64 Fri Mar 31 23:50:50 2017
ceph-osd-10.2.5-37.el7cp.x86_64 Fri Mar 31 23:50:56 2017
ceph-selinux-10.2.5-37.el7cp.x86_64 Fri Mar 31 23:50:50 2017
libcephfs1-10.2.5-37.el7cp.x86_64 Fri Mar 31 23:50:47 2017
python-cephfs-10.2.5-37.el7cp.x86_64 Fri Mar 31 23:50:47 2017
Fixed upstream in pull request:
Already in latest 2.3 compose
I think a good way to reliably reproduce this problem would be to setup a virtual machine and attach 50 disks to it. It should experience the same kind of delays during reboot, don't you think ?
This is great ! Investitgating.
Here is what happens:
- 60 systemctl units for ceph-disk run at the same time
- each has a different lock so they are not blocked by each other (that is what http://tracker.ceph.com/issues/18060 is for)
- all units run a ceph-disk activate which has a global lock so they effectively all wait for each other, there is lock contention anyway (meaning my suggestion for an upgrade is actually not a good suggestion, it does not improve things)
- it takes a lot longer than 2 minutes (the default timeout) to activate all 60 OSDs, therefore some of them timeout and never launch
The workaround is simple and can be documented as suggested in the Documentation field (Doc Text). I applied it to the OSD machine you provided and all OSD come up after a reboot.
http://tracker.ceph.com/issues/18740 is a better workaround since it can workaround the problem without modifying an installed file that will be overriden at the next installation. But it is not a good solution either. And it has not been backported either.
I think the right fix is to just set a very large timeout (2 hours) and let the OSD come up one after the other. I will propose a fix upstream in that direction.
Please move the bug once you have the fix. As of now I will move this to ASSIGNED.
Looks good to me !
We're currently bumping into this and have also backported https://github.com/ceph/ceph/pull/13197 with no success (Also have increased to 7200 with no success). Is there any insight into what the root cause of the issue is or perhaps a better workaround for it? It seems clear that it happens with larger numbers of OSDs in a given host.
(In reply to Stephen Smith from comment #17)
> We're currently bumping into this and have also backported
> https://github.com/ceph/ceph/pull/13197 with no success (Also have increased
> to 7200 with no success). Is there any insight into what the root cause of
> the issue is or perhaps a better workaround for it? It seems clear that it
> happens with larger numbers of OSDs in a given host.
Just saw https://bugzilla.redhat.com/show_bug.cgi?id=1486099
Testing to see if this resolves the issue for us.
Applying the workaround from https://bugzilla.redhat.com/show_bug.cgi?id=1486099 doesn't appear to have worked for us.
Some additional information for our cluster: We have 60 8TB OSDs per host and this seems to only happen after the first 1 or 2 reboots of each host on a completely empty test cluster. After the first couple of reboots the OSDs appear to come up at the proper rate.
For information I'm spending time on this issue, trying to find a new angle to diagnose the problem. We don't have more information but it's worth revisiting the "cold case" ;-)
There was some discussion of this issue on ceph-users recently (we encountered this problem, but didn't know about this bugzilla entry). On a storage node with 60 6TB disks, total startup time was around 18 minutes if all the fss were dirty (clean xfs startup is much quicker).
The problem (as you note in comment 11) is that there's still a single global lock for disk activation. Is this really necessary, or could you have a per-device lock there as well? That would let startup be parallelizsed
The rationale for having a timeout at all was that this would result in useful error messages if an osd startup wedges entirely. Actually, though, timeout makes it much harder to diagnose startup issues - ceph-disk trigger calls ceph-disk activate with the "command" function, which uses subprocess.communicate(); so stdout and stderr from ceph-disk activate end up in buffers in the ceph-disk trigger process. So when timeout fires, they are entirely lost.
Relatedly, the systemd service file doesn't really tell you that timeout firing is what killed the job - you have to intuit this fact from the 124 exit code. It would be better if the service file handled the timeout firing in a manner that produced explicit logging (probably some sort of shell runery in the ExecStart?)
[we have RH support for our ceph install, if that helps you resource fixing on it :-) ]
> Is this really necessary, or could you have a per-device lock there as well? That would let startup be parallelizsed
It would be a non trivial change to do with ceph-disk. Fortunately ceph-volume is going to deprecate ceph-disk and resolve this problem.
> The rationale for having a timeout at all was that this would result in useful error messages if an osd startup wedges entirely.
It was mainly to avoid blocking forever, which happened in the past.
> Actually, though, timeout makes it much harder to diagnose startup issues..
I agree it is difficult to diagnose because logs are scattered in various places, depending on where the originate from ( udev/systemd/ceph-disk/ceph-osd ). The upcoming ceph-volume will greatly simplify this.
*** Bug 1490716 has been marked as a duplicate of this bug. ***
*** Bug 1439210 has been marked as a duplicate of this bug. ***
*** Bug 1486099 has been marked as a duplicate of this bug. ***
I'm glad this was the right fix and we did *not* need a workaround !
(In reply to Loic Dachary from comment #82)
> I'm glad this was the right fix and we did *not* need a workaround !
Thanks Loic. Yes, it is the right fix.
Adding fix in public comment for visibility - https://github.com/ceph/ceph/pull/15504/files from upstream jewel backport tracker - http://tracker.ceph.com/issues/20151.
This is getting backported on top of 2.4 async - 10.2.7-48.el7cp.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.