Bug 1326740
Summary: | ceph-disk@dev-sd<>2.service is created on all OSD nodes, and its in failed state | ||
---|---|---|---|
Product: | [Red Hat Storage] Red Hat Ceph Storage | Reporter: | Tejas <tchandra> |
Component: | Ceph-Disk | Assignee: | Boris Ranto <branto> |
Status: | CLOSED ERRATA | QA Contact: | Tejas <tchandra> |
Severity: | medium | Docs Contact: | Bara Ancincova <bancinco> |
Priority: | high | ||
Version: | 2.0 | CC: | adeza, branto, ceph-eng-bugs, hnallurv, icolle, kdreyer, ldachary, mbukatov, nthomas, sankarshan, tchandra, uboppana, vakulkar, vashastr |
Target Milestone: | rc | ||
Target Release: | 2.1 | ||
Hardware: | Unspecified | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | RHEL: ceph-10.2.3-2.el7cp Ubuntu: ceph_10.2.3-3redhat1xenial | Doc Type: | Bug Fix |
Doc Text: |
.The `ceph-disk` service no longer fails because of limited resources
After installation of a new Ceph storage cluster, failed instances of the `ceph-disk` service appeared because the service was started twice: once to activate the data partition, and once to activate the journal partition. After the disk activation, one of these instances failed because of limited resources. With this update, the instance without the resources terminates with the `0` exit status and an informative message is returned.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2016-11-22 19:25:15 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1322504, 1383917 |
Description
Tejas
2016-04-13 11:59:27 UTC
Loic, Can you comment on what is going on here? thank you, G Is it still an issue with 10.2.0 ? yes, we are still seeing this issue in ceph 10.2.0 Loic, what would cause this? > Is this service required? Yes. > if so why is it in a failed state? Here is what happens at boot time (or after a new OSD is prepared and activated): Scenario a) * udev runs the service to activate /dev/sda1 which is the data partition of the OSD but /dev/sdb1 which is the journal partition is not up yet and the service fails * udev runs the service to activate /dev/sdb1 which is the journal partition of the OSD and since the data partition is already up, it succeeds and the OSD is ready to be used Scenario b) It is the same as Scenario a) except the journal partition shows up first and the data partition after it. Either way, one of the two services is going to fail because the resources are only partially available (one disk out of two). The more general case is a little more involved but it's not substantially different. Please let me know if that explanation makes sense to you. I thought at first that it was really failing and looked hard to find the bug causing it :-) Expected behavior, so moving out of 2.0. However, we should plan on a better way to handle this scenario for the user in a future 2.z release. Loic, on many setups, I consistently see only scenario a) that you mentioned in comment 9, Its on both ubuntu/rhel as well, Is there anything we can do to fix this in 2.0? It looks pretty bad even though display only and makes user think that disk activation has failed I created an issue upstream to work on this. This could be fixed if we added 1 as an acceptable exit code for ceph-disk command, e.g. adding this line to the service file for ceph-disk: SuccessExitStatus=1 We'd be just ignoring the error, though. As mentioned by Vasu in comment 11, it gives bad user experience. Request this to be fixed in 2.0. Adding it to "Bug 1343229 - [Tracker] engineering work required for 2.0 GA" Boris, SuccessExitStatus=1 would do the trick. But it would also hide legitimate errors. It's a judgement call. Cheers It looks good to me. Upstream PR: https://github.com/ceph/ceph/pull/9943 We should be able to fix this in the next async errata. *** Bug 1346308 has been marked as a duplicate of this bug. *** Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2016-2815.html |