Bug 1273127
Summary: | Backport tracker 12738 - OSD reboots every few minutes with FAILED assert(clone_size.count(clone)) | ||
---|---|---|---|
Product: | [Red Hat Storage] Red Hat Ceph Storage | Reporter: | Mike Hackett <mhackett> |
Component: | RADOS | Assignee: | David Zafman <dzafman> |
Status: | CLOSED ERRATA | QA Contact: | shylesh <shmohan> |
Severity: | medium | Docs Contact: | Bara Ancincova <bancinco> |
Priority: | medium | ||
Version: | 1.2.3 | CC: | ceph-eng-bugs, dzafman, flucifre, hnallurv, jdillama, kchai, kdreyer, shmohan, sjust, sweil, tserlin, vumrao |
Target Milestone: | rc | ||
Target Release: | 1.3.3 | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | RHEL: ceph-0.94.7-5.el7cp Ubuntu: ceph_0.94.7-3redhat1trusty | Doc Type: | Bug Fix |
Doc Text: |
.OSDs no longer reboot when corrupted snapsets are found during scrubbing
Previously, Ceph incorrectly handled corrupted snapsets that were found during scrubbing. This behavior caused the OSD nodes to terminate unexpectedly every time the snapsets were detected. As a consequence, the OSDs rebooted every few minutes. With this update, the underlying source code has been modified, and OSDs no longer reboots in the described situation.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2016-09-29 12:54:51 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1335269 | ||
Bug Blocks: | 1348597, 1372735 |
Description
Mike Hackett
2015-10-19 16:53:12 UTC
David, would you please provide the reproduction steps for QE to verify the fix here? To trigger the crash, It looks like you have to corrupt a snapset in a particular way, and then issue a scrub command to the OSD? Once the fix is available you can use the ceph-objectstore-tool to test. This undocumented feature of the tool is part of the same code that fixes the problem: Create an object with one more snapshots. ceph-objectstore-tool --data-path XXXX --journal-path XXXX --op list name-of-object Get JSON for head object which has "snapid": -2 ceph-objectstore-tool --data-path XXXX --journal-path XXXX 'JSON' clear-snapset clone_size To produce without the fix you could use get-xattr snapset on an object without any snapshots and set-xattr snapset to corrupt the head object of an object with snapshots: ceph-objectstore-tool --data-path XXXX --journal-path XXXX 'JSON-NOSNAPSOBJ' get-xattr snapset > saved.snapset ceph-objectstore-tool --data-path XXXX --journal-path XXXX 'JSON' set-xattr snapset saved.snapset Fixed in infernalis, but it's a long series of scrub changes we'd prefer not to backport. https://github.com/ceph/ceph/pull/7702 is the backport to Hammer that has passed my testing. Hammer backport tracker : http://tracker.ceph.com/issues/14077 Fix is in 0.94.7 - Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2016-1972.html |