Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
This project is now read‑only. Starting Monday, February 2, please use https://ibm-ceph.atlassian.net/ for all bug tracking management.

Bug 1351320

Summary: OSD's assert during snap trim osd/ReplicatedPG.cc: 2655: FAILED assert(0)
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Mike Hackett <mhackett>
Component: RADOSAssignee: David Zafman <dzafman>
Status: CLOSED UPSTREAM QA Contact: ceph-qe-bugs <ceph-qe-bugs>
Severity: high Docs Contact:
Priority: high    
Version: 1.3.2CC: ceph-eng-bugs, dzafman, flucifre, kchai, kdreyer, nlevine, olim, vumrao
Target Milestone: rc   
Target Release: 1.3.3   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-07-18 14:20:41 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Mike Hackett 2016-06-29 17:48:20 UTC
Description of problem:

Was running Ceph 94.3 previously and was encountering issue with OSDs asserting due to snapset corruption during scrubbing (https://bugzilla.redhat.com/show_bug.cgi?id=1273127). Updated to Ceph 94.7 as belief was snapset corruption was caused by creating and/or deleting rbd snapshots during pg splitting. This use model creates and deletes thousands of rbd snapshots per day and they had very recently split pgs when this snapset corruption originally started happening.

The 0.94.7 upgrade allowed scrubbing to happen and marked the pgs inconsistent instead. (https://github.com/ceph/ceph/pull/7702) was then able to track down the inconsistencies and resolve them, so all of the pgs are now consistent and scrubbable. The issue is now seeing OSD's segfault during snap trimming.

OSD Assert for OSD.234:

2016-06-27 08:08:16.909337 7f19777c0700 -1 osd/ReplicatedPG.cc: In function 'ReplicatedPG::RepGather* ReplicatedPG::trim_object(const hobject_t&)' thread 7f19777c0700 time 2016-06-27 08:08:16.903355
osd/ReplicatedPG.cc: 2655: FAILED assert(0)

ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0xbb1fab]
 2: (ReplicatedPG::trim_object(hobject_t const&)+0x1e4) [0x85bb64]
 3: (ReplicatedPG::TrimmingObjects::react(ReplicatedPG::SnapTrim const&)+0x427) [0x85e287]
 4: (boost::statechart::simple_state&lt;ReplicatedPG::TrimmingObjects, ReplicatedPG::SnapTrimmer, boost::mpl::list&lt;mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na&gt;, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0xb4) [0x8bf1f4]
 5: (boost::statechart::state_machine&lt;ReplicatedPG::SnapTrimmer, ReplicatedPG::NotTrimming, std::allocator&lt;void&gt;, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x5f) [0x8ab92f]
 6: (ReplicatedPG::snap_trimmer()+0x52c) [0x82f7fc]
 7: (OSD::SnapTrimWQ::_process(PG*)+0x1a) [0x6c43aa]
 8: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa5e) [0xba2a0e]
 9: (ThreadPool::WorkThread::entry()+0x10) [0xba3ab0]
 10: (()+0x8182) [0x7f199ee58182]
 11: (clone()+0x6d) [0x7f199d3c347d]
 NOTE: a copy of the executable, or `objdump -rdS &lt;executable&gt;` is needed to interpret this.

--- begin dump of recent events ---

Logs are located: https://api.access.redhat.com/rs/cases/01658829/attachments/aa33247b-c123-4085-a276-f9b81c3e83a7


Version-Release number of selected component (if applicable):
Ceph 94.7 
But also confirmed at RHCS 1.3.2

Comment 4 Mike Hackett 2016-06-29 18:02:05 UTC
Snaptrimq for pg 0.1ef1 reports:

trim_objectcould not find coid 0/bddc9ef/rbd_data.b77eb164a531e5.0000000000004fdf/1e73e

Comment 5 Mike Hackett 2016-06-29 19:49:55 UTC
Reviewing the provided OSD log for OSD.234 we can see that we are actively trying to trim the snapshots and it cannot because there is an object that is corrupted for some reason which leads to the OSD taking an assert.

***Review***

In the OSD logs we can see the following occurring:

On PG 0.1ef1 we start snaptrimming for object rbd_data.b77eb164a531e5.0000000000004fdf:

SnapTrimmer state<Trimming/TrimmingObjects>: TrimmingObjects react trimming 0/bddc9ef1/rbd_data.b77eb164a531e5.0000000000004fdf/1e73e
    -3> 2016-06-27 08:11:06.781905 7f08f1268700 10 osd.234 pg_epoch: 547937 pg[0.1ef1( v 547936'1181836 (543861'1178763,547936'1181836] local-les=547936 n=71899 ec=1 les/c 547936/547936 547935/547935/547935) [234,259,19] r=0 lpr=547935 crt=547932'1181831 lcod 547932'1181834 mlcod 547932'1181834 active+clean

This PG has an incredibly large snapq so I have not listed this (just a note).

When trimming we are attempting to obtain the object info and are unable to find, It's trying to getattr OI_ATTR which is the object info attr.:

get_object_context: obc NOT found in cache: 0/bddc9ef1/rbd_data.b77eb164a531e5.0000000000004fdf/1e73e
    -2> 2016-06-27 08:11:06.784323 7f08f1268700 10 osd.234 pg_epoch: 547937 pg[0.1ef1( v 547936'1181836 (543861'1178763,547936'1181836] local-les=547936 n=71899 ec=1 les/c 547936/547936 547935/547935/547935) [234,259,19] r=0 lpr=547935 crt=547932'1181831 lcod 547932'1181834 mlcod 547932'1181834 active+clean

We report that no object info can be obtained. 

get_object_context: no obc for soid 0/bddc9ef1/rbd_data.b77eb164a531e5.0000000000004fdf/1e73e and !can_create
    -1> 2016-06-27 08:11:06.786646 7f08f1268700 -1 osd.234 pg_epoch: 547937 pg[0.1ef1( v 547936'1181836 (543861'1178763,547936'1181836] local-les=547936 n=71899 ec=1 les/c 547936/547936 547935/547935/547935) [234,259,19] r=0 lpr=547935 crt=547932'1181831 lcod 547932'1181834 mlcod 547932'1181834 active+clean

snaptrim on object fails:

trim_objectcould not find coid 0/bddc9ef1/rbd_data.b77eb164a531e5.0000000000004fdf/1e73e

And we assert the OSD.

***Recovery***

Our recovery actions after discussion with engineering for this issue would be the following steps:

1. OSD.234 is the acting primary OSD for PG 0.1ef1, on OSD node hosting OSD.234

2. Look in /var/lib/ceph/osd/ceph-234/current/ for 0.1ef1 to validate object is present in filesystem.

'sudo find /var/lib/ceph/osd/ceph-234/current/0.1ef1_head/ -name 'rbd\\udata.b77eb164a531e5.0000000000004fdf*' -ls'

***NOTE***
IF object is present continue on with action plan (steps 3-5), if it is not then please update as then we have an issue with the snapmapper pointing to an invalid object and NOT an issue with a corrupt/missing object info.

3. Look into a window when snaptrimming is not occurring or turn off snaptrimming

4. On OSD.234 run the following:

'ceph pg repair 0.1ef1'

validate we repair successfully with pg repair command, the repair should tell us.

5. Turn on snaptrimming (if it was turned off previously)

Comment 6 Mike Hackett 2016-06-29 21:06:56 UTC
setting config value osd_pg_max_concurrent_snap_trims to 0 will disable snap trims.  The default is 2

Comment 8 Mike Hackett 2016-06-30 13:49:55 UTC
Issue looks to be: http://tracker.ceph.com/issues/13837