Bug 1404419

Summary: map gap causes inconsistent pool.cached_removed_snaps structure
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Samuel Just <sjust>
Component: RADOSAssignee: Samuel Just <sjust>
Status: CLOSED ERRATA QA Contact: shylesh <shmohan>
Severity: medium Docs Contact:
Priority: low    
Version: 2.1CC: ceph-eng-bugs, dzafman, hnallurv, jdurgin, kchai, kdreyer, sjust, uboppana
Target Milestone: rc   
Target Release: 2.2   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: RHEL: ceph-10.2.5-7.el7cp Ubuntu: ceph_10.2.5-3redhat1xenial Doc Type: No Doc Update
Doc Text:
Fixes a bug which could cause a crash on an OSD restarted after a very long time of being stopped.
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-03-14 15:47:41 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Samuel Just 2016-12-13 19:46:59 UTC
Description of problem:

Restarting an OSD after a long period of being down can cause an inconsistent pool.cached_removed_snaps structure

Version-Release number of selected component (if applicable):


How reproducible:

Not very

Steps to Reproduce:
1. Create a cluster with 4 osds
2. Create a pool
3. Create an rbd image on that pool
4. Create and remove a snapshot 1000 times
5. stop one of the osds and mark it out
6. let the cluster go clean
7. Create and remove a snapshot 2000 times
8. Bring the osd up
9. Mark the osd in

Actual results:

The osd is likely to crash with backtrace in interval_map.

Expected results:

There is no crash.

Additional info:

Restarting the osd works around the issue.

-16> 2016-05-19 14:51:36.794141 7fbb03926700 10 filestore(/var/lib/ceph/osd/ceph-5) _do_transaction on 0x7fbb1e61ee00
-15> 2016-05-19 14:51:36.794139 7fbb03125700 10 filestore oid: #1:88000000::::head# not skipping op, spos 3617.0.0
-14> 2016-05-19 14:51:36.794146 7fbb03125700 10 filestore > header.spos 0.0.0
-13> 2016-05-19 14:51:36.794139 7fbaf3f7d700 20 osd.5 pg_epoch: 290 pg[1.15( v 196'73 (0'0,196'73] local-les=288 n=1 ec=22 les/c/f 288/288/0 287/287/277) [1,5] r=1 lpr=287 pi=8-286/5 luod=0'0 crt=196'73 lcod 0'0 active NIBBLEWISE] agent_stop
-12> 2016-05-19 14:51:36.794121 7fbaee772700 -1 ** Caught signal (Aborted) **
in thread 7fbaee772700 thread_name:tp_osd_tp

ceph version 10.2.0-1069-g3362c8d (3362c8dd2718b1ff61a18bc7f49474e6808c2fc7)
1: (()+0x904ca2) [0x7fbb120eaca2]
2: (()+0x10340) [0x7fbb1048f340]
3: (gsignal()+0x39) [0x7fbb0e4f1cc9]
4: (abort()+0x148) [0x7fbb0e4f50d8]
5: (()+0x2fb86) [0x7fbb0e4eab86]
6: (()+0x2fc32) [0x7fbb0e4eac32]
7: (ReplicatedPG::WaitingOnReplicas::react(ReplicatedPG::SnapTrim const&)+0xf79) [0x7fbb11d04ce9]
8: (boost::statechart::simple_state<ReplicatedPG::WaitingOnReplicas, ReplicatedPG::SnapTrimmer, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0xb4) [0x7fbb11d34264]
9: (boost::statechart::state_machine<ReplicatedPG::SnapTrimmer, ReplicatedPG::NotTrimming, std::allocator<void>, boost::statechart::null_exception_translator>::process_queued_events()+0x12b) [0x7fbb11d20afb]
10: (boost::statechart::state_machine<ReplicatedPG::SnapTrimmer, ReplicatedPG::NotTrimming, std::allocator<void>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x84) [0x7fbb11d20cc4]
11: (ReplicatedPG::snap_trimmer(unsigned int)+0x46b) [0x7fbb11c9f73b]
12: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x8e3) [0x7fbb11b7c1b3]
13: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x877) [0x7fbb121d68f7]
14: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x7fbb121d8820]
15: (()+0x8182) [0x7fbb10487182]
16: (clone()+0x6d) [0x7fbb0e5b547d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

10.2.4 has an incorrect fix for this bug (though it also has some other patches that make the bug very unlikely to cause a crash).

Comment 2 Samuel Just 2016-12-13 19:47:30 UTC
original incorrect fix: https://github.com/ceph/ceph/pull/9236

Comment 3 Samuel Just 2017-01-04 21:47:54 UTC
https://github.com/ceph/ceph/pull/12791

Comment 4 Samuel Just 2017-01-11 21:58:06 UTC
in ceph-2-rhel-patches

Comment 12 errata-xmlrpc 2017-03-14 15:47:41 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2017-0514.html