Description of problem: Restarting an OSD after a long period of being down can cause an inconsistent pool.cached_removed_snaps structure Version-Release number of selected component (if applicable): How reproducible: Not very Steps to Reproduce: 1. Create a cluster with 4 osds 2. Create a pool 3. Create an rbd image on that pool 4. Create and remove a snapshot 1000 times 5. stop one of the osds and mark it out 6. let the cluster go clean 7. Create and remove a snapshot 2000 times 8. Bring the osd up 9. Mark the osd in Actual results: The osd is likely to crash with backtrace in interval_map. Expected results: There is no crash. Additional info: Restarting the osd works around the issue. -16> 2016-05-19 14:51:36.794141 7fbb03926700 10 filestore(/var/lib/ceph/osd/ceph-5) _do_transaction on 0x7fbb1e61ee00 -15> 2016-05-19 14:51:36.794139 7fbb03125700 10 filestore oid: #1:88000000::::head# not skipping op, spos 3617.0.0 -14> 2016-05-19 14:51:36.794146 7fbb03125700 10 filestore > header.spos 0.0.0 -13> 2016-05-19 14:51:36.794139 7fbaf3f7d700 20 osd.5 pg_epoch: 290 pg[1.15( v 196'73 (0'0,196'73] local-les=288 n=1 ec=22 les/c/f 288/288/0 287/287/277) [1,5] r=1 lpr=287 pi=8-286/5 luod=0'0 crt=196'73 lcod 0'0 active NIBBLEWISE] agent_stop -12> 2016-05-19 14:51:36.794121 7fbaee772700 -1 ** Caught signal (Aborted) ** in thread 7fbaee772700 thread_name:tp_osd_tp ceph version 10.2.0-1069-g3362c8d (3362c8dd2718b1ff61a18bc7f49474e6808c2fc7) 1: (()+0x904ca2) [0x7fbb120eaca2] 2: (()+0x10340) [0x7fbb1048f340] 3: (gsignal()+0x39) [0x7fbb0e4f1cc9] 4: (abort()+0x148) [0x7fbb0e4f50d8] 5: (()+0x2fb86) [0x7fbb0e4eab86] 6: (()+0x2fc32) [0x7fbb0e4eac32] 7: (ReplicatedPG::WaitingOnReplicas::react(ReplicatedPG::SnapTrim const&)+0xf79) [0x7fbb11d04ce9] 8: (boost::statechart::simple_state<ReplicatedPG::WaitingOnReplicas, ReplicatedPG::SnapTrimmer, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0xb4) [0x7fbb11d34264] 9: (boost::statechart::state_machine<ReplicatedPG::SnapTrimmer, ReplicatedPG::NotTrimming, std::allocator<void>, boost::statechart::null_exception_translator>::process_queued_events()+0x12b) [0x7fbb11d20afb] 10: (boost::statechart::state_machine<ReplicatedPG::SnapTrimmer, ReplicatedPG::NotTrimming, std::allocator<void>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x84) [0x7fbb11d20cc4] 11: (ReplicatedPG::snap_trimmer(unsigned int)+0x46b) [0x7fbb11c9f73b] 12: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x8e3) [0x7fbb11b7c1b3] 13: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x877) [0x7fbb121d68f7] 14: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x7fbb121d8820] 15: (()+0x8182) [0x7fbb10487182] 16: (clone()+0x6d) [0x7fbb0e5b547d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. 10.2.4 has an incorrect fix for this bug (though it also has some other patches that make the bug very unlikely to cause a crash).
original incorrect fix: https://github.com/ceph/ceph/pull/9236
https://github.com/ceph/ceph/pull/12791
in ceph-2-rhel-patches
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2017-0514.html