Bug 1398867

Summary: [Escalation] Cache-tieiring, OSD's in hot-storage asserting with osd/ReplicatedPG.cc: 10521: FAILED assert(obc) in hit_set_trim
Product: Red Hat Ceph Storage Reporter: Kyle Squizzato <ksquizza>
Component: RADOSAssignee: Samuel Just <sjust>
Status: CLOSED WONTFIX QA Contact: ceph-qe-bugs <ceph-qe-bugs>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 1.3.3CC: ceph-eng-bugs, dzafman, kchai, skinjo, vumrao
Target Milestone: rc   
Target Release: 1.3.4   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-11-27 18:36:04 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Kyle Squizzato 2016-11-26 20:02:16 UTC
Description of problem:
Customer recently started experiencing OSD failure/flaps on all of the OSDs in their hot-storage pool.

The assertions appeared to have begun after restarting backfill.  The customer states the issue started after they increased pg/pgp_count.  They increased these counts when the cluster was in recovery, not active+clean.

The hot-storage chassis is as follows: 

-50   4.96991     chassis CH2
63   0.70999         osd.63                    up  1.00000          1.00000
65   0.70999         osd.65                    up  1.00000          1.00000
66   0.70999         osd.66                    up  1.00000          1.00000
67   0.70999         osd.67                  down        0          1.00000
47   0.70999         osd.47                    up  1.00000          1.00000
92   0.70998         osd.92                    up  1.00000          1.00000
93   0.70998         osd.93                    up  1.00000          1.00000
-51   4.96991     chassis CH3
54   0.70999         osd.54                    up  1.00000          1.00000
43   0.70999         osd.43                  down        0          1.00000
44   0.70999         osd.44                    up  1.00000          1.00000
69   0.70999         osd.69                    up  1.00000          1.00000
46   0.70999         osd.46                    up  1.00000          1.00000
91   0.70998         osd.91                    up  1.00000          1.00000
94   0.70998         osd.94                    up  1.00000          1.00000
today alone, the following osd's have failed: 

osd.44
osd.46
osd.66
osd.67
osd.69
osd.91
osd.92
osd.93
osd.94
During the failures, an osd will flap by hitting the same assertion in quick succession, for example osd.94: 

2016-11-26 05:38:48.103200 mon.0 10.24.19.25:6789/0 3180044 : cluster [INF] osd.94 10.24.19.45:6800/13817 failed (6 reports from 6 peers after 21.000236 >= grace 20.000000)
2016-11-26 05:39:55.460343 mon.0 10.24.19.25:6789/0 3180306 : cluster [INF] osd.94 10.24.19.45:6802/27971 failed (7 reports from 6 peers after 22.000328 >= grace 20.000000)
2016-11-26 05:41:37.640188 mon.0 10.24.19.25:6789/0 3180749 : cluster [INF] osd.94 10.24.19.45:6800/28930 failed (8 reports from 6 peers after 20.000126 >= grace 20.000000)
2016-11-26 05:48:22.906236 mon.0 10.24.19.25:6789/0 3182655 : cluster [INF] osd.94 10.24.19.45:6800/3444 failed (6 reports from 6 peers after 22.413912 >= grace 20.000000)
The assertion in greater detail (full recent event dump in BZ, and osd.logs in case)
2016-11-26 05:39:32.987897 7fe446ec7700 -1 osd/ReplicatedPG.cc: In function 'void ReplicatedPG::hit_set_trim(ReplicatedPG::RepGather*, unsigned int)' thread 7fe446ec7700 time 2016-11-26 05:39:32.985793
osd/ReplicatedPG.cc: 10521: FAILED assert(obc)

 ceph version 0.94.9 (fe6d859066244b97b24f09d46552afc2071e6f90)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0xbde2c5]
 2: (ReplicatedPG::hit_set_trim(ReplicatedPG::RepGather*, unsigned int)+0x75f) [0x87e89f]
 3: (ReplicatedPG::hit_set_persist()+0xedb) [0x87f8bb]
 4: (ReplicatedPG::do_op(std::tr1::shared_ptr<OpRequest>&)+0xe3a) [0x8a11aa]
 5: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x68a) [0x83c37a]
 6: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x405) [0x69af05]
 7: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x333) [0x69b473]
 8: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x86f) [0xbcd9cf]
 9: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xbcfb00]
 10: (()+0x7dc5) [0x7fe47d877dc5]
 11: (clone()+0x6d) [0x7fe47c358ced]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
I've been unable to find a log of this, but in the description they also stated they saw this on an OSD at some point: 
ceph version 0.94.9 (fe6d859066244b97b24f09d46552afc2071e6f90)
1: /usr/bin/ceph-osd() [0xadaf02]
2: (()+0xf100) [0x7f31f0f6c100]
3: (gsignal()+0x37) [0x7f31ef9845f7]
4: (abort()+0x148) [0x7f31ef985ce8]
5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f31f02899d5]
6: (()+0x5e946) [0x7f31f0287946]
7: (()+0x5e973) [0x7f31f0287973]
8: (()+0x5eb93) [0x7f31f0287b93]
9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x27a) [0xbde4ba]
10: (ReplicatedPG::trim_object(hobject_t const&)+0x202) [0x86e0d2]
11: (ReplicatedPG::TrimmingObjects::react(ReplicatedPG::SnapTrim const&)+0x43f) [0x87080f]
12: (boost::statechart::simple_state<ReplicatedPG::TrimmingObjects, ReplicatedPG::SnapTrimmer, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0xd4) [0x8d38b4]
13: (boost::statechart::state_machine<ReplicatedPG::SnapTrimmer, ReplicatedPG::NotTrimming, std::allocator<void>, boost::statechart::null_exception_translator>::process_queued_events()+0x137) [0x8bf547]
14: (boost::statechart::state_machine<ReplicatedPG::SnapTrimmer, ReplicatedPG::NotTrimming, std::allocator<void>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x94) [0x8bf734]
15: (ReplicatedPG::snap_trimmer()+0x53b) [0x840b6b]
16: (OSD::SnapTrimWQ::_process(PG*)+0x2a) [0x6c7aea]
17: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa76) [0xbcea96]
18: (ThreadPool::WorkThread::entry()+0x10) [0xbcfb20]
19: (()+0x7dc5) [0x7f31f0f64dc5]
20: (clone()+0x6d) [0x7f31efa45ced]
So far the customer has been able to workaround the issue by increasing the hit_set_count which, per my guess is just delaying the inevitable by preventing hit_set_trim() from attempting to trim the HitSets.  
Customer is also having issues flushing/evicting the cache tier, but that may be unrelated to the assertions or a by-product of the IO being generated from osd's flapping.  They're seeing: 
rbd_header.1907517e0c57b1 failed to evict /rbd_header.1907517e0c57b1: (16) Device or resource busy
which matches http://tracker.ceph.com/issues/12659


Version-Release number of selected component (if applicable):
ceph-0.94.9-0.el7.x86_64 on hot-storage
v0.94.6-254-ge219e85-1 on cold-storage

How reproducible:
Consistent

Steps to Reproduce:
I have not reproduced, but customer stated the following in the case: 
"I have managed to reproduce a similar situation in my test lab by changing osd_hit_set_namespace to a different value than default ".ceph-internal" and bumping the OSDs. It went fine until deep-scrub started and OSDs started to flap the same way as they do on our production cluster. I have not managed to cleanly get out of that situation. It feels like they are on each hot tier OSD's leveldb written in some weird binary format, that none of the regular leveldb cli tools are able to extract. I managed to extract some strings via ldbdump go tool, but obviously not able to modify the key/value (nor do i know if that would actually remove them cleanly or if it's a way to go)."

Actual results:
OSDs assert during HitSet trimming

Expected results:
OSDs do not assert during HitSet trimming 

Additional Info: 
**Note**: This customer is utilizing cache tiering on community bits (Centos7 and Ubuntu), they're aware it's tech preview.  Sales has sold them a subscription stating Red Hat could help them get there cluster back to a healthy state and then we can work with them to move them off of community bits and Centos7 onto RHEL/RHCS.

* Close match to the issue, seems like the same code path, user stated he fixed it: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-April/009173.html
* The customer has stated any down OSDs on OSD2-int-fr were taken down for maintenance and should be ignored.  (I've been focusing my investigation on the hot-storage buckets as these appear to be encountering the most recent issues).