Bug 1398867 - [Escalation] Cache-tieiring, OSD's in hot-storage asserting with osd/ReplicatedPG.cc: 10521: FAILED assert(obc) in hit_set_trim
Summary: [Escalation] Cache-tieiring, OSD's in hot-storage asserting with osd/Replicat...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat
Component: RADOS
Version: 1.3.3
Hardware: All
OS: Linux
urgent
urgent
Target Milestone: rc
: 1.3.4
Assignee: Samuel Just
QA Contact: ceph-qe-bugs
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-11-26 20:02 UTC by Kyle Squizzato
Modified: 2020-01-17 16:15 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-11-27 18:36:04 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description Kyle Squizzato 2016-11-26 20:02:16 UTC
Description of problem:
Customer recently started experiencing OSD failure/flaps on all of the OSDs in their hot-storage pool.

The assertions appeared to have begun after restarting backfill.  The customer states the issue started after they increased pg/pgp_count.  They increased these counts when the cluster was in recovery, not active+clean.

The hot-storage chassis is as follows: 

-50   4.96991     chassis CH2
63   0.70999         osd.63                    up  1.00000          1.00000
65   0.70999         osd.65                    up  1.00000          1.00000
66   0.70999         osd.66                    up  1.00000          1.00000
67   0.70999         osd.67                  down        0          1.00000
47   0.70999         osd.47                    up  1.00000          1.00000
92   0.70998         osd.92                    up  1.00000          1.00000
93   0.70998         osd.93                    up  1.00000          1.00000
-51   4.96991     chassis CH3
54   0.70999         osd.54                    up  1.00000          1.00000
43   0.70999         osd.43                  down        0          1.00000
44   0.70999         osd.44                    up  1.00000          1.00000
69   0.70999         osd.69                    up  1.00000          1.00000
46   0.70999         osd.46                    up  1.00000          1.00000
91   0.70998         osd.91                    up  1.00000          1.00000
94   0.70998         osd.94                    up  1.00000          1.00000
today alone, the following osd's have failed: 

osd.44
osd.46
osd.66
osd.67
osd.69
osd.91
osd.92
osd.93
osd.94
During the failures, an osd will flap by hitting the same assertion in quick succession, for example osd.94: 

2016-11-26 05:38:48.103200 mon.0 10.24.19.25:6789/0 3180044 : cluster [INF] osd.94 10.24.19.45:6800/13817 failed (6 reports from 6 peers after 21.000236 >= grace 20.000000)
2016-11-26 05:39:55.460343 mon.0 10.24.19.25:6789/0 3180306 : cluster [INF] osd.94 10.24.19.45:6802/27971 failed (7 reports from 6 peers after 22.000328 >= grace 20.000000)
2016-11-26 05:41:37.640188 mon.0 10.24.19.25:6789/0 3180749 : cluster [INF] osd.94 10.24.19.45:6800/28930 failed (8 reports from 6 peers after 20.000126 >= grace 20.000000)
2016-11-26 05:48:22.906236 mon.0 10.24.19.25:6789/0 3182655 : cluster [INF] osd.94 10.24.19.45:6800/3444 failed (6 reports from 6 peers after 22.413912 >= grace 20.000000)
The assertion in greater detail (full recent event dump in BZ, and osd.logs in case)
2016-11-26 05:39:32.987897 7fe446ec7700 -1 osd/ReplicatedPG.cc: In function 'void ReplicatedPG::hit_set_trim(ReplicatedPG::RepGather*, unsigned int)' thread 7fe446ec7700 time 2016-11-26 05:39:32.985793
osd/ReplicatedPG.cc: 10521: FAILED assert(obc)

 ceph version 0.94.9 (fe6d859066244b97b24f09d46552afc2071e6f90)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0xbde2c5]
 2: (ReplicatedPG::hit_set_trim(ReplicatedPG::RepGather*, unsigned int)+0x75f) [0x87e89f]
 3: (ReplicatedPG::hit_set_persist()+0xedb) [0x87f8bb]
 4: (ReplicatedPG::do_op(std::tr1::shared_ptr<OpRequest>&)+0xe3a) [0x8a11aa]
 5: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x68a) [0x83c37a]
 6: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x405) [0x69af05]
 7: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x333) [0x69b473]
 8: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x86f) [0xbcd9cf]
 9: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xbcfb00]
 10: (()+0x7dc5) [0x7fe47d877dc5]
 11: (clone()+0x6d) [0x7fe47c358ced]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
I've been unable to find a log of this, but in the description they also stated they saw this on an OSD at some point: 
ceph version 0.94.9 (fe6d859066244b97b24f09d46552afc2071e6f90)
1: /usr/bin/ceph-osd() [0xadaf02]
2: (()+0xf100) [0x7f31f0f6c100]
3: (gsignal()+0x37) [0x7f31ef9845f7]
4: (abort()+0x148) [0x7f31ef985ce8]
5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f31f02899d5]
6: (()+0x5e946) [0x7f31f0287946]
7: (()+0x5e973) [0x7f31f0287973]
8: (()+0x5eb93) [0x7f31f0287b93]
9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x27a) [0xbde4ba]
10: (ReplicatedPG::trim_object(hobject_t const&)+0x202) [0x86e0d2]
11: (ReplicatedPG::TrimmingObjects::react(ReplicatedPG::SnapTrim const&)+0x43f) [0x87080f]
12: (boost::statechart::simple_state<ReplicatedPG::TrimmingObjects, ReplicatedPG::SnapTrimmer, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0xd4) [0x8d38b4]
13: (boost::statechart::state_machine<ReplicatedPG::SnapTrimmer, ReplicatedPG::NotTrimming, std::allocator<void>, boost::statechart::null_exception_translator>::process_queued_events()+0x137) [0x8bf547]
14: (boost::statechart::state_machine<ReplicatedPG::SnapTrimmer, ReplicatedPG::NotTrimming, std::allocator<void>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x94) [0x8bf734]
15: (ReplicatedPG::snap_trimmer()+0x53b) [0x840b6b]
16: (OSD::SnapTrimWQ::_process(PG*)+0x2a) [0x6c7aea]
17: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa76) [0xbcea96]
18: (ThreadPool::WorkThread::entry()+0x10) [0xbcfb20]
19: (()+0x7dc5) [0x7f31f0f64dc5]
20: (clone()+0x6d) [0x7f31efa45ced]
So far the customer has been able to workaround the issue by increasing the hit_set_count which, per my guess is just delaying the inevitable by preventing hit_set_trim() from attempting to trim the HitSets.  
Customer is also having issues flushing/evicting the cache tier, but that may be unrelated to the assertions or a by-product of the IO being generated from osd's flapping.  They're seeing: 
rbd_header.1907517e0c57b1 failed to evict /rbd_header.1907517e0c57b1: (16) Device or resource busy
which matches http://tracker.ceph.com/issues/12659


Version-Release number of selected component (if applicable):
ceph-0.94.9-0.el7.x86_64 on hot-storage
v0.94.6-254-ge219e85-1 on cold-storage

How reproducible:
Consistent

Steps to Reproduce:
I have not reproduced, but customer stated the following in the case: 
"I have managed to reproduce a similar situation in my test lab by changing osd_hit_set_namespace to a different value than default ".ceph-internal" and bumping the OSDs. It went fine until deep-scrub started and OSDs started to flap the same way as they do on our production cluster. I have not managed to cleanly get out of that situation. It feels like they are on each hot tier OSD's leveldb written in some weird binary format, that none of the regular leveldb cli tools are able to extract. I managed to extract some strings via ldbdump go tool, but obviously not able to modify the key/value (nor do i know if that would actually remove them cleanly or if it's a way to go)."

Actual results:
OSDs assert during HitSet trimming

Expected results:
OSDs do not assert during HitSet trimming 

Additional Info: 
**Note**: This customer is utilizing cache tiering on community bits (Centos7 and Ubuntu), they're aware it's tech preview.  Sales has sold them a subscription stating Red Hat could help them get there cluster back to a healthy state and then we can work with them to move them off of community bits and Centos7 onto RHEL/RHCS.

* Close match to the issue, seems like the same code path, user stated he fixed it: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-April/009173.html
* The customer has stated any down OSDs on OSD2-int-fr were taken down for maintenance and should be ignored.  (I've been focusing my investigation on the hot-storage buckets as these appear to be encountering the most recent issues).


Note You need to log in before you can comment on or make changes to this bug.