Hide Forgot
Description of problem: Customer recently started experiencing OSD failure/flaps on all of the OSDs in their hot-storage pool. The assertions appeared to have begun after restarting backfill. The customer states the issue started after they increased pg/pgp_count. They increased these counts when the cluster was in recovery, not active+clean. The hot-storage chassis is as follows: -50 4.96991 chassis CH2 63 0.70999 osd.63 up 1.00000 1.00000 65 0.70999 osd.65 up 1.00000 1.00000 66 0.70999 osd.66 up 1.00000 1.00000 67 0.70999 osd.67 down 0 1.00000 47 0.70999 osd.47 up 1.00000 1.00000 92 0.70998 osd.92 up 1.00000 1.00000 93 0.70998 osd.93 up 1.00000 1.00000 -51 4.96991 chassis CH3 54 0.70999 osd.54 up 1.00000 1.00000 43 0.70999 osd.43 down 0 1.00000 44 0.70999 osd.44 up 1.00000 1.00000 69 0.70999 osd.69 up 1.00000 1.00000 46 0.70999 osd.46 up 1.00000 1.00000 91 0.70998 osd.91 up 1.00000 1.00000 94 0.70998 osd.94 up 1.00000 1.00000 today alone, the following osd's have failed: osd.44 osd.46 osd.66 osd.67 osd.69 osd.91 osd.92 osd.93 osd.94 During the failures, an osd will flap by hitting the same assertion in quick succession, for example osd.94: 2016-11-26 05:38:48.103200 mon.0 10.24.19.25:6789/0 3180044 : cluster [INF] osd.94 10.24.19.45:6800/13817 failed (6 reports from 6 peers after 21.000236 >= grace 20.000000) 2016-11-26 05:39:55.460343 mon.0 10.24.19.25:6789/0 3180306 : cluster [INF] osd.94 10.24.19.45:6802/27971 failed (7 reports from 6 peers after 22.000328 >= grace 20.000000) 2016-11-26 05:41:37.640188 mon.0 10.24.19.25:6789/0 3180749 : cluster [INF] osd.94 10.24.19.45:6800/28930 failed (8 reports from 6 peers after 20.000126 >= grace 20.000000) 2016-11-26 05:48:22.906236 mon.0 10.24.19.25:6789/0 3182655 : cluster [INF] osd.94 10.24.19.45:6800/3444 failed (6 reports from 6 peers after 22.413912 >= grace 20.000000) The assertion in greater detail (full recent event dump in BZ, and osd.logs in case) 2016-11-26 05:39:32.987897 7fe446ec7700 -1 osd/ReplicatedPG.cc: In function 'void ReplicatedPG::hit_set_trim(ReplicatedPG::RepGather*, unsigned int)' thread 7fe446ec7700 time 2016-11-26 05:39:32.985793 osd/ReplicatedPG.cc: 10521: FAILED assert(obc) ceph version 0.94.9 (fe6d859066244b97b24f09d46552afc2071e6f90) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0xbde2c5] 2: (ReplicatedPG::hit_set_trim(ReplicatedPG::RepGather*, unsigned int)+0x75f) [0x87e89f] 3: (ReplicatedPG::hit_set_persist()+0xedb) [0x87f8bb] 4: (ReplicatedPG::do_op(std::tr1::shared_ptr<OpRequest>&)+0xe3a) [0x8a11aa] 5: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x68a) [0x83c37a] 6: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x405) [0x69af05] 7: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x333) [0x69b473] 8: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x86f) [0xbcd9cf] 9: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xbcfb00] 10: (()+0x7dc5) [0x7fe47d877dc5] 11: (clone()+0x6d) [0x7fe47c358ced] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. I've been unable to find a log of this, but in the description they also stated they saw this on an OSD at some point: ceph version 0.94.9 (fe6d859066244b97b24f09d46552afc2071e6f90) 1: /usr/bin/ceph-osd() [0xadaf02] 2: (()+0xf100) [0x7f31f0f6c100] 3: (gsignal()+0x37) [0x7f31ef9845f7] 4: (abort()+0x148) [0x7f31ef985ce8] 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f31f02899d5] 6: (()+0x5e946) [0x7f31f0287946] 7: (()+0x5e973) [0x7f31f0287973] 8: (()+0x5eb93) [0x7f31f0287b93] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x27a) [0xbde4ba] 10: (ReplicatedPG::trim_object(hobject_t const&)+0x202) [0x86e0d2] 11: (ReplicatedPG::TrimmingObjects::react(ReplicatedPG::SnapTrim const&)+0x43f) [0x87080f] 12: (boost::statechart::simple_state<ReplicatedPG::TrimmingObjects, ReplicatedPG::SnapTrimmer, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0xd4) [0x8d38b4] 13: (boost::statechart::state_machine<ReplicatedPG::SnapTrimmer, ReplicatedPG::NotTrimming, std::allocator<void>, boost::statechart::null_exception_translator>::process_queued_events()+0x137) [0x8bf547] 14: (boost::statechart::state_machine<ReplicatedPG::SnapTrimmer, ReplicatedPG::NotTrimming, std::allocator<void>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x94) [0x8bf734] 15: (ReplicatedPG::snap_trimmer()+0x53b) [0x840b6b] 16: (OSD::SnapTrimWQ::_process(PG*)+0x2a) [0x6c7aea] 17: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa76) [0xbcea96] 18: (ThreadPool::WorkThread::entry()+0x10) [0xbcfb20] 19: (()+0x7dc5) [0x7f31f0f64dc5] 20: (clone()+0x6d) [0x7f31efa45ced] So far the customer has been able to workaround the issue by increasing the hit_set_count which, per my guess is just delaying the inevitable by preventing hit_set_trim() from attempting to trim the HitSets. Customer is also having issues flushing/evicting the cache tier, but that may be unrelated to the assertions or a by-product of the IO being generated from osd's flapping. They're seeing: rbd_header.1907517e0c57b1 failed to evict /rbd_header.1907517e0c57b1: (16) Device or resource busy which matches http://tracker.ceph.com/issues/12659 Version-Release number of selected component (if applicable): ceph-0.94.9-0.el7.x86_64 on hot-storage v0.94.6-254-ge219e85-1 on cold-storage How reproducible: Consistent Steps to Reproduce: I have not reproduced, but customer stated the following in the case: "I have managed to reproduce a similar situation in my test lab by changing osd_hit_set_namespace to a different value than default ".ceph-internal" and bumping the OSDs. It went fine until deep-scrub started and OSDs started to flap the same way as they do on our production cluster. I have not managed to cleanly get out of that situation. It feels like they are on each hot tier OSD's leveldb written in some weird binary format, that none of the regular leveldb cli tools are able to extract. I managed to extract some strings via ldbdump go tool, but obviously not able to modify the key/value (nor do i know if that would actually remove them cleanly or if it's a way to go)." Actual results: OSDs assert during HitSet trimming Expected results: OSDs do not assert during HitSet trimming Additional Info: **Note**: This customer is utilizing cache tiering on community bits (Centos7 and Ubuntu), they're aware it's tech preview. Sales has sold them a subscription stating Red Hat could help them get there cluster back to a healthy state and then we can work with them to move them off of community bits and Centos7 onto RHEL/RHCS. * Close match to the issue, seems like the same code path, user stated he fixed it: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-April/009173.html * The customer has stated any down OSDs on OSD2-int-fr were taken down for maintenance and should be ignored. (I've been focusing my investigation on the hot-storage buckets as these appear to be encountering the most recent issues).