Bug 1400285

Summary: RHCS - 0.94.3-3.el7cp - ./include/interval_set.h: 386: FAILED assert(_size >= 0)
Product: Red Hat Ceph Storage Reporter: Vikhyat Umrao <vumrao>
Component: RADOSAssignee: Brad Hubbard <bhubbard>
Status: CLOSED CURRENTRELEASE QA Contact: ceph-qe-bugs <ceph-qe-bugs>
Severity: medium Docs Contact:
Priority: medium    
Version: 1.3.3CC: bhubbard, ceph-eng-bugs, dzafman, kchai, mhackett
Target Milestone: rc   
Target Release: 2.1   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: ceph-10.2.3-13.el7cp Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-12-07 05:06:01 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Vikhyat Umrao 2016-11-30 19:27:53 UTC
Description of problem:

RHCS - 0.94.3-3.el7cp - ./include/interval_set.h: 386: FAILED assert(_size >= 0)

- Customer was removing couple of OSD nodes and in removal of OSD process,  recovery was running and in same time multiple OSD's got crashed with below given assert during recovery.

  0> 2016-11-29 04:09:15.149619 7fe52ddbe700 -1 ./include/interval_set.h: In function 'void interval_set<T>::erase(T, T) [with T = snapid_t]' thread 7fe52ddbe700 time 2016-11-29 04:09:15.130676
./include/interval_set.h: 386: FAILED assert(_size >= 0)

 ceph version 0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0xb20ed5]
 
 2: (interval_set<snapid_t>::subtract(interval_set<snapid_t> const&)+0xc0) [0x7fa940]
 
 3: (PG::activate(ObjectStore::Transaction&, unsigned int, std::list<Context*, std::allocator<Context*> >&, std::map<int, std::map<spg_t, pg_query_t, std::less<spg_t>, std::allocator<std::pair<spg_t const, pg_query_t> > >, std::less<int>, std::allocator<std::pair<int const, std::map<spg_t, pg_query_t, std::less<spg_t>, std::allocator<std::pair<spg_t const, pg_query_t> > > > > >&, std::map<int, std::vector<std::pair<pg_notify_t, std::map<unsigned int, pg_interval_t, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, pg_interval_t> > > >, std::allocator<std::pair<pg_notify_t, std::map<unsigned int, pg_interval_t, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, pg_interval_t> > > > > >, std::less<int>, std::allocator<std::pair<int const, std::vector<std::pair<pg_notify_t, std::map<unsigned int, pg_interval_t, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, pg_interval_t> > > >, std::allocator<std::pair<pg_notify_t, std::map<unsigned int, pg_interval_t, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, pg_interval_t> > > > > > > > >*, PG::RecoveryCtx*)+0x703) [0x7cf883]
 
4: (PG::RecoveryState::Active::Active(boost::statechart::state<PG::RecoveryState::Active, PG::RecoveryState::Primary, PG::RecoveryState::Activating, (boost::statechart::history_mode)0>::my_context)+0x3ff) [0x7d230f]
 
 
5: (boost::statechart::detail::safe_reaction_result boost::statechart::simple_state<PG::RecoveryState::Peering, PG::RecoveryState::Primary, PG::RecoveryState::GetInfo, (boost::statechart::history_mode)0>::transit_impl<PG::RecoveryState::Active, PG::RecoveryState::RecoveryMachine, boost::statechart::detail::no_transition_function>(boost::statechart::detail::no_transition_function const&)+0xb8) [0x805708]

 6: (boost::statechart::simple_state<PG::RecoveryState::Peering, PG::RecoveryState::Primary, PG::RecoveryState::GetInfo, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x15a) [0x805bba]

 7: (boost::statechart::simple_state<PG::RecoveryState::GetMissing, PG::RecoveryState::Peering, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0xd0) [0x806920]
 8: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::send_event(boost::statechart::event_base const&)+0x6b) [0x7edd5b]
 9: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_queued_events()+0xd4) [0x7edef4]
 10: (PG::handle_peering_event(std::tr1::shared_ptr<PG::CephPeeringEvt>, PG::RecoveryCtx*)+0x1fe) [0x799bae]
 11: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x2d0) [0x6720c0]
 

 12: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x28) [0x6ccaf8]
 13: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa76) [0xb11736]
 14: (ThreadPool::WorkThread::entry()+0x10) [0xb127c0]
 15: (()+0x7dc5) [0x7fe54b471dc5]
 16: (clone()+0x6d) [0x7fe549f5228d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.


Version-Release number of selected component (if applicable):
Red Hat Ceph Storage 1.3.1 -  0.94.3-3.el7cp

Comment 4 Vikhyat Umrao 2016-11-30 19:30:47 UTC
- Then I started looking into the code for more hints as we do not have debug logs from time of the issue and currently this issue is not reproducible in customer environment.
- I got this in changelog:

doc/changelog/v0.94.7.txt:2543:    hammer: Unable to bring up OSD's after dealing with FULL cluster (OSD assert with /include/interval_set.h: 386: FAILED assert(_size >= 0))
doc/release-notes.rst:115:* osd: Unable to bring up OSD's after dealing with FULL cluster (OSD assert with /include/interval_set.h: 386: FAILED assert(_size >= 0)) (`issue#14428 <http://tracker.ceph.com/issues/14428>`_, `pr#7415 <http://github.com/ceph/ceph/pull/7415>`_, Alexey Sheplyakov)


- and this tracker: http://tracker.ceph.com/issues/14428 which was created by Red Hat support for another upstream customer issue and was fixed in hammer version: v0.94.7 onwards.

$ git tag --contains 3d84420847642e1cb4212f943ecf3ed479112d96
v0.94.7
v0.94.8
v0.94.9

commit 3d84420847642e1cb4212f943ecf3ed479112d96
Author: Alexey Sheplyakov <asheplyakov>
Date:   Thu Jan 21 09:42:09 2016 +0300

    PG::activate(): handle unexpected cached_removed_snaps more gracefully
    
    PGPool::update(): ditto
    
    Fixes: #14428
    Backport: infernalis, hammer, firefly
    
    Signed-off-by: Alexey Sheplyakov <asheplyakov>
    (cherry picked from commit aba6746b850e9397ff25570f08d0af8847a7162c)

diff --git a/src/osd/PG.cc b/src/osd/PG.cc
index c6b1d27..c08ad5d 100644
--- a/src/osd/PG.cc
+++ b/src/osd/PG.cc
@@ -158,8 +158,18 @@ void PGPool::update(OSDMapRef map)
   name = map->get_pool_name(id);
   if (pi->get_snap_epoch() == map->get_epoch()) {
     pi->build_removed_snaps(newly_removed_snaps);
-    newly_removed_snaps.subtract(cached_removed_snaps);
-    cached_removed_snaps.union_of(newly_removed_snaps);
+    interval_set<snapid_t> intersection;
+    intersection.intersection_of(newly_removed_snaps, cached_removed_snaps);
+    if (intersection == cached_removed_snaps) {
+        newly_removed_snaps.subtract(cached_removed_snaps);
+        cached_removed_snaps.union_of(newly_removed_snaps);
+    } else {
+        lgeneric_subdout(g_ceph_context, osd, 0) << __func__
+          << " cached_removed_snaps shrank from " << cached_removed_snaps
+          << " to " << newly_removed_snaps << dendl;
+        cached_removed_snaps = newly_removed_snaps;
+        newly_removed_snaps.clear();
+    }
     snapc = pi->get_snap_context();
   } else {
     newly_removed_snaps.clear();
@@ -1550,7 +1560,16 @@ void PG::activate(ObjectStore::Transaction& t,
     dout(20) << "activate - purged_snaps " << info.purged_snaps
             << " cached_removed_snaps " << pool.cached_removed_snaps << dendl;
     snap_trimq = pool.cached_removed_snaps;
-    snap_trimq.subtract(info.purged_snaps);
+    interval_set<snapid_t> intersection;
+    intersection.intersection_of(snap_trimq, info.purged_snaps);
+    if (intersection == info.purged_snaps) {
+      snap_trimq.subtract(info.purged_snaps);
+    } else {
+        dout(0) << "warning: info.purged_snaps (" << info.purged_snaps
+                << ") is not a subset of pool.cached_removed_snaps ("
+                << pool.cached_removed_snaps << ")" << dendl;
+        snap_trimq.subtract(intersection);
+    }

Comment 6 Vikhyat Umrao 2016-11-30 19:31:52 UTC
- To me this looks similar. 
- We need engineering help to acknowledge if my understanding is correct.
- So we can confirm customer to upgrade to Red Hat Ceph Storage 1.3.3 - 0.94.9-3.el7cp

Thank you,
Vikhyat

Comment 8 Brad Hubbard 2016-12-01 09:10:55 UTC
snap_trimq = pool.cached_removed_snaps;
snap_trimq.subtract(info.purged_snaps);
 
interval_set<snapid_t> cached_removed_snaps;      // current removed_snaps set
interval_set<snapid_t> snap_trimq;
interval_set<snapid_t> purged_snaps;
 
So all of the data types involved are interval_set<snapid_t> and snapid_t is
basically a wrapper for a uint64_t (a big, unsigned int).
 
So it appears that snap_trimq (initialised from the cached_removed_snaps field
of the current pool) did not contain some of the snapids in purged_snaps and
this is indeed unexpected. Restricting the interval_set passed to subtract to
only contain the intersection (only elements present in both) of the two sets
should resolve this and that is part of what the patch pointed out by Vikhyat
does. It looks like we populate cached_removed_snaps in PGPool::update based on
an osdmap and its epoch so maybe there was a problem with map updates? The
second part of the http://tracker.ceph.com/issues/14428 patch checks this
process for sanity and tries to guard (and warn) against this
 
To me the patch suggested by Vikhyat seems to offer good protection against this 
kind of issue and an upgrade is therefore indicated however I'd like to get a
second opinion on this.

Comment 9 Vikhyat Umrao 2016-12-01 15:10:47 UTC
Thanks Brad for analysis and to me looks right.

Because we do not have debug logs from this customer : Cisco Systems but if you see logs given in upstream tracker:  http://tracker.ceph.com/issues/14428 from another customer(ceph-osd.44.log:https://api.access.redhat.com/rs/cases/01568711/attachments/b2c52f28-eb4b-4734-9f80-c3f879bb465a), we do have this below log printed and 

1533   // initialize snap_trimq
1534   if (is_primary()) {
1535     dout(20) << "activate - purged_snaps " << info.purged_snaps
1536              << " cached_removed_snaps " << pool.cached_removed_snaps << dendl; 

  -19> 2016-01-19 18:58:09.417532 7f87ed883700 20 osd.44 pg_epoch: 2155 pg[8.8e( v 1874'658759 (1866'655730,1874'658759] local-les=2155 n=5623 ec=350 les/c 2120/2152 2154/2154/2154) [44,52]/[44,48] r=0 lpr=2154 pi=1948-2153/37 bft=52 crt=1874'658757 lcod 0'0 mlcod 0'0 remapped] activate - purged_snaps [] cached_removed_snaps []

^^ and if you see these old logs yes we were not having snap ids. It was blank.

Comment 10 Brad Hubbard 2016-12-02 00:06:30 UTC
After discussions we determined this was more likely the relevant line in the upstream tracker case.

  -424> 2016-01-19 18:58:09.332150 7f87ed082700 20 osd.44 pg_epoch: 2155 pg[10.33( v 1868'3546 (441'872,1868'3546] local-les=2155 n=847 ec=360 les/c 1987/2152 2154/2154/2154) [44,53,59]/[44,48] r=0 lpr=2154 pi=1948-2153/34 bft=53,59 crt=1868'3544 lcod 0'0 mlcod 0'0 remapped] activate - purged_snaps [1~3,6~4,b~4,12~5,19~4,1e~c,2b~2,31~2,36~2,3a~2,4f~2,52~2,56~1] cached_removed_snaps [1~3]

This demonstrates that purged_snaps and cached_removed_snaps were not null and that purged_snaps contained multiple entries that were not present in cached_removed_snaps. This is indicative of what must have happened in the case of this Bugzilla but the reason it happened must have been different (in the upstream case the user ran --force-nonempty).

Posting this now to avoid any confusion going forward.

Comment 11 Brad Hubbard 2016-12-02 06:06:24 UTC
There is compelling evidence to suggest the assert is linked to the running of "osd crush create-or-move" on these osds as the list of down osds for the most part correlates with the list of osds that create-or-move was run on and at least some of the discrepancies can be explained by other failures.

$ zgrep create-or-move ceph.audit.*|grep osd.229
ceph.audit.log:2016-11-29 03:53:31.869501 mon.0 X.X.X.10:6789/0 36543098 : audit [INF] from='client.? X.X.X.36:0/2628049' entity='osd.229' cmd=[{"prefix": "osd crush create-or-move", "args": ["host=boxen-027"], "id": 229, "weight": 2.72}]: dispatch

$ grep interval_set ceph-osd.229.log |head -1
2016-11-29 04:09:15.149619 7fe52ddbe700 -1 ./include/interval_set.h: In function 'void interval_set<T>::erase(T, T) [with T = snapid_t]' thread 7fe52ddbe700 time 2016-11-29 04:09:15.130676

Comment 13 Brad Hubbard 2016-12-07 05:06:01 UTC
After a complete review of the details of this bug and the code involved it is clear the issue was caused by populating cached_removed_snaps from the osdmap stored on disk in a function called from load_pgs. This data was further manipulated via a call to PG::update and using a current map (a large map gap). The combination of the data from the two maps resulted in an inconsistency causing us to assert in PG::activate. Because this assert occurred in PG::activate we have already called write_if_dirty on the pg and persisted a new osd map to the pg's filestore. The next time we boot we start successfully as we read a current osdmap off the disk and do not calculate an inconsistent cached_removed_snaps. Sam's patch (5fb8fb4e450949ebc4c724c102a5b2ae6f1d06c8) skips modification of cached_removed_snaps if the map gap is greater than 1 and would directly address this issue.

5fb8fb4e450949ebc4c724c102a5b2ae6f1d06c8 is not in our ceph-1.3-rhel-7 branch so is not in any version of 1.3 we have shipped and currently will not be in the next version since it's not yet in our tree. It is however in v10.2.3-13.el7 (2.1) so the recommendation would be to upgrade to that version to avoid this issue.

Note that the patch from comment #4 would not resolve this issue but would reduce the impact from an assert to a logged warning. Also note that this issue is not generally seen as the conditions for its occurrence are difficult to reproduce.