Bug 2228546

Summary: osd crash sigsev in BlueStore::Onode::put called by PG::update_snap_map
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Mudit Agarwal <muagarwa>
Component: RADOSAssignee: Radoslaw Zarzynski <rzarzyns>
Status: NEW --- QA Contact: Pawan <pdhiran>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.2CC: akupczyk, bhubbard, bniver, ceph-eng-bugs, cephqe-warriors, ebenahar, jpeyrard, muagarwa, nojha, rzarzyns, sheggodu, sostapov, vumrao
Target Milestone: ---Flags: rzarzyns: needinfo? (akupczyk)
Target Release: 7.1   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 2186234 Environment:
Last Closed: Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2186234    
Bug Blocks:    

Description Mudit Agarwal 2023-08-02 15:45:01 UTC
+++ This bug was initially created as a clone of Bug #2186234 +++

Description of problem (please be detailed as possible and provide log
snippests):

ceph osd.1 running on OCP/ODF 4.10.10 cluster crash and restart automatically.
ceph status report WARNING related to one daemon crash but look like it's a one shot event.


  cluster:
    id:     7ec4a1a3-5211-4e4a-9bf6-c9e01f39a474
    health: HEALTH_WARN
            1 daemons have recently crashed

  services:
    mon: 3 daemons, quorum a,b,c (age 7w)
    mgr: a(active, since 5h)
    mds: 1/1 daemons up, 1 hot standby
    osd: 3 osds: 3 up (since 5h), 3 in (since 2M)
    rgw: 1 daemon active (1 hosts, 1 zones)

  data:
    volumes: 1/1 healthy
    pools:   11 pools, 177 pgs
    objects: 643.32k objects, 318 GiB
    usage:   960 GiB used, 540 GiB / 1.5 TiB avail
    pgs:     177 active+clean

  io:
    client:   9.4 MiB/s rd, 13 MiB/s wr, 14 op/s rd, 122 op/s wr



Version of all relevant components (if applicable):

    "osd": {
        "ceph version 16.2.7-126.el8cp (fe0af61d104d48cb9d116cde6e593b5fc8c197e4) pacific (stable)": 3
    },

The ceph container run the following image 5-235 : 
registry.redhat.io/rhceph/rhceph-5-rhel8@sha256:fc25524ccb0ea78526257778ab54bfb1a25772b75fcc97df98eb06a0e67e1bf6

Which is downloadable at the following site : 
https://catalog.redhat.com/software/containers/rhceph/rhceph-5-rhel8/60ec72a74a6a2c7844abe5fb?tag=5-235&push_date=1656624434000



Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

No impact on production so far


Is there any workaround available to the best of your knowledge?

For now, no workaround because it is working.

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?

No idea

Can this issue reproducible?

For now, it is not.

Can this issue reproduce from the UI?

No

If this is a regression, please provide more details to justify this:

No regression.

Steps to Reproduce:

No reproductible so far

Actual results:

N/A

Expected results:


Additional info:

Here is the full stack trace on the crash we have : 

 ceph version 16.2.7-126.el8cp (fe0af61d104d48cb9d116cde6e593b5fc8c197e4) pacific (stable)
 1: /lib64/libpthread.so.0(+0x12ce0) [0x7f705ce60ce0]
 2: (BlueStore::Onode::put()+0x2cb) [0x5634a77de74b]
 3: (BlueStore::omap_get_values(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ghobject_t const&, std::set<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::ba
sic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, ceph::buffer::v15_2_0::list, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std:
:__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> > >*)+0x1a5) [0x5634a7825c15]
 4: (MapCacher::MapCacher<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, ceph::buffer::v15_2_0::list>::get_keys(std::set<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator
<char> > >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, ceph::buffer::v15_2_0::list, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocat
or<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> > >*)+0x3e3) [0x5634a7506393]
 5: (SnapMapper::get_snaps(hobject_t const&, SnapMapper::object_snaps*)+0xec) [0x5634a74fc80c]
 6: (SnapMapper::add_oid(hobject_t const&, std::set<snapid_t, std::less<snapid_t>, std::allocator<snapid_t> > const&, MapCacher::Transaction<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, ceph::buffer::v15_2_0::list>*)+0xcb) [0x5634a750015b]
 7: (PG::update_snap_map(std::vector<pg_log_entry_t, std::allocator<pg_log_entry_t> > const&, ceph::os::Transaction&)+0x814) [0x5634a73b2394]
 8: (non-virtual thunk to PrimaryLogPG::log_operation(std::vector<pg_log_entry_t, std::allocator<pg_log_entry_t> >&&, std::optional<pg_hit_set_history_t> const&, eversion_t const&, eversion_t const&, eversion_t const&, bool, ceph::os::Transaction&, bool)+0x290) [0x5634a74c73f0]
 9: (ReplicatedBackend::do_repop(boost::intrusive_ptr<OpRequest>)+0xd0e) [0x5634a769a48e]
 10: (ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x267) [0x5634a76aa997]
 11: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x52) [0x5634a74dd192]
 12: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x5de) [0x5634a74802ee]
 13: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x309) [0x5634a7308aa9]
 14: (ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x68) [0x5634a7567218]
 15: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xc28) [0x5634a7328f18]
 16: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) [0x5634a799c1d4]
 17: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x5634a799ee74]
 18: /lib64/libpthread.so.0(+0x81cf) [0x7f705ce561cf]
 19: clone()
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.


This stack trace was captured by ceph-crash running on ODF.
I attach the whole file capture by OCS Mustgather.

--- Additional comment from RHEL Program Management on 2023-04-12 14:02:23 UTC ---

This bug having no release flag set previously, is now set with release flag 'odf‑4.13.0' to '?', and so is being proposed to be fixed at the ODF 4.13.0 release. Note that the 3 Acks (pm_ack, devel_ack, qa_ack), if any previously set while release flag was missing, have now been reset since the Acks are to be set against a release flag.

--- Additional comment from  on 2023-04-12 14:21:00 UTC ---

I am suspecting the following patch which does not look to be in the 16.2.7-126.

~~~~~
commit 0960f981f72009086317710f51c5acefe1bd7232
Author: Igor Fedotov <ifedotov>
Date:   Tue Dec 14 17:56:37 2021 +0300

    os/bluestore: get rid of fake onode nref increment for pinned entry
    
    Looks like this isn't necessary any more after fixing
    https://tracker.ceph.com/issues/53002
    
    Signed-off-by: Igor Fedotov <igor.fedotov>

diff --git a/src/os/bluestore/BlueStore.cc b/src/os/bluestore/BlueStore.cc
index 4d8b7f3f3e1..78709eb5dc8 100644
--- a/src/os/bluestore/BlueStore.cc
+++ b/src/os/bluestore/BlueStore.cc
@@ -3623,14 +3623,7 @@ void BlueStore::Onode::calc_omap_tail(
   _key_encode_u64(o->onode.nid, out);
   out->push_back('~');
 }
-//
-// A tricky thing about Onode's ref counter is that we do an additional
-// increment when newly pinned instance is detected. And -1 on unpin.
-// This prevents from a conflict with a delete call (when nref == 0).
-// The latter might happen while the thread is in unpin() function
-// (and e.g. waiting for lock acquisition) since nref is already
-// decremented. And another 'putting' thread on the instance will release it.
-//
+
 void BlueStore::Onode::get() {
   if (++nref >= 2 && !pinned) {
     OnodeCacheShard* ocs = c->get_onode_cache();
@@ -3643,11 +3636,7 @@ void BlueStore::Onode::get() {
     }
     bool was_pinned = pinned;
     pinned = nref >= 2;
-    // additional increment for newly pinned instance
     bool r = !was_pinned && pinned;
-    if (r) {
-      ++nref;
-    }
     if (cached && r) {
       ocs->_pin(this);
     }
@@ -3657,7 +3646,7 @@ void BlueStore::Onode::get() {
 void BlueStore::Onode::put() {
   ++put_nref;
   int n = --nref;
-  if (n == 2) {
+  if (n == 1) {
     OnodeCacheShard* ocs = c->get_onode_cache();
     ocs->lock.lock();
     // It is possible that during waiting split_cache moved us to different OnodeCacheShard.
@@ -3667,8 +3656,7 @@ void BlueStore::Onode::put() {
       ocs->lock.lock();
     }
     bool need_unpin = pinned;
-    pinned = pinned && nref > 2; // intentionally use > not >= as we have
-                                 // +1 due to pinned state
+    pinned = pinned && nref >= 2;
     need_unpin = need_unpin && !pinned;
     if (cached && need_unpin) {
       if (exists) {
@@ -3679,10 +3667,6 @@ void BlueStore::Onode::put() {
         c->onode_map._remove(oid);
       }
     }
-    // additional decrement for newly unpinned instance
-    if (need_unpin) {
-      --nref;
-    }
     ocs->lock.unlock();
   }
   auto pn = --put_nref;
~~~~~

--- Additional comment from Sunil Kumar Acharya on 2023-04-17 15:43:34 UTC ---

Moving this out of 4.13 as the issue still need to be picked up for investigation. Please feel free to propose it back to 4.13 if this turns out to be a blocker bz with a comment.