Bug 2186234

Summary: [ODF 4.10.10][Tracker for BZ #2228546] osd crash sigsev in BlueStore::Onode::put called by PG::update_snap_map
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: jpeyrard
Component: cephAssignee: Adam Kupczyk <akupczyk>
ceph sub component: RADOS QA Contact: Elad <ebenahar>
Status: NEW --- Docs Contact:
Severity: medium    
Priority: unspecified CC: bniver, muagarwa, odf-bz-bot, rzarzyns, sheggodu, sostapov
Version: 4.10   
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 2228546 (view as bug list) Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2228546    

Description jpeyrard 2023-04-12 14:02:12 UTC
Created attachment 1957179 [details]
crash of the osd capture by ceph-crash in odf mustgather

Description of problem (please be detailed as possible and provide log
snippests):

ceph osd.1 running on OCP/ODF 4.10.10 cluster crash and restart automatically.
ceph status report WARNING related to one daemon crash but look like it's a one shot event.


  cluster:
    id:     7ec4a1a3-5211-4e4a-9bf6-c9e01f39a474
    health: HEALTH_WARN
            1 daemons have recently crashed

  services:
    mon: 3 daemons, quorum a,b,c (age 7w)
    mgr: a(active, since 5h)
    mds: 1/1 daemons up, 1 hot standby
    osd: 3 osds: 3 up (since 5h), 3 in (since 2M)
    rgw: 1 daemon active (1 hosts, 1 zones)

  data:
    volumes: 1/1 healthy
    pools:   11 pools, 177 pgs
    objects: 643.32k objects, 318 GiB
    usage:   960 GiB used, 540 GiB / 1.5 TiB avail
    pgs:     177 active+clean

  io:
    client:   9.4 MiB/s rd, 13 MiB/s wr, 14 op/s rd, 122 op/s wr



Version of all relevant components (if applicable):

    "osd": {
        "ceph version 16.2.7-126.el8cp (fe0af61d104d48cb9d116cde6e593b5fc8c197e4) pacific (stable)": 3
    },

The ceph container run the following image 5-235 : 
registry.redhat.io/rhceph/rhceph-5-rhel8@sha256:fc25524ccb0ea78526257778ab54bfb1a25772b75fcc97df98eb06a0e67e1bf6

Which is downloadable at the following site : 
https://catalog.redhat.com/software/containers/rhceph/rhceph-5-rhel8/60ec72a74a6a2c7844abe5fb?tag=5-235&push_date=1656624434000



Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

No impact on production so far


Is there any workaround available to the best of your knowledge?

For now, no workaround because it is working.

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?

No idea

Can this issue reproducible?

For now, it is not.

Can this issue reproduce from the UI?

No

If this is a regression, please provide more details to justify this:

No regression.

Steps to Reproduce:

No reproductible so far

Actual results:

N/A

Expected results:


Additional info:

Here is the full stack trace on the crash we have : 

 ceph version 16.2.7-126.el8cp (fe0af61d104d48cb9d116cde6e593b5fc8c197e4) pacific (stable)
 1: /lib64/libpthread.so.0(+0x12ce0) [0x7f705ce60ce0]
 2: (BlueStore::Onode::put()+0x2cb) [0x5634a77de74b]
 3: (BlueStore::omap_get_values(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ghobject_t const&, std::set<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::ba
sic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, ceph::buffer::v15_2_0::list, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std:
:__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> > >*)+0x1a5) [0x5634a7825c15]
 4: (MapCacher::MapCacher<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, ceph::buffer::v15_2_0::list>::get_keys(std::set<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator
<char> > >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, ceph::buffer::v15_2_0::list, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocat
or<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> > >*)+0x3e3) [0x5634a7506393]
 5: (SnapMapper::get_snaps(hobject_t const&, SnapMapper::object_snaps*)+0xec) [0x5634a74fc80c]
 6: (SnapMapper::add_oid(hobject_t const&, std::set<snapid_t, std::less<snapid_t>, std::allocator<snapid_t> > const&, MapCacher::Transaction<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, ceph::buffer::v15_2_0::list>*)+0xcb) [0x5634a750015b]
 7: (PG::update_snap_map(std::vector<pg_log_entry_t, std::allocator<pg_log_entry_t> > const&, ceph::os::Transaction&)+0x814) [0x5634a73b2394]
 8: (non-virtual thunk to PrimaryLogPG::log_operation(std::vector<pg_log_entry_t, std::allocator<pg_log_entry_t> >&&, std::optional<pg_hit_set_history_t> const&, eversion_t const&, eversion_t const&, eversion_t const&, bool, ceph::os::Transaction&, bool)+0x290) [0x5634a74c73f0]
 9: (ReplicatedBackend::do_repop(boost::intrusive_ptr<OpRequest>)+0xd0e) [0x5634a769a48e]
 10: (ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x267) [0x5634a76aa997]
 11: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x52) [0x5634a74dd192]
 12: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x5de) [0x5634a74802ee]
 13: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x309) [0x5634a7308aa9]
 14: (ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x68) [0x5634a7567218]
 15: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xc28) [0x5634a7328f18]
 16: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) [0x5634a799c1d4]
 17: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x5634a799ee74]
 18: /lib64/libpthread.so.0(+0x81cf) [0x7f705ce561cf]
 19: clone()
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.


This stack trace was captured by ceph-crash running on ODF.
I attach the whole file capture by OCS Mustgather.