2186234 – [ODF 4.10.10][Tracker for BZ #2228546] osd crash sigsev in BlueStore::Onode::put called by PG::update_snap_map

Bug 2186234 - [ODF 4.10.10][Tracker for BZ #2228546] osd crash sigsev in BlueStore::Onode::put called by PG::update_snap_map

Summary: [ODF 4.10.10][Tracker for BZ #2228546] osd crash sigsev in BlueStore::Onode::...

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	ceph
Sub Component:
Version:	4.10
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Adam Kupczyk
QA Contact:	Elad
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2228546
TreeView+	depends on / blocked

Reported:	2023-04-12 14:02 UTC by jpeyrard
Modified:	2024-05-16 04:25 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	2228546 (view as bug list)
Environment:
Last Closed:	2024-01-08 13:44:59 UTC
Embargoed:

Attachments	(Terms of Use)

Description jpeyrard 2023-04-12 14:02:12 UTC

Created attachment 1957179 [details]
crash of the osd capture by ceph-crash in odf mustgather

Description of problem (please be detailed as possible and provide log
snippests):

ceph osd.1 running on OCP/ODF 4.10.10 cluster crash and restart automatically.
ceph status report WARNING related to one daemon crash but look like it's a one shot event.


  cluster:
    id:     7ec4a1a3-5211-4e4a-9bf6-c9e01f39a474
    health: HEALTH_WARN
            1 daemons have recently crashed

  services:
    mon: 3 daemons, quorum a,b,c (age 7w)
    mgr: a(active, since 5h)
    mds: 1/1 daemons up, 1 hot standby
    osd: 3 osds: 3 up (since 5h), 3 in (since 2M)
    rgw: 1 daemon active (1 hosts, 1 zones)

  data:
    volumes: 1/1 healthy
    pools:   11 pools, 177 pgs
    objects: 643.32k objects, 318 GiB
    usage:   960 GiB used, 540 GiB / 1.5 TiB avail
    pgs:     177 active+clean

  io:
    client:   9.4 MiB/s rd, 13 MiB/s wr, 14 op/s rd, 122 op/s wr



Version of all relevant components (if applicable):

    "osd": {
        "ceph version 16.2.7-126.el8cp (fe0af61d104d48cb9d116cde6e593b5fc8c197e4) pacific (stable)": 3
    },

The ceph container run the following image 5-235 : 
registry.redhat.io/rhceph/rhceph-5-rhel8@sha256:fc25524ccb0ea78526257778ab54bfb1a25772b75fcc97df98eb06a0e67e1bf6

Which is downloadable at the following site : 
https://catalog.redhat.com/software/containers/rhceph/rhceph-5-rhel8/60ec72a74a6a2c7844abe5fb?tag=5-235&push_date=1656624434000



Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

No impact on production so far


Is there any workaround available to the best of your knowledge?

For now, no workaround because it is working.

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?

No idea

Can this issue reproducible?

For now, it is not.

Can this issue reproduce from the UI?

No

If this is a regression, please provide more details to justify this:

No regression.

Steps to Reproduce:

No reproductible so far

Actual results:

N/A

Expected results:


Additional info:

Here is the full stack trace on the crash we have : 

 ceph version 16.2.7-126.el8cp (fe0af61d104d48cb9d116cde6e593b5fc8c197e4) pacific (stable)
 1: /lib64/libpthread.so.0(+0x12ce0) [0x7f705ce60ce0]
 2: (BlueStore::Onode::put()+0x2cb) [0x5634a77de74b]
 3: (BlueStore::omap_get_values(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ghobject_t const&, std::set<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::ba
sic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, ceph::buffer::v15_2_0::list, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std:
:__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> > >*)+0x1a5) [0x5634a7825c15]
 4: (MapCacher::MapCacher<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, ceph::buffer::v15_2_0::list>::get_keys(std::set<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator
<char> > >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, ceph::buffer::v15_2_0::list, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocat
or<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> > >*)+0x3e3) [0x5634a7506393]
 5: (SnapMapper::get_snaps(hobject_t const&, SnapMapper::object_snaps*)+0xec) [0x5634a74fc80c]
 6: (SnapMapper::add_oid(hobject_t const&, std::set<snapid_t, std::less<snapid_t>, std::allocator<snapid_t> > const&, MapCacher::Transaction<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, ceph::buffer::v15_2_0::list>*)+0xcb) [0x5634a750015b]
 7: (PG::update_snap_map(std::vector<pg_log_entry_t, std::allocator<pg_log_entry_t> > const&, ceph::os::Transaction&)+0x814) [0x5634a73b2394]
 8: (non-virtual thunk to PrimaryLogPG::log_operation(std::vector<pg_log_entry_t, std::allocator<pg_log_entry_t> >&&, std::optional<pg_hit_set_history_t> const&, eversion_t const&, eversion_t const&, eversion_t const&, bool, ceph::os::Transaction&, bool)+0x290) [0x5634a74c73f0]
 9: (ReplicatedBackend::do_repop(boost::intrusive_ptr<OpRequest>)+0xd0e) [0x5634a769a48e]
 10: (ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x267) [0x5634a76aa997]
 11: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x52) [0x5634a74dd192]
 12: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x5de) [0x5634a74802ee]
 13: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x309) [0x5634a7308aa9]
 14: (ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x68) [0x5634a7567218]
 15: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xc28) [0x5634a7328f18]
 16: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) [0x5634a799c1d4]
 17: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x5634a799ee74]
 18: /lib64/libpthread.so.0(+0x81cf) [0x7f705ce561cf]
 19: clone()
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.


This stack trace was captured by ceph-crash running on ODF.
I attach the whole file capture by OCS Mustgather.

Comment 6 Red Hat Bugzilla 2024-05-16 04:25:07 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days

Note You need to log in before you can comment on or make changes to this bug.