Created attachment 1957179 [details] crash of the osd capture by ceph-crash in odf mustgather Description of problem (please be detailed as possible and provide log snippests): ceph osd.1 running on OCP/ODF 4.10.10 cluster crash and restart automatically. ceph status report WARNING related to one daemon crash but look like it's a one shot event. cluster: id: 7ec4a1a3-5211-4e4a-9bf6-c9e01f39a474 health: HEALTH_WARN 1 daemons have recently crashed services: mon: 3 daemons, quorum a,b,c (age 7w) mgr: a(active, since 5h) mds: 1/1 daemons up, 1 hot standby osd: 3 osds: 3 up (since 5h), 3 in (since 2M) rgw: 1 daemon active (1 hosts, 1 zones) data: volumes: 1/1 healthy pools: 11 pools, 177 pgs objects: 643.32k objects, 318 GiB usage: 960 GiB used, 540 GiB / 1.5 TiB avail pgs: 177 active+clean io: client: 9.4 MiB/s rd, 13 MiB/s wr, 14 op/s rd, 122 op/s wr Version of all relevant components (if applicable): "osd": { "ceph version 16.2.7-126.el8cp (fe0af61d104d48cb9d116cde6e593b5fc8c197e4) pacific (stable)": 3 }, The ceph container run the following image 5-235 : registry.redhat.io/rhceph/rhceph-5-rhel8@sha256:fc25524ccb0ea78526257778ab54bfb1a25772b75fcc97df98eb06a0e67e1bf6 Which is downloadable at the following site : https://catalog.redhat.com/software/containers/rhceph/rhceph-5-rhel8/60ec72a74a6a2c7844abe5fb?tag=5-235&push_date=1656624434000 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? No impact on production so far Is there any workaround available to the best of your knowledge? For now, no workaround because it is working. Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? No idea Can this issue reproducible? For now, it is not. Can this issue reproduce from the UI? No If this is a regression, please provide more details to justify this: No regression. Steps to Reproduce: No reproductible so far Actual results: N/A Expected results: Additional info: Here is the full stack trace on the crash we have : ceph version 16.2.7-126.el8cp (fe0af61d104d48cb9d116cde6e593b5fc8c197e4) pacific (stable) 1: /lib64/libpthread.so.0(+0x12ce0) [0x7f705ce60ce0] 2: (BlueStore::Onode::put()+0x2cb) [0x5634a77de74b] 3: (BlueStore::omap_get_values(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ghobject_t const&, std::set<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::ba sic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, ceph::buffer::v15_2_0::list, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std: :__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> > >*)+0x1a5) [0x5634a7825c15] 4: (MapCacher::MapCacher<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, ceph::buffer::v15_2_0::list>::get_keys(std::set<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator <char> > >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, ceph::buffer::v15_2_0::list, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocat or<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> > >*)+0x3e3) [0x5634a7506393] 5: (SnapMapper::get_snaps(hobject_t const&, SnapMapper::object_snaps*)+0xec) [0x5634a74fc80c] 6: (SnapMapper::add_oid(hobject_t const&, std::set<snapid_t, std::less<snapid_t>, std::allocator<snapid_t> > const&, MapCacher::Transaction<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, ceph::buffer::v15_2_0::list>*)+0xcb) [0x5634a750015b] 7: (PG::update_snap_map(std::vector<pg_log_entry_t, std::allocator<pg_log_entry_t> > const&, ceph::os::Transaction&)+0x814) [0x5634a73b2394] 8: (non-virtual thunk to PrimaryLogPG::log_operation(std::vector<pg_log_entry_t, std::allocator<pg_log_entry_t> >&&, std::optional<pg_hit_set_history_t> const&, eversion_t const&, eversion_t const&, eversion_t const&, bool, ceph::os::Transaction&, bool)+0x290) [0x5634a74c73f0] 9: (ReplicatedBackend::do_repop(boost::intrusive_ptr<OpRequest>)+0xd0e) [0x5634a769a48e] 10: (ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x267) [0x5634a76aa997] 11: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x52) [0x5634a74dd192] 12: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x5de) [0x5634a74802ee] 13: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x309) [0x5634a7308aa9] 14: (ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x68) [0x5634a7567218] 15: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xc28) [0x5634a7328f18] 16: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) [0x5634a799c1d4] 17: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x5634a799ee74] 18: /lib64/libpthread.so.0(+0x81cf) [0x7f705ce561cf] 19: clone() NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. This stack trace was captured by ceph-crash running on ODF. I attach the whole file capture by OCS Mustgather.