Bug 2219223

Summary: OSDs are crashed with error : Message::encode
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Pawan <pdhiran>
Component: RADOSAssignee: Radoslaw Zarzynski <rzarzyns>
Status: NEW --- QA Contact: Pawan <pdhiran>
Severity: high Docs Contact:
Priority: unspecified    
Version: 6.1CC: bhubbard, ceph-eng-bugs, cephqe-warriors, kseeger, nojha, vumrao
Target Milestone: ---   
Target Release: 6.1z2   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Pawan 2023-07-03 03:21:47 UTC
Description of problem:
Observing crashes on OSDs with below crash trace:

# ceph crash info 2023-06-05T21:44:47.304778Z_c34bb028-bacb-4453-b7a4-f8e78bb45931
{
    "backtrace": [
        "/lib64/libc.so.6(+0x54df0) [0x7f9173ec9df0]",
        "/lib64/libc.so.6(+0xa154c) [0x7f9173f1654c]",
        "(ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char const*, std::chrono::time_point<ceph::coarse_mono_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > >)+0x27b) [0x56060edb1f6b]",
        "(ceph::HeartbeatMap::clear_timeout(ceph::heartbeat_handle_d*)+0x5f) [0x56060edb221f]",
        "(BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ceph::os::Transaction, std::allocator<ceph::os::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0xa04) [0x56060eccf244]",
        "/usr/bin/ceph-osd(+0x4dc688) [0x56060e819688]",
        "(OSD::dispatch_context(PeeringCtx&, PG*, std::shared_ptr<OSDMap const>, ThreadPool::TPHandle*)+0x115) [0x56060e883f25]",
        "(OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0x2c5) [0x56060e88d525]",
        "(ceph::osd::scheduler::PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x51) [0x56060ea6a9b1]",
        "(OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xbf7) [0x56060e8a1b37]",
        "(ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x2a3) [0x56060edc5993]",
        "/usr/bin/ceph-osd(+0xa88f34) [0x56060edc5f34]",
        "/lib64/libc.so.6(+0x9f802) [0x7f9173f14802]",
        "/lib64/libc.so.6(+0x3f450) [0x7f9173eb4450]"
    ],
    "ceph_version": "17.2.6-58.0.TEST.bz2119217.el9cp",
    "crash_id": "2023-06-05T21:44:47.304778Z_c34bb028-bacb-4453-b7a4-f8e78bb45931",
    "entity_name": "osd.19",
    "os_id": "rhel",
    "os_name": "Red Hat Enterprise Linux",
    "os_version": "9.2 (Plow)",
    "os_version_id": "9.2",
    "process_name": "ceph-osd",
    "stack_sig": "b58d85471dd847a9e99e6e29cf54d1baed4337a44d0a6088f729bbde8329cfe0",
    "timestamp": "2023-06-05T21:44:47.304778Z",
    "utsname_hostname": "ceph-pdhiran-spoc3h-node4",
    "utsname_machine": "x86_64",
    "utsname_release": "5.14.0-284.11.1.el9_2.x86_64",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP PREEMPT_DYNAMIC Wed Apr 12 10:45:03 EDT 2023"
}


According to Nitzan from comment : https://bugzilla.redhat.com/show_bug.cgi?id=2209274#c13

frame 7 according to the objdump is probably encode_payload(features) and looks related to bug - https://tracker.ceph.com/issues/52657

Tests that had been performed on the cluster:
1. OSD daemon & OSD host reboots.
2. OSD daemon & OSD host removal
3. OSD deamon & OSD Host addition.
4. Creating multiple new pools and writing, reading and deleting data.

Version-Release number of selected component (if applicable):
# ceph version
ceph version 17.2.6-58.0.TEST.bz2119217.el9cp (7da3e6ae59de2dacd4d7dc88c7421d9016259fea) quincy (stable)

How reproducible:
1/1

Steps to Reproduce:
1. Deploy RHCS 6.1 cluster
2. Perform OSD replacement tests mentioned above
3. While the cluster was left idle and data was being written into the pools, Observe OSDs crashing randomly


Actual results:
Observing OSD crashes

Expected results:
No Crashes to be observed on the cluster

Additional info:
Raising new bug as requested here : https://bugzilla.redhat.com/show_bug.cgi?id=2209274#c14