Bug 2219223 - OSDs are crashed with error : Message::encode
Summary: OSDs are crashed with error : Message::encode
Keywords:
Status: NEW
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: RADOS
Version: 6.1
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 6.1z2
Assignee: Radoslaw Zarzynski
QA Contact: Pawan
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-07-03 03:21 UTC by Pawan
Modified: 2023-07-11 20:11 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHCEPH-6947 0 None None None 2023-07-03 03:22:05 UTC

Description Pawan 2023-07-03 03:21:47 UTC
Description of problem:
Observing crashes on OSDs with below crash trace:

# ceph crash info 2023-06-05T21:44:47.304778Z_c34bb028-bacb-4453-b7a4-f8e78bb45931
{
    "backtrace": [
        "/lib64/libc.so.6(+0x54df0) [0x7f9173ec9df0]",
        "/lib64/libc.so.6(+0xa154c) [0x7f9173f1654c]",
        "(ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char const*, std::chrono::time_point<ceph::coarse_mono_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > >)+0x27b) [0x56060edb1f6b]",
        "(ceph::HeartbeatMap::clear_timeout(ceph::heartbeat_handle_d*)+0x5f) [0x56060edb221f]",
        "(BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ceph::os::Transaction, std::allocator<ceph::os::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0xa04) [0x56060eccf244]",
        "/usr/bin/ceph-osd(+0x4dc688) [0x56060e819688]",
        "(OSD::dispatch_context(PeeringCtx&, PG*, std::shared_ptr<OSDMap const>, ThreadPool::TPHandle*)+0x115) [0x56060e883f25]",
        "(OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0x2c5) [0x56060e88d525]",
        "(ceph::osd::scheduler::PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x51) [0x56060ea6a9b1]",
        "(OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xbf7) [0x56060e8a1b37]",
        "(ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x2a3) [0x56060edc5993]",
        "/usr/bin/ceph-osd(+0xa88f34) [0x56060edc5f34]",
        "/lib64/libc.so.6(+0x9f802) [0x7f9173f14802]",
        "/lib64/libc.so.6(+0x3f450) [0x7f9173eb4450]"
    ],
    "ceph_version": "17.2.6-58.0.TEST.bz2119217.el9cp",
    "crash_id": "2023-06-05T21:44:47.304778Z_c34bb028-bacb-4453-b7a4-f8e78bb45931",
    "entity_name": "osd.19",
    "os_id": "rhel",
    "os_name": "Red Hat Enterprise Linux",
    "os_version": "9.2 (Plow)",
    "os_version_id": "9.2",
    "process_name": "ceph-osd",
    "stack_sig": "b58d85471dd847a9e99e6e29cf54d1baed4337a44d0a6088f729bbde8329cfe0",
    "timestamp": "2023-06-05T21:44:47.304778Z",
    "utsname_hostname": "ceph-pdhiran-spoc3h-node4",
    "utsname_machine": "x86_64",
    "utsname_release": "5.14.0-284.11.1.el9_2.x86_64",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP PREEMPT_DYNAMIC Wed Apr 12 10:45:03 EDT 2023"
}


According to Nitzan from comment : https://bugzilla.redhat.com/show_bug.cgi?id=2209274#c13

frame 7 according to the objdump is probably encode_payload(features) and looks related to bug - https://tracker.ceph.com/issues/52657

Tests that had been performed on the cluster:
1. OSD daemon & OSD host reboots.
2. OSD daemon & OSD host removal
3. OSD deamon & OSD Host addition.
4. Creating multiple new pools and writing, reading and deleting data.

Version-Release number of selected component (if applicable):
# ceph version
ceph version 17.2.6-58.0.TEST.bz2119217.el9cp (7da3e6ae59de2dacd4d7dc88c7421d9016259fea) quincy (stable)

How reproducible:
1/1

Steps to Reproduce:
1. Deploy RHCS 6.1 cluster
2. Perform OSD replacement tests mentioned above
3. While the cluster was left idle and data was being written into the pools, Observe OSDs crashing randomly


Actual results:
Observing OSD crashes

Expected results:
No Crashes to be observed on the cluster

Additional info:
Raising new bug as requested here : https://bugzilla.redhat.com/show_bug.cgi?id=2209274#c14


Note You need to log in before you can comment on or make changes to this bug.