Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

This project is now read‑only. Starting Monday, February 2, please use https://ibm-ceph.atlassian.net/ for all bug tracking management.

Bug 2219223

Summary:	OSDs are crashed with error : Message::encode
Product:	[Red Hat Storage] Red Hat Ceph Storage	Reporter:	Pawan <pdhiran>
Component:	RADOS	Assignee:	Radoslaw Zarzynski <rzarzyns>
Status:	CLOSED NOTABUG	QA Contact:	Pawan <pdhiran>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	6.1	CC:	bhubbard, ceph-eng-bugs, cephqe-warriors, kseeger, nojha, vumrao
Target Milestone:	---
Target Release:	6.1z2
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2023-09-06 15:56:59 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Pawan 2023-07-03 03:21:47 UTC

Description of problem:
Observing crashes on OSDs with below crash trace:

# ceph crash info 2023-06-05T21:44:47.304778Z_c34bb028-bacb-4453-b7a4-f8e78bb45931
{
    "backtrace": [
        "/lib64/libc.so.6(+0x54df0) [0x7f9173ec9df0]",
        "/lib64/libc.so.6(+0xa154c) [0x7f9173f1654c]",
        "(ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char const*, std::chrono::time_point<ceph::coarse_mono_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > >)+0x27b) [0x56060edb1f6b]",
        "(ceph::HeartbeatMap::clear_timeout(ceph::heartbeat_handle_d*)+0x5f) [0x56060edb221f]",
        "(BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ceph::os::Transaction, std::allocator<ceph::os::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0xa04) [0x56060eccf244]",
        "/usr/bin/ceph-osd(+0x4dc688) [0x56060e819688]",
        "(OSD::dispatch_context(PeeringCtx&, PG*, std::shared_ptr<OSDMap const>, ThreadPool::TPHandle*)+0x115) [0x56060e883f25]",
        "(OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0x2c5) [0x56060e88d525]",
        "(ceph::osd::scheduler::PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x51) [0x56060ea6a9b1]",
        "(OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xbf7) [0x56060e8a1b37]",
        "(ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x2a3) [0x56060edc5993]",
        "/usr/bin/ceph-osd(+0xa88f34) [0x56060edc5f34]",
        "/lib64/libc.so.6(+0x9f802) [0x7f9173f14802]",
        "/lib64/libc.so.6(+0x3f450) [0x7f9173eb4450]"
    ],
    "ceph_version": "17.2.6-58.0.TEST.bz2119217.el9cp",
    "crash_id": "2023-06-05T21:44:47.304778Z_c34bb028-bacb-4453-b7a4-f8e78bb45931",
    "entity_name": "osd.19",
    "os_id": "rhel",
    "os_name": "Red Hat Enterprise Linux",
    "os_version": "9.2 (Plow)",
    "os_version_id": "9.2",
    "process_name": "ceph-osd",
    "stack_sig": "b58d85471dd847a9e99e6e29cf54d1baed4337a44d0a6088f729bbde8329cfe0",
    "timestamp": "2023-06-05T21:44:47.304778Z",
    "utsname_hostname": "ceph-pdhiran-spoc3h-node4",
    "utsname_machine": "x86_64",
    "utsname_release": "5.14.0-284.11.1.el9_2.x86_64",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP PREEMPT_DYNAMIC Wed Apr 12 10:45:03 EDT 2023"
}


According to Nitzan from comment : https://bugzilla.redhat.com/show_bug.cgi?id=2209274#c13

frame 7 according to the objdump is probably encode_payload(features) and looks related to bug - https://tracker.ceph.com/issues/52657

Tests that had been performed on the cluster:
1. OSD daemon & OSD host reboots.
2. OSD daemon & OSD host removal
3. OSD deamon & OSD Host addition.
4. Creating multiple new pools and writing, reading and deleting data.

Version-Release number of selected component (if applicable):
# ceph version
ceph version 17.2.6-58.0.TEST.bz2119217.el9cp (7da3e6ae59de2dacd4d7dc88c7421d9016259fea) quincy (stable)

How reproducible:
1/1

Steps to Reproduce:
1. Deploy RHCS 6.1 cluster
2. Perform OSD replacement tests mentioned above
3. While the cluster was left idle and data was being written into the pools, Observe OSDs crashing randomly


Actual results:
Observing OSD crashes

Expected results:
No Crashes to be observed on the cluster

Additional info:
Raising new bug as requested here : https://bugzilla.redhat.com/show_bug.cgi?id=2209274#c14