Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

This project is now read‑only. Starting Monday, February 2, please use https://ibm-ceph.atlassian.net/ for all bug tracking management.

Bug 2152986

Summary:	OSDs stuck with ""failed to load OSD map for epoch 585379, got 0 bytes"" after node reboot followed by service restart
Product:	[Red Hat Storage] Red Hat Ceph Storage	Reporter:	Vasishta <vashastr>
Component:	RADOS	Assignee:	Prashant Dhange <pdhange>
Status:	ASSIGNED ---	QA Contact:	Pawan <pdhiran>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	5.3	CC:	adking, akupczyk, amathuri, amk, bhubbard, bhull, ceph-eng-bugs, cephqe-warriors, choffman, hyelloji, ksirivad, lflores, nojha, pdhange, rfriedma, rzarzyns, skanta, sseshasa, vumrao
Target Milestone:	---	Flags:	pdhange: needinfo? (rzarzyns)
Target Release:	9.1
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:		Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	2154752

Description Vasishta 2022-12-13 17:32:48 UTC

Description of problem:
In a cluster with different public and cluster networks,

There was one node where there were some stale mounts of cephfs, after clearing them off, node had to be rebooted to unblock ceph-volume (BZ 2152053) to get unstuck with  inventory list to get unblocked with ceph orchestrator to upgrade cluster from one build to another.

After the reboot All OSDs in the particular node were not up, (service were down) with message - **state<Start>: transitioning to Stray** . ceph orchestrator had actually tried upgrading them from one build of 5.3 to 16.2.10-82.el8cp

After waiting for a day, restarted all ceph daemons on that particular node using ceph.target.

OSDs are stuck/flapping saying 
+0000 7f3bb637f700 20 osd.73 633905 get_map 580410 - loading and decoding 0x56344e6f0000
Dec 13 17:22:01 f12-h09-000-1029u.rdu2.scalelab.redhat.com conmon[628485]: debug 2022-12-13T17:22:00.685+0000 7f3bb637f700 -1 osd.73 633905 failed to load OSD map for epoch 580410, got 0 bytes
Dec 13 17:22:01 f12-h09-000-1029u.rdu2.scalelab.redhat.com conmon[628485]: debug 2022-12-13T17:22:00.685+0000 7f3bb637f700 20 osd.73 633905 advance_pg missing map 580410


Version-Release number of selected component (if applicable):
16.2.10-82.el8cp

How reproducible:
Faced once
After above mentioned description, tried osd restart just for one OSD using orchestrator, osd restarted, but same situation.

Steps to Reproduce:<Mentioned above>

Actual results:
OSDs are not up and in

Expected results:
OSDs to comeup after node reboot.

Additional info: