Bug 2152986

Summary: OSDs stuck with ""failed to load OSD map for epoch 585379, got 0 bytes"" after node reboot followed by service restart
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Vasishta <vashastr>
Component: RADOSAssignee: Prashant Dhange <pdhange>
Status: ASSIGNED --- QA Contact: Pawan <pdhiran>
Severity: medium Docs Contact:
Priority: medium    
Version: 5.3CC: adking, akupczyk, amathuri, amk, bhubbard, bhull, ceph-eng-bugs, cephqe-warriors, choffman, hyelloji, ksirivad, lflores, lithomas, nojha, pdhange, rfriedma, rzarzyns, skanta, sseshasa, vumrao
Target Milestone: ---   
Target Release: 8.0   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2154752    

Description Vasishta 2022-12-13 17:32:48 UTC
Description of problem:
In a cluster with different public and cluster networks,

There was one node where there were some stale mounts of cephfs, after clearing them off, node had to be rebooted to unblock ceph-volume (BZ 2152053) to get unstuck with  inventory list to get unblocked with ceph orchestrator to upgrade cluster from one build to another.

After the reboot All OSDs in the particular node were not up, (service were down) with message - **state<Start>: transitioning to Stray** . ceph orchestrator had actually tried upgrading them from one build of 5.3 to 16.2.10-82.el8cp

After waiting for a day, restarted all ceph daemons on that particular node using ceph.target.

OSDs are stuck/flapping saying 
+0000 7f3bb637f700 20 osd.73 633905 get_map 580410 - loading and decoding 0x56344e6f0000
Dec 13 17:22:01 f12-h09-000-1029u.rdu2.scalelab.redhat.com conmon[628485]: debug 2022-12-13T17:22:00.685+0000 7f3bb637f700 -1 osd.73 633905 failed to load OSD map for epoch 580410, got 0 bytes
Dec 13 17:22:01 f12-h09-000-1029u.rdu2.scalelab.redhat.com conmon[628485]: debug 2022-12-13T17:22:00.685+0000 7f3bb637f700 20 osd.73 633905 advance_pg missing map 580410


Version-Release number of selected component (if applicable):
16.2.10-82.el8cp

How reproducible:
Faced once
After above mentioned description, tried osd restart just for one OSD using orchestrator, osd restarted, but same situation.

Steps to Reproduce:<Mentioned above>

Actual results:
OSDs are not up and in

Expected results:
OSDs to comeup after node reboot.

Additional info: