Bug 2152986 - OSDs stuck with ""failed to load OSD map for epoch 585379, got 0 bytes"" after node reboot followed by service restart
Summary: OSDs stuck with ""failed to load OSD map for epoch 585379, got 0 bytes"" afte...
Keywords:
Status: ASSIGNED
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: RADOS
Version: 5.3
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: ---
: 8.0
Assignee: Prashant Dhange
QA Contact: Pawan
URL:
Whiteboard:
Depends On:
Blocks: 2154752
TreeView+ depends on / blocked
 
Reported: 2022-12-13 17:32 UTC by Vasishta
Modified: 2023-07-21 23:28 UTC (History)
20 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHCEPH-5774 0 None None None 2022-12-13 17:39:40 UTC

Description Vasishta 2022-12-13 17:32:48 UTC
Description of problem:
In a cluster with different public and cluster networks,

There was one node where there were some stale mounts of cephfs, after clearing them off, node had to be rebooted to unblock ceph-volume (BZ 2152053) to get unstuck with  inventory list to get unblocked with ceph orchestrator to upgrade cluster from one build to another.

After the reboot All OSDs in the particular node were not up, (service were down) with message - **state<Start>: transitioning to Stray** . ceph orchestrator had actually tried upgrading them from one build of 5.3 to 16.2.10-82.el8cp

After waiting for a day, restarted all ceph daemons on that particular node using ceph.target.

OSDs are stuck/flapping saying 
+0000 7f3bb637f700 20 osd.73 633905 get_map 580410 - loading and decoding 0x56344e6f0000
Dec 13 17:22:01 f12-h09-000-1029u.rdu2.scalelab.redhat.com conmon[628485]: debug 2022-12-13T17:22:00.685+0000 7f3bb637f700 -1 osd.73 633905 failed to load OSD map for epoch 580410, got 0 bytes
Dec 13 17:22:01 f12-h09-000-1029u.rdu2.scalelab.redhat.com conmon[628485]: debug 2022-12-13T17:22:00.685+0000 7f3bb637f700 20 osd.73 633905 advance_pg missing map 580410


Version-Release number of selected component (if applicable):
16.2.10-82.el8cp

How reproducible:
Faced once
After above mentioned description, tried osd restart just for one OSD using orchestrator, osd restarted, but same situation.

Steps to Reproduce:<Mentioned above>

Actual results:
OSDs are not up and in

Expected results:
OSDs to comeup after node reboot.

Additional info:


Note You need to log in before you can comment on or make changes to this bug.