Bug 2175570

Summary: [RADOS] After rebooting osd nodes, 20 SSD backed BlueStore OSDs are crashing
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: James Biao <jbiao>
Component: RADOSAssignee: Neha Ojha <nojha>
Status: CLOSED COMPLETED QA Contact: Pawan <pdhiran>
Severity: medium Docs Contact:
Priority: medium    
Version: 3.3CC: bhubbard, ceph-eng-bugs, cephqe-warriors, pdhange, rzarzyns, vumrao
Target Milestone: ---Flags: pdhange: needinfo? (jbiao)
Target Release: 6.1z1   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-07-10 14:39:12 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description James Biao 2023-03-05 23:26:52 UTC
Description of problem:
After rebooting osd nodes, 20 SSD backed BlueStore OSDs out of 21 are crashing, while HDD and NVME osds on the same node are running properly. OSD can't start again

2023-02-21 08:55:01.431536 7f7bf8c23d80  1 bluefs _init_alloc id 1 alloc_size 0x10000 size 0x2e932400000
2023-02-21 09:07:34.809028 7f7bf8c23d80 -1 /builddir/build/BUILD/ceph-12.2.12/src/os/bluestore/BlueFS.cc: In function 'int BlueFS::_replay(bool)' thread 7f7bf8c23d80 time 2023-02-21 09:07:34.806474
/builddir/build/BUILD/ceph-12.2.12/src/os/bluestore/BlueFS.cc: 729: FAILED assert(r == (int)more)

 ceph version 12.2.12-127.el7cp (149c9c8a16ac33a42231ce4145067d3ceec16ac7) luminous (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x562c4117e030]
 2: (BlueFS::_replay(bool)+0x2b47) [0x562c41108257]
 3: (BlueFS::mount()+0x1d4) [0x562c411096d4]
 4: (BlueStore::_open_db(bool)+0x1857) [0x562c41015677]
 5: (BlueStore::_mount(bool)+0x40e) [0x562c4104aa7e]
 6: (OSD::init()+0x3bd) [0x562c40bef73d]
 7: (main()+0x2d07) [0x562c40af2c37]
 8: (__libc_start_main()+0xf5) [0x7f7bf52dc555]
 9: (()+0x4c2b03) [0x562c40b92b03]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.


Version-Release number of selected component (if applicable):
RHCS 3.3

How reproducible:
Once at customer site. 

Steps to Reproduce:
1. Build cluster with colocated SSD osd 
2. Reboot OSD node 
3.

Actual results:
SSD osds are down and keep crashing

Expected results:
SSD osds start properly

Additional info: