Bug 2175570 - [RADOS] After rebooting osd nodes, 20 SSD backed BlueStore OSDs are crashing [NEEDINFO]
Summary: [RADOS] After rebooting osd nodes, 20 SSD backed BlueStore OSDs are crashing
Keywords:
Status: CLOSED COMPLETED
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: RADOS
Version: 3.3
Hardware: All
OS: Linux
medium
medium
Target Milestone: ---
: 6.1z1
Assignee: Neha Ojha
QA Contact: Pawan
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-03-05 23:26 UTC by James Biao
Modified: 2023-07-10 14:39 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-07-10 14:39:12 UTC
Embargoed:
pdhange: needinfo? (jbiao)


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHCEPH-6233 0 None None None 2023-03-05 23:28:25 UTC

Description James Biao 2023-03-05 23:26:52 UTC
Description of problem:
After rebooting osd nodes, 20 SSD backed BlueStore OSDs out of 21 are crashing, while HDD and NVME osds on the same node are running properly. OSD can't start again

2023-02-21 08:55:01.431536 7f7bf8c23d80  1 bluefs _init_alloc id 1 alloc_size 0x10000 size 0x2e932400000
2023-02-21 09:07:34.809028 7f7bf8c23d80 -1 /builddir/build/BUILD/ceph-12.2.12/src/os/bluestore/BlueFS.cc: In function 'int BlueFS::_replay(bool)' thread 7f7bf8c23d80 time 2023-02-21 09:07:34.806474
/builddir/build/BUILD/ceph-12.2.12/src/os/bluestore/BlueFS.cc: 729: FAILED assert(r == (int)more)

 ceph version 12.2.12-127.el7cp (149c9c8a16ac33a42231ce4145067d3ceec16ac7) luminous (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x562c4117e030]
 2: (BlueFS::_replay(bool)+0x2b47) [0x562c41108257]
 3: (BlueFS::mount()+0x1d4) [0x562c411096d4]
 4: (BlueStore::_open_db(bool)+0x1857) [0x562c41015677]
 5: (BlueStore::_mount(bool)+0x40e) [0x562c4104aa7e]
 6: (OSD::init()+0x3bd) [0x562c40bef73d]
 7: (main()+0x2d07) [0x562c40af2c37]
 8: (__libc_start_main()+0xf5) [0x7f7bf52dc555]
 9: (()+0x4c2b03) [0x562c40b92b03]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.


Version-Release number of selected component (if applicable):
RHCS 3.3

How reproducible:
Once at customer site. 

Steps to Reproduce:
1. Build cluster with colocated SSD osd 
2. Reboot OSD node 
3.

Actual results:
SSD osds are down and keep crashing

Expected results:
SSD osds start properly

Additional info:


Note You need to log in before you can comment on or make changes to this bug.