Bug 2183996

Summary: [GSS][RADOS] OSDs in CLBO state with error "FAILED ceph_assert(r >= 0 && r <= (int)tail_read)"
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Karun Josy <kjosy>
Component: cephAssignee: Neha Ojha <nojha>
ceph sub component: RADOS QA Contact: Elad <ebenahar>
Status: CLOSED NOTABUG Docs Contact:
Severity: urgent    
Priority: unspecified CC: akupczyk, bhubbard, bkunal, bniver, hnallurv, kelwhite, muagarwa, nojha, ocs-bugs, odf-bz-bot, pdhange, sostapov
Version: 4.10   
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-04-11 05:50:28 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Karun Josy 2023-04-03 11:21:59 UTC
* Description of problem (please be detailed as possible and provide log
snippets):


+ 2 OSDs are in CLBO state possibly due to rocksdb corruption
+ This is the assert msg:
----------------------
    "assert_msg": "/builddir/build/BUILD/ceph-16.2.7/src/os/bluestore/BlueStore.cc: In function 'void BlueStore::_do_write_small(BlueStore::TransContext*, BlueStore::CollectionRef&, BlueStore::OnodeRef, uint64_t, uint64_t, ceph::buffer::v15_2_0::list::iterator&, BlueStore::WriteContext*)' thread 7f07ca3c4700 time 2023-04-02T19:06:58.679984+0000\n/builddir/build/BUILD/ceph-16.2.7/src/os/bluestore/BlueStore.cc: 13534: FAILED ceph_assert(r >= 0 && r <= (int)tail_read)\n",
----------------------

+ There is an open tracker upstream which looks similar: https://tracker.ceph.com/issues/51900




* Version of all relevant components (if applicable):
ODF 4.10
ceph version 16.2.7-126.el8cp 

* Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
2 out of 3 OSDs are down, PGs inactive, production impacted

* Is there any workaround available to the best of your knowledge?
No

Comment 6 Prashant Dhange 2023-04-04 06:31:36 UTC
The osd.0 and osd.2 are failing because of unable to read superblock :

debug 2023-04-03T09:19:45.201+0000 7fdca2ce5080 -1 bluestore(/var/lib/ceph/osd/ceph-2) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x0, got 0x659b92dc, expected 0x48b54be2, device location [0x2000~1000], logical extent 0x0~1000, object #-1:7b3f43c4:::osd_superblock:0#
debug 2023-04-03T09:19:45.202+0000 7fdca2ce5080 -1 bluestore(/var/lib/ceph/osd/ceph-2) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x0, got 0x659b92dc, expected 0x48b54be2, device location [0x2000~1000], logical extent 0x0~1000, object #-1:7b3f43c4:::osd_superblock:0#
debug 2023-04-03T09:19:45.202+0000 7fdca2ce5080 -1 bluestore(/var/lib/ceph/osd/ceph-2) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x0, got 0x659b92dc, expected 0x48b54be2, device location [0x2000~1000], logical extent 0x0~1000, object #-1:7b3f43c4:::osd_superblock:0#
debug 2023-04-03T09:19:45.203+0000 7fdca2ce5080 -1 bluestore(/var/lib/ceph/osd/ceph-2) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x0, got 0x659b92dc, expected 0x48b54be2, device location [0x2000~1000], logical extent 0x0~1000, object #-1:7b3f43c4:::osd_superblock:0#
debug 2023-04-03T09:19:45.203+0000 7fdca2ce5080 -1 osd.2 0 OSD::init() : unable to read osd superblock

It looks like OSD super block got corrupted which likely seems to be because of multiple OSD daemons tried to access same device.