2183996 – [GSS][RADOS] OSDs in CLBO state with error "FAILED ceph_assert(r >= 0 && r <= (int)tail_read)"

Bug 2183996 - [GSS][RADOS] OSDs in CLBO state with error "FAILED ceph_assert(r >= 0 && r <= (int)tail_read)"

Summary: [GSS][RADOS] OSDs in CLBO state with error "FAILED ceph_assert(r >= 0 && r <=...

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	ceph
Sub Component:
Version:	4.10
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	---
Assignee:	Neha Ojha
QA Contact:	Elad
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2023-04-03 11:21 UTC by Karun Josy
Modified:	2023-08-09 16:37 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-04-11 05:50:28 UTC
Embargoed:

Attachments	(Terms of Use)

Description Karun Josy 2023-04-03 11:21:59 UTC

* Description of problem (please be detailed as possible and provide log
snippets):


+ 2 OSDs are in CLBO state possibly due to rocksdb corruption
+ This is the assert msg:
----------------------
    "assert_msg": "/builddir/build/BUILD/ceph-16.2.7/src/os/bluestore/BlueStore.cc: In function 'void BlueStore::_do_write_small(BlueStore::TransContext*, BlueStore::CollectionRef&, BlueStore::OnodeRef, uint64_t, uint64_t, ceph::buffer::v15_2_0::list::iterator&, BlueStore::WriteContext*)' thread 7f07ca3c4700 time 2023-04-02T19:06:58.679984+0000\n/builddir/build/BUILD/ceph-16.2.7/src/os/bluestore/BlueStore.cc: 13534: FAILED ceph_assert(r >= 0 && r <= (int)tail_read)\n",
----------------------

+ There is an open tracker upstream which looks similar: https://tracker.ceph.com/issues/51900




* Version of all relevant components (if applicable):
ODF 4.10
ceph version 16.2.7-126.el8cp 

* Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
2 out of 3 OSDs are down, PGs inactive, production impacted

* Is there any workaround available to the best of your knowledge?
No

Comment 6 Prashant Dhange 2023-04-04 06:31:36 UTC

The osd.0 and osd.2 are failing because of unable to read superblock :

debug 2023-04-03T09:19:45.201+0000 7fdca2ce5080 -1 bluestore(/var/lib/ceph/osd/ceph-2) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x0, got 0x659b92dc, expected 0x48b54be2, device location [0x2000~1000], logical extent 0x0~1000, object #-1:7b3f43c4:::osd_superblock:0#
debug 2023-04-03T09:19:45.202+0000 7fdca2ce5080 -1 bluestore(/var/lib/ceph/osd/ceph-2) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x0, got 0x659b92dc, expected 0x48b54be2, device location [0x2000~1000], logical extent 0x0~1000, object #-1:7b3f43c4:::osd_superblock:0#
debug 2023-04-03T09:19:45.202+0000 7fdca2ce5080 -1 bluestore(/var/lib/ceph/osd/ceph-2) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x0, got 0x659b92dc, expected 0x48b54be2, device location [0x2000~1000], logical extent 0x0~1000, object #-1:7b3f43c4:::osd_superblock:0#
debug 2023-04-03T09:19:45.203+0000 7fdca2ce5080 -1 bluestore(/var/lib/ceph/osd/ceph-2) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x0, got 0x659b92dc, expected 0x48b54be2, device location [0x2000~1000], logical extent 0x0~1000, object #-1:7b3f43c4:::osd_superblock:0#
debug 2023-04-03T09:19:45.203+0000 7fdca2ce5080 -1 osd.2 0 OSD::init() : unable to read osd superblock

It looks like OSD super block got corrupted which likely seems to be because of multiple OSD daemons tried to access same device.

Note You need to log in before you can comment on or make changes to this bug.