Bug 2151762

Summary:	ceph osd pod failed to start up with error: rocksdb: Corruption: Bad table magic number
Product:	[Red Hat Storage] Red Hat OpenShift Data Foundation	Reporter:	jiaxl <jthunder1005>
Component:	ceph	Assignee:	Neha Ojha <nojha>
ceph sub component:	RADOS	QA Contact:	Elad <ebenahar>
Status:	CLOSED NOTABUG	Docs Contact:
Severity:	unspecified
Priority:	unspecified	CC:	bniver, muagarwa, nojha, ocs-bugs, odf-bz-bot, pdhiran, rzarzyns
Version:	4.10
Target Milestone:	---
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2023-04-05 18:34:37 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description jiaxl 2022-12-08 05:19:56 UTC

Description of problem (please be detailed as possible and provide log
snippets):

After installing ODF 4.10 on OCP 4.10, one of the ceph osd pods keeped crashed.
Logs:

debug 2022-12-07T12:18:23.249+0000 7f2a7013d080 4 rocksdb: Options.optimize_filters_for_hits: 0
debug 2022-12-07T12:18:23.249+0000 7f2a7013d080 4 rocksdb: Options.paranoid_file_checks: 0
debug 2022-12-07T12:18:23.249+0000 7f2a7013d080 4 rocksdb: Options.force_consistency_checks: 0
debug 2022-12-07T12:18:23.249+0000 7f2a7013d080 4 rocksdb: Options.report_bg_io_stats: 0
debug 2022-12-07T12:18:23.249+0000 7f2a7013d080 4 rocksdb: Options.ttl: 2592000
debug 2022-12-07T12:18:23.249+0000 7f2a7013d080 4 rocksdb: Options.periodic_compaction_seconds: 0
debug 2022-12-07T12:18:23.250+0000 7f2a7013d080 4 rocksdb: [column_family.cc:555] (skipping printing options)
debug 2022-12-07T12:18:23.250+0000 7f2a7013d080 4 rocksdb: [column_family.cc:555] (skipping printing options)
debug 2022-12-07T12:18:23.264+0000 7f2a7013d080 4 rocksdb: [db_impl/db_impl.cc:397] Shutdown: canceling all background work
debug 2022-12-07T12:18:23.265+0000 7f2a7013d080 4 rocksdb: [db_impl/db_impl.cc:573] Shutdown complete
debug 2022-12-07T12:18:23.267+0000 7f2a7013d080 -1 rocksdb: Corruption: Bad table magic number: expected 9863518390377041911, found 0 in db/000094.sst
debug 2022-12-07T12:18:23.268+0000 7f2a7013d080 -1 bluestore(/var/lib/ceph/osd/ceph-2) _open_db erroring opening db:
debug 2022-12-07T12:18:23.268+0000 7f2a7013d080 1 bluefs umount
debug 2022-12-07T12:18:23.269+0000 7f2a7013d080 1 bdev(0x557c4f81d000 /var/lib/ceph/osd/ceph-2/block) close
debug 2022-12-07T12:18:23.361+0000 7f2a7013d080 1 bdev(0x557c4f81cc00 /var/lib/ceph/osd/ceph-2/block) close
debug 2022-12-07T12:18:23.619+0000 7f2a7013d080 -1 osd.2 0 OSD:init: unable to mount object store
debug 2022-12-07T12:18:23.619+0000 7f2a7013d080 -1 ** ERROR: osd init failed: (5) Input/output error

Version of all relevant components (if applicable):

ODF version: 4.10.8
Local Storage operator: 4.10.0
Ceph version: 16.2.7-126.el8cp (fe0af61d104d48cb9d116cde6e593b5fc8c197e4) pacific (stable)

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?
After some google searches, this issue is most relevant.
https://tracker.ceph.com/issues/54547

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?
From the log, it is likely that this is an occasional bug

Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1.
2.
3.


Actual results:


Expected results:


Additional info:

Comment 11 Red Hat Bugzilla 2023-12-08 04:31:37 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days