Bug 2089155

Summary:	[GSS] OSD pods are not running and the OSD daemon is crashed
Product:	[Red Hat Storage] Red Hat OpenShift Data Foundation	Reporter:	Priya Pandey <prpandey>
Component:	ceph	Assignee:	Neha Ojha <nojha>
ceph sub component:	RADOS	QA Contact:	Elad <ebenahar>
Status:	CLOSED NOTABUG	Docs Contact:
Severity:	medium
Priority:	unspecified	CC:	akupczyk, assingh, bniver, hnallurv, mhackett, muagarwa, nojha, ocs-bugs, odf-bz-bot, pdhiran, rzarzyns, vumrao
Version:	4.10
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2023-04-05 19:18:59 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Priya Pandey 2022-05-23 06:27:27 UTC

Description of problem (please be detailed as possible and provide log
snippets):


- The OSD pods are in CLBO state with several restarts of containers. 


-----------------------------------------------

rook-ceph-osd-0-588b7db67b-r9hnr                                  1/2     CrashLoopBackOff       329 (2m2s ago)   5d14h   10.130.2.33      dvtslocnw03-data.nbsdev.co.uk   <none>           <none>

rook-ceph-osd-2-767b5c54f5-rdwrj                                  1/2     CrashLoopBackOff       330 (88s ago)    5d14h   10.128.4.40      dvtslocnw01-data.nbsdev.co.uk   <none>           <none>

-----------------------------------------------

- The devices are attached to the node and there're no issues with the disk.


- The ceph osd pods are crashed with the below error:

-----------------------------------------------

2022-05-17T08:22:47.459126416Z debug     -3> 2022-05-17T08:22:47.416+0000 7f10dcdd6080  1 bluefs _allocate unable to allocate 0x400000 on bdev 1, allocator name block, allocator type hybrid, capacity 0x4b0000000
0, block size 0x1000, free 0xd8e089000, fragmentation 0.586552, allocated 0x0
2022-05-17T08:22:47.459126416Z debug     -2> 2022-05-17T08:22:47.416+0000 7f10dcdd6080 -1 bluefs _allocate allocation failed, needed 0x4000002022-05-17T08:22:47.459139711Z 
2022-05-17T08:22:47.459139711Z debug     -1> 2022-05-17T08:22:47.424+0000 7f10dcdd6080 -1 /builddir/build/BUILD/ceph-16.2.7/src/os/bluestore/BlueFS.cc: In function 'int BlueFS::_flush_and_sync_log(std::unique_lo
ck<std::mutex>&, uint64_t, uint64_t)' thread 7f10dcdd6080 time 2022-05-17T08:22:47.417513+0000
2022-05-17T08:22:47.459139711Z /builddir/build/BUILD/ceph-16.2.7/src/os/bluestore/BlueFS.cc: 2554: FAILED ceph_assert(r == 0)
2022-05-17T08:22:47.459139711Z 
2022-05-17T08:22:47.459139711Z  ceph version 16.2.7-98.el8cp (b20d33c3b301e005bed203d3cad7245da3549f80) pacific (stable)
2022-05-17T08:22:47.459139711Z  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x158) [0x55cd61fc2e3c]
2022-05-17T08:22:47.459139711Z  2: ceph-osd(+0x56b056) [0x55cd61fc3056]
2022-05-17T08:22:47.459139711Z  3: (BlueFS::_flush_and_sync_log(std::unique_lock<std::mutex>&, unsigned long, unsigned long)+0x1c93) [0x55cd626c24f3]
2022-05-17T08:22:47.459139711Z  4: (BlueFS::_fsync(BlueFS::FileWriter*, std::unique_lock<std::mutex>&)+0x322) [0x55cd626c2f22]
2022-05-17T08:22:47.459139711Z  5: (BlueRocksWritableFile::Sync()+0x6c) [0x55cd626ea79c]
2022-05-17T08:22:47.459139711Z  6: (rocksdb::LegacyWritableFileWrapper::Sync(rocksdb::IOOptions const&, rocksdb::IODebugContext*)+0x1f) [0x55cd62b83aef]
2022-05-17T08:22:47.459139711Z  7: (rocksdb::WritableFileWriter::SyncInternal(bool)+0x402) [0x55cd62c95262]
2022-05-17T08:22:47.459139711Z  8: (rocksdb::WritableFileWriter::Sync(bool)+0x88) [0x55cd62c968a8]
2022-05-17T08:22:47.459139711Z  9: (rocksdb::BuildTable(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb::Env*, rocksdb::FileSystem*, rocksdb::ImmutableCFOptions co
nst&, rocksdb::MutableCFOptions const&, rocksdb::FileOptions const&, rocksdb::TableCache*, rocksdb::InternalIteratorBase<rocksdb::Slice>*, std::vector<std::unique_ptr<rocksdb::FragmentedRangeTombstoneIterator, s
td::default_delete<rocksdb::FragmentedRangeTombstoneIterator> >, std::allocator<std::unique_ptr<rocksdb::FragmentedRangeTombstoneIterator, std::default_delete<rocksdb::FragmentedRangeTombstoneIterator> > > >, ro
cksdb::FileMetaData*, rocksdb::InternalKeyComparator const&, std::vector<std::unique_ptr<rocksdb::IntTblPropCollectorFactory, std::default_delete<rocksdb::IntTblPropCollectorFactory> >, std::allocator<std::uniqu
e_ptr<rocksdb::IntTblPropCollectorFactory, std::default_delete<rocksdb::IntTblPropCollectorFactory> > > > const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > con
st&, std::vector<unsigned long, std::allocator<unsigned long> >, unsigned long, rocksdb::SnapshotChecker*, rocksdb::CompressionType, unsigned long, rocksdb::CompressionOptions const&, bool, rocksdb::InternalStat
s*, rocksdb::TableFileCreationReason, rocksdb::EventLogger*, int, rocksdb::Env::IOPriority, rocksdb::TableProperties*, int, unsigned long, unsigned long, rocksdb::Env::WriteLifeTimeHint, unsigned long)+0x2ddb) [
0x55cd62d63a0b]

-----------------------------------------------



Version of all relevant components (if applicable):

- v4.10

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

- Two OSD pods are restarting which makes the cluster unstable.


Is there any workaround available to the best of your knowledge?

N/A

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
N/A


Can this issue reproducible?

No, specific to the cu environment.

Can this issue reproduce from the UI?
N/A

If this is a regression, please provide more details to justify this:
N/A

Steps to Reproduce:
N/A

Actual results:

- The OSD pods are in CLBO state.

Expected results:

- The OSD pods should be running fine.

Additional info:

In the next comments

Comment 27 Red Hat Bugzilla 2023-12-08 04:28:47 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days