2089155 – [GSS] OSD pods are not running and the OSD daemon is crashed

Bug 2089155 - [GSS] OSD pods are not running and the OSD daemon is crashed

Summary: [GSS] OSD pods are not running and the OSD daemon is crashed

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	ceph
Sub Component:
Version:	4.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Neha Ojha
QA Contact:	Elad
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-05-23 06:27 UTC by Priya Pandey
Modified:	2023-12-08 04:28 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-04-05 19:18:59 UTC
Embargoed:

Attachments	(Terms of Use)

Description Priya Pandey 2022-05-23 06:27:27 UTC

Description of problem (please be detailed as possible and provide log
snippets):


- The OSD pods are in CLBO state with several restarts of containers. 


-----------------------------------------------

rook-ceph-osd-0-588b7db67b-r9hnr                                  1/2     CrashLoopBackOff       329 (2m2s ago)   5d14h   10.130.2.33      dvtslocnw03-data.nbsdev.co.uk   <none>           <none>

rook-ceph-osd-2-767b5c54f5-rdwrj                                  1/2     CrashLoopBackOff       330 (88s ago)    5d14h   10.128.4.40      dvtslocnw01-data.nbsdev.co.uk   <none>           <none>

-----------------------------------------------

- The devices are attached to the node and there're no issues with the disk.


- The ceph osd pods are crashed with the below error:

-----------------------------------------------

2022-05-17T08:22:47.459126416Z debug     -3> 2022-05-17T08:22:47.416+0000 7f10dcdd6080  1 bluefs _allocate unable to allocate 0x400000 on bdev 1, allocator name block, allocator type hybrid, capacity 0x4b0000000
0, block size 0x1000, free 0xd8e089000, fragmentation 0.586552, allocated 0x0
2022-05-17T08:22:47.459126416Z debug     -2> 2022-05-17T08:22:47.416+0000 7f10dcdd6080 -1 bluefs _allocate allocation failed, needed 0x4000002022-05-17T08:22:47.459139711Z 
2022-05-17T08:22:47.459139711Z debug     -1> 2022-05-17T08:22:47.424+0000 7f10dcdd6080 -1 /builddir/build/BUILD/ceph-16.2.7/src/os/bluestore/BlueFS.cc: In function 'int BlueFS::_flush_and_sync_log(std::unique_lo
ck<std::mutex>&, uint64_t, uint64_t)' thread 7f10dcdd6080 time 2022-05-17T08:22:47.417513+0000
2022-05-17T08:22:47.459139711Z /builddir/build/BUILD/ceph-16.2.7/src/os/bluestore/BlueFS.cc: 2554: FAILED ceph_assert(r == 0)
2022-05-17T08:22:47.459139711Z 
2022-05-17T08:22:47.459139711Z  ceph version 16.2.7-98.el8cp (b20d33c3b301e005bed203d3cad7245da3549f80) pacific (stable)
2022-05-17T08:22:47.459139711Z  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x158) [0x55cd61fc2e3c]
2022-05-17T08:22:47.459139711Z  2: ceph-osd(+0x56b056) [0x55cd61fc3056]
2022-05-17T08:22:47.459139711Z  3: (BlueFS::_flush_and_sync_log(std::unique_lock<std::mutex>&, unsigned long, unsigned long)+0x1c93) [0x55cd626c24f3]
2022-05-17T08:22:47.459139711Z  4: (BlueFS::_fsync(BlueFS::FileWriter*, std::unique_lock<std::mutex>&)+0x322) [0x55cd626c2f22]
2022-05-17T08:22:47.459139711Z  5: (BlueRocksWritableFile::Sync()+0x6c) [0x55cd626ea79c]
2022-05-17T08:22:47.459139711Z  6: (rocksdb::LegacyWritableFileWrapper::Sync(rocksdb::IOOptions const&, rocksdb::IODebugContext*)+0x1f) [0x55cd62b83aef]
2022-05-17T08:22:47.459139711Z  7: (rocksdb::WritableFileWriter::SyncInternal(bool)+0x402) [0x55cd62c95262]
2022-05-17T08:22:47.459139711Z  8: (rocksdb::WritableFileWriter::Sync(bool)+0x88) [0x55cd62c968a8]
2022-05-17T08:22:47.459139711Z  9: (rocksdb::BuildTable(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb::Env*, rocksdb::FileSystem*, rocksdb::ImmutableCFOptions co
nst&, rocksdb::MutableCFOptions const&, rocksdb::FileOptions const&, rocksdb::TableCache*, rocksdb::InternalIteratorBase<rocksdb::Slice>*, std::vector<std::unique_ptr<rocksdb::FragmentedRangeTombstoneIterator, s
td::default_delete<rocksdb::FragmentedRangeTombstoneIterator> >, std::allocator<std::unique_ptr<rocksdb::FragmentedRangeTombstoneIterator, std::default_delete<rocksdb::FragmentedRangeTombstoneIterator> > > >, ro
cksdb::FileMetaData*, rocksdb::InternalKeyComparator const&, std::vector<std::unique_ptr<rocksdb::IntTblPropCollectorFactory, std::default_delete<rocksdb::IntTblPropCollectorFactory> >, std::allocator<std::uniqu
e_ptr<rocksdb::IntTblPropCollectorFactory, std::default_delete<rocksdb::IntTblPropCollectorFactory> > > > const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > con
st&, std::vector<unsigned long, std::allocator<unsigned long> >, unsigned long, rocksdb::SnapshotChecker*, rocksdb::CompressionType, unsigned long, rocksdb::CompressionOptions const&, bool, rocksdb::InternalStat
s*, rocksdb::TableFileCreationReason, rocksdb::EventLogger*, int, rocksdb::Env::IOPriority, rocksdb::TableProperties*, int, unsigned long, unsigned long, rocksdb::Env::WriteLifeTimeHint, unsigned long)+0x2ddb) [
0x55cd62d63a0b]

-----------------------------------------------



Version of all relevant components (if applicable):

- v4.10

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

- Two OSD pods are restarting which makes the cluster unstable.


Is there any workaround available to the best of your knowledge?

N/A

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
N/A


Can this issue reproducible?

No, specific to the cu environment.

Can this issue reproduce from the UI?
N/A

If this is a regression, please provide more details to justify this:
N/A

Steps to Reproduce:
N/A

Actual results:

- The OSD pods are in CLBO state.

Expected results:

- The OSD pods should be running fine.

Additional info:

In the next comments

Comment 27 Red Hat Bugzilla 2023-12-08 04:28:47 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days

Note You need to log in before you can comment on or make changes to this bug.