Bug 2037279 - [Azure] OSDs go into CLBO state while mounting an RBD PVC
Summary: [Azure] OSDs go into CLBO state while mounting an RBD PVC
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: rook
Version: 4.9
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: ODF 4.10.0
Assignee: Subham Rai
QA Contact: Rachael
URL:
Whiteboard:
Depends On:
Blocks: 2041964 2043179 2043185 2043186
TreeView+ depends on / blocked
 
Reported: 2022-01-05 11:18 UTC by Rachael
Modified: 2023-08-09 17:03 UTC (History)
16 users (show)

Fixed In Version: 4.10.0-115
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 2041964 2043179 2043185 2043186 (view as bug list)
Environment:
Last Closed: 2022-04-13 18:51:24 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github rook rook pull 9582 0 None Merged osd: correct bluestore compression min blob size for ssd 2022-01-18 16:03:52 UTC
Red Hat Product Errata RHSA-2022:1372 0 None None None 2022-04-13 18:51:43 UTC

Description Rachael 2022-01-05 11:18:15 UTC
Description of problem (please be detailed as possible and provide log
snippets):

The test link given below uses a single cephblockpool with replica 2 and compression enabled which is shared by two storageclasses. The test consists of creating PVCs using these SCs, mounting them on app pods and running IO on them. On running the following test, the OSD pods start going into CrashLoopBackOff state.

https://github.com/red-hat-storage/ocs-ci/blob/master/tests/manage/storageclass/test_create_2_sc_with_1_pool_comp_rep2.py

$ oc get pods |grep rook-ceph-osd-
rook-ceph-osd-0-6fb4bf9944-7snwv                                  2/2     Running            6 (4m ago)     53m
rook-ceph-osd-1-84df69484f-29m5f                                  1/2     CrashLoopBackOff   5 (115s ago)   52m
rook-ceph-osd-2-6c5d8987d7-hmk6z                                  1/2     CrashLoopBackOff   5 (48s ago)    52m

This leaves the cluster in an unhealthy state. The OSD states switch between Running -> Error -> CrashLoopBackOff. This behavior has been seen only in clusters on the azure platform so far, for both upgraded and new deployments. The same test was run on an AWS cluster with the same builds but this issue was not observed.

Version of all relevant components (if applicable):
OCP: 4.10.0-0.nightly-2021-12-23-153012
ODF: 4.9.1-258.ci, 4.9.1-257.ci, 4.9.0-251.ci


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Yes, the cluster is left in an unhealthy state.

Is there any workaround available to the best of your knowledge?
Not that I am aware of

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
3

Can this issue reproducible?
Yes

Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:

Yes, there has been successful runs of this test in previous releases. The most recent successful run has been with the following versions:
ODF: 4.9.0-251.ci
OCP: 4.9.0-0.nightly-2021-12-08-092803


Steps to Reproduce:
------------------

OCS-CI test: https://github.com/red-hat-storage/ocs-ci/blob/master/tests/manage/storageclass/test_create_2_sc_with_1_pool_comp_rep2.py

The test above consists of the following steps:
1. Create 2 Storage Class with creating one rbd pool for both
2. Create PVCs using the new Storage Classes
3. Mount PVC to app pod


Actual results:
---------------
After the PVCs were created and were being mounted to the app pod, it was observed that the OSDs started going into CrashLoopBackOff state, which left the cluster unhealthy


Expected results:
-----------------

The test should pass and the cluster should be in a healthy state throughout the test

Comment 4 Travis Nielsen 2022-01-05 21:49:03 UTC
The OSD.2 log [1] shows the following bluestore assert. Neha, who could take a look?

2021-12-28T18:56:21.713686028Z debug 2021-12-28T18:56:21.712+0000 7f4e5efec700  1 osd.2 pg_epoch: 475 pg[2.8( v 443'248574 (290'246134,443'248574] local-lis/les=442/443 n=527 ec=12/12 lis/c=462/239 les/c/f=463/240/0 sis=475) [1,2,0] r=1 lpr=475 pi=[239,475)/4 crt=443'248574 lcod 0'0 mlcod 0'0 unknown NOTIFY mbc={}] state<Start>: transitioning to Stray
2021-12-28T18:56:22.766209808Z /builddir/build/BUILD/ceph-16.2.0/src/os/bluestore/HybridAllocator.cc: In function 'virtual int64_t HybridAllocator::allocate(uint64_t, uint64_t, uint64_t, int64_t, PExtentVector*)' thread 7f4e5ffee700 time 2021-12-28T18:56:22.766157+0000
2021-12-28T18:56:22.766209808Z /builddir/build/BUILD/ceph-16.2.0/src/os/bluestore/HybridAllocator.cc: 31: FAILED ceph_assert(want % unit == 0)
2021-12-28T18:56:22.766415811Z /builddir/build/BUILD/ceph-16.2.0/src/os/bluestore/HybridAllocator.cc: In function 'virtual int64_t HybridAllocator::allocate(uint64_t, uint64_t, uint64_t, int64_t, PExtentVector*)' thread 7f4e5b7e5700 time 2021-12-28T18:56:22.766379+0000
2021-12-28T18:56:22.766415811Z /builddir/build/BUILD/ceph-16.2.0/src/os/bluestore/HybridAllocator.cc: 31: FAILED ceph_assert(want % unit == 0)
2021-12-28T18:56:22.767038720Z /builddir/build/BUILD/ceph-16.2.0/src/os/bluestore/HybridAllocator.cc: In function 'virtual int64_t HybridAllocator::allocate(uint64_t, uint64_t, uint64_t, int64_t, PExtentVector*)' thread 7f4e5dfea700 time 2021-12-28T18:56:22.766990+0000
2021-12-28T18:56:22.767038720Z /builddir/build/BUILD/ceph-16.2.0/src/os/bluestore/HybridAllocator.cc: 31: FAILED ceph_assert(want % unit == 0)
2021-12-28T18:56:22.767359025Z /builddir/build/BUILD/ceph-16.2.0/src/os/bluestore/HybridAllocator.cc: In function 'virtual int64_t HybridAllocator::allocate(uint64_t, uint64_t, uint64_t, int64_t, PExtentVector*)' thread 7f4e607ef700 time 2021-12-28T18:56:22.767327+0000
2021-12-28T18:56:22.767359025Z /builddir/build/BUILD/ceph-16.2.0/src/os/bluestore/HybridAllocator.cc: 31: FAILED ceph_assert(want % unit == 0)
2021-12-28T18:56:22.772643302Z  ceph version 16.2.0-146.el8cp (56f5e9cfe88a08b6899327eca5166ca1c4a392aa) pacific (stable)
2021-12-28T18:56:22.772643302Z  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x158) [0x5639c40de27a]
2021-12-28T18:56:22.772643302Z  2: ceph-osd(+0x567494) [0x5639c40de494]
2021-12-28T18:56:22.772643302Z  3: (HybridAllocator::allocate(unsigned long, unsigned long, unsigned long, long, std::vector<bluestore_pextent_t, mempool::pool_allocator<(mempool::pool_index_t)5, bluestore_pextent_t> >*)+0x97f) [0x5639c481663f]
2021-12-28T18:56:22.772643302Z  4: (BlueStore::_do_alloc_write(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>, boost::intrusive_ptr<BlueStore::Onode>, BlueStore::WriteContext*)+0x70e) [0x5639c46e8a8e]
2021-12-28T18:56:22.772643302Z  5: (BlueStore::_do_write(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>, unsigned long, unsigned long, ceph::buffer::v15_2_0::list&, unsigned int)+0x694) [0x5639c4721b04]
2021-12-28T18:56:22.772643302Z  6: (BlueStore::_write(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>&, unsigned long, unsigned long, ceph::buffer::v15_2_0::list&, unsigned int)+0xd1) [0x5639c4722c71]
2021-12-28T18:56:22.772643302Z  7: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, ceph::os::Transaction*)+0x2077) [0x5639c4726cb7]
2021-12-28T18:56:22.772643302Z  8: (BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ceph::os::Transaction, std::allocator<ceph::os::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x316) [0x5639c4727ff6]
2021-12-28T18:56:22.772643302Z  9: (non-virtual thunk to PrimaryLogPG::queue_transactions(std::vector<ceph::os::Transaction, std::allocator<ceph::os::Transaction> >&, boost::intrusive_ptr<OpRequest>)+0x58) [0x5639c4384058]
2021-12-28T18:56:22.772643302Z  10: (ReplicatedBackend::submit_transaction(hobject_t const&, object_stat_sum_t const&, eversion_t const&, std::unique_ptr<PGTransaction, std::default_delete<PGTransaction> >&&, eversion_t const&, eversion_t const&, std::vector<pg_log_entry_t, std::allocator<pg_log_entry_t> >&&, std::optional<pg_hit_set_history_t>&, Context*, unsigned long, osd_reqid_t, boost::intrusive_ptr<OpRequest>)+0xc9d) [0x5639c456dbdd]
2021-12-28T18:56:22.772643302Z  11: (PrimaryLogPG::issue_repop(PrimaryLogPG::RepGather*, PrimaryLogPG::OpContext*)+0xce0) [0x5639c42ef570]
2021-12-28T18:56:22.772643302Z  12: (PrimaryLogPG::execute_ctx(PrimaryLogPG::OpContext*)+0x115d) [0x5639c434b45d]
2021-12-28T18:56:22.772643302Z  13: (PrimaryLogPG::do_op(boost::intrusive_ptr<OpRequest>&)+0x2de2) [0x5639c4353fd2]
2021-12-28T18:56:22.772643302Z  14: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0xd1c) [0x5639c435b31c]
2021-12-28T18:56:22.772643302Z  15: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x309) [0x5639c41e4789]
2021-12-28T18:56:22.772643302Z  16: (ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x68) [0x5639c4441968]
2021-12-28T18:56:22.772643302Z  17: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xa58) [0x5639c4204738]
2021-12-28T18:56:22.772643302Z  18: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) [0x5639c486bc74]
2021-12-28T18:56:22.772643302Z  19: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x5639c486e914]
2021-12-28T18:56:22.772643302Z  20: /lib64/libpthread.so.0(+0x817a) [0x7f4e838e317a]
2021-12-28T18:56:22.772643302Z  21: clone()
2021-12-28T18:56:22.772676103Z debug 2021-12-28T18:56:22.771+0000 7f4e5ffee700 -1 /builddir/build/BUILD/ceph-16.2.0/src/os/bluestore/HybridAllocator.cc: In function 'virtual int64_t HybridAllocator::allocate(uint64_t, uint64_t, uint64_t, int64_t, PExtentVector*)' thread 7f4e607ef700 time 2021-12-28T18:56:22.767327+0000
2021-12-28T18:56:22.772676103Z /builddir/build/BUILD/ceph-16.2.0/src/os/bluestore/HybridAllocator.cc: 31: FAILED ceph_assert(want % unit == 0)
2021-12-28T18:56:22.772676103Z 
2021-12-28T18:56:22.772676103Z  ceph version 16.2.0-146.el8cp (56f5e9cfe88a08b6899327eca5166ca1c4a392aa) pacific (stable)
2021-12-28T18:56:22.772676103Z  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x158) [0x5639c40de27a]
2021-12-28T18:56:22.772676103Z  2: ceph-osd(+0x567494) [0x5639c40de494]
2021-12-28T18:56:22.772676103Z  3: (HybridAllocator::allocate(unsigned long, unsigned long, unsigned long, long, std::vector<bluestore_pextent_t, mempool::pool_allocator<(mempool::pool_index_t)5, bluestore_pextent_t> >*)+0x97f) [0x5639c481663f]
2021-12-28T18:56:22.772676103Z  4: (BlueStore::_do_alloc_write(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>, boost::intrusive_ptr<BlueStore::Onode>, BlueStore::WriteContext*)+0x70e) [0x5639c46e8a8e]
2021-12-28T18:56:22.772676103Z  5: (BlueStore::_do_write(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>, unsigned long, unsigned long, ceph::buffer::v15_2_0::list&, unsigned int)+0x694) [0x5639c4721b04]
2021-12-28T18:56:22.772676103Z  6: (BlueStore::_write(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>&, unsigned long, unsigned long, ceph::buffer::v15_2_0::list&, unsigned int)+0xd1) [0x5639c4722c71]
2021-12-28T18:56:22.772676103Z  7: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, ceph::os::Transaction*)+0x2077) [0x5639c4726cb7]
2021-12-28T18:56:22.772676103Z  8: (BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ceph::os::Transaction, std::allocator<ceph::os::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x316) [0x5639c4727ff6]
2021-12-28T18:56:22.772676103Z  9: (non-virtual thunk to PrimaryLogPG::queue_transactions(std::vector<ceph::os::Transaction, std::allocator<ceph::os::Transaction> >&, boost::intrusive_ptr<OpRequest>)+0x58) [0x5639c4384058]
2021-12-28T18:56:22.772676103Z  10: (ReplicatedBackend::submit_transaction(hobject_t const&, object_stat_sum_t const&, eversion_t const&, std::unique_ptr<PGTransaction, std::default_delete<PGTransaction> >&&, eversion_t const&, eversion_t const&, std::vector<pg_log_entry_t, std::allocator<pg_log_entry_t> >&&, std::optional<pg_hit_set_history_t>&, Context*, unsigned long, osd_reqid_t, boost::intrusive_ptr<OpRequest>)+0xc9d) [0x5639c456dbdd]
2021-12-28T18:56:22.772676103Z  11: (PrimaryLogPG::issue_repop(PrimaryLogPG::RepGather*, PrimaryLogPG::OpContext*)+0xce0) [0x5639c42ef570]
2021-12-28T18:56:22.772676103Z  12: (PrimaryLogPG::execute_ctx(PrimaryLogPG::OpContext*)+0x115d) [0x5639c434b45d]
2021-12-28T18:56:22.772676103Z  13: (PrimaryLogPG::do_op(boost::intrusive_ptr<OpRequest>&)+0x2de2) [0x5639c4353fd2]
2021-12-28T18:56:22.772676103Z  14: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0xd1c) [0x5639c435b31c]
2021-12-28T18:56:22.772676103Z  15: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x309) [0x5639c41e4789]
2021-12-28T18:56:22.772676103Z  16: (ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x68) [0x5639c4441968]
2021-12-28T18:56:22.772676103Z  17: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xa58) [0x5639c4204738]
2021-12-28T18:56:22.772676103Z  18: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) [0x5639c486bc74]
2021-12-28T18:56:22.772676103Z  19: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x5639c486e914]
2021-12-28T18:56:22.772676103Z  20: /lib64/libpthread.so.0(+0x817a) [0x7f4e838e317a]
2021-12-28T18:56:22.772676103Z  21: clone()
2021-12-28T18:56:22.772676103Z 
2021-12-28T18:56:22.772692403Z *** Caught signal (Aborted) **
2021-12-28T18:56:22.772692403Z  in thread 7f4e5ffee700 thread_name:tp_osd_tp
2021-12-28T18:56:22.779175698Z  ceph version 16.2.0-146.el8cp (56f5e9cfe88a08b6899327eca5166ca1c4a392aa) pacific (stable)
2021-12-28T18:56:22.779175698Z  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x158) [0x5639c40de27a]
2021-12-28T18:56:22.779175698Z  2: ceph-osd(+0x567494) [0x5639c40de494]
2021-12-28T18:56:22.779175698Z  3: (HybridAllocator::allocate(unsigned long, unsigned long, unsigned long, long, std::vector<bluestore_pextent_t, mempool::pool_allocator<(mempool::pool_index_t)5, bluestore_pextent_t> >*)+0x97f) [0x5639c481663f]
2021-12-28T18:56:22.779175698Z  4: (BlueStore::_do_alloc_write(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>, boost::intrusive_ptr<BlueStore::Onode>, BlueStore::WriteContext*)+0x70e) [0x5639c46e8a8e]
2021-12-28T18:56:22.779175698Z  5: (BlueStore::_do_write(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>, unsigned long, unsigned long, ceph::buffer::v15_2_0::list&, unsigned int)+0x694) [0x5639c4721b04]
2021-12-28T18:56:22.779175698Z  6: (BlueStore::_write(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>&, unsigned long, unsigned long, ceph::buffer::v15_2_0::list&, unsigned int)+0xd1) [0x5639c4722c71]
2021-12-28T18:56:22.779175698Z  7: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, ceph::os::Transaction*)+0x2077) [0x5639c4726cb7]
2021-12-28T18:56:22.779175698Z  8: (BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ceph::os::Transaction, std::allocator<ceph::os::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x316) [0x5639c4727ff6]
2021-12-28T18:56:22.779175698Z  9: (non-virtual thunk to PrimaryLogPG::queue_transactions(std::vector<ceph::os::Transaction, std::allocator<ceph::os::Transaction> >&, boost::intrusive_ptr<OpRequest>)+0x58) [0x5639c4384058]
2021-12-28T18:56:22.779175698Z  10: (ReplicatedBackend::submit_transaction(hobject_t const&, object_stat_sum_t const&, eversion_t const&, std::unique_ptr<PGTransaction, std::default_delete<PGTransaction> >&&, eversion_t const&, eversion_t const&, std::vector<pg_log_entry_t, std::allocator<pg_log_entry_t> >&&, std::optional<pg_hit_set_history_t>&, Context*, unsigned long, osd_reqid_t, boost::intrusive_ptr<OpRequest>)+0xc9d) [0x5639c456dbdd]
2021-12-28T18:56:22.779175698Z  11: (PrimaryLogPG::issue_repop(PrimaryLogPG::RepGather*, PrimaryLogPG::OpContext*)+0xce0) [0x5639c42ef570]
2021-12-28T18:56:22.779175698Z  12: (PrimaryLogPG::execute_ctx(PrimaryLogPG::OpContext*)+0x115d) [0x5639c434b45d]
2021-12-28T18:56:22.779175698Z  13: (PrimaryLogPG::do_op(boost::intrusive_ptr<OpRequest>&)+0x2de2) [0x5639c4353fd2]
2021-12-28T18:56:22.779175698Z  14: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0xd1c) [0x5639c435b31c]
2021-12-28T18:56:22.779175698Z  15: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x309) [0x5639c41e4789]
2021-12-28T18:56:22.779175698Z  16: (ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x68) [0x5639c4441968]
2021-12-28T18:56:22.779175698Z  17: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xa58) [0x5639c4204738]
2021-12-28T18:56:22.779175698Z  18: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) [0x5639c486bc74]
2021-12-28T18:56:22.779175698Z  19: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x5639c486e914]
2021-12-28T18:56:22.779175698Z  20: /lib64/libpthread.so.0(+0x817a) [0x7f4e838e317a]
2021-12-28T18:56:22.779175698Z  21: clone()
2021-12-28T18:56:22.779214398Z debug 


[1] http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-067zi3c33-uo/j-067zi3c33-uo_20211228T093525/logs/failed_testcase_ocs_logs_1640694899/test_multiple_sc_one_pool_rep2_comp_ocs_logs/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-71a98052bdd86ccc10967af82739b04d6b2305eb6721cb0744470de2c284fd15/namespaces/openshift-storage/pods/rook-ceph-osd-2-c95644c47-sz89g/osd/osd/logs/current.log

Comment 5 Neha Ojha 2022-01-06 00:00:57 UTC
(In reply to Travis Nielsen from comment #4)
> The OSD.2 log [1] shows the following bluestore assert. Neha, who could take
> a look?

This assert does not ring a bell for me, but @akupczyk is the best person to take a look at it. How reproducible is this? Can we capture logs with higher debug levels, like debug_bluestore=20?

> 
> 2021-12-28T18:56:21.713686028Z debug 2021-12-28T18:56:21.712+0000
> 7f4e5efec700  1 osd.2 pg_epoch: 475 pg[2.8( v 443'248574
> (290'246134,443'248574] local-lis/les=442/443 n=527 ec=12/12 lis/c=462/239
> les/c/f=463/240/0 sis=475) [1,2,0] r=1 lpr=475 pi=[239,475)/4 crt=443'248574
> lcod 0'0 mlcod 0'0 unknown NOTIFY mbc={}] state<Start>: transitioning to
> Stray
> 2021-12-28T18:56:22.766209808Z
> /builddir/build/BUILD/ceph-16.2.0/src/os/bluestore/HybridAllocator.cc: In
> function 'virtual int64_t HybridAllocator::allocate(uint64_t, uint64_t,
> uint64_t, int64_t, PExtentVector*)' thread 7f4e5ffee700 time
> 2021-12-28T18:56:22.766157+0000
> 2021-12-28T18:56:22.766209808Z
> /builddir/build/BUILD/ceph-16.2.0/src/os/bluestore/HybridAllocator.cc: 31:
> FAILED ceph_assert(want % unit == 0)
> 2021-12-28T18:56:22.766415811Z
> /builddir/build/BUILD/ceph-16.2.0/src/os/bluestore/HybridAllocator.cc: In
> function 'virtual int64_t HybridAllocator::allocate(uint64_t, uint64_t,
> uint64_t, int64_t, PExtentVector*)' thread 7f4e5b7e5700 time
> 2021-12-28T18:56:22.766379+0000
> 2021-12-28T18:56:22.766415811Z
> /builddir/build/BUILD/ceph-16.2.0/src/os/bluestore/HybridAllocator.cc: 31:
> FAILED ceph_assert(want % unit == 0)
> 2021-12-28T18:56:22.767038720Z
> /builddir/build/BUILD/ceph-16.2.0/src/os/bluestore/HybridAllocator.cc: In
> function 'virtual int64_t HybridAllocator::allocate(uint64_t, uint64_t,
> uint64_t, int64_t, PExtentVector*)' thread 7f4e5dfea700 time
> 2021-12-28T18:56:22.766990+0000
> 2021-12-28T18:56:22.767038720Z
> /builddir/build/BUILD/ceph-16.2.0/src/os/bluestore/HybridAllocator.cc: 31:
> FAILED ceph_assert(want % unit == 0)
> 2021-12-28T18:56:22.767359025Z
> /builddir/build/BUILD/ceph-16.2.0/src/os/bluestore/HybridAllocator.cc: In
> function 'virtual int64_t HybridAllocator::allocate(uint64_t, uint64_t,
> uint64_t, int64_t, PExtentVector*)' thread 7f4e607ef700 time
> 2021-12-28T18:56:22.767327+0000
> 2021-12-28T18:56:22.767359025Z
> /builddir/build/BUILD/ceph-16.2.0/src/os/bluestore/HybridAllocator.cc: 31:
> FAILED ceph_assert(want % unit == 0)
> 2021-12-28T18:56:22.772643302Z  ceph version 16.2.0-146.el8cp
> (56f5e9cfe88a08b6899327eca5166ca1c4a392aa) pacific (stable)
> 2021-12-28T18:56:22.772643302Z  1: (ceph::__ceph_assert_fail(char const*,
> char const*, int, char const*)+0x158) [0x5639c40de27a]
> 2021-12-28T18:56:22.772643302Z  2: ceph-osd(+0x567494) [0x5639c40de494]
> 2021-12-28T18:56:22.772643302Z  3: (HybridAllocator::allocate(unsigned long,
> unsigned long, unsigned long, long, std::vector<bluestore_pextent_t,
> mempool::pool_allocator<(mempool::pool_index_t)5, bluestore_pextent_t>
> >*)+0x97f) [0x5639c481663f]
> 2021-12-28T18:56:22.772643302Z  4:
> (BlueStore::_do_alloc_write(BlueStore::TransContext*,
> boost::intrusive_ptr<BlueStore::Collection>,
> boost::intrusive_ptr<BlueStore::Onode>, BlueStore::WriteContext*)+0x70e)
> [0x5639c46e8a8e]
> 2021-12-28T18:56:22.772643302Z  5:
> (BlueStore::_do_write(BlueStore::TransContext*,
> boost::intrusive_ptr<BlueStore::Collection>&,
> boost::intrusive_ptr<BlueStore::Onode>, unsigned long, unsigned long,
> ceph::buffer::v15_2_0::list&, unsigned int)+0x694) [0x5639c4721b04]
> 2021-12-28T18:56:22.772643302Z  6:
> (BlueStore::_write(BlueStore::TransContext*,
> boost::intrusive_ptr<BlueStore::Collection>&,
> boost::intrusive_ptr<BlueStore::Onode>&, unsigned long, unsigned long,
> ceph::buffer::v15_2_0::list&, unsigned int)+0xd1) [0x5639c4722c71]
> 2021-12-28T18:56:22.772643302Z  7:
> (BlueStore::_txc_add_transaction(BlueStore::TransContext*,
> ceph::os::Transaction*)+0x2077) [0x5639c4726cb7]
> 2021-12-28T18:56:22.772643302Z  8:
> (BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::
> CollectionImpl>&, std::vector<ceph::os::Transaction,
> std::allocator<ceph::os::Transaction> >&, boost::intrusive_ptr<TrackedOp>,
> ThreadPool::TPHandle*)+0x316) [0x5639c4727ff6]
> 2021-12-28T18:56:22.772643302Z  9: (non-virtual thunk to
> PrimaryLogPG::queue_transactions(std::vector<ceph::os::Transaction,
> std::allocator<ceph::os::Transaction> >&,
> boost::intrusive_ptr<OpRequest>)+0x58) [0x5639c4384058]
> 2021-12-28T18:56:22.772643302Z  10:
> (ReplicatedBackend::submit_transaction(hobject_t const&, object_stat_sum_t
> const&, eversion_t const&, std::unique_ptr<PGTransaction,
> std::default_delete<PGTransaction> >&&, eversion_t const&, eversion_t
> const&, std::vector<pg_log_entry_t, std::allocator<pg_log_entry_t> >&&,
> std::optional<pg_hit_set_history_t>&, Context*, unsigned long, osd_reqid_t,
> boost::intrusive_ptr<OpRequest>)+0xc9d) [0x5639c456dbdd]
> 2021-12-28T18:56:22.772643302Z  11:
> (PrimaryLogPG::issue_repop(PrimaryLogPG::RepGather*,
> PrimaryLogPG::OpContext*)+0xce0) [0x5639c42ef570]
> 2021-12-28T18:56:22.772643302Z  12:
> (PrimaryLogPG::execute_ctx(PrimaryLogPG::OpContext*)+0x115d) [0x5639c434b45d]
> 2021-12-28T18:56:22.772643302Z  13:
> (PrimaryLogPG::do_op(boost::intrusive_ptr<OpRequest>&)+0x2de2)
> [0x5639c4353fd2]
> 2021-12-28T18:56:22.772643302Z  14:
> (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&,
> ThreadPool::TPHandle&)+0xd1c) [0x5639c435b31c]
> 2021-12-28T18:56:22.772643302Z  15:
> (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>,
> ThreadPool::TPHandle&)+0x309) [0x5639c41e4789]
> 2021-12-28T18:56:22.772643302Z  16:
> (ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*,
> boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x68) [0x5639c4441968]
> 2021-12-28T18:56:22.772643302Z  17: (OSD::ShardedOpWQ::_process(unsigned
> int, ceph::heartbeat_handle_d*)+0xa58) [0x5639c4204738]
> 2021-12-28T18:56:22.772643302Z  18:
> (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4)
> [0x5639c486bc74]
> 2021-12-28T18:56:22.772643302Z  19:
> (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x5639c486e914]
> 2021-12-28T18:56:22.772643302Z  20: /lib64/libpthread.so.0(+0x817a)
> [0x7f4e838e317a]
> 2021-12-28T18:56:22.772643302Z  21: clone()
> 2021-12-28T18:56:22.772676103Z debug 2021-12-28T18:56:22.771+0000
> 7f4e5ffee700 -1
> /builddir/build/BUILD/ceph-16.2.0/src/os/bluestore/HybridAllocator.cc: In
> function 'virtual int64_t HybridAllocator::allocate(uint64_t, uint64_t,
> uint64_t, int64_t, PExtentVector*)' thread 7f4e607ef700 time
> 2021-12-28T18:56:22.767327+0000
> 2021-12-28T18:56:22.772676103Z
> /builddir/build/BUILD/ceph-16.2.0/src/os/bluestore/HybridAllocator.cc: 31:
> FAILED ceph_assert(want % unit == 0)
> 2021-12-28T18:56:22.772676103Z 
> 2021-12-28T18:56:22.772676103Z  ceph version 16.2.0-146.el8cp
> (56f5e9cfe88a08b6899327eca5166ca1c4a392aa) pacific (stable)
> 2021-12-28T18:56:22.772676103Z  1: (ceph::__ceph_assert_fail(char const*,
> char const*, int, char const*)+0x158) [0x5639c40de27a]
> 2021-12-28T18:56:22.772676103Z  2: ceph-osd(+0x567494) [0x5639c40de494]
> 2021-12-28T18:56:22.772676103Z  3: (HybridAllocator::allocate(unsigned long,
> unsigned long, unsigned long, long, std::vector<bluestore_pextent_t,
> mempool::pool_allocator<(mempool::pool_index_t)5, bluestore_pextent_t>
> >*)+0x97f) [0x5639c481663f]
> 2021-12-28T18:56:22.772676103Z  4:
> (BlueStore::_do_alloc_write(BlueStore::TransContext*,
> boost::intrusive_ptr<BlueStore::Collection>,
> boost::intrusive_ptr<BlueStore::Onode>, BlueStore::WriteContext*)+0x70e)
> [0x5639c46e8a8e]
> 2021-12-28T18:56:22.772676103Z  5:
> (BlueStore::_do_write(BlueStore::TransContext*,
> boost::intrusive_ptr<BlueStore::Collection>&,
> boost::intrusive_ptr<BlueStore::Onode>, unsigned long, unsigned long,
> ceph::buffer::v15_2_0::list&, unsigned int)+0x694) [0x5639c4721b04]
> 2021-12-28T18:56:22.772676103Z  6:
> (BlueStore::_write(BlueStore::TransContext*,
> boost::intrusive_ptr<BlueStore::Collection>&,
> boost::intrusive_ptr<BlueStore::Onode>&, unsigned long, unsigned long,
> ceph::buffer::v15_2_0::list&, unsigned int)+0xd1) [0x5639c4722c71]
> 2021-12-28T18:56:22.772676103Z  7:
> (BlueStore::_txc_add_transaction(BlueStore::TransContext*,
> ceph::os::Transaction*)+0x2077) [0x5639c4726cb7]
> 2021-12-28T18:56:22.772676103Z  8:
> (BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::
> CollectionImpl>&, std::vector<ceph::os::Transaction,
> std::allocator<ceph::os::Transaction> >&, boost::intrusive_ptr<TrackedOp>,
> ThreadPool::TPHandle*)+0x316) [0x5639c4727ff6]
> 2021-12-28T18:56:22.772676103Z  9: (non-virtual thunk to
> PrimaryLogPG::queue_transactions(std::vector<ceph::os::Transaction,
> std::allocator<ceph::os::Transaction> >&,
> boost::intrusive_ptr<OpRequest>)+0x58) [0x5639c4384058]
> 2021-12-28T18:56:22.772676103Z  10:
> (ReplicatedBackend::submit_transaction(hobject_t const&, object_stat_sum_t
> const&, eversion_t const&, std::unique_ptr<PGTransaction,
> std::default_delete<PGTransaction> >&&, eversion_t const&, eversion_t
> const&, std::vector<pg_log_entry_t, std::allocator<pg_log_entry_t> >&&,
> std::optional<pg_hit_set_history_t>&, Context*, unsigned long, osd_reqid_t,
> boost::intrusive_ptr<OpRequest>)+0xc9d) [0x5639c456dbdd]
> 2021-12-28T18:56:22.772676103Z  11:
> (PrimaryLogPG::issue_repop(PrimaryLogPG::RepGather*,
> PrimaryLogPG::OpContext*)+0xce0) [0x5639c42ef570]
> 2021-12-28T18:56:22.772676103Z  12:
> (PrimaryLogPG::execute_ctx(PrimaryLogPG::OpContext*)+0x115d) [0x5639c434b45d]
> 2021-12-28T18:56:22.772676103Z  13:
> (PrimaryLogPG::do_op(boost::intrusive_ptr<OpRequest>&)+0x2de2)
> [0x5639c4353fd2]
> 2021-12-28T18:56:22.772676103Z  14:
> (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&,
> ThreadPool::TPHandle&)+0xd1c) [0x5639c435b31c]
> 2021-12-28T18:56:22.772676103Z  15:
> (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>,
> ThreadPool::TPHandle&)+0x309) [0x5639c41e4789]
> 2021-12-28T18:56:22.772676103Z  16:
> (ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*,
> boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x68) [0x5639c4441968]
> 2021-12-28T18:56:22.772676103Z  17: (OSD::ShardedOpWQ::_process(unsigned
> int, ceph::heartbeat_handle_d*)+0xa58) [0x5639c4204738]
> 2021-12-28T18:56:22.772676103Z  18:
> (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4)
> [0x5639c486bc74]
> 2021-12-28T18:56:22.772676103Z  19:
> (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x5639c486e914]
> 2021-12-28T18:56:22.772676103Z  20: /lib64/libpthread.so.0(+0x817a)
> [0x7f4e838e317a]
> 2021-12-28T18:56:22.772676103Z  21: clone()
> 2021-12-28T18:56:22.772676103Z 
> 2021-12-28T18:56:22.772692403Z *** Caught signal (Aborted) **
> 2021-12-28T18:56:22.772692403Z  in thread 7f4e5ffee700 thread_name:tp_osd_tp
> 2021-12-28T18:56:22.779175698Z  ceph version 16.2.0-146.el8cp
> (56f5e9cfe88a08b6899327eca5166ca1c4a392aa) pacific (stable)
> 2021-12-28T18:56:22.779175698Z  1: (ceph::__ceph_assert_fail(char const*,
> char const*, int, char const*)+0x158) [0x5639c40de27a]
> 2021-12-28T18:56:22.779175698Z  2: ceph-osd(+0x567494) [0x5639c40de494]
> 2021-12-28T18:56:22.779175698Z  3: (HybridAllocator::allocate(unsigned long,
> unsigned long, unsigned long, long, std::vector<bluestore_pextent_t,
> mempool::pool_allocator<(mempool::pool_index_t)5, bluestore_pextent_t>
> >*)+0x97f) [0x5639c481663f]
> 2021-12-28T18:56:22.779175698Z  4:
> (BlueStore::_do_alloc_write(BlueStore::TransContext*,
> boost::intrusive_ptr<BlueStore::Collection>,
> boost::intrusive_ptr<BlueStore::Onode>, BlueStore::WriteContext*)+0x70e)
> [0x5639c46e8a8e]
> 2021-12-28T18:56:22.779175698Z  5:
> (BlueStore::_do_write(BlueStore::TransContext*,
> boost::intrusive_ptr<BlueStore::Collection>&,
> boost::intrusive_ptr<BlueStore::Onode>, unsigned long, unsigned long,
> ceph::buffer::v15_2_0::list&, unsigned int)+0x694) [0x5639c4721b04]
> 2021-12-28T18:56:22.779175698Z  6:
> (BlueStore::_write(BlueStore::TransContext*,
> boost::intrusive_ptr<BlueStore::Collection>&,
> boost::intrusive_ptr<BlueStore::Onode>&, unsigned long, unsigned long,
> ceph::buffer::v15_2_0::list&, unsigned int)+0xd1) [0x5639c4722c71]
> 2021-12-28T18:56:22.779175698Z  7:
> (BlueStore::_txc_add_transaction(BlueStore::TransContext*,
> ceph::os::Transaction*)+0x2077) [0x5639c4726cb7]
> 2021-12-28T18:56:22.779175698Z  8:
> (BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::
> CollectionImpl>&, std::vector<ceph::os::Transaction,
> std::allocator<ceph::os::Transaction> >&, boost::intrusive_ptr<TrackedOp>,
> ThreadPool::TPHandle*)+0x316) [0x5639c4727ff6]
> 2021-12-28T18:56:22.779175698Z  9: (non-virtual thunk to
> PrimaryLogPG::queue_transactions(std::vector<ceph::os::Transaction,
> std::allocator<ceph::os::Transaction> >&,
> boost::intrusive_ptr<OpRequest>)+0x58) [0x5639c4384058]
> 2021-12-28T18:56:22.779175698Z  10:
> (ReplicatedBackend::submit_transaction(hobject_t const&, object_stat_sum_t
> const&, eversion_t const&, std::unique_ptr<PGTransaction,
> std::default_delete<PGTransaction> >&&, eversion_t const&, eversion_t
> const&, std::vector<pg_log_entry_t, std::allocator<pg_log_entry_t> >&&,
> std::optional<pg_hit_set_history_t>&, Context*, unsigned long, osd_reqid_t,
> boost::intrusive_ptr<OpRequest>)+0xc9d) [0x5639c456dbdd]
> 2021-12-28T18:56:22.779175698Z  11:
> (PrimaryLogPG::issue_repop(PrimaryLogPG::RepGather*,
> PrimaryLogPG::OpContext*)+0xce0) [0x5639c42ef570]
> 2021-12-28T18:56:22.779175698Z  12:
> (PrimaryLogPG::execute_ctx(PrimaryLogPG::OpContext*)+0x115d) [0x5639c434b45d]
> 2021-12-28T18:56:22.779175698Z  13:
> (PrimaryLogPG::do_op(boost::intrusive_ptr<OpRequest>&)+0x2de2)
> [0x5639c4353fd2]
> 2021-12-28T18:56:22.779175698Z  14:
> (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&,
> ThreadPool::TPHandle&)+0xd1c) [0x5639c435b31c]
> 2021-12-28T18:56:22.779175698Z  15:
> (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>,
> ThreadPool::TPHandle&)+0x309) [0x5639c41e4789]
> 2021-12-28T18:56:22.779175698Z  16:
> (ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*,
> boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x68) [0x5639c4441968]
> 2021-12-28T18:56:22.779175698Z  17: (OSD::ShardedOpWQ::_process(unsigned
> int, ceph::heartbeat_handle_d*)+0xa58) [0x5639c4204738]
> 2021-12-28T18:56:22.779175698Z  18:
> (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4)
> [0x5639c486bc74]
> 2021-12-28T18:56:22.779175698Z  19:
> (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x5639c486e914]
> 2021-12-28T18:56:22.779175698Z  20: /lib64/libpthread.so.0(+0x817a)
> [0x7f4e838e317a]
> 2021-12-28T18:56:22.779175698Z  21: clone()
> 2021-12-28T18:56:22.779214398Z debug 
> 
> 
> [1]
> http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-067zi3c33-
> uo/j-067zi3c33-uo_20211228T093525/logs/failed_testcase_ocs_logs_1640694899/
> test_multiple_sc_one_pool_rep2_comp_ocs_logs/ocs_must_gather/quay-io-rhceph-
> dev-ocs-must-gather-sha256-
> 71a98052bdd86ccc10967af82739b04d6b2305eb6721cb0744470de2c284fd15/namespaces/
> openshift-storage/pods/rook-ceph-osd-2-c95644c47-sz89g/osd/osd/logs/current.
> log

Comment 10 Sébastien Han 2022-01-11 16:20:16 UTC
Rachael, please go into the toolbox and run

ceph config set osd debug_bluestore 20

Thanks

Comment 13 Mudit Agarwal 2022-01-14 10:24:14 UTC
Rachel, please try again and reopen if you still see the issue.

Comment 15 Mudit Agarwal 2022-01-17 09:24:58 UTC
Yeah, looks like we need a fix in rook.
Travis, please take  a look.

Also, I can see that the same is present in 4.6/4.7/4.8 as well so this is not a regression. Probably something which we never tested?
https://github.com/red-hat-storage/rook/blob/d1788e1616734109b104bfae687833a7ee3ca747/pkg/operator/ceph/cluster/osd/spec.go#L158

Comment 16 Subham Rai 2022-01-17 09:40:55 UTC
(In reply to Mudit Agarwal from comment #15)
> Yeah, looks like we need a fix in rook.
> Travis, please take  a look.
> 
> Also, I can see that the same is present in 4.6/4.7/4.8 as well so this is
> not a regression. Probably something which we never tested?
> https://github.com/red-hat-storage/rook/blob/
> d1788e1616734109b104bfae687833a7ee3ca747/pkg/operator/ceph/cluster/osd/spec.
> go#L158

I think it was me who did the typo while doing the changes in rook a long time back. I'll modify it to 8192. Assigning it to myself.

Comment 23 Travis Nielsen 2022-01-24 16:38:14 UTC
Already in the latest sync of 4.10.

Comment 29 errata-xmlrpc 2022-04-13 18:51:24 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.10.0 enhancement, security & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:1372


Note You need to log in before you can comment on or make changes to this bug.