Description of problem: Progress section in ceph status is stuck for indefinite time Version-Release number of selected component (if applicable): 14.2.8-35.el8cp How reproducible: Had faced it in earlier version of nautilus (built for downstream 4.0) Steps followed: 1. Upgraded from luminous to nautilus (registry.access.redhat.com/rhceph/rhceph-3-rhel7 to ceph-4.1-rhel-8-containers-candidate-37018-20200413024316) using ceph-ansible. 2. Migrated FS OSDs to BS. Actual results: data: pools: 8 pools, 560 pgs objects: 50.24k objects, 117 GiB usage: 2.4 TiB used, 18 TiB / 20 TiB avail pgs: 560 active+clean progress: Rebalancing after osd.16 marked out [========================......] Rebalancing after osd.20 marked out [==================............] Rebalancing after osd.27 marked out [=======================.......] Rebalancing after osd.24 marked out [============================..] Rebalancing after osd.26 marked out [========================......] Rebalancing after osd.18 marked out [===================...........] Rebalancing after osd.14 marked out [==========================....] Rebalancing after osd.22 marked out [=================.............] Rebalancing after osd.12 marked out [======================........] Rebalancing after osd.28 marked out [=====================.........] Rebalancing after osd.10 marked out [===========================...] Expected results: progress section must be updated and accurate. Additional info: Please let us know if any log is needed in particular.
Adding NEEDINFO for QE to reproduce.
Solution that seems to fix the problem. - Recreated problem with https://github.com/ceph/ceph/tree/v14.2.8 - Test with marking 2/3 OSDs out. Result suggest that progress bar got stuck forever (waited 5 mins) - Apply my patch, which is everything up to https://github.com/ceph/ceph/commit/93d4d9d7044e991a7bbdb70b0aef02284e6eda22#diff-e6c8e5b8f137e32891a6ad184d076415 - Problem seems to be fixed This is the list of commits I am trying to back port to Nautilus before patching the downstream to prevent rebasing issue: 93d4d9d7044e991a7bbdb70b0aef02284e6eda22 901a37f436143a2525d6063f64942019cc888229 2046c25362a69b1ed2c0009e9ef6a944f0d9e621 dd2c3f66a1dbd9582b7cd695efff66317b730c8a d37e8a4d84d873b7df264c63077805be8618ad7a f618e56c93ad82a20ab844fddc3d2ded42f2a48e 21e1caba6df9d591ebff54939d020ce0a3e57efe
https://github.com/ceph/ceph/pull/37589 This is the pull request for back porting Nautilus
(In reply to ksirivad from comment #16) > https://github.com/ceph/ceph/pull/37589 > This is the pull request for back porting Nautilus Excellent, how about a devel-ack then?
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat Ceph Storage 4.2 Security and Bug Fix Update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2445