Bug 1826224

Summary: progress section in ceph status stuck for indefinite time
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Vasishta <vashastr>
Component: RADOSAssignee: Kamoltat (Junior) Sirivadhna <ksirivad>
Status: CLOSED ERRATA QA Contact: Pawan <pdhiran>
Severity: low Docs Contact: Amrita <asakthiv>
Priority: low    
Version: 4.1CC: akupczyk, asakthiv, bhubbard, ceph-eng-bugs, dzafman, jbiao, kchai, ksirivad, nojha, pdhiran, rzarzyns, sseshasa, tserlin
Target Milestone: ---   
Target Release: 4.2z2   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: ceph-14.2.11-157.el8cp, ceph-14.2.11-157.el7cp Doc Type: Bug Fix
Doc Text:
.The Progress module is no longer stuck for an indefinite time Previously, the progress vents in Ceph status were stuck for an indefinite time. This was caused by the Progress module checking the PG state early and not syncing with the epoch of the OSDMap. With this release, progress events now pop up as expected.
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-06-15 17:13:06 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1890121    

Description Vasishta 2020-04-21 09:21:15 UTC
Description of problem:
Progress section in ceph status is stuck for indefinite time

Version-Release number of selected component (if applicable):
14.2.8-35.el8cp

How reproducible:
Had faced it in earlier version of nautilus (built for downstream 4.0)

Steps followed:
1. Upgraded from luminous to nautilus (registry.access.redhat.com/rhceph/rhceph-3-rhel7 to ceph-4.1-rhel-8-containers-candidate-37018-20200413024316) using ceph-ansible.
2. Migrated FS OSDs to BS.


Actual results:
 data:
    pools:   8 pools, 560 pgs
    objects: 50.24k objects, 117 GiB
    usage:   2.4 TiB used, 18 TiB / 20 TiB avail
    pgs:     560 active+clean
 
  progress:
    Rebalancing after osd.16 marked out
      [========================......]
    Rebalancing after osd.20 marked out
      [==================............]
    Rebalancing after osd.27 marked out
      [=======================.......]
    Rebalancing after osd.24 marked out
      [============================..]
    Rebalancing after osd.26 marked out
      [========================......]
    Rebalancing after osd.18 marked out
      [===================...........]
    Rebalancing after osd.14 marked out
      [==========================....]
    Rebalancing after osd.22 marked out
      [=================.............]
    Rebalancing after osd.12 marked out
      [======================........]
    Rebalancing after osd.28 marked out
      [=====================.........]
    Rebalancing after osd.10 marked out
      [===========================...]


Expected results:
progress section must be updated and accurate.

Additional info:
Please let us know if any log is needed in particular.

Comment 10 Yaniv Kaul 2020-05-07 13:21:45 UTC
Adding NEEDINFO for QE to reproduce.

Comment 15 Kamoltat (Junior) Sirivadhna 2020-10-07 05:21:04 UTC

Solution that seems to fix the problem.

- Recreated problem with https://github.com/ceph/ceph/tree/v14.2.8 
- Test with marking 2/3 OSDs out. Result suggest that progress bar got stuck forever (waited 5 mins)
- Apply my patch, which is everything up to https://github.com/ceph/ceph/commit/93d4d9d7044e991a7bbdb70b0aef02284e6eda22#diff-e6c8e5b8f137e32891a6ad184d076415
- Problem seems to be fixed



This is the list of commits I am trying to back port to Nautilus before patching the downstream to prevent rebasing issue:


93d4d9d7044e991a7bbdb70b0aef02284e6eda22
901a37f436143a2525d6063f64942019cc888229
2046c25362a69b1ed2c0009e9ef6a944f0d9e621
dd2c3f66a1dbd9582b7cd695efff66317b730c8a
d37e8a4d84d873b7df264c63077805be8618ad7a
f618e56c93ad82a20ab844fddc3d2ded42f2a48e
21e1caba6df9d591ebff54939d020ce0a3e57efe

Comment 16 Kamoltat (Junior) Sirivadhna 2020-10-08 12:23:09 UTC
https://github.com/ceph/ceph/pull/37589
This is the pull request for back porting Nautilus

Comment 17 Yaniv Kaul 2020-11-25 08:17:46 UTC
(In reply to ksirivad from comment #16)
> https://github.com/ceph/ceph/pull/37589
> This is the pull request for back porting Nautilus

Excellent, how about a devel-ack then?

Comment 30 errata-xmlrpc 2021-06-15 17:13:06 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat Ceph Storage 4.2 Security and Bug Fix Update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2445