Bug 1956601

Summary: [RADOS]: Global Recovery Event is running continuously with 0 objects in the cluster
Product: Red Hat Ceph Storage Reporter: skanta
Component: RADOSAssignee: ksirivad
Status: ASSIGNED --- QA Contact: Manohar Murthy <mmurthy>
Severity: medium Docs Contact:
Priority: medium    
Version: 5.0CC: akupczyk, bhubbard, ceph-eng-bugs, dzafman, jdurgin, kchai, nojha, rzarzyns, sseshasa, vashastr, vereddy, vumrao
Target Milestone: ---   
Target Release: 5.1   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description skanta 2021-05-04 03:48:29 UTC
Description of problem:

   Global Recovery Event is running continuously with 0 objects in the cluster.


Version-Release number of selected component (if applicable):

[ceph: root@magna048 ceph]# ceph -v
ceph version 16.2.0-26.el8cp (26d0f1958ee507a7c6dd31af72106c7006a4d0b7) pacific (stable)
[ceph: root@magna048 ceph]# 


How reproducible:

Steps to Reproduce:
1. Configure a cluster.
2. Create a pool


Actual results:

1. cluster configuration output
        [ceph: root@magna048 /]# ceph -s
            cluster:
                   id:     ee3257e8-ac73-11eb-b907-002590fbc71c
                   health: HEALTH_OK
 
            services:
            mon: 3 daemons, quorum magna048,magna049,magna050 (age 11m)
            mgr: magna048.rffxzv(active, since 25m), standbys: magna049.htxdoz
            osd: 23 osds: 23 up (since 6m), 23 in (since 6m)
 
            data:
               pools:   1 pools, 1 pgs
               objects: 0 objects, 0 B
               usage:   127 MiB used, 66 TiB / 66 TiB avail
               pgs:     1 active+clean

2. Created pool
   [ceph: root@magna048 /]# ceph osd pool create testbench 100 100
       pool 'testbench' created
    [ceph: root@magna048 /]#

3.Output after pool creation-

  [ceph: root@magna048 /]# ceph -s
  cluster:
    id:     ee3257e8-ac73-11eb-b907-002590fbc71c
    health: HEALTH_OK
 
  services:
    mon: 3 daemons, quorum magna048,magna049,magna050 (age 12m)
    mgr: magna048.rffxzv(active, since 27m), standbys: magna049.htxdoz
    osd: 23 osds: 23 up (since 7m), 23 in (since 7m)
 
  data:
    pools:   2 pools, 101 pgs
    objects: 0 objects, 0 B
    usage:   133 MiB used, 66 TiB / 66 TiB avail
    pgs:     101 active+clean
 
  progress:
    Global Recovery Event (1s)
      [............................] (remaining: 64s)


After few hours-

   [ceph: root@magna048 ceph]# ceph -s
  cluster:
    id:     ee3257e8-ac73-11eb-b907-002590fbc71c
    health: HEALTH_OK
 
  services:
    mon: 3 daemons, quorum magna048,magna049,magna050 (age 2h)
    mgr: magna048.rffxzv(active, since 2h), standbys: magna049.htxdoz
    osd: 23 osds: 23 up (since 2h), 23 in (since 2h)
 
  data:
    pools:   2 pools, 33 pgs
    objects: 0 objects, 0 B
    usage:   424 MiB used, 66 TiB / 66 TiB avail
    pgs:     33 active+clean
 
  progress:
    Global Recovery Event (2h)
      [===========================.] (remaining: 4m)
 
[ceph: root@magna048 ceph]# 


Expected results:


Additional info:

Comment 1 skanta 2021-05-04 05:53:21 UTC
After Four Hours-

[ceph: root@magna048 ceph]# ceph -s
  cluster:
    id:     ee3257e8-ac73-11eb-b907-002590fbc71c
    health: HEALTH_OK
 
  services:
    mon: 3 daemons, quorum magna048,magna049,magna050 (age 4h)
    mgr: magna048.rffxzv(active, since 4h), standbys: magna049.htxdoz
    osd: 23 osds: 23 up (since 4h), 23 in (since 4h)
 
  data:
    pools:   2 pools, 33 pgs
    objects: 0 objects, 0 B
    usage:   424 MiB used, 66 TiB / 66 TiB avail
    pgs:     33 active+clean
 
  progress:
    Global Recovery Event (4h)
      [===========================.] (remaining: 8m)
 
[ceph: root@magna048 ceph]#

Comment 2 Josh Durgin 2021-05-04 14:47:04 UTC
This looks like the expected behavior of the pg_autoscaler - it's reducing the number of pgs from the initial 100 to the minimum for the pool.

That the progress event isn't going away is a presentation bug - since there are 0 objects and no pgs in recovery, it is purely a display issue.

Thus lowering the severity and moving to 5.1.

Comment 3 ksirivad 2021-05-05 14:30:07 UTC
Because this a 5.0 which is equivalent to Pacific in upstream, I don't think it is related to the new pg_autoscaler behavior that scales down the PGs since this feature was reverted before it was branched off. I have an idea of what the problem is, which is to do with how the progress module is ignoring PGs that are not being reported by the OSD. I'm working on the fix and will patch this once it is backported on upstream.

Comment 5 Vasishta 2021-05-07 05:56:11 UTC
Hi Team,
Triggering another event made the progress section to get refreshed and inapt recovery event info disappeared.

Example -
I initiated 
>> ceph orch upgrade start
Prgress section got updated 

  progress:
    Upgrade to 16.2.0-31.el8cp (0s)
      [=...........................]

Comment 7 Neha Ojha 2021-05-07 19:30:19 UTC
*** Bug 1958037 has been marked as a duplicate of this bug. ***