Bug 1856975

Summary:	On the cluster with 6 OSDs and running the FIO in the BG the recovery of degraded PGs after the upgrade takes a lot of time
Product:	[Red Hat Storage] Red Hat OpenShift Container Storage	Reporter:	Petr Balogh <pbalogh>
Component:	ceph	Assignee:	Josh Durgin <jdurgin>
Status:	CLOSED DUPLICATE	QA Contact:	Raz Tamir <ratamir>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	4.4	CC:	bniver, jdurgin, madam, ocs-bugs, sostapov, vakulkar
Target Milestone:	---	Keywords:	Automation
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-07-21 19:50:19 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Petr Balogh 2020-07-14 20:14:04 UTC

Description of problem (please be detailed as possible and provide log
snippests):

On the cluster after scale up to 6 OSDs on AWS IPI deployment.

After the upgrade from 4.4.1 to 4.4.2 build I see that:
18:12:26 - MainThread - ocs_ci.ocs.resources.storage_cluster - INFO - Verifying ceph health
18:12:28 - MainThread - ocs_ci.utility.utils - WARNING - Waiting for clean PGs. Degraded: 82 Undersized: 91.
18:12:28 - MainThread - ocs_ci.utility.utils - WARNING - Ceph cluster health is not OK. Health: HEALTH_WARN Degraded data redundancy: 6214/57972 objects degraded (10.719%), 82 pgs degraded, 91 pgs undersized

After almost 1 hour I see:
19:10:31 - MainThread - ocs_ci.utility.retry - WARNING - Ceph cluster health is not OK. Health: HEALTH_WARN Degraded data redundancy: 5720/58488 objects degraded (9.780%), 76 pgs degraded, 85 pgs undersized; 1 slow ops, oldest one blocked for 81 sec, osd.0 has slow ops
, Retrying in 30 seconds...

So we got from 82 to 76 degraded PGs (about 10 mins per PG).

io in ceph cluster is:
    client:   5.7 MiB/s rd, 55 MiB/s wr, 1.46k op/s rd, 2.25k op/s wr

When I checked the IO with ceph -s command  I saw:
$ oc rsh -n openshift-storage rook-ceph-tools-6c67d65646-jf5b8 ceph status
  cluster:
    id:     55f50236-44d1-47fb-98d8-09e2edac4886
    health: HEALTH_WARN
            Degraded data redundancy: 5720/58488 objects degraded (9.780%), 76 pgs degraded, 85 pgs undersized
            1 slow ops, oldest one blocked for 97 sec, osd.0 has slow ops

  services:
    mon: 3 daemons, quorum a,b,c (age 2h)
    mgr: a(active, since 2h)
    mds: ocs-storagecluster-cephfilesystem:1 {0=ocs-storagecluster-cephfilesystem-a=up:active} 1 up:standby-replay
    osd: 6 osds: 6 up (since 70m), 6 in (since 2h); 192 remapped pgs

  task status:
    scrub status:
        mds.ocs-storagecluster-cephfilesystem-a: idle
        mds.ocs-storagecluster-cephfilesystem-b: idle

  data:
    pools:   3 pools, 288 pgs
    objects: 19.50k objects, 75 GiB
    usage:   232 GiB used, 12 TiB / 12 TiB avail
    pgs:     5720/58488 objects degraded (9.780%)
             19374/58488 objects misplaced (33.125%)
             107 active+remapped+backfill_wait
             96  active+clean
             75  active+undersized+degraded+remapped+backfill_wait
             9   active+undersized+remapped+backfill_wait
             1   active+undersized+degraded+remapped+backfilling

  io:
    client:   5.7 MiB/s rd, 55 MiB/s wr, 1.46k op/s rd, 2.25k op/s wr

Which means that the cluster will take ~13 hours with this pace to get to health OK.

Version of all relevant components (if applicable):
Upgrade from 4.4.1 to 4.4.2 
OCP 4.4 nightly


Need to confirm if this is expected to see such slow recovery of PGs and why we see it all the time after the upgrade.


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
The 

Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?
Yes


Can this issue reproduce from the UI?
Haven't tried, it's via our automation


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Deploy OCS 4.4.1
2. Run FIO in BG
3. Upgrade to 4.4.2


Actual results:
Degraded PGs and it's taking a lot of time to recover.


Expected results:
Have this done much quicker.


Additional info:

Jenkins job:
https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/9846/console

Must gather from cluster:
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jnk-ai3c33-ua/jnk-ai3c33-ua_20200714T153442/logs/failed_testcase_ocs_logs_1594743991/test_upgrade_ocs_logs/

Comment 2 Yaniv Kaul 2020-07-15 09:06:30 UTC

Strange that you don't see the recovery IO in the Ceph status?
I assume you are using the regular node types, EBS, etc...?

Comment 3 Petr Balogh 2020-07-15 09:41:19 UTC

(In reply to Yaniv Kaul from comment #2)
> Strange that you don't see the recovery IO in the Ceph status?
@Yaniv, not sure, I was running command from ceph toolbox pod as usual.

> I assume you are using the regular node types, EBS, etc...?
Yep this was regular deployment over AWS IPI.

Comment 4 Michael Adam 2020-07-16 12:31:18 UTC

Doesn't seem to belong into ocs-operator. Either rook or ceph. Moving to rook first for further analysis.

Comment 5 Vasu Kulkarni 2020-07-16 16:28:58 UTC

I think the high number of objects taht are misplaced should be looked into, the osd disruption is for brief amount of time during upgrade, so not sure why we are seeing huge spike - 19374/58488 objects misplaced (33.125%)

Comment 6 Travis Nielsen 2020-07-16 23:13:29 UTC

This is closely related to the upgrade issue with unhealthy PGs in https://bugzilla.redhat.com/show_bug.cgi?id=1856254.
Moving to the ceph component as it appears Rook is reconciling as expected.

Comment 7 Josh Durgin 2020-07-21 19:50:19 UTC

This looks the same as https://bugzilla.redhat.com/show_bug.cgi?id=1856254 - the large number of misplaced objects is again due to most objects existing on osds 0,1,2. Based on the pg dump here, it looks like it's due to the cluster being expanded to 6 osds recently without having a chance to backfill to them yet.

*** This bug has been marked as a duplicate of bug 1856254 ***

Comment 8 Petr Balogh 2020-07-27 08:10:43 UTC

Hey Josh,

how we can we see that rebalance completed after adding new OSDs?  And we can decide we are ready to run the upgrade?

I think that we are just looking if HEALTH of the cluster is OK, and if so we are starting with the upgrade.

Does it mean that we have HEALTH_OK even there is still rebalancing running?

Just looking at the way how we will be able to improve wait time after add_capacity test before we start upgrade to be sure we do not see this issue.

Thanks

Comment 9 Petr Balogh 2020-07-27 15:08:33 UTC

Elad is trying to fix the add_capacity test to wait for finished rebalance and we have some discussion for example here:
https://github.com/red-hat-storage/ocs-ci/pull/2570#issuecomment-664429968
and in the same PR here:
https://github.com/red-hat-storage/ocs-ci/pull/2570#discussion_r460950298

Can you Josh, or other Ceph folks help us here to figure out with right approach how to properly wait for rebalance finish and if possible to calculate somehow the time we need?  So would be good to have some calculated timeout set, so we will be able to count how much time to finish with rebalance and get ceph health OK.

Not sure if we can somehow easily calculate it, or do you recommend something else?

Thanks

Comment 10 Josh Durgin 2020-07-27 22:34:58 UTC

(In reply to Petr Balogh from comment #8)
> Hey Josh,
> 
> how we can we see that rebalance completed after adding new OSDs?  And we
> can decide we are ready to run the upgrade?
> 
> I think that we are just looking if HEALTH of the cluster is OK, and if so
> we are starting with the upgrade.
> 
> Does it mean that we have HEALTH_OK even there is still rebalancing running?

It sounds like you may be running into a case where the cluster is HEALTH_OK briefly, perhaps before beginning to backfill the new nodes.

> Just looking at the way how we will be able to improve wait time after
> add_capacity test before we start upgrade to be sure we do not see this
> issue.

Have a look at the upstream recovery checks - they're looking at whether all PGs are active and finished recovery, rather than relying on HEALTH_OK:

https://github.com/ceph/ceph/blob/master/qa/tasks/ceph_manager.py#L2576

Also note that it's checking for recovery progress by looking at whether recovery is happening, rather than using a fixed timeout. At a higher layer if anything in the test does not complete in 12 hours, the test is killed.

https://github.com/ceph/ceph/blob/master/qa/tasks/ceph_manager.py#L2371

Since hardware and workload varies so much, I'd suggest using a similar approach here, and not worrying about calculating a timeout, but inspecting whether recovery is progressing.