Description of problem (please be detailed as possible and provide log snippests): On the cluster after scale up to 6 OSDs on AWS IPI deployment. After the upgrade from 4.4.1 to 4.4.2 build I see that: 18:12:26 - MainThread - ocs_ci.ocs.resources.storage_cluster - INFO - Verifying ceph health 18:12:28 - MainThread - ocs_ci.utility.utils - WARNING - Waiting for clean PGs. Degraded: 82 Undersized: 91. 18:12:28 - MainThread - ocs_ci.utility.utils - WARNING - Ceph cluster health is not OK. Health: HEALTH_WARN Degraded data redundancy: 6214/57972 objects degraded (10.719%), 82 pgs degraded, 91 pgs undersized After almost 1 hour I see: 19:10:31 - MainThread - ocs_ci.utility.retry - WARNING - Ceph cluster health is not OK. Health: HEALTH_WARN Degraded data redundancy: 5720/58488 objects degraded (9.780%), 76 pgs degraded, 85 pgs undersized; 1 slow ops, oldest one blocked for 81 sec, osd.0 has slow ops , Retrying in 30 seconds... So we got from 82 to 76 degraded PGs (about 10 mins per PG). io in ceph cluster is: client: 5.7 MiB/s rd, 55 MiB/s wr, 1.46k op/s rd, 2.25k op/s wr When I checked the IO with ceph -s command I saw: $ oc rsh -n openshift-storage rook-ceph-tools-6c67d65646-jf5b8 ceph status cluster: id: 55f50236-44d1-47fb-98d8-09e2edac4886 health: HEALTH_WARN Degraded data redundancy: 5720/58488 objects degraded (9.780%), 76 pgs degraded, 85 pgs undersized 1 slow ops, oldest one blocked for 97 sec, osd.0 has slow ops services: mon: 3 daemons, quorum a,b,c (age 2h) mgr: a(active, since 2h) mds: ocs-storagecluster-cephfilesystem:1 {0=ocs-storagecluster-cephfilesystem-a=up:active} 1 up:standby-replay osd: 6 osds: 6 up (since 70m), 6 in (since 2h); 192 remapped pgs task status: scrub status: mds.ocs-storagecluster-cephfilesystem-a: idle mds.ocs-storagecluster-cephfilesystem-b: idle data: pools: 3 pools, 288 pgs objects: 19.50k objects, 75 GiB usage: 232 GiB used, 12 TiB / 12 TiB avail pgs: 5720/58488 objects degraded (9.780%) 19374/58488 objects misplaced (33.125%) 107 active+remapped+backfill_wait 96 active+clean 75 active+undersized+degraded+remapped+backfill_wait 9 active+undersized+remapped+backfill_wait 1 active+undersized+degraded+remapped+backfilling io: client: 5.7 MiB/s rd, 55 MiB/s wr, 1.46k op/s rd, 2.25k op/s wr Which means that the cluster will take ~13 hours with this pace to get to health OK. Version of all relevant components (if applicable): Upgrade from 4.4.1 to 4.4.2 OCP 4.4 nightly Need to confirm if this is expected to see such slow recovery of PGs and why we see it all the time after the upgrade. Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? The Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? Yes Can this issue reproduce from the UI? Haven't tried, it's via our automation If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. Deploy OCS 4.4.1 2. Run FIO in BG 3. Upgrade to 4.4.2 Actual results: Degraded PGs and it's taking a lot of time to recover. Expected results: Have this done much quicker. Additional info: Jenkins job: https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/9846/console Must gather from cluster: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jnk-ai3c33-ua/jnk-ai3c33-ua_20200714T153442/logs/failed_testcase_ocs_logs_1594743991/test_upgrade_ocs_logs/
Strange that you don't see the recovery IO in the Ceph status? I assume you are using the regular node types, EBS, etc...?
(In reply to Yaniv Kaul from comment #2) > Strange that you don't see the recovery IO in the Ceph status? @Yaniv, not sure, I was running command from ceph toolbox pod as usual. > I assume you are using the regular node types, EBS, etc...? Yep this was regular deployment over AWS IPI.
Doesn't seem to belong into ocs-operator. Either rook or ceph. Moving to rook first for further analysis.
I think the high number of objects taht are misplaced should be looked into, the osd disruption is for brief amount of time during upgrade, so not sure why we are seeing huge spike - 19374/58488 objects misplaced (33.125%)
This is closely related to the upgrade issue with unhealthy PGs in https://bugzilla.redhat.com/show_bug.cgi?id=1856254. Moving to the ceph component as it appears Rook is reconciling as expected.
This looks the same as https://bugzilla.redhat.com/show_bug.cgi?id=1856254 - the large number of misplaced objects is again due to most objects existing on osds 0,1,2. Based on the pg dump here, it looks like it's due to the cluster being expanded to 6 osds recently without having a chance to backfill to them yet. *** This bug has been marked as a duplicate of bug 1856254 ***
Hey Josh, how we can we see that rebalance completed after adding new OSDs? And we can decide we are ready to run the upgrade? I think that we are just looking if HEALTH of the cluster is OK, and if so we are starting with the upgrade. Does it mean that we have HEALTH_OK even there is still rebalancing running? Just looking at the way how we will be able to improve wait time after add_capacity test before we start upgrade to be sure we do not see this issue. Thanks
Elad is trying to fix the add_capacity test to wait for finished rebalance and we have some discussion for example here: https://github.com/red-hat-storage/ocs-ci/pull/2570#issuecomment-664429968 and in the same PR here: https://github.com/red-hat-storage/ocs-ci/pull/2570#discussion_r460950298 Can you Josh, or other Ceph folks help us here to figure out with right approach how to properly wait for rebalance finish and if possible to calculate somehow the time we need? So would be good to have some calculated timeout set, so we will be able to count how much time to finish with rebalance and get ceph health OK. Not sure if we can somehow easily calculate it, or do you recommend something else? Thanks
(In reply to Petr Balogh from comment #8) > Hey Josh, > > how we can we see that rebalance completed after adding new OSDs? And we > can decide we are ready to run the upgrade? > > I think that we are just looking if HEALTH of the cluster is OK, and if so > we are starting with the upgrade. > > Does it mean that we have HEALTH_OK even there is still rebalancing running? It sounds like you may be running into a case where the cluster is HEALTH_OK briefly, perhaps before beginning to backfill the new nodes. > Just looking at the way how we will be able to improve wait time after > add_capacity test before we start upgrade to be sure we do not see this > issue. Have a look at the upstream recovery checks - they're looking at whether all PGs are active and finished recovery, rather than relying on HEALTH_OK: https://github.com/ceph/ceph/blob/master/qa/tasks/ceph_manager.py#L2576 Also note that it's checking for recovery progress by looking at whether recovery is happening, rather than using a fixed timeout. At a higher layer if anything in the test does not complete in 12 hours, the test is killed. https://github.com/ceph/ceph/blob/master/qa/tasks/ceph_manager.py#L2371 Since hardware and workload varies so much, I'd suggest using a similar approach here, and not worrying about calculating a timeout, but inspecting whether recovery is progressing.