Bug 1856975
Summary: | On the cluster with 6 OSDs and running the FIO in the BG the recovery of degraded PGs after the upgrade takes a lot of time | ||
---|---|---|---|
Product: | [Red Hat Storage] Red Hat OpenShift Container Storage | Reporter: | Petr Balogh <pbalogh> |
Component: | ceph | Assignee: | Josh Durgin <jdurgin> |
Status: | CLOSED DUPLICATE | QA Contact: | Raz Tamir <ratamir> |
Severity: | medium | Docs Contact: | |
Priority: | unspecified | ||
Version: | 4.4 | CC: | bniver, jdurgin, madam, ocs-bugs, sostapov, vakulkar |
Target Milestone: | --- | Keywords: | Automation |
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2020-07-21 19:50:19 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Petr Balogh
2020-07-14 20:14:04 UTC
Strange that you don't see the recovery IO in the Ceph status? I assume you are using the regular node types, EBS, etc...? (In reply to Yaniv Kaul from comment #2) > Strange that you don't see the recovery IO in the Ceph status? @Yaniv, not sure, I was running command from ceph toolbox pod as usual. > I assume you are using the regular node types, EBS, etc...? Yep this was regular deployment over AWS IPI. Doesn't seem to belong into ocs-operator. Either rook or ceph. Moving to rook first for further analysis. I think the high number of objects taht are misplaced should be looked into, the osd disruption is for brief amount of time during upgrade, so not sure why we are seeing huge spike - 19374/58488 objects misplaced (33.125%) This is closely related to the upgrade issue with unhealthy PGs in https://bugzilla.redhat.com/show_bug.cgi?id=1856254. Moving to the ceph component as it appears Rook is reconciling as expected. This looks the same as https://bugzilla.redhat.com/show_bug.cgi?id=1856254 - the large number of misplaced objects is again due to most objects existing on osds 0,1,2. Based on the pg dump here, it looks like it's due to the cluster being expanded to 6 osds recently without having a chance to backfill to them yet. *** This bug has been marked as a duplicate of bug 1856254 *** Hey Josh, how we can we see that rebalance completed after adding new OSDs? And we can decide we are ready to run the upgrade? I think that we are just looking if HEALTH of the cluster is OK, and if so we are starting with the upgrade. Does it mean that we have HEALTH_OK even there is still rebalancing running? Just looking at the way how we will be able to improve wait time after add_capacity test before we start upgrade to be sure we do not see this issue. Thanks Elad is trying to fix the add_capacity test to wait for finished rebalance and we have some discussion for example here: https://github.com/red-hat-storage/ocs-ci/pull/2570#issuecomment-664429968 and in the same PR here: https://github.com/red-hat-storage/ocs-ci/pull/2570#discussion_r460950298 Can you Josh, or other Ceph folks help us here to figure out with right approach how to properly wait for rebalance finish and if possible to calculate somehow the time we need? So would be good to have some calculated timeout set, so we will be able to count how much time to finish with rebalance and get ceph health OK. Not sure if we can somehow easily calculate it, or do you recommend something else? Thanks (In reply to Petr Balogh from comment #8) > Hey Josh, > > how we can we see that rebalance completed after adding new OSDs? And we > can decide we are ready to run the upgrade? > > I think that we are just looking if HEALTH of the cluster is OK, and if so > we are starting with the upgrade. > > Does it mean that we have HEALTH_OK even there is still rebalancing running? It sounds like you may be running into a case where the cluster is HEALTH_OK briefly, perhaps before beginning to backfill the new nodes. > Just looking at the way how we will be able to improve wait time after > add_capacity test before we start upgrade to be sure we do not see this > issue. Have a look at the upstream recovery checks - they're looking at whether all PGs are active and finished recovery, rather than relying on HEALTH_OK: https://github.com/ceph/ceph/blob/master/qa/tasks/ceph_manager.py#L2576 Also note that it's checking for recovery progress by looking at whether recovery is happening, rather than using a fixed timeout. At a higher layer if anything in the test does not complete in 12 hours, the test is killed. https://github.com/ceph/ceph/blob/master/qa/tasks/ceph_manager.py#L2371 Since hardware and workload varies so much, I'd suggest using a similar approach here, and not worrying about calculating a timeout, but inspecting whether recovery is progressing. |