Bug 1809064
Summary: | [RFE]Speed up OCS recovery/drain during OCP upgrade | ||
---|---|---|---|
Product: | [Red Hat Storage] Red Hat OpenShift Container Storage | Reporter: | Neha Berry <nberry> |
Component: | ceph | Assignee: | Neha Ojha <nojha> |
Status: | CLOSED DUPLICATE | QA Contact: | Raz Tamir <ratamir> |
Severity: | high | Docs Contact: | |
Priority: | unspecified | ||
Version: | 4.3 | CC: | jdurgin, madam, nojha, ocs-bugs, owasserm, ratamir, rojoseph, shan, sostapov, tnielsen |
Target Milestone: | --- | Keywords: | FutureFeature |
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2020-07-22 19:54:46 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Neha Berry
2020-03-02 11:32:27 UTC
After a discussion with @Neha Ojha and @Josh Durgin to investigate the viability of giving more resources to the recovery process in-between upgrades, we decided we needed a bug so people could weigh in on what kind of tradeoffs between io and recovery we wanted to have. Another question is whether we could tune other recovery parameters without actually affecting client IO. Are the current limits too conservative? Are the relevant knobs osd_max_backfills and osd_recovery_sleep? If there are more, can we capture them here? This is for a cluster with portable OSDs, correct? If the node is drained, isn't the OSD coming back up on another node? Or is a new node not available to start the drained OSD? If the OSD is getting moved to another node, there shouldn't be any need to backfill the data since there would be no data movement after the OSD is moved to the other node. If an OCP upgrade is causing data movement, something doesn't sound right. Upgrades would cause components to stop temporarily, but as long as the OSDs come back up, I don't see why we would wait for data to backfill. It shouldn't be expected during upgrade. Why is data movement happening? I'm concerned with the fact that in small cluster, you already have one less node to run. Adding more pressure does not seem like the right thing to do. Per discussion with Rohan, I now understand that the backfilling is not from moving data around, but is for completing the data commits for active client IO while the OSD was down temporarily. The cluster is heavily loaded so it takes time to come to a state with clean PGs. If the cluster can't keep up with the backfill, we need to allow backfill to happen more aggressively even if it means impact to client IO perf. (In reply to Travis Nielsen from comment #6) > Per discussion with Rohan, I now understand that the backfilling is not from > moving data around, but is for completing the data commits for active client > IO while the OSD was down temporarily. The cluster is heavily loaded so it > takes time to come to a state with clean PGs. It'd be nice if they could actually throttle those clients... > > If the cluster can't keep up with the backfill, we need to allow backfill to > happen more aggressively even if it means impact to client IO perf. Agree, but I do wonder if that will eventually converge - a hungry client will just keep us busy backfilling? (In reply to Yaniv Kaul from comment #7) > (In reply to Travis Nielsen from comment #6) > > Per discussion with Rohan, I now understand that the backfilling is not from > > moving data around, but is for completing the data commits for active client > > IO while the OSD was down temporarily. The cluster is heavily loaded so it > > takes time to come to a state with clean PGs. > > It'd be nice if they could actually throttle those clients... > > > > > If the cluster can't keep up with the backfill, we need to allow backfill to > > happen more aggressively even if it means impact to client IO perf. > > Agree, but I do wonder if that will eventually converge - a hungry client > will just keep us busy backfilling? Agreed, something doesn't sound right if the OSD if it takes hours to complete the backfill for new ingress for the short time that the OSD was down. @Neha, can you provide more details for how busy the client IO is during the upgrade? And if there is less client IO, do you see any issues with the PGs not getting to clean state? > All the default recovery/backfill settings favor client I/O. In that sense, yes they are conservative enough to make sure client I/O is never bogged down by the impact of recovery.
Sorry, I was wondering if they are too conservative.
I'm guessing this one also needs more about what kind of I/O is being run.
How much are we able to characterize the I/O in a general product?
Also, I think I heard someone say there were some known issues with Ceph Filesystem recovery?
Would it be possible to implement some sort of automatic resource allocation for recovery/backfill that reacts to client IO? Moving 4.4 BZs to 4.5 Josh, Neha, is this tracked somewhere in Core RADOS along with the QoS work? The idea is to auto-tune Ceph to speed up recovery (give a higher priority). So prioritizing client recovery over client IOs. Thanks. Pushing this again to 4.6 and thinking of moving it Ceph if not already tracked. (In reply to leseb from comment #14) > Josh, Neha, is this tracked somewhere in Core RADOS along with the QoS work? > The idea is to auto-tune Ceph to speed up recovery (give a higher priority). > So prioritizing client recovery over client IOs. > > Thanks. > > Pushing this again to 4.6 and thinking of moving it Ceph if not already > tracked. As Neha mentioned above, there is no actionable fix since we don't have an understanding of what exactly the test is doing. We can't say that higher recovery priority will help here if the problem is recovery of small files for cephfs, for example. Neha asked for this information two months ago - at a minimum of what workload is being run. Can QE provide a detailed description of the test setup and what I/O was being done? Hi @Neha, Could you please provide the needed information? I think this is an optimization that should leave in Ceph but it's a long term goal. > I am planning to test on AWS + OCP 4.5 + OCS 4.5 -> and upgrade between internal nightly builds of OCP 4.5 I don't know the current stability of either OCP 4.5 or OCS 4.5 builds, but there is a simpler way to trigger a machine config operator upgrade (which will result in the same cyclic drain/reboot process) https://gist.github.com/rohantmp/113d6b18d5e6fee385e9738e90383246 *** This bug has been marked as a duplicate of bug 1856254 *** |