Bug 1809064

Summary:	[RFE]Speed up OCS recovery/drain during OCP upgrade
Product:	[Red Hat Storage] Red Hat OpenShift Container Storage	Reporter:	Neha Berry <nberry>
Component:	ceph	Assignee:	Neha Ojha <nojha>
Status:	CLOSED DUPLICATE	QA Contact:	Raz Tamir <ratamir>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	4.3	CC:	jdurgin, madam, nojha, ocs-bugs, owasserm, ratamir, rojoseph, shan, sostapov, tnielsen
Target Milestone:	---	Keywords:	FutureFeature
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-07-22 19:54:46 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Neha Berry 2020-03-02 11:32:27 UTC

Description of problem:
----------------------------

This BZ is raised after a discussion with @rohan joseph to track this issue.

Sometimes, OCP upgrade either gets stuck or failed during MCO upgrade as the drain of OCS nodes takes much longer than expected. OCP upgrade includes draining the nodes for maintenence as part of the upgrade process.

>>Reason: 

After drain of one OCS node, the PDB logic waits for PGs to be clean before allowing drain of the 2nd OCS nodes. In case we have considerable data, post node drain, the PGs may take hours to return to ACTIVE+CLEAN state after OSDs are recovered.


>> Change in logic in OCS upgrade:
Initially, even for OCS upgrade, Upgrade of OSDs used to wait for ACTIVE+CLEAN state of PGs, but then the logic was changed via Bug 1786029

>> Some known bugs during OCP upgrade due to extended recovery timings of PGs:

In OCP upgrade, following a node drain, we still wait for PGs to be active clean indefinitely(see BZ#1785151) .


>>Some other bugs where PGs recovery took much longer to maybe cause a timeout:

Bug 1785805


Version-Release number of selected component (if applicable):
-------------------

OCS 4.2, OCS 4.3

How reproducible:
-----------------
Always



Steps to Reproduce:
1. Install OCP +OCS and start IO so as to fill the cluster up >50%
2. Initiate OCP upgrade
3. As part of node drain, it is seen that OCP upgrade takes too long to complete as PGs take much longer time to come to ACTIV+CLEAN state on OCS worker nodes.

Comment 3 Rohan CJ 2020-03-02 11:44:31 UTC

After a discussion with  @Neha Ojha and @Josh Durgin to investigate the viability of giving more resources to the recovery process in-between upgrades,
we decided we needed a bug so people could weigh in on what kind of tradeoffs between io and recovery we wanted to have.

Another question is whether we could tune other recovery parameters without actually affecting client IO. Are the current limits too conservative?

Are the relevant knobs osd_max_backfills and osd_recovery_sleep? If there are more, can we capture them here?

Comment 4 Travis Nielsen 2020-03-02 14:36:03 UTC

This is for a cluster with portable OSDs, correct? If the node is drained, isn't the OSD coming back up on another node? Or is a new node not available to start the drained OSD? If the OSD is getting moved to another node, there shouldn't be any need to backfill the data since there would be no data movement after the OSD is moved to the other node.

If an OCP upgrade is causing data movement, something doesn't sound right. Upgrades would cause components to stop temporarily, but as long as the OSDs come back up, I don't see why we would wait for data to backfill. It shouldn't be expected during upgrade. Why is data movement happening?

Comment 5 Yaniv Kaul 2020-03-02 15:05:32 UTC

I'm concerned with the fact that in small cluster, you already have one less node to run. Adding more pressure does not seem like the right thing to do.

Comment 6 Travis Nielsen 2020-03-02 17:21:27 UTC

Per discussion with Rohan, I now understand that the backfilling is not from moving data around, but is for completing the data commits for active client IO while the OSD was down temporarily. The cluster is heavily loaded so it takes time to come to a state with clean PGs. 

If the cluster can't keep up with the backfill, we need to allow backfill to happen more aggressively even if it means impact to client IO perf.

Comment 7 Yaniv Kaul 2020-03-02 20:14:20 UTC

(In reply to Travis Nielsen from comment #6)
> Per discussion with Rohan, I now understand that the backfilling is not from
> moving data around, but is for completing the data commits for active client
> IO while the OSD was down temporarily. The cluster is heavily loaded so it
> takes time to come to a state with clean PGs. 

It'd be nice if they could actually throttle those clients...

> 
> If the cluster can't keep up with the backfill, we need to allow backfill to
> happen more aggressively even if it means impact to client IO perf.

Agree, but I do wonder if that will eventually converge - a hungry client will just keep us busy backfilling?

Comment 8 Travis Nielsen 2020-03-02 20:46:23 UTC

(In reply to Yaniv Kaul from comment #7)
> (In reply to Travis Nielsen from comment #6)
> > Per discussion with Rohan, I now understand that the backfilling is not from
> > moving data around, but is for completing the data commits for active client
> > IO while the OSD was down temporarily. The cluster is heavily loaded so it
> > takes time to come to a state with clean PGs. 
> 
> It'd be nice if they could actually throttle those clients...
> 
> > 
> > If the cluster can't keep up with the backfill, we need to allow backfill to
> > happen more aggressively even if it means impact to client IO perf.
> 
> Agree, but I do wonder if that will eventually converge - a hungry client
> will just keep us busy backfilling?

Agreed, something doesn't sound right if the OSD if it takes hours to complete the backfill for new ingress for the short time that the OSD was down.
@Neha, can you provide more details for how busy the client IO is during the upgrade? And if there is less client IO, do you see any issues with the PGs not getting to clean state?

Comment 10 Rohan CJ 2020-03-03 10:58:16 UTC

> All the default recovery/backfill settings favor client I/O. In that sense, yes they are conservative enough to make sure client I/O is never bogged down by the impact of recovery.

Sorry, I was wondering if they are too conservative. 

I'm guessing this one also needs more about what kind of I/O is being run.

How much are we able to characterize the I/O in a general product?

Also, I think I heard someone say there were some known issues with Ceph Filesystem recovery?

Comment 11 Rohan CJ 2020-03-16 08:07:21 UTC

Would it be possible to implement some sort of automatic resource allocation for recovery/backfill that reacts to client IO?

Comment 13 Travis Nielsen 2020-03-24 14:38:40 UTC

Moving 4.4 BZs to 4.5

Comment 14 Sébastien Han 2020-05-04 12:28:01 UTC

Josh, Neha, is this tracked somewhere in Core RADOS along with the QoS work?
The idea is to auto-tune Ceph to speed up recovery (give a higher priority).
So prioritizing client recovery over client IOs.

Thanks.

Pushing this again to 4.6 and thinking of moving it Ceph if not already tracked.

Comment 15 Josh Durgin 2020-05-04 15:43:31 UTC

(In reply to leseb from comment #14)
> Josh, Neha, is this tracked somewhere in Core RADOS along with the QoS work?
> The idea is to auto-tune Ceph to speed up recovery (give a higher priority).
> So prioritizing client recovery over client IOs.
> 
> Thanks.
> 
> Pushing this again to 4.6 and thinking of moving it Ceph if not already
> tracked.

As Neha mentioned above, there is no actionable fix since we don't have an understanding of what exactly the test is doing.
We can't say that higher recovery priority will help here if the problem is recovery of small files for cephfs, for example.

Neha asked for this information two months ago - at a minimum of what workload is being run. Can QE provide a detailed description of the test setup and what I/O was being done?

Comment 16 Raz Tamir 2020-05-17 07:13:21 UTC

Hi @Neha,

Could you please provide the needed information?

Comment 17 Sébastien Han 2020-05-22 09:24:38 UTC

I think this is an optimization that should leave in Ceph but it's a long term goal.

Comment 23 Rohan CJ 2020-06-19 09:22:02 UTC

> I am planning to test on AWS + OCP 4.5 + OCS 4.5 -> and upgrade between internal nightly builds of OCP 4.5

I don't know the current stability of either OCP 4.5 or OCS 4.5 builds, but there is a simpler way to trigger a machine config operator upgrade (which will result in the same cyclic drain/reboot process)
https://gist.github.com/rohantmp/113d6b18d5e6fee385e9738e90383246

Comment 26 Josh Durgin 2020-07-22 19:54:46 UTC


*** This bug has been marked as a duplicate of bug 1856254 ***