1473028 – [RFE] Increase priority for inactive PGs 'backfill/recovery' as compare to active+<some other states> PGs

Bug 1473028 - [RFE] Increase priority for inactive PGs 'backfill/recovery' as compare to active+<some other states> PGs

Summary: [RFE] Increase priority for inactive PGs 'backfill/recovery' as compare to ac...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	RADOS
Sub Component:
Version:	2.3
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	rc
Target Release:	3.2
Assignee:	Josh Durgin
QA Contact:	ceph-qe-bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-07-19 21:45 UTC by Vikhyat Umrao
Modified:	2020-12-14 09:10 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-06-12 21:23:00 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Vikhyat Umrao 2017-07-19 21:45:09 UTC

Description of problem:
[RFE] Increase priority for inactive PGs 'backfill/recovery' as compare to active+<some other state> PGs

For example:
=================

1 active+clean+scrubbing,
1 undersized+degraded+remapped+wait_backfill+peered,
33 active+degraded+remapped+backfilling,
24 active+recovery_wait+degraded+remapped,
11 active+remapped+backfill_toofull,
53954 active+clean

If we check in above example we have 33 PGs getting backfilled which are in active+ <some other state> but we have one PG which is inactive(undersized+degraded+remapped+wait_backfill+peered) and blocking client IO because it is inactive.

Ceph should prioritize this PG as compared to other 33 PGs which are in active+<some other state>.

Version-Release number of selected component (if applicable):
Red Hat Ceph Storage 2.3

There is a lot of work going on this feature in upstream in Luminous and in next point releases of Luminous.

This RFE bug is used to track all that work and then backport to Red Hat Ceph Storage 3.y.

We do have a backport[1] available in jewel to fix this issue but this fix is not complete resolution this needs more work to achieve this feature.

I had a discussion with Josh for this current fix and for future work.

This current fix:

- it's just adjusting the priority of the backfill ops
- it's not a perfect fix, but it does improve things

- with osds choosing independently, there could be one osd that finished all the high priority ones and started some lower priority backfills before all the high priority ones from other osds have finished means some osds could have a bunch of inactive pgs to backfill, and other osds may have none. This could give above state what we see here in above example.

- so when the osds with no inactive pgs are primary and start backfill, they'll have to start low priority backfills

- it's possible we could improve this in luminous since we can cancel backfills now. e.g. when a higher-priority one needs the reservation, cancel a much lower priority backfill to get it. This might increase total recovery time since you need to restart some backfills but it may help availability.

There is another PR work in progress for master branch:

https://github.com/ceph/ceph/pull/13723
In a large, live cluster, it may be desirable to have particular PGs recovered before others. An actual example includes a recovery after a rack failure, where a lot of PGs must be recovered and some of them host data for live VMs with SLA higher than other VMs, in which case we'd like to have high-SLA VMs to be restored to full health and performance as fast as possible.

This PR adds four new commands:

1. ceph pg force-recovery
2. ceph pg force-backfill
3. ceph pg cancel-force-recovery
4. ceph pg cancel-force-backfill

which mark one or more specified PGs as "forced", and thus having their recovery or backfill priority maximized. This PR also alters the priorities of default recovery (reduces max priority to 254), so any other PG won't get in the way. The user can restore default priorities with "cancel-force-*" commands at any time.

[1] https://github.com/ceph/ceph/pull/13232/commits/2f2032814189a4ecbf8dc01b59bebfae8ab3f524

$ git tag --contains 2f2032814189a4ecbf8dc01b59bebfae8ab3f524
v10.2.7
v10.2.8
v10.2.9

Comment 2 Vikhyat Umrao 2018-09-13 17:36:09 UTC

Moving it to 3.2. If these 4 commands are fine to do the job for the feature which was requested here. we can close this as CURRENTRELEASE.

1. ceph pg force-recovery
2. ceph pg force-backfill
3. ceph pg cancel-force-recovery
4. ceph pg cancel-force-backfill

Comment 3 Josh Durgin 2018-09-14 21:58:30 UTC

(In reply to Vikhyat Umrao from comment #2)
> Moving it to 3.2. If these 4 commands are fine to do the job for the feature
> which was requested here. we can close this as CURRENTRELEASE.
> 
> 1. ceph pg force-recovery
> 2. ceph pg force-backfill
> 3. ceph pg cancel-force-recovery
> 4. ceph pg cancel-force-backfill

I'd say it's still worth improving the default recovery order, without requiring admin intervention. Moving to 3.* since this definitely won't make 3.2.

Comment 4 Vikhyat Umrao 2018-09-14 22:02:21 UTC

(In reply to Josh Durgin from comment #3)
> (In reply to Vikhyat Umrao from comment #2)
> > Moving it to 3.2. If these 4 commands are fine to do the job for the feature
> > which was requested here. we can close this as CURRENTRELEASE.
> > 
> > 1. ceph pg force-recovery
> > 2. ceph pg force-backfill
> > 3. ceph pg cancel-force-recovery
> > 4. ceph pg cancel-force-backfill
> 
> I'd say it's still worth improving the default recovery order, without
> requiring admin intervention. Moving to 3.* since this definitely won't make
> 3.2.

Thank you, Josh. I agree it would be best then.

Comment 5 Josh Durgin 2018-09-28 23:50:50 UTC

Moving to 4.0 after discussion with Vikhyat.

Comment 6 Josh Durgin 2018-10-31 21:42:16 UTC

https://trello.com/c/3vMx1Ikk/418-osd-prioritize-recovery-backfilll-of-inactive-pgs

Note You need to log in before you can comment on or make changes to this bug.