Description of problem: [RFE] Increase priority for inactive PGs 'backfill/recovery' as compare to active+<some other state> PGs For example: ================= 1 active+clean+scrubbing, 1 undersized+degraded+remapped+wait_backfill+peered, 33 active+degraded+remapped+backfilling, 24 active+recovery_wait+degraded+remapped, 11 active+remapped+backfill_toofull, 53954 active+clean If we check in above example we have 33 PGs getting backfilled which are in active+ <some other state> but we have one PG which is inactive(undersized+degraded+remapped+wait_backfill+peered) and blocking client IO because it is inactive. Ceph should prioritize this PG as compared to other 33 PGs which are in active+<some other state>. Version-Release number of selected component (if applicable): Red Hat Ceph Storage 2.3 There is a lot of work going on this feature in upstream in Luminous and in next point releases of Luminous. This RFE bug is used to track all that work and then backport to Red Hat Ceph Storage 3.y. We do have a backport[1] available in jewel to fix this issue but this fix is not complete resolution this needs more work to achieve this feature. I had a discussion with Josh for this current fix and for future work. This current fix: - it's just adjusting the priority of the backfill ops - it's not a perfect fix, but it does improve things - with osds choosing independently, there could be one osd that finished all the high priority ones and started some lower priority backfills before all the high priority ones from other osds have finished means some osds could have a bunch of inactive pgs to backfill, and other osds may have none. This could give above state what we see here in above example. - so when the osds with no inactive pgs are primary and start backfill, they'll have to start low priority backfills - it's possible we could improve this in luminous since we can cancel backfills now. e.g. when a higher-priority one needs the reservation, cancel a much lower priority backfill to get it. This might increase total recovery time since you need to restart some backfills but it may help availability. There is another PR work in progress for master branch: https://github.com/ceph/ceph/pull/13723 In a large, live cluster, it may be desirable to have particular PGs recovered before others. An actual example includes a recovery after a rack failure, where a lot of PGs must be recovered and some of them host data for live VMs with SLA higher than other VMs, in which case we'd like to have high-SLA VMs to be restored to full health and performance as fast as possible. This PR adds four new commands: 1. ceph pg force-recovery 2. ceph pg force-backfill 3. ceph pg cancel-force-recovery 4. ceph pg cancel-force-backfill which mark one or more specified PGs as "forced", and thus having their recovery or backfill priority maximized. This PR also alters the priorities of default recovery (reduces max priority to 254), so any other PG won't get in the way. The user can restore default priorities with "cancel-force-*" commands at any time. [1] https://github.com/ceph/ceph/pull/13232/commits/2f2032814189a4ecbf8dc01b59bebfae8ab3f524 $ git tag --contains 2f2032814189a4ecbf8dc01b59bebfae8ab3f524 v10.2.7 v10.2.8 v10.2.9
Moving it to 3.2. If these 4 commands are fine to do the job for the feature which was requested here. we can close this as CURRENTRELEASE. 1. ceph pg force-recovery 2. ceph pg force-backfill 3. ceph pg cancel-force-recovery 4. ceph pg cancel-force-backfill
(In reply to Vikhyat Umrao from comment #2) > Moving it to 3.2. If these 4 commands are fine to do the job for the feature > which was requested here. we can close this as CURRENTRELEASE. > > 1. ceph pg force-recovery > 2. ceph pg force-backfill > 3. ceph pg cancel-force-recovery > 4. ceph pg cancel-force-backfill I'd say it's still worth improving the default recovery order, without requiring admin intervention. Moving to 3.* since this definitely won't make 3.2.
(In reply to Josh Durgin from comment #3) > (In reply to Vikhyat Umrao from comment #2) > > Moving it to 3.2. If these 4 commands are fine to do the job for the feature > > which was requested here. we can close this as CURRENTRELEASE. > > > > 1. ceph pg force-recovery > > 2. ceph pg force-backfill > > 3. ceph pg cancel-force-recovery > > 4. ceph pg cancel-force-backfill > > I'd say it's still worth improving the default recovery order, without > requiring admin intervention. Moving to 3.* since this definitely won't make > 3.2. Thank you, Josh. I agree it would be best then.
Moving to 4.0 after discussion with Vikhyat.
https://trello.com/c/3vMx1Ikk/418-osd-prioritize-recovery-backfilll-of-inactive-pgs