2126559 – [Workload-DFG] mclock backfill is getting higher priority than WPQ

Bug 2126559 - [Workload-DFG] mclock backfill is getting higher priority than WPQ

Summary: [Workload-DFG] mclock backfill is getting higher priority than WPQ

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	RADOS
Sub Component:
Version:	6.0
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	6.0
Assignee:	Sridhar Seshasayee
QA Contact:	skanta
Docs Contact:	Eliska
URL:
Whiteboard:
Depends On:
Blocks:	2126050
TreeView+	depends on / blocked

Reported:	2022-09-13 20:57 UTC by Tim Wilkinson
Modified:	2023-03-20 18:58 UTC (History)
CC List:	20 users (show)
Fixed In Version:	ceph-17.2.5-26.el9cp
Doc Type:	Known Issue
Doc Text:	.The Ceph OSD benchmark test might get skipped Currently, the Ceph OSD benchmark test boot-up might sometimes not run even with `osd_mclock_force_run_benchmark_on_init` parameter set to `true`. As a consequence, the `osd_mclock_max_capacity_iops_[hdd,ssd]` parameter value is not overridden with the default values. As a workaround, perform the following steps: . Set `osd_mclock_force_run_benchmark_on_init` to `true`: + .Example ---- [ceph: root@host01 /]# ceph config set osd osd_mclock_force_run_benchmark_on_init true ---- . Remove the value on the respective OSD: + .Syntax [source,subs="verbatim,quotes"] ---- ceph config rm OSD._OSD_ID_ osd_mclock_max_capacity_iops_[hdd,ssd] ---- + .Example ---- [ceph: root@host01 /]# ceph config rm osd.0 osd_mclock_max_capacity_iops_hdd ---- . Restart the OSD This results in the `osd_mclock_max_capacity_iops_[ssd,hdd]` parameter being either set with the default value or the new value if it is within the threshold setting.
Clone Of:
Environment:
Last Closed:	2023-03-20 18:58:05 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Ceph Project Bug Tracker	57529	None	None	None	2022-09-13 21:21:51 UTC
Red Hat Issue Tracker	RHCEPH-5257	None	None	None	2022-09-13 20:58:39 UTC
Red Hat Product Errata	RHBA-2023:1360	None	None	None	2023-03-20 18:58:40 UTC

Internal Links: 2163473

Description Tim Wilkinson 2022-09-13 20:57:23 UTC

Description of problem:
----------------------
In testing of the performance of the mClock OSD scheduler vs. that of WPQ, we see that the mClock high_client_ops profile is placing all priority on recovery instead of client IO. 



Version-Release number:
----------------------
kernel-4.18.0-348.12.2.el8_5.x86_64

{
    "mon": {
        "ceph version 17.2.1 (ec95624474b1871a821a912b8c3af68f8f8e7aa1) quincy (stable)": 3
    },
    "mgr": {
        "ceph version 17.2.1 (ec95624474b1871a821a912b8c3af68f8f8e7aa1) quincy (stable)": 3
    },
    "osd": {
        "ceph version 17.2.1 (ec95624474b1871a821a912b8c3af68f8f8e7aa1) quincy (stable)": 192
    },
    "mds": {},
    "rgw": {
        "ceph version 17.2.1 (ec95624474b1871a821a912b8c3af68f8f8e7aa1) quincy (stable)": 8
    },
    "overall": {
        "ceph version 17.2.1 (ec95624474b1871a821a912b8c3af68f8f8e7aa1) quincy (stable)": 206
    }
}



OSDfailure Test Scenarios:
=========================

 Test cycle phases …

 Generic sized objs (max 64MB)
 ------------------
 - 7hr warp fill populates 6 buckets w/1.2M objs per
 - 1hr hybrid - no failure
 - 1hr hybrid - a single OSD device loss
 - 1hr hybrid - a second OSD node loss (24 OSDs)
 - 1hr hybrid - recovery


 Small sized objs (max 256KB)
 ----------------
 - 7hr warp fill populates  6 buckets w/27M objs per
 - 1hr hybrid - no failure
 - 1hr hybrid - one OSD node loss (24 OSDs)
 - 1hr hybrid - a second OSD node loss (24 OSDs)
 - 1hr hybrid - recovery


In comparing the mClock default profile scheduler to the WPQ scheduler, we examine the ceph.log files and note the status of the PGs after each of the stages described above.


Additional info:
----------------

===============================
Small Object Sizing (max 256KB)
===============================

WPQ - Phase 2 3200 PGs

2022-09-13T03:32:51.223941+0000 mgr.f28-h30-000-r630.rmkcbz (mgr.24206) 5307 : cluster [DBG] pgmap v5163: 4769 pgs: 3228 active+undersized+degraded, 103 active+undersized, 1438 active+clean; 7.7 TiB data, 17 TiB used, 293 TiB / 311 TiB avail; 288 MiB/s rd, 130 MiB/s wr, 90.85k op/s; 136403499/1089067356 objects degraded (12.525%)


WPQ - Phase 3 - 4000 PGs

2022-09-13T08:20:44.239410+0000 mgr.f28-h30-000-r630.rmkcbz (mgr.24206) 14053 : cluster [DBG] pgmap v14552: 4769 pgs: 1 active+undersized+degraded+remapped, 175 active+remapped+backfill_wait, 3912 active+undersized+degraded+remapped+backfill_wait, 5 active+undersized+degraded+remapped+backfilling, 676 active+clean; 8.7 TiB data, 22 TiB used, 333 TiB / 355 TiB avail; 283 MiB/s rd, 130 MiB/s wr, 90.47k op/s; 285294745/1228092120 objects degraded (23.231%); 22852039/1228092120 objects misplaced (1.861%); 4.7 MiB/s, 109 objects/s recovering


WPQ - Phase 4 - 4100 PGs

2022-09-13T08:20:42.187633+0000 mgr.f28-h30-000-r630.rmkcbz (mgr.24206) 14052 : cluster [DBG] pgmap v14551: 4769 pgs: 1 active+undersized+degraded+remapped, 1 active+recovering+undersized+remapped, 175 active+remapped+backfill_wait, 3912 active+undersized+degraded+remapped+backfill_wait, 5 active+undersized+degraded+remapped+backfilling, 675 active+clean; 8.7 TiB data, 22 TiB used, 333 TiB / 355 TiB avail; 56 MiB/s rd, 24 MiB/s wr, 17.44k op/s; 285282762/1228013376 objects degraded (23.231%); 22850638/1228013376 objects misplaced (1.861%); 2.0 MiB/s, 53 objects/s recovering


mClock - Phase 2 - 3145 PGs - all backfilling

2022-09-08T12:04:46.780444+0000 mgr.f28-h29-000-r630.laexho (mgr.14222) 19881 : cluster [DBG] pgmap v19111: 4769 pgs: 1622 active+clean, 1 activating, 3145 active+undersized+degraded+remapped+backfilling, 1 peering; 7.8 TiB data, 18 TiB used, 293 TiB / 311 TiB avail; 1.8 MiB/s rd, 750 KiB/s wr, 583 op/s; 136335864/1096688742 objects degraded (12.432%); 26491702/1096688742 objects misplaced (2.416%); 398 MiB/s, 2.70M keys/s, 8.95k objects/s recovering


mClock - Phase 3 3889 - PGs backfilling

2022-09-08T15:06:54.615447+0000 mgr.f28-h29-000-r630.laexho (mgr.14222) 25391 : cluster [DBG] pgmap v25897: 4769 pgs: 1 peering, 1 active+recovering+undersized+degraded+remapped, 5 active+remapped+backfilling, 3889 active+undersized+degraded+remapped+backfilling, 2 active+undersized+remapped, 871 active+clean; 7.8 TiB data, 16 TiB used, 250 TiB / 266 TiB avail; 113 KiB/s rd, 1.0 MiB/s wr, 483 op/s; 203162167/1097925093 objects degraded (18.504%); 43382450/1097925093 objects misplaced (3.951%); 347 MiB/s, 662.25k keys/s, 7.76k objects/s recovering


Mclock - Client - Phase 4
----------------
2022-09-08T19:11:39.687786+0000 mgr.f28-h29-000-r630.laexho (mgr.14222) 32831 : cluster [DBG] pgmap v34964: 4769 pgs: 3 active+recovering+undersized+remapped, 159 active+remapped+backfilling, 10 active+recovering+undersized+degraded+remapped, 4597 active+clean; 7.8 TiB data, 20 TiB used, 335 TiB / 355 TiB avail; 5.4 MiB/s rd, 54 MiB/s wr, 22.56k op/s; 861/1099784454 objects degraded (0.000%); 6182853/1099784454 objects misplaced (0.562%); 517 MiB/s, 11.54k objects/s recovering


================================
Generic Object Sizing (max 64MB)
================================

WPQ - Phase 2 

2022-09-11T23:00:21.020614+0000 mgr.f28-h22-000-r630.rblqtu (mgr.14266) 18883 : cluster [DBG] pgmap v18023: 4769 pgs: 80 active+undersized+degraded+remapped+backfill_wait, 23 active+undersized+degraded+remapped+backfilling, 4666 active+clean; 89 TiB data, 139 TiB used, 215 TiB / 353 TiB avail; 3.4 GiB/s rd, 2.0 GiB/s wr, 6.77k op/s; 708584/169087635 objects degraded (0.419%); 891 MiB/s, 8.71k keys/s, 265 objects/s recovering


WPQ - Phase 3

2022-09-12T01:35:36.467003+0000 mgr.f28-h22-000-r630.rblqtu (mgr.14266) 23609 : cluster [DBG] pgmap v22845: 4769 pgs: 13 active+undersized+remapped+backfill_wait, 91 active+undersized+degraded+remapped+backfilling, 1644 active+clean, 3021 active+undersized+degraded+remapped+backfill_wait; 100 TiB data, 136 TiB used, 173 TiB / 309 TiB avail; 1.6 GiB/s rd, 1.1 GiB/s wr, 3.57k op/s; 23545953/189389985 objects degraded (12.433%); 4573600/189389985 objects misplaced (2.415%); 2.7 GiB/s, 47.11k keys/s, 843 objects/s recovering


WPQ - Phase 4

2022-09-12T04:07:26.205572+0000 mgr.f28-h22-000-r630.rblqtu (mgr.14266) 28234 : cluster [DBG] pgmap v28942: 4769 pgs: 2 active+recovery_wait+remapped, 2 active+recovery_wait+degraded, 18 active+undersized+degraded+remapped+backfill_wait, 1600 active+clean, 3105 active+recovery_wait+undersized+degraded+remapped, 2 active+recovering+undersized+degraded+remapped, 40 active+remapped+backfill_wait; 107 TiB data, 172 TiB used, 184 TiB / 355 TiB avail; 1.5 GiB/s rd, 1.1 GiB/s wr, 3.55k op/s; 2652980/203673714 objects degraded (1.303%); 1245837/203673714 objects misplaced (0.612%); 46 MiB/s, 19 objects/s recovering


mClock - Phase 2 

2022-09-11T00:07:08.721539+0000 mgr.f28-h22-000-r630.qeqxij (mgr.14224) 18824 : cluster [DBG] pgmap v18093: 4769 pgs: 128 active+undersized+degraded+remapped+backfilling, 4641 active+clean; 88 TiB data, 137 TiB used, 217 TiB / 353 TiB avail; 1.4 GiB/s rd, 1.0 GiB/s wr, 3.36k op/s; 857018/166638669 objects degraded (0.514%); 3.3 GiB/s, 28.59k keys/s, 1.02k objects/s recovering


mClock - Phase 3

2022-09-11T02:44:03.777805+0000 mgr.f28-h22-000-r630.qeqxij (mgr.14224) 23599 : cluster [DBG] pgmap v23029: 4769 pgs: 1 peering, 3059 active+undersized+degraded+remapped+backfilling, 1709 active+clean; 98 TiB data, 133 TiB used, 175 TiB / 309 TiB avail; 111 MiB/s rd, 94 MiB/s wr, 264 op/s; 23041512/185769276 objects degraded (12.403%); 4479848/185769276 objects misplaced (2.412%); 11 GiB/s, 50.28k keys/s, 3.34k objects/s recovering


Mclock - Recovery - Phase 4

2022-09-11T04:31:24.801597+0000 mgr.f28-h22-000-r630.qeqxij (mgr.14224) 26848 : cluster [DBG] pgmap v27826: 4769 pgs: 126 active+recovering+degraded, 8 active+recovering, 4239 active+clean, 102 active+undersized+degraded+remapped+backfilling, 168 active+remapped+backfilling, 23 active+recovering+undersized+degraded+remapped, 103 active+recovering+undersized+remapped; 98 TiB data, 154 TiB used, 199 TiB / 353 TiB avail; 215 KiB/s rd, 444 MiB/s wr, 979 op/s; 727468/185859453 objects degraded (0.391%); 1173884/185859453 objects misplaced (0.632%); 13 GiB/s, 4.12k objects/s recovering

Comment 21 Shreyansh Sancheti 2022-10-21 12:44:17 UTC

Comment 60 errata-xmlrpc 2023-03-20 18:58:05 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat Ceph Storage 6.0 Bug Fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:1360

Note You need to log in before you can comment on or make changes to this bug.

akupczyk
amathuri
bhubbard
ceph-eng-bugs
cephqe-warriors
choffman
ekristov
ksirivad
lflores
mhackett
nojha
pdhange
racpatel
rfriedma
rzarzyns
skanta
ssanchet
sseshasa
sunnagar
vumrao