Bug 2294594 - [Workload-DFG][mClock] Inconsistent client throughput during recovery with mClock balanced profile
Summary: [Workload-DFG][mClock] Inconsistent client throughput during recovery with mC...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: RADOS
Version: 7.1
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
: 8.0z2
Assignee: Sridhar Seshasayee
QA Contact: Harsh Kumar
URL:
Whiteboard:
Depends On:
Blocks: 2299480 2299482
TreeView+ depends on / blocked
 
Reported: 2024-06-27 15:01 UTC by Harsh Kumar
Modified: 2025-03-06 14:22 UTC (History)
13 users (show)

Fixed In Version: ceph-19.2.0-60.el9cp
Doc Type: Bug Fix
Doc Text:
.New shard and multiple worker threads configuration now yields significant results in terms of consistency of client Previously, scheduling with mClock was not optimal with multiple OSD shards on a HDD based Ceph cluster. Hence, the client throughput was found to be inconsistent across test runs coupled with multiple reported slow requests during recovery and backfill operations. With this fix, HDD OSD shard configuration is updated as follows: - osd_op_num_shards_hdd = 1 (was 5) - osd_op_num_threads_per_shard_hdd = 5 (was 1) Now, the new shard and multiple worker threads configuration yields significant results in terms of consistency of client and recovery throughput across multiple test runs.
Clone Of:
: 2299480 2299482 (view as bug list)
Environment:
Last Closed: 2025-03-06 14:22:04 UTC
Embargoed:
sseshasa: needinfo-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Ceph Project Bug Tracker 66289 0 None None None 2024-09-05 11:36:07 UTC
Github ceph ceph pull 58509 0 None open common/options: Change HDD OSD shard configuration defaults for mClock 2024-09-05 11:36:07 UTC
Github ceph ceph pull 59973 0 None open squid: common/options: Change HDD OSD shard configuration defaults for mClock 2024-11-13 23:06:09 UTC
Red Hat Issue Tracker RHCEPH-9243 0 None None None 2024-06-27 15:04:32 UTC
Red Hat Knowledge Base (Solution) 7092973 0 None None None 2024-10-25 16:52:35 UTC
Red Hat Product Errata RHBA-2025:2457 0 None None None 2025-03-06 14:22:16 UTC

Description Harsh Kumar 2024-06-27 15:01:04 UTC
Description of problem:
>>> Upstream tracker: https://tracker.ceph.com/issues/66289

	It has been observed during the regular Release Criteria testing of 7.1 and otherwise, that when cluster goes through
	 a recovery phase, the average client throughput captured during this time is inconsistent over several runs.

	The test workflow where this was found is called OSDFailure testing and consists of the following rounds:
	    a. Warming up the OSDs with pure writes (a.k.a Fill workload)
	    b. Measuring the performance of the cluster over a period of 1 hour using hybrid workload
	    c. Injecting failure by bringing down an OSD host with continuous hybrid IOs
	    d. Injecting failure by bringing down two OSD host with continuous hybrid IOs
	    e. Down OSD hosts are brought back up and performance of the cluster is monitored during recovery phase with continuous hybrid IOs

	While the performance of the cluster with respect to client IO has been consistent during phases a, b and c, the 
	same cannot be said when the cluster goes through phase d, and e

	Contrary to the above, the performance has remained stable and consistent with WPQ osd_op_queue throughout the testing

	Test Phases description specific to our use case :

	  Phase 1:
	    Consists of warming up the cluster followed by 1 measurement round
	    - The 192 OSDs are distributed across 8 nodes with 24 OSDs per node.
	    - 300 RGW buckets are created and are each filled with 750K objects
	    - Objects are in the range of small object sizes [1KiB, 4KiB, 16KiB, 64KiB, 256KiB]
	    - 5 clients together fill the RGW pool with around 225 million objects
	    - The client workload is initiated using the warp tool.

	  Phase 2: OSD node 1 failure
	    Consists of injecting failure by bringing down an OSD host with continuous hybrid IOs
	    - In this phase one OSD node(24 OSDs) is brought down and the cluster is
	    monitored along with the collection of client and recovery metrics.

	  Phase 3: OSD node 2 failure
	    Consists of injecting failure by bringing down two OSD host with continuous hybrid IOs
	    - After around a couple of hours, another OSD node (24 OSDs) is brought
	    down and the same metrics are collected as above.

	  Phase 4: Bring up both OSD nodes and monitor cluster.
	  Down OSD hosts are brought back up to gauge recovery performance
	  - Both the failed OSD nodes are brought up and the cluster recovery/backfill
	  is monitored along with the client metrics. The test doesn't wait for
	  the cluster to complete the recovery/backfill process fully due to the
	  long recovery times.

Version-Release number of selected component (if applicable):
18.2.1-188.el9cp

How reproducible:
5/5

Steps to Reproduce:
1. Warp up the cluster by filling it up till 10% capacity
2. Bring one OSD node done with continuous background IOs for 2 hours, observe the cluster behaviour during this time
3. Bring another OSD node done with continuous background IOs for 2 hours, observe the cluster behaviour during this time
4. Bring back all down OSD nodes up with continuous background IOs for two hours, let the cluster recover

Actual results:
Client Throughput observed during OSD down scenarios is inconsistent across multiple runs


Expected results:
Client Throughput with background recovery regardless of low or high should be consistent across multiple runs

Additional info:

Results from different runs
=============================================================================================================================================================================
Result doc: https://docs.google.com/spreadsheets/d/1mdyRqcaQAtY4McMV3TLplXhYd3QU8baZUlYMQe98NSs/edit?gid=1735478984#gid=1735478984

  -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  |                                                                        First run | small sized object | mClock                                                                  |
  |                                                                             (RHCS 7.1 - 18.2.1-188.el9cp)                                                                       |
  -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  | Job |      Workload     | Total Read Throughput | Total Write Throughput | Avg Latency | Avg RGW | Avg RGW | Avg OSD | Avg OSD |         Avg Recovery           | Recovery Time |
  | ID  |                   |   MB/s   |   Objs/s   |   MB/s   |    Objs/s   |    (ms)     |   %CPU  |   %Mem  |   %CPU  |   %Mem  | with IO (MB/s) | w/o IO (MB/s) |     hh:mm     |
  -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  | w1  | fill 225M objs    |     -    |      -     |   1128   |    17529    |      86     |         |         |         |         |        --      |       --      |               |
  | w2  | hybrid noFailure  |    485   |    7384    |    376   |     5745    |      66     |         |         |         |         |        --      |       --      |     2313      |
  | w2  | hybrid OSDnode1   |    234   |    3539    |    182   |     2754    |     133     |         |         |         |         |       138      |       --      |  PGs Unclean  |
  | w2  | hybrid OSDnode2   |    329   |    5030    |    257   |     3914    |      94     |         |         |         |         |        52      |       --      |     after     |
  | w2  | hybrid OSDrecover |    397   |    6040    |    308   |     4699    |      81     |         |         |         |         |         9      |       22      |   24 hours    |
  -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>>>  Timeline for phase 3 from job w2
	2024/05/25-03:16:43: Starting warp hybrid phase 3 
	2024/05/25-03:20:25: Stop all OSDs on f27-h13-000-6048r
	2024/05/25-05:35:06: Completed warp hybrid phase 3 

	OSD perf dump - http://f28-h28-000-r630.rdu2.scalelab.redhat.com/RESULTS/w2_240524-220748_runtest-71_small/osd-perfdumps/
	OSD logs: root@f28-h28-000-r630:~/rc-2024/OSDfailure-warp/RESULTS/w2_240524-220748_runtest-71_small/osd-logs
	or
	http://f28-h28-000-r630.rdu2.scalelab.redhat.com/RESULTS/w2_240524-220748_runtest-71_small/osd-logs/


  -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  |                                                                          Second run | small sized object | mClock                                                               |
  |                                                                                (RHCS 7.1 - 18.2.1-188.el9cp)                                                                    |
  -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  | Job |      Workload     | Total Read Throughput | Total Write Throughput | Avg Latency | Avg RGW | Avg RGW | Avg OSD | Avg OSD |         Avg Recovery           | Recovery Time |
  | ID  |                   |   MB/s   |   Objs/s   |   MB/s   |    Objs/s   |    (ms)     |   %CPU  |   %Mem  |   %CPU  |   %Mem  | with IO (MB/s) | w/o IO (MB/s) |     hh:mm     |
  -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  | w3  | fill 225M objs    |     -    |      -     |   1152   |    17855    |      84     |         |         |         |         |        --      |       --      |               |
  | w4  | hybrid noFailure  |    491   |    7493    |    382   |     5829    |      65     |         |         |         |         |        --      |       --      |     2180      |
  | w4  | hybrid OSDnode1   |    234   |    3579    |    183   |     2785    |     131     |         |         |         |         |       120      |       --      |  PGs Unclean  |
  | w4  | hybrid OSDnode2   |    241   |    3658    |    188   |     2846    |     126     |         |         |         |         |        70      |       --      |      after    |
  | w4  | hybrid OSDrecover |    400   |    6088    |    310   |     4736    |      80     |         |         |         |         |        12      |       23      |    24 hours   |
  -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>>>  Timeline for phase 3 from job w4
	2024/05/27-07:22:56: Starting warp hybrid phase 3 
	2024/05/27-07:26:37: Stop all OSDs on f27-h13-000-6048r
	2024/05/27-09:36:56: Completed warp hybrid phase 3

	OSD perf dump - http://f28-h28-000-r630.rdu2.scalelab.redhat.com/RESULTS/w4/osd-perfdumps/
	OSD logs: Not available
  
  -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  |                                                                          Third run | small sized object | mClock                                                                |
  |                                                                               (RHCS 7.1 - 18.2.1-188.el9cp)                                                                     |
  -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  | Job |      Workload     | Total Read Throughput | Total Write Throughput | Avg Latency | Avg RGW | Avg RGW | Avg OSD | Avg OSD |         Avg Recovery           | Recovery Time |
  | ID  |                   |   MB/s   |   Objs/s   |   MB/s   |    Objs/s   |    (ms)     |   %CPU  |   %Mem  |   %CPU  |   %Mem  | with IO (MB/s) | w/o IO (MB/s) |     hh:mm     |
  -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  | w11 | fill 225M objs    |     -    |      -     |   1305   |    20224    |      74     |   721   |   0.21  |    91   |   2.3   |        --      |       --      |               |
  | w12 | hybrid noFailure  |    492   |    7493    |    382   |     5829    |      65     |   524   |   0.2   |    69   |   2.8   |        --      |       --      |      2363     |
  | w12 | hybrid OSDnode1   |    228   |    3449    |    178   |     2684    |     137     |   434   |   0.21  |    57   |   2.7   |       147      |       --      |  PGs Unclean  |
  | w12 | hybrid OSDnode2   |    331   |    5035    |    258   |     3917    |      93     |   375   |   0.2   |    51   |   2.7   |        52      |       --      |      after    |
  | w12 | hybrid OSDrecover |    414   |    6282    |    320   |     4888    |      79     |   354   |   0.2   |    43   |   2.2   |         6      |       28      |    24 hours   |
  -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>>>  Timeline for phase 3 from job w12
	2024/06/10-10:02:42: Starting warp hybrid phase 3 
	2024/06/10-10:06:23: Stop all OSDs on f27-h13-000-6048r
	2024/06/10-12:21:08: Completed warp hybrid phase 3

	OSD perf dump - http://f28-h28-000-r630.rdu2.scalelab.redhat.com/RESULTS/w12_240610-044907_runtest-71_runtest_small_repeat2/osd-perfdumps/
	OSD logs: http://f28-h28-000-r630.rdu2.scalelab.redhat.com/RESULTS/w12_240610-044907_runtest-71_runtest_small_repeat2/osd-logs/



  -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  |                                                                        Fourth run | small sized object | mClock                                                                 |
  |                                                                            (RHCS 7.1 - 18.2.1-188.el9cp)                                                                        |
  -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  | Job |      Workload     | Total Read Throughput | Total Write Throughput | Avg Latency | Avg RGW | Avg RGW | Avg OSD | Avg OSD |         Avg Recovery           | Recovery Time |
  | ID  |                   |   MB/s   |   Objs/s   |   MB/s   |    Objs/s   |    (ms)     |   %CPU  |   %Mem  |   %CPU  |   %Mem  | with IO (MB/s) | w/o IO (MB/s) |     hh:mm     |
  -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  | w13 | fill 225M objs    |     -    |      -     |   1306   |    20259    |      74     |   537   |   0.2   |    69   |   2.3   |        --      |       --      |               |
  | w14 | hybrid noFailure  |    491   |    7485    |    381   |     5823    |      66     |   291   |   0.2   |    38   |   2.8   |        --      |       --      |      2470     |
  | w14 | hybrid OSDnode1   |    217   |    3309    |    170   |     2575    |     145     |   277   |   0.2   |    36   |   2.7   |       140      |       --      |  PGs Unclean  |
  | w14 | hybrid OSDnode2   |    341   |    5193    |    266   |     4040    |      91     |   260   |   0.2   |    35   |   2.7   |        59      |       --      |      after    |
  | w14 | hybrid OSDrecover |    411   |    6237    |    318   |     4852    |      79     |   260   |   0.2   |    33   |   2.2   |         7      |       25      |    24 hours   |
  -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>>>  Timeline for phase 3 from job w14
	2024/06/12-04:17:03: Starting warp hybrid phase 3 
	2024/06/12-04:20:44: Stop all OSDs on f27-h13-000-6048r
	2024/06/12-06:36:01: Completed warp hybrid phase 3

	OSD perf dump - http://f28-h28-000-r630.rdu2.scalelab.redhat.com/RESULTS/w14_240611-225522_runtest-71_runtest_small_repeat3/osd-perfdumps/
	OSD logs: http://f28-h28-000-r630.rdu2.scalelab.redhat.com/RESULTS/w14_240611-225522_runtest-71_runtest_small_repeat3/osd-logs/



  -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  |                                                                            First run | small sized object | WPQ                                                                 |
  |                                                                                (RHCS 7.1 - 18.2.1-188.el9cp)                                                                    |
  -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  | Job |      Workload     | Total Read Throughput | Total Write Throughput | Avg Latency | Avg RGW | Avg RGW | Avg OSD | Avg OSD |         Avg Recovery           | Recovery Time |
  | ID  |                   |   MB/s   |   Objs/s   |   MB/s   |    Objs/s   |    (ms)     |   %CPU  |   %Mem  |   %CPU  |   %Mem  | with IO (MB/s) | w/o IO (MB/s) |     hh:mm     |
  -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  | w9  | fill 225M objs    |     -    |      -     |    1266  |    19365    |      77     |   482   |   0.2   |    64   |   2.3   |        --      |       --      |               |
  | w10 | hybrid noFailure  |    488   |    7433    |     379  |     5783    |      65     |   206   |   0.2   |    29   |   2.8   |        --      |       --      |      3835     |
  | w10 | hybrid OSDnode1   |    399   |    6075    |     309  |     4726    |      80     |   213   |   0.2   |    29   |   2.7   |        76      |       --      |  PGs Unclean  |
  | w10 | hybrid OSDnode2   |    386   |    5860    |     299  |     4559    |      82     |   219   |   0.2   |    30   |   2.7   |        42      |       --      |      after    |
  | w10 | hybrid OSDrecover |    410   |    6233    |     317  |     4849    |      81     |   224   |   0.2   |    29   |   2.2   |         5      |        6      |    24 hours   |
  -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>>>  Timeline for phase 3 from job w10
	2024/06/08-15:05:03: Starting warp hybrid phase 3 
	2024/06/08-15:08:45: Stop all OSDs on f27-h13-000-6048r
	2024/06/08-17:26:07: Completed warp hybrid phase 3

	OSD perf dump - http://f28-h28-000-r630.rdu2.scalelab.redhat.com/RESULTS/w10_240608-093336_runtest-71_runtest_small_wpq/osd-perfdumps/
	OSD logs: http://f28-h28-000-r630.rdu2.scalelab.redhat.com/RESULTS/w10_240608-093336_runtest-71_runtest_small_wpq/osd-logs/

Comment 27 errata-xmlrpc 2025-03-06 14:22:04 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat Ceph Storage 8.0 security, bug fixes, and enhancement updates), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2025:2457


Note You need to log in before you can comment on or make changes to this bug.