Bug 2294594

Summary:	[Workload-DFG][mClock] Inconsistent client throughput during recovery with mClock balanced profile
Product:	[Red Hat Storage] Red Hat Ceph Storage	Reporter:	Harsh Kumar <hakumar>
Component:	RADOS	Assignee:	Sridhar Seshasayee <sseshasa>
Status:	CLOSED ERRATA	QA Contact:	Harsh Kumar <hakumar>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	7.1	CC:	bhkaur, bhubbard, bkunal, ceph-eng-bugs, cephqe-warriors, kbader, mcaldeir, ngangadh, nojha, rzarzyns, sseshasa, tserlin, vumrao
Target Milestone:	---	Flags:	sseshasa: needinfo-
Target Release:	8.0z2
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	ceph-19.2.0-60.el9cp	Doc Type:	Bug Fix
Doc Text:	.New shard and multiple worker threads configuration now yields significant results in terms of consistency of client Previously, scheduling with mClock was not optimal with multiple OSD shards on a HDD based Ceph cluster. Hence, the client throughput was found to be inconsistent across test runs coupled with multiple reported slow requests during recovery and backfill operations. With this fix, HDD OSD shard configuration is updated as follows: - osd_op_num_shards_hdd = 1 (was 5) - osd_op_num_threads_per_shard_hdd = 5 (was 1) Now, the new shard and multiple worker threads configuration yields significant results in terms of consistency of client and recovery throughput across multiple test runs.	Story Points:	---
Clone Of:
Clones:	2299480 2299482 (view as bug list)		Environment:
Last Closed:	2025-03-06 14:22:04 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	2299480, 2299482

Description Harsh Kumar 2024-06-27 15:01:04 UTC

Description of problem:
>>> Upstream tracker: https://tracker.ceph.com/issues/66289

	It has been observed during the regular Release Criteria testing of 7.1 and otherwise, that when cluster goes through
	 a recovery phase, the average client throughput captured during this time is inconsistent over several runs.

	The test workflow where this was found is called OSDFailure testing and consists of the following rounds:
	    a. Warming up the OSDs with pure writes (a.k.a Fill workload)
	    b. Measuring the performance of the cluster over a period of 1 hour using hybrid workload
	    c. Injecting failure by bringing down an OSD host with continuous hybrid IOs
	    d. Injecting failure by bringing down two OSD host with continuous hybrid IOs
	    e. Down OSD hosts are brought back up and performance of the cluster is monitored during recovery phase with continuous hybrid IOs

	While the performance of the cluster with respect to client IO has been consistent during phases a, b and c, the 
	same cannot be said when the cluster goes through phase d, and e

	Contrary to the above, the performance has remained stable and consistent with WPQ osd_op_queue throughout the testing

	Test Phases description specific to our use case :

	  Phase 1:
	    Consists of warming up the cluster followed by 1 measurement round
	    - The 192 OSDs are distributed across 8 nodes with 24 OSDs per node.
	    - 300 RGW buckets are created and are each filled with 750K objects
	    - Objects are in the range of small object sizes [1KiB, 4KiB, 16KiB, 64KiB, 256KiB]
	    - 5 clients together fill the RGW pool with around 225 million objects
	    - The client workload is initiated using the warp tool.

	  Phase 2: OSD node 1 failure
	    Consists of injecting failure by bringing down an OSD host with continuous hybrid IOs
	    - In this phase one OSD node(24 OSDs) is brought down and the cluster is
	    monitored along with the collection of client and recovery metrics.

	  Phase 3: OSD node 2 failure
	    Consists of injecting failure by bringing down two OSD host with continuous hybrid IOs
	    - After around a couple of hours, another OSD node (24 OSDs) is brought
	    down and the same metrics are collected as above.

	  Phase 4: Bring up both OSD nodes and monitor cluster.
	  Down OSD hosts are brought back up to gauge recovery performance
	  - Both the failed OSD nodes are brought up and the cluster recovery/backfill
	  is monitored along with the client metrics. The test doesn't wait for
	  the cluster to complete the recovery/backfill process fully due to the
	  long recovery times.

Version-Release number of selected component (if applicable):
18.2.1-188.el9cp

How reproducible:
5/5

Steps to Reproduce:
1. Warp up the cluster by filling it up till 10% capacity
2. Bring one OSD node done with continuous background IOs for 2 hours, observe the cluster behaviour during this time
3. Bring another OSD node done with continuous background IOs for 2 hours, observe the cluster behaviour during this time
4. Bring back all down OSD nodes up with continuous background IOs for two hours, let the cluster recover

Actual results:
Client Throughput observed during OSD down scenarios is inconsistent across multiple runs


Expected results:
Client Throughput with background recovery regardless of low or high should be consistent across multiple runs

Additional info:

Results from different runs
=============================================================================================================================================================================
Result doc: https://docs.google.com/spreadsheets/d/1mdyRqcaQAtY4McMV3TLplXhYd3QU8baZUlYMQe98NSs/edit?gid=1735478984#gid=1735478984

  -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  |                                                                        First run | small sized object | mClock                                                                  |
  |                                                                             (RHCS 7.1 - 18.2.1-188.el9cp)                                                                       |
  -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  | Job |      Workload     | Total Read Throughput | Total Write Throughput | Avg Latency | Avg RGW | Avg RGW | Avg OSD | Avg OSD |         Avg Recovery           | Recovery Time |
  | ID  |                   |   MB/s   |   Objs/s   |   MB/s   |    Objs/s   |    (ms)     |   %CPU  |   %Mem  |   %CPU  |   %Mem  | with IO (MB/s) | w/o IO (MB/s) |     hh:mm     |
  -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  | w1  | fill 225M objs    |     -    |      -     |   1128   |    17529    |      86     |         |         |         |         |        --      |       --      |               |
  | w2  | hybrid noFailure  |    485   |    7384    |    376   |     5745    |      66     |         |         |         |         |        --      |       --      |     2313      |
  | w2  | hybrid OSDnode1   |    234   |    3539    |    182   |     2754    |     133     |         |         |         |         |       138      |       --      |  PGs Unclean  |
  | w2  | hybrid OSDnode2   |    329   |    5030    |    257   |     3914    |      94     |         |         |         |         |        52      |       --      |     after     |
  | w2  | hybrid OSDrecover |    397   |    6040    |    308   |     4699    |      81     |         |         |         |         |         9      |       22      |   24 hours    |
  -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>>>  Timeline for phase 3 from job w2
	2024/05/25-03:16:43: Starting warp hybrid phase 3 
	2024/05/25-03:20:25: Stop all OSDs on f27-h13-000-6048r
	2024/05/25-05:35:06: Completed warp hybrid phase 3 

	OSD perf dump - http://f28-h28-000-r630.rdu2.scalelab.redhat.com/RESULTS/w2_240524-220748_runtest-71_small/osd-perfdumps/
	OSD logs: root@f28-h28-000-r630:~/rc-2024/OSDfailure-warp/RESULTS/w2_240524-220748_runtest-71_small/osd-logs
	or
	http://f28-h28-000-r630.rdu2.scalelab.redhat.com/RESULTS/w2_240524-220748_runtest-71_small/osd-logs/


  -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  |                                                                          Second run | small sized object | mClock                                                               |
  |                                                                                (RHCS 7.1 - 18.2.1-188.el9cp)                                                                    |
  -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  | Job |      Workload     | Total Read Throughput | Total Write Throughput | Avg Latency | Avg RGW | Avg RGW | Avg OSD | Avg OSD |         Avg Recovery           | Recovery Time |
  | ID  |                   |   MB/s   |   Objs/s   |   MB/s   |    Objs/s   |    (ms)     |   %CPU  |   %Mem  |   %CPU  |   %Mem  | with IO (MB/s) | w/o IO (MB/s) |     hh:mm     |
  -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  | w3  | fill 225M objs    |     -    |      -     |   1152   |    17855    |      84     |         |         |         |         |        --      |       --      |               |
  | w4  | hybrid noFailure  |    491   |    7493    |    382   |     5829    |      65     |         |         |         |         |        --      |       --      |     2180      |
  | w4  | hybrid OSDnode1   |    234   |    3579    |    183   |     2785    |     131     |         |         |         |         |       120      |       --      |  PGs Unclean  |
  | w4  | hybrid OSDnode2   |    241   |    3658    |    188   |     2846    |     126     |         |         |         |         |        70      |       --      |      after    |
  | w4  | hybrid OSDrecover |    400   |    6088    |    310   |     4736    |      80     |         |         |         |         |        12      |       23      |    24 hours   |
  -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>>>  Timeline for phase 3 from job w4
	2024/05/27-07:22:56: Starting warp hybrid phase 3 
	2024/05/27-07:26:37: Stop all OSDs on f27-h13-000-6048r
	2024/05/27-09:36:56: Completed warp hybrid phase 3

	OSD perf dump - http://f28-h28-000-r630.rdu2.scalelab.redhat.com/RESULTS/w4/osd-perfdumps/
	OSD logs: Not available
  
  -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  |                                                                          Third run | small sized object | mClock                                                                |
  |                                                                               (RHCS 7.1 - 18.2.1-188.el9cp)                                                                     |
  -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  | Job |      Workload     | Total Read Throughput | Total Write Throughput | Avg Latency | Avg RGW | Avg RGW | Avg OSD | Avg OSD |         Avg Recovery           | Recovery Time |
  | ID  |                   |   MB/s   |   Objs/s   |   MB/s   |    Objs/s   |    (ms)     |   %CPU  |   %Mem  |   %CPU  |   %Mem  | with IO (MB/s) | w/o IO (MB/s) |     hh:mm     |
  -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  | w11 | fill 225M objs    |     -    |      -     |   1305   |    20224    |      74     |   721   |   0.21  |    91   |   2.3   |        --      |       --      |               |
  | w12 | hybrid noFailure  |    492   |    7493    |    382   |     5829    |      65     |   524   |   0.2   |    69   |   2.8   |        --      |       --      |      2363     |
  | w12 | hybrid OSDnode1   |    228   |    3449    |    178   |     2684    |     137     |   434   |   0.21  |    57   |   2.7   |       147      |       --      |  PGs Unclean  |
  | w12 | hybrid OSDnode2   |    331   |    5035    |    258   |     3917    |      93     |   375   |   0.2   |    51   |   2.7   |        52      |       --      |      after    |
  | w12 | hybrid OSDrecover |    414   |    6282    |    320   |     4888    |      79     |   354   |   0.2   |    43   |   2.2   |         6      |       28      |    24 hours   |
  -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>>>  Timeline for phase 3 from job w12
	2024/06/10-10:02:42: Starting warp hybrid phase 3 
	2024/06/10-10:06:23: Stop all OSDs on f27-h13-000-6048r
	2024/06/10-12:21:08: Completed warp hybrid phase 3

	OSD perf dump - http://f28-h28-000-r630.rdu2.scalelab.redhat.com/RESULTS/w12_240610-044907_runtest-71_runtest_small_repeat2/osd-perfdumps/
	OSD logs: http://f28-h28-000-r630.rdu2.scalelab.redhat.com/RESULTS/w12_240610-044907_runtest-71_runtest_small_repeat2/osd-logs/



  -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  |                                                                        Fourth run | small sized object | mClock                                                                 |
  |                                                                            (RHCS 7.1 - 18.2.1-188.el9cp)                                                                        |
  -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  | Job |      Workload     | Total Read Throughput | Total Write Throughput | Avg Latency | Avg RGW | Avg RGW | Avg OSD | Avg OSD |         Avg Recovery           | Recovery Time |
  | ID  |                   |   MB/s   |   Objs/s   |   MB/s   |    Objs/s   |    (ms)     |   %CPU  |   %Mem  |   %CPU  |   %Mem  | with IO (MB/s) | w/o IO (MB/s) |     hh:mm     |
  -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  | w13 | fill 225M objs    |     -    |      -     |   1306   |    20259    |      74     |   537   |   0.2   |    69   |   2.3   |        --      |       --      |               |
  | w14 | hybrid noFailure  |    491   |    7485    |    381   |     5823    |      66     |   291   |   0.2   |    38   |   2.8   |        --      |       --      |      2470     |
  | w14 | hybrid OSDnode1   |    217   |    3309    |    170   |     2575    |     145     |   277   |   0.2   |    36   |   2.7   |       140      |       --      |  PGs Unclean  |
  | w14 | hybrid OSDnode2   |    341   |    5193    |    266   |     4040    |      91     |   260   |   0.2   |    35   |   2.7   |        59      |       --      |      after    |
  | w14 | hybrid OSDrecover |    411   |    6237    |    318   |     4852    |      79     |   260   |   0.2   |    33   |   2.2   |         7      |       25      |    24 hours   |
  -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>>>  Timeline for phase 3 from job w14
	2024/06/12-04:17:03: Starting warp hybrid phase 3 
	2024/06/12-04:20:44: Stop all OSDs on f27-h13-000-6048r
	2024/06/12-06:36:01: Completed warp hybrid phase 3

	OSD perf dump - http://f28-h28-000-r630.rdu2.scalelab.redhat.com/RESULTS/w14_240611-225522_runtest-71_runtest_small_repeat3/osd-perfdumps/
	OSD logs: http://f28-h28-000-r630.rdu2.scalelab.redhat.com/RESULTS/w14_240611-225522_runtest-71_runtest_small_repeat3/osd-logs/



  -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  |                                                                            First run | small sized object | WPQ                                                                 |
  |                                                                                (RHCS 7.1 - 18.2.1-188.el9cp)                                                                    |
  -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  | Job |      Workload     | Total Read Throughput | Total Write Throughput | Avg Latency | Avg RGW | Avg RGW | Avg OSD | Avg OSD |         Avg Recovery           | Recovery Time |
  | ID  |                   |   MB/s   |   Objs/s   |   MB/s   |    Objs/s   |    (ms)     |   %CPU  |   %Mem  |   %CPU  |   %Mem  | with IO (MB/s) | w/o IO (MB/s) |     hh:mm     |
  -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  | w9  | fill 225M objs    |     -    |      -     |    1266  |    19365    |      77     |   482   |   0.2   |    64   |   2.3   |        --      |       --      |               |
  | w10 | hybrid noFailure  |    488   |    7433    |     379  |     5783    |      65     |   206   |   0.2   |    29   |   2.8   |        --      |       --      |      3835     |
  | w10 | hybrid OSDnode1   |    399   |    6075    |     309  |     4726    |      80     |   213   |   0.2   |    29   |   2.7   |        76      |       --      |  PGs Unclean  |
  | w10 | hybrid OSDnode2   |    386   |    5860    |     299  |     4559    |      82     |   219   |   0.2   |    30   |   2.7   |        42      |       --      |      after    |
  | w10 | hybrid OSDrecover |    410   |    6233    |     317  |     4849    |      81     |   224   |   0.2   |    29   |   2.2   |         5      |        6      |    24 hours   |
  -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>>>  Timeline for phase 3 from job w10
	2024/06/08-15:05:03: Starting warp hybrid phase 3 
	2024/06/08-15:08:45: Stop all OSDs on f27-h13-000-6048r
	2024/06/08-17:26:07: Completed warp hybrid phase 3

	OSD perf dump - http://f28-h28-000-r630.rdu2.scalelab.redhat.com/RESULTS/w10_240608-093336_runtest-71_runtest_small_wpq/osd-perfdumps/
	OSD logs: http://f28-h28-000-r630.rdu2.scalelab.redhat.com/RESULTS/w10_240608-093336_runtest-71_runtest_small_wpq/osd-logs/

Comment 27 errata-xmlrpc 2025-03-06 14:22:04 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat Ceph Storage 8.0 security, bug fixes, and enhancement updates), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2025:2457