1604108 – serious performance regression upgrading from 2.5 to 3.1

Bug 1604108 - serious performance regression upgrading from 2.5 to 3.1

Summary: serious performance regression upgrading from 2.5 to 3.1

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	RGW
Sub Component:
Version:	3.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	rc
Target Release:	3.2
Assignee:	Matt Benjamin (redhat)
QA Contact:	ceph-qe-bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1581350
TreeView+	depends on / blocked

Reported:	2018-07-19 21:02 UTC by John Harrigan
Modified:	2018-08-20 20:14 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-08-20 20:14:39 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
COSbench ioWorkload run for 24 hours (1.05 KB, application/xml) 2018-07-19 21:02 UTC, John Harrigan	no flags	Details
perf graphs of RHCS25 run (223.34 KB, image/png) 2018-07-19 21:04 UTC, John Harrigan	no flags	Details
perf graphs of RHCS31 run (203.33 KB, image/png) 2018-07-19 21:05 UTC, John Harrigan	no flags	Details
View All

Description John Harrigan 2018-07-19 21:02:56 UTC

Created attachment 1462233 [details]
COSbench ioWorkload run for 24 hours

Description of problem:
Storage workload DFG is seeing reduced performance and much longer client
latencies since upgrading a ceph cluster from RHCS 2.5 to RHCS 3.1 

Version-Release number of selected component (if applicable):
RHCS 2.5 --> RHCS 3.1 (12.2.5-32.el7cp) downloaded from 
http://download.eng.bos.redhat.com/composes/auto/ceph-3.1-rhel-7/RHCEPH-3.1-RHEL-7-20180712.ci.2/

Actual results:
Same I/O workload applied for 24 hours on RHCS 2.5 and on RHCS 3.1 (after upgrade) is running slower now.

Expected results:
Same workload would hopefully be running faster on RHCS 3.1

Additional info:
dug into the COSbench CSV files for the two runs on the upgrade cluster, w74 (rhcs2.5) and w75 (rhcs3.1)
I calculate the average latencies for the first hour of each run as well as the average latencies for the final hour of each run, for each of the four operations (read, list, write and delete).

W74 (RHCS2.5)
  READ: first hour average  = 470ms    last hour = 529ms
  LIST:   first hour average  = 46ms   last hour = 64ms
  WRITE: first hour average = 1680ms   last hour = 2723ms
  DELETE: first hour average = 143ms   last hour = 136ms

W75 (RHCS3.1)
  READ: first hour average  = 1067ms    last hour = 3512ms
  LIST:   first hour average  = 80ms    last hour = 5652ms
  WRITE: first hour average =  708ms    last hour = 13133ms
  DELETE: first hour average = 111ms    last hour = 2558ms

Other than writes in the first hour, the RHCS 2.5 cluster is outperforming the upgraded RHCS 3.1 cluster.  
>> Look at degradation from first to last hour on all operations for RHCS 3.1

Comment 3 John Harrigan 2018-07-19 21:04:43 UTC

Created attachment 1462252 [details]
perf graphs of RHCS25 run

Comment 4 John Harrigan 2018-07-19 21:05:19 UTC

Created attachment 1462259 [details]
perf graphs of RHCS31 run

Comment 5 John Harrigan 2018-07-19 21:07:21 UTC

(In reply to John Harrigan from comment #4)
> Created attachment 1462259 [details]
> perf graphs of RHCS31 run

Observe how the perf tanbks around sample #16500, which is roughly 22hours into the runtime.

Comment 6 John Harrigan 2018-07-19 23:07:52 UTC

(In reply to John Harrigan from comment #0)
> Created attachment 1462233 [details]
> COSbench ioWorkload run for 24 hours
> 
> Description of problem:
> Storage workload DFG is seeing reduced performance and much longer client
> latencies since upgrading a ceph cluster from RHCS 2.5 to RHCS 3.1 
> 
> Version-Release number of selected component (if applicable):
> RHCS 2.5 --> RHCS 3.1 (12.2.5-32.el7cp) downloaded from 
> http://download.eng.bos.redhat.com/composes/auto/ceph-3.1-rhel-7/RHCEPH-3.1-
> RHEL-7-20180712.ci.2/
> 
> Actual results:
> Same I/O workload applied for 24 hours on RHCS 2.5 and on RHCS 3.1 (after
> upgrade) is running slower now.
> 
> Expected results:
> Same workload would hopefully be running faster on RHCS 3.1
> 
> Additional info:
> dug into the COSbench CSV files for the two runs on the upgrade cluster, w74
> (rhcs2.5) and w75 (rhcs3.1)
> I calculate the average latencies for the first hour of each run as well as
> the average latencies for the final hour of each run, for each of the four
> operations (read, list, write and delete).
> 
> W74 (RHCS2.5)
>   READ: first hour average  = 470ms    last hour = 529ms
>   LIST:   first hour average  = 46ms   last hour = 64ms
>   WRITE: first hour average = 1680ms   last hour = 2723ms
>   DELETE: first hour average = 143ms   last hour = 136ms
> 
> W75 (RHCS3.1)
>   READ: first hour average  = 1067ms    last hour = 3512ms
>   LIST:   first hour average  = 80ms    last hour = 5652ms
>   WRITE: first hour average =  708ms    last hour = 13133ms
>   DELETE: first hour average = 111ms    last hour = 2558ms
> 
> Other than writes in the first hour, the RHCS 2.5 cluster is outperforming
> the upgraded RHCS 3.1 cluster.  
> >> Look at degradation from first to last hour on all operations for RHCS 3.1

The final paragraph should state:
 During the first hour of runtime, 3.1 outperforms 2.5 on write and delete operations but is slower on read and list. The reduction in 3.1 performance from the first to the last hour is dramatic. About 22 hours into the run the 3.1 performance drops hard.

Comment 7 John Harrigan 2018-07-19 23:41:27 UTC

Tiffany, were any non-default Ceph settings applied to either the RHCS 2.5 or RHCS 3.1 clusters?
If so please list them.

Comment 8 Josh Durgin 2018-07-20 02:01:17 UTC

How are these results accounting for aging? This workload has a net positive rados object count skew, especially if GC is not keeping up, so at some point you will run into the next filestore splitting point, causing a perf drop.

Is this cluster still up so we can inspect it?

Comment 9 John Harrigan 2018-07-20 02:58:21 UTC

I have done back to back 24 hour runs on the Sizing cluster and a decrease
in performance between runs was recorded. However the latency increase on
the second run was on the order of 5-10% while here we saw much greater
latency increases.

The upgrade testing was purposely devised to mimic a customer upgrade scenario.
As such the 2.5 cluster is upgraded and then the workload was re-run. This
approach definitely leaves the cluster and pools in a different state than a
cluster which is installed with 3.1.

For example, the Sizing cluster is running 3.1 and the pools were created with
expectedNumObjects 500M. This Upgrade cluster had the pools created under
RHCS2.5 in which expectdNumObjs is not yet implemented. So filestore splitting
would be active for both the 2.5 and 3.1 test runs. Once upgraded to 3.1
I would expect the new setting for "filestore_merge_threshold: changed from 10 to -10 in RHCS 3.1" to take effect. Could a negative value of
filestore_merge_threshold (which disables filestore splitting)
negatively impact an upgraded cluster which has pools that were not created
with expectedNumObjects?

And the parallel GC change with 3.1. Testing on the Sizing cluster indicates
that parallel GC activity adversely impacts client IO rates
(1/2 rate from 18hrs to 24hr).
While its helpful being more aggressive than previous implementations, the perf
overhead may be a customer dissatisfaction area.

Unfortunately we will be loosing both clusters end of this week. This cluster
was already purged and redeployed with 2.5 tonite. There is need to conduct the
upgrade process with an active IOworkload, which was not done during the first
upgrade procedure. In order to emulate a customer usage we need to perform the
actual upgrade while a continuous IOworkload is active.

Comment 10 Josh Durgin 2018-07-20 03:41:43 UTC

(In reply to John Harrigan from comment #9)
> I have done back to back 24 hour runs on the Sizing cluster and a decrease
> in performance between runs was recorded. However the latency increase on
> the second run was on the order of 5-10% while here we saw much greater
> latency increases.

48 hours of load isn't necessarily the same - if there was a higher net rate of rados object creation in 3.1, e.g. due to the increased write throughput you saw, 3.1 could hit the next filestore split threshold sooner (e.g. at the end of this test run) where it might have taken longer to get there with 2.5, so it did not show up in the same time frame.

> The upgrade testing was purposely devised to mimic a customer upgrade
> scenario.
> As such the 2.5 cluster is upgraded and then the workload was re-run. This
> approach definitely leaves the cluster and pools in a different state than a
> cluster which is installed with 3.1.

A customer who has been using RGW will have already finished the filestore splits in 2.5. That is, the steady state is no filestore splits.

> For example, the Sizing cluster is running 3.1 and the pools were created
> with
> expectedNumObjects 500M. This Upgrade cluster had the pools created under
> RHCS2.5 in which expectdNumObjs is not yet implemented. So filestore
> splitting
> would be active for both the 2.5 and 3.1 test runs. Once upgraded to 3.1
> I would expect the new setting for "filestore_merge_threshold: changed from
> 10 to -10 in RHCS 3.1" to take effect. Could a negative value of
> filestore_merge_threshold (which disables filestore splitting)
> negatively impact an upgraded cluster which has pools that were not created
> with expectedNumObjects?

This disables merging, and it wouldn't explain the perf drop at the end of the test run.
 
> And the parallel GC change with 3.1. Testing on the Sizing cluster indicates
> that parallel GC activity adversely impacts client IO rates
> (1/2 rate from 18hrs to 24hr).
> While its helpful being more aggressive than previous implementations, the
> perf
> overhead may be a customer dissatisfaction area.
> 
> Unfortunately we will be loosing both clusters end of this week. This cluster
> was already purged and redeployed with 2.5 tonite. There is need to conduct
> the
> upgrade process with an active IOworkload, which was not done during the
> first
> upgrade procedure. In order to emulate a customer usage we need to perform
> the
> actual upgrade while a continuous IOworkload is active.

Can we get access to this cluster to investigate tomorrow then? Without further analysis of what is happening in the cluster, it's hard to conclude that there's a regression, and if there is one what the cause is.

Comment 11 John Harrigan 2018-07-20 03:54:41 UTC

Based on the amount of time it takes to reproduce these procedures the
cluster will not be in the right state tomorrow. I have requested that the
allocation be extended through next week but that looks unlikely.

Comment 12 Tiffany Nguyen 2018-07-20 20:14:50 UTC

(In reply to John Harrigan from comment #7)
> Tiffany, were any non-default Ceph settings applied to either the RHCS 2.5
> or RHCS 3.1 clusters?
> If so please list them.

It was using all default Ceph settings for both 2.5 and 3.1.

Comment 13 John Harrigan 2018-07-24 13:01:44 UTC

This was the status on the Upgrade cluster, post-upgrade.
Note that two of the recommended settings were not applied.
Should I open a new BZ for this?

[root@c08-h22-r630 ~]# ceph -v
ceph version 12.2.5-32.el7cp (6c5a0b29a0322f73c820c8b69785193d38fb2bfa) luminous (stable)


For the settings, only filestore_merge_threshold has been changed:

[root@c06-h09-6048r ~]# ceph daemon osd.140 config get filestore_merge_threshold
{
    "filestore_merge_threshold": "-10"
}
[root@c06-h09-6048r ~]# ceph daemon osd.140 config get objecter_inflight_ops
{
    "objecter_inflight_ops": "1024"
}
[root@c06-h09-6048r ~]# ceph daemon osd.140 config get rgw_thread_pool_size
{
    "rgw_thread_pool_size": "100"
}

Comment 14 John Harrigan 2018-07-24 14:53:04 UTC

(In reply to John Harrigan from comment #13)
> This was the status on the Upgrade cluster, post-upgrade.
> Note that two of the recommended settings were not applied.
> Should I open a new BZ for this?
> 
> [root@c08-h22-r630 ~]# ceph -v
> ceph version 12.2.5-32.el7cp (6c5a0b29a0322f73c820c8b69785193d38fb2bfa)
> luminous (stable)
> 
> 
> For the settings, only filestore_merge_threshold has been changed:
> 
> [root@c06-h09-6048r ~]# ceph daemon osd.140 config get
> filestore_merge_threshold
> {
>     "filestore_merge_threshold": "-10"
> }
> [root@c06-h09-6048r ~]# ceph daemon osd.140 config get objecter_inflight_ops
> {
>     "objecter_inflight_ops": "1024"
> }
> [root@c06-h09-6048r ~]# ceph daemon osd.140 config get rgw_thread_pool_size
> {
>     "rgw_thread_pool_size": "100"
> }

never mind. This is expected.
I now realize that only the "filestore_merge_threshold" has made it downstream.

- John

Comment 17 Vasu Kulkarni 2018-07-27 15:57:09 UTC

Moving this back to 3.1 for now

Comment 18 Vasu Kulkarni 2018-07-27 16:19:41 UTC

The reason for the move was concern that it is a serious bug to be looked for and have one more round of discussion in our upcoming DFG meeting with Matt and team.

Comment 21 John Harrigan 2018-08-15 11:23:52 UTC

SUMMARY:
Using a "steady-state" workload the results do not indicate a performance
regression. 
However, I think further testing is required to investigate RADOS object
creation in RHCS 3.1 on pre-existing pools.
The results from the Scale Lab, using the original workload (which caused RADOS
object creation), indicate a possible perf regression with the new to RHCS 3.1
default - filestore_split_merge= -10, when using pools created without 
expected_num_objects (i.e. from RHCS 2.5). 

TEST DESCRIPTION:
Based on Josh's input above (Comment #10) I modified the 'hybrid' workload to
avoid RADOS object creation while it runs. The specific change is here:
  Looking in GCrate/vars.shinc for numOBJmax setting
  * in Scale Lab: numOBJmax=$(( (numOBJ * 10) ))
  * in BAGL: numOBJmax=$numOBJ

So what is different here? In Scale Lab the workload was expanding the
object space while in BAGL it was emulating steady state with a constrained object space, and NOT causing RADOS object creation.

Using this "steady-state" workload on the BAGL cluster (4x OSDs and 6x Clients),
the following runs were performed:
  * two 24 hr runs completed with RHCS 2.5
  * Cluster upgraded from 2.5 to 3.1
  * two 24hr runs completed with RHCS 3.1 (post-upgrade)

The table with client latency timings can be seen here
https://docs.google.com/document/d/1mMfYsWRUym7RN7NtTxVKeJSjzvgmQCly-bhOk_mH6Kc/edit#bookmark=id.jsfvuzevk41

The complete writeup of the testing effort is here
https://docs.google.com/document/d/1mMfYsWRUym7RN7NtTxVKeJSjzvgmQCly-bhOk_mH6Kc/edit?usp=sharing

Comment 22 John Harrigan 2018-08-20 20:13:50 UTC

Description:

Tiffany and I ran some tests on the BAGL cluster to try to isolate the impact
of filestore_split_merge setting and answer this question:
> when running workloads which cause new RADOS object creation, is there a 
> perf regression due to change in default value for filestore_merge_threshold
> (from 10 to -10). This will be the case when customers upgrade to RHCS 3.1 
> and use existing pools. It is also the case when customers create a lot of
> objects in pools created without expected_num_objects.

We investigated these four scenarios:
1) RHCS 3.1 using new default FMT value of -10
2) RHCS 3.1 using previous FMT value of 10
3) RHCS 2.5 using default FMT value of 10
4) RHCS 3.1 using default FMT value of -10 with expected_num_objects


Since cluster aging has a direct impact on the results, for each of these
scenarios we deployed a fresh ceph cluster. Then we immediately performed a
write intensive workload, filling the cluster to 50%. 
We used COSbench to issue and monitor the fill cluster workload, recording:
bandwidth, average response time and 99% response time:
1) RHCS 3.1 using new default FMT value of -10
   Bandwidth=666.5MB/S  Avg-ResTime=793ms  99%-RT=15790ms

2) RHCS 3.1 using previous FMT value of 10
   Bandwidth=698MB/S  Avg-ResTime=764ms  99%-RT=15570ms

3) RHCS 2.5 using default FMT value of 10
   Bandwidth=516MB/S  Avg-ResTime=1024ms  99%-RT=23270ms (SLOWEST)

4) RHCS 3.1 using default FMT value of -10 with expected_num_objects
  Bandwidth=1.03GB/S  Avg-ResTime=513ms  99%-RT=9710ms (FASTEST)

The results show that all the RHCS 3.1 cluster fills ran significantly faster, 
with lower latencies, than RHCS 2.5 (scenario #3). Amongst the RHCS 3.1 test,
Scenario #2 was slightly faster than Scenario #1, but not alarmingly so. 

Scenario #4 ran on RHCS 3.1 and the pool was created using
expected_num_objects, which precreates the filestore directory structure. As 
expected that test provided the best performance.
That is the method we want RGW customers to adopt. It is new to RHCS 3.1 and
is being documented in the RGW for Production Guide.

I am closing this BZ.

Note You need to log in before you can comment on or make changes to this bug.