Created attachment 1462233 [details] COSbench ioWorkload run for 24 hours Description of problem: Storage workload DFG is seeing reduced performance and much longer client latencies since upgrading a ceph cluster from RHCS 2.5 to RHCS 3.1 Version-Release number of selected component (if applicable): RHCS 2.5 --> RHCS 3.1 (12.2.5-32.el7cp) downloaded from http://download.eng.bos.redhat.com/composes/auto/ceph-3.1-rhel-7/RHCEPH-3.1-RHEL-7-20180712.ci.2/ Actual results: Same I/O workload applied for 24 hours on RHCS 2.5 and on RHCS 3.1 (after upgrade) is running slower now. Expected results: Same workload would hopefully be running faster on RHCS 3.1 Additional info: dug into the COSbench CSV files for the two runs on the upgrade cluster, w74 (rhcs2.5) and w75 (rhcs3.1) I calculate the average latencies for the first hour of each run as well as the average latencies for the final hour of each run, for each of the four operations (read, list, write and delete). W74 (RHCS2.5) READ: first hour average = 470ms last hour = 529ms LIST: first hour average = 46ms last hour = 64ms WRITE: first hour average = 1680ms last hour = 2723ms DELETE: first hour average = 143ms last hour = 136ms W75 (RHCS3.1) READ: first hour average = 1067ms last hour = 3512ms LIST: first hour average = 80ms last hour = 5652ms WRITE: first hour average = 708ms last hour = 13133ms DELETE: first hour average = 111ms last hour = 2558ms Other than writes in the first hour, the RHCS 2.5 cluster is outperforming the upgraded RHCS 3.1 cluster. >> Look at degradation from first to last hour on all operations for RHCS 3.1
Created attachment 1462252 [details] perf graphs of RHCS25 run
Created attachment 1462259 [details] perf graphs of RHCS31 run
(In reply to John Harrigan from comment #4) > Created attachment 1462259 [details] > perf graphs of RHCS31 run Observe how the perf tanbks around sample #16500, which is roughly 22hours into the runtime.
(In reply to John Harrigan from comment #0) > Created attachment 1462233 [details] > COSbench ioWorkload run for 24 hours > > Description of problem: > Storage workload DFG is seeing reduced performance and much longer client > latencies since upgrading a ceph cluster from RHCS 2.5 to RHCS 3.1 > > Version-Release number of selected component (if applicable): > RHCS 2.5 --> RHCS 3.1 (12.2.5-32.el7cp) downloaded from > http://download.eng.bos.redhat.com/composes/auto/ceph-3.1-rhel-7/RHCEPH-3.1- > RHEL-7-20180712.ci.2/ > > Actual results: > Same I/O workload applied for 24 hours on RHCS 2.5 and on RHCS 3.1 (after > upgrade) is running slower now. > > Expected results: > Same workload would hopefully be running faster on RHCS 3.1 > > Additional info: > dug into the COSbench CSV files for the two runs on the upgrade cluster, w74 > (rhcs2.5) and w75 (rhcs3.1) > I calculate the average latencies for the first hour of each run as well as > the average latencies for the final hour of each run, for each of the four > operations (read, list, write and delete). > > W74 (RHCS2.5) > READ: first hour average = 470ms last hour = 529ms > LIST: first hour average = 46ms last hour = 64ms > WRITE: first hour average = 1680ms last hour = 2723ms > DELETE: first hour average = 143ms last hour = 136ms > > W75 (RHCS3.1) > READ: first hour average = 1067ms last hour = 3512ms > LIST: first hour average = 80ms last hour = 5652ms > WRITE: first hour average = 708ms last hour = 13133ms > DELETE: first hour average = 111ms last hour = 2558ms > > Other than writes in the first hour, the RHCS 2.5 cluster is outperforming > the upgraded RHCS 3.1 cluster. > >> Look at degradation from first to last hour on all operations for RHCS 3.1 The final paragraph should state: During the first hour of runtime, 3.1 outperforms 2.5 on write and delete operations but is slower on read and list. The reduction in 3.1 performance from the first to the last hour is dramatic. About 22 hours into the run the 3.1 performance drops hard.
Tiffany, were any non-default Ceph settings applied to either the RHCS 2.5 or RHCS 3.1 clusters? If so please list them.
How are these results accounting for aging? This workload has a net positive rados object count skew, especially if GC is not keeping up, so at some point you will run into the next filestore splitting point, causing a perf drop. Is this cluster still up so we can inspect it?
I have done back to back 24 hour runs on the Sizing cluster and a decrease in performance between runs was recorded. However the latency increase on the second run was on the order of 5-10% while here we saw much greater latency increases. The upgrade testing was purposely devised to mimic a customer upgrade scenario. As such the 2.5 cluster is upgraded and then the workload was re-run. This approach definitely leaves the cluster and pools in a different state than a cluster which is installed with 3.1. For example, the Sizing cluster is running 3.1 and the pools were created with expectedNumObjects 500M. This Upgrade cluster had the pools created under RHCS2.5 in which expectdNumObjs is not yet implemented. So filestore splitting would be active for both the 2.5 and 3.1 test runs. Once upgraded to 3.1 I would expect the new setting for "filestore_merge_threshold: changed from 10 to -10 in RHCS 3.1" to take effect. Could a negative value of filestore_merge_threshold (which disables filestore splitting) negatively impact an upgraded cluster which has pools that were not created with expectedNumObjects? And the parallel GC change with 3.1. Testing on the Sizing cluster indicates that parallel GC activity adversely impacts client IO rates (1/2 rate from 18hrs to 24hr). While its helpful being more aggressive than previous implementations, the perf overhead may be a customer dissatisfaction area. Unfortunately we will be loosing both clusters end of this week. This cluster was already purged and redeployed with 2.5 tonite. There is need to conduct the upgrade process with an active IOworkload, which was not done during the first upgrade procedure. In order to emulate a customer usage we need to perform the actual upgrade while a continuous IOworkload is active.
(In reply to John Harrigan from comment #9) > I have done back to back 24 hour runs on the Sizing cluster and a decrease > in performance between runs was recorded. However the latency increase on > the second run was on the order of 5-10% while here we saw much greater > latency increases. 48 hours of load isn't necessarily the same - if there was a higher net rate of rados object creation in 3.1, e.g. due to the increased write throughput you saw, 3.1 could hit the next filestore split threshold sooner (e.g. at the end of this test run) where it might have taken longer to get there with 2.5, so it did not show up in the same time frame. > The upgrade testing was purposely devised to mimic a customer upgrade > scenario. > As such the 2.5 cluster is upgraded and then the workload was re-run. This > approach definitely leaves the cluster and pools in a different state than a > cluster which is installed with 3.1. A customer who has been using RGW will have already finished the filestore splits in 2.5. That is, the steady state is no filestore splits. > For example, the Sizing cluster is running 3.1 and the pools were created > with > expectedNumObjects 500M. This Upgrade cluster had the pools created under > RHCS2.5 in which expectdNumObjs is not yet implemented. So filestore > splitting > would be active for both the 2.5 and 3.1 test runs. Once upgraded to 3.1 > I would expect the new setting for "filestore_merge_threshold: changed from > 10 to -10 in RHCS 3.1" to take effect. Could a negative value of > filestore_merge_threshold (which disables filestore splitting) > negatively impact an upgraded cluster which has pools that were not created > with expectedNumObjects? This disables merging, and it wouldn't explain the perf drop at the end of the test run. > And the parallel GC change with 3.1. Testing on the Sizing cluster indicates > that parallel GC activity adversely impacts client IO rates > (1/2 rate from 18hrs to 24hr). > While its helpful being more aggressive than previous implementations, the > perf > overhead may be a customer dissatisfaction area. > > Unfortunately we will be loosing both clusters end of this week. This cluster > was already purged and redeployed with 2.5 tonite. There is need to conduct > the > upgrade process with an active IOworkload, which was not done during the > first > upgrade procedure. In order to emulate a customer usage we need to perform > the > actual upgrade while a continuous IOworkload is active. Can we get access to this cluster to investigate tomorrow then? Without further analysis of what is happening in the cluster, it's hard to conclude that there's a regression, and if there is one what the cause is.
Based on the amount of time it takes to reproduce these procedures the cluster will not be in the right state tomorrow. I have requested that the allocation be extended through next week but that looks unlikely.
(In reply to John Harrigan from comment #7) > Tiffany, were any non-default Ceph settings applied to either the RHCS 2.5 > or RHCS 3.1 clusters? > If so please list them. It was using all default Ceph settings for both 2.5 and 3.1.
This was the status on the Upgrade cluster, post-upgrade. Note that two of the recommended settings were not applied. Should I open a new BZ for this? [root@c08-h22-r630 ~]# ceph -v ceph version 12.2.5-32.el7cp (6c5a0b29a0322f73c820c8b69785193d38fb2bfa) luminous (stable) For the settings, only filestore_merge_threshold has been changed: [root@c06-h09-6048r ~]# ceph daemon osd.140 config get filestore_merge_threshold { "filestore_merge_threshold": "-10" } [root@c06-h09-6048r ~]# ceph daemon osd.140 config get objecter_inflight_ops { "objecter_inflight_ops": "1024" } [root@c06-h09-6048r ~]# ceph daemon osd.140 config get rgw_thread_pool_size { "rgw_thread_pool_size": "100" }
(In reply to John Harrigan from comment #13) > This was the status on the Upgrade cluster, post-upgrade. > Note that two of the recommended settings were not applied. > Should I open a new BZ for this? > > [root@c08-h22-r630 ~]# ceph -v > ceph version 12.2.5-32.el7cp (6c5a0b29a0322f73c820c8b69785193d38fb2bfa) > luminous (stable) > > > For the settings, only filestore_merge_threshold has been changed: > > [root@c06-h09-6048r ~]# ceph daemon osd.140 config get > filestore_merge_threshold > { > "filestore_merge_threshold": "-10" > } > [root@c06-h09-6048r ~]# ceph daemon osd.140 config get objecter_inflight_ops > { > "objecter_inflight_ops": "1024" > } > [root@c06-h09-6048r ~]# ceph daemon osd.140 config get rgw_thread_pool_size > { > "rgw_thread_pool_size": "100" > } never mind. This is expected. I now realize that only the "filestore_merge_threshold" has made it downstream. - John
Moving this back to 3.1 for now
The reason for the move was concern that it is a serious bug to be looked for and have one more round of discussion in our upcoming DFG meeting with Matt and team.
SUMMARY: Using a "steady-state" workload the results do not indicate a performance regression. However, I think further testing is required to investigate RADOS object creation in RHCS 3.1 on pre-existing pools. The results from the Scale Lab, using the original workload (which caused RADOS object creation), indicate a possible perf regression with the new to RHCS 3.1 default - filestore_split_merge= -10, when using pools created without expected_num_objects (i.e. from RHCS 2.5). TEST DESCRIPTION: Based on Josh's input above (Comment #10) I modified the 'hybrid' workload to avoid RADOS object creation while it runs. The specific change is here: Looking in GCrate/vars.shinc for numOBJmax setting * in Scale Lab: numOBJmax=$(( (numOBJ * 10) )) * in BAGL: numOBJmax=$numOBJ So what is different here? In Scale Lab the workload was expanding the object space while in BAGL it was emulating steady state with a constrained object space, and NOT causing RADOS object creation. Using this "steady-state" workload on the BAGL cluster (4x OSDs and 6x Clients), the following runs were performed: * two 24 hr runs completed with RHCS 2.5 * Cluster upgraded from 2.5 to 3.1 * two 24hr runs completed with RHCS 3.1 (post-upgrade) The table with client latency timings can be seen here https://docs.google.com/document/d/1mMfYsWRUym7RN7NtTxVKeJSjzvgmQCly-bhOk_mH6Kc/edit#bookmark=id.jsfvuzevk41 The complete writeup of the testing effort is here https://docs.google.com/document/d/1mMfYsWRUym7RN7NtTxVKeJSjzvgmQCly-bhOk_mH6Kc/edit?usp=sharing
Description: Tiffany and I ran some tests on the BAGL cluster to try to isolate the impact of filestore_split_merge setting and answer this question: > when running workloads which cause new RADOS object creation, is there a > perf regression due to change in default value for filestore_merge_threshold > (from 10 to -10). This will be the case when customers upgrade to RHCS 3.1 > and use existing pools. It is also the case when customers create a lot of > objects in pools created without expected_num_objects. We investigated these four scenarios: 1) RHCS 3.1 using new default FMT value of -10 2) RHCS 3.1 using previous FMT value of 10 3) RHCS 2.5 using default FMT value of 10 4) RHCS 3.1 using default FMT value of -10 with expected_num_objects Since cluster aging has a direct impact on the results, for each of these scenarios we deployed a fresh ceph cluster. Then we immediately performed a write intensive workload, filling the cluster to 50%. We used COSbench to issue and monitor the fill cluster workload, recording: bandwidth, average response time and 99% response time: 1) RHCS 3.1 using new default FMT value of -10 Bandwidth=666.5MB/S Avg-ResTime=793ms 99%-RT=15790ms 2) RHCS 3.1 using previous FMT value of 10 Bandwidth=698MB/S Avg-ResTime=764ms 99%-RT=15570ms 3) RHCS 2.5 using default FMT value of 10 Bandwidth=516MB/S Avg-ResTime=1024ms 99%-RT=23270ms (SLOWEST) 4) RHCS 3.1 using default FMT value of -10 with expected_num_objects Bandwidth=1.03GB/S Avg-ResTime=513ms 99%-RT=9710ms (FASTEST) The results show that all the RHCS 3.1 cluster fills ran significantly faster, with lower latencies, than RHCS 2.5 (scenario #3). Amongst the RHCS 3.1 test, Scenario #2 was slightly faster than Scenario #1, but not alarmingly so. Scenario #4 ran on RHCS 3.1 and the pool was created using expected_num_objects, which precreates the filestore directory structure. As expected that test provided the best performance. That is the method we want RGW customers to adopt. It is new to RHCS 3.1 and is being documented in the RGW for Production Guide. I am closing this BZ.