Bug 2045072

Summary: AWS - degradation on OCP4.10+ODF 4.10 vs OCP 4.9 + ODF 4.9 in files per second in CephFS 4KB create and append actions, and 16KB create actions
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Yuli Persky <ypersky>
Component: cephAssignee: Travis Nielsen <tnielsen>
ceph sub component: RBD QA Contact: Elad <ebenahar>
Status: CLOSED INSUFFICIENT_DATA Docs Contact:
Severity: unspecified    
Priority: unspecified CC: alayani, bniver, ekuric, jopinto, kramdoss, madam, mmuench, muagarwa, ocs-bugs, odf-bz-bot, pnataraj, shberry, tnielsen, ypadia
Version: 4.10Keywords: Automation, Performance
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-10-19 06:04:08 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Yuli Persky 2022-01-25 14:53:43 UTC
Description of problem (please be detailed as possible and provide log
snippests):


 AWS platform - there is a degradation on OCP4.10+ODF 4.10 vs OCP 4.9 + ODF 4.9 in files per second in CephFS 4KB create and append and 16KB create actions. 

The comparison report with the exace file per second  rates between 4.10.0 ( build 73) and 4.9.0  and graphs are availalble on the Performance Dashboard here: 

http://ocsperf.ceph.redhat.com:8080/index.php?version1=13&build1=26&platform1=1&az_topology1=1&test_name%5B%5D=2&version2=14&build2=28&platform2=1&az_topology2=1&version3=&build3=&version4=&build4=&submit=Choose+options


Version of all relevant components (if applicable):

ODF 4.10.0.73


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Yes - on number of executions we saw this problem being persistent. 

Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. run tests/e2e/performance/io_workload/test_small_file_workload
2.compare CephFS files per second rate to 4.9 results ( you may use performance dashboard http://ocsperf.ceph.redhat.com:8080/ or this report : https://docs.google.com/document/d/1OJfARHBAJs6bkYqri_HpSNM_N5gchUQ6P-lKe6ujQ6o/edit#heading=h.soyx5tyy3ajz
3.


Actual results:

in CephFS for 4KB file size for Create and Append actions the "files per second" is lower in 4.10 than in 1.9 
 

Expected results:

The Files per Second rate should be similar or bigger than in 4.9 

Additional info:

Link to Performance Dashboard with the results and comparison tables: http://ocsperf.ceph.redhat.com:8080/index.php?version1=13&build1=26&platform1=1&az_topology1=1&test_name%5B%5D=2&version2=14&build2=28&platform2=1&az_topology2=1&version3=&build3=&version4=&build4=&submit=Choose+options

Link to Jenkins Job : 

https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/view/Performance/job/qe-trigger-aws-ipi-3az-rhcos-3m-3w-performance/58/

Link to must gather logs: 

http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-058ai3c33-p/j-058ai3c33-p_20220105T230912/logs/testcases_1641427511/

Comment 2 yati padia 2022-02-09 07:18:17 UTC
 ypersky the must-gather attached here is for 4.10 or 4.9? Also in order to compare the performance, we will ned must-gather for both 4.9 and 4.10 to make sure if the time taken to create and append the file is on ceph-csi level or not.

Comment 3 Yuli Persky 2022-03-02 11:55:58 UTC
The must gather provided above was for 4.10 run. 

I've run the small files tests again on 4.9 and this is the link to 4.9 nyst gather : 

http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/lr5-ypersky-a9/lr5-ypersky-a9_20220301T225837/logs/testcases_1646217501/

Comment 4 Mudit Agarwal 2022-03-08 13:22:06 UTC
Not a 4.10 blocker, moving out

Comment 5 Scott Ostapovicz 2022-03-08 15:48:41 UTC
@tnielsen have we confirmed this is a degradation in ceph fs performance and this is not a csi issue?

Comment 6 Travis Nielsen 2022-03-08 18:55:31 UTC
The CSI driver is only involved in the provisioning and mounting. Not sure how that would affect the cephfs write performance since the csi driver is not in the data path.

Comment 7 Scott Ostapovicz 2022-03-14 14:48:15 UTC
@vshankar Was there a change in performance in ceph fs in this time frame?
@tnielsen Did the AWS provisioning of the PV change in this time frame?  Was there a CI change?

Comment 8 Scott Ostapovicz 2022-03-14 14:50:05 UTC
@vshankar Was there a change in performance in ceph fs in this time frame?
@tnielsen Did the AWS provisioning of the PV change in this time frame?  Was there a CI change?

Comment 9 Scott Ostapovicz 2022-03-21 14:51:07 UTC
as we discussed offline, back to you Travis

Comment 10 Travis Nielsen 2022-03-21 17:50:27 UTC
Scott and I discussed that there are no known changes to Rook or CephFS that would affect performance from 4.9 to 4.10. 

Yuli A common issue of AWS clusters is that the performance is not guaranteed for the devices that are provisioned. What is the storage class specified for the mons and OSDs when creating the cluster (storageClassDeviceSets volume template in the CephCluster CR)? Is it gp2? If you need consistent testing, you would to use a storage class that would guarantee consistent IOPS.

Comment 11 Yuli Persky 2022-04-12 21:32:20 UTC
@Travis Nielsen,

The test is using a default storage class, not gp-2. 
When we test performance, we should use the same storage class as the customers, and that would be the default storage class.

Which storage class guarantees consistent IOPs to your opinion and which does not ?

Comment 12 Travis Nielsen 2022-04-12 22:15:43 UTC
io2 looks like the most predictable according to the AWS volume types page [1]:
- gp2: 100 IOPS to a maximum of 16,000 IOPS, and provide up to 250 MB/s of throughput per volume
- io1: up to 50 IOPS/GB to a maximum of 64,000 IOPS and provide up to 1,000 MB/s of throughput per volume
- io2: 500 IOPS for every provisioned GB

[1] https://aws.amazon.com/ebs/volume-types

Comment 13 Yuli Persky 2022-04-18 23:01:06 UTC
I've run the same test ( test_small_file_workload.py) again on 4.0.10.221 , on AWS. 
The purpose of the run was to see whether the degradation is consistent. 

The relevant Jenkins job ( where all the must gather logs are stored) is: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/11859/

The Performance Dashboard comparison of the new run on 4.0.10.221 vs 4.9 is availble here:

http://ocsperf.ceph.redhat.com:8080/index.php?version1=13&build1=26&platform1=1&az_topology1=1&test_name%5B%5D=2&version2=14&build2=63&platform2=1&az_topology2=1&version3=&build3=&version4=&build4=&submit=Choose+options

We actually do see a similar degradation AGAIN in CephFS Small Files results. 

The degradation in 4 KB files size is 8% for create action and 25% for append action. 

The degradation in 16KB diles size is 32% for creation action. 

This confirm the first results that we got on build 4.0.10.73 , when I opened this BZ. 

Looks like the degradation IS consistent. And the test is using the default storage classes, similarly to what customers are likely to do.