Description of problem (please be detailed as possible and provide log snippests): AWS platform - there is a degradation on OCP4.10+ODF 4.10 vs OCP 4.9 + ODF 4.9 in files per second in CephFS 4KB create and append and 16KB create actions. The comparison report with the exace file per second rates between 4.10.0 ( build 73) and 4.9.0 and graphs are availalble on the Performance Dashboard here: http://ocsperf.ceph.redhat.com:8080/index.php?version1=13&build1=26&platform1=1&az_topology1=1&test_name%5B%5D=2&version2=14&build2=28&platform2=1&az_topology2=1&version3=&build3=&version4=&build4=&submit=Choose+options Version of all relevant components (if applicable): ODF 4.10.0.73 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? Yes - on number of executions we saw this problem being persistent. Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. run tests/e2e/performance/io_workload/test_small_file_workload 2.compare CephFS files per second rate to 4.9 results ( you may use performance dashboard http://ocsperf.ceph.redhat.com:8080/ or this report : https://docs.google.com/document/d/1OJfARHBAJs6bkYqri_HpSNM_N5gchUQ6P-lKe6ujQ6o/edit#heading=h.soyx5tyy3ajz 3. Actual results: in CephFS for 4KB file size for Create and Append actions the "files per second" is lower in 4.10 than in 1.9 Expected results: The Files per Second rate should be similar or bigger than in 4.9 Additional info: Link to Performance Dashboard with the results and comparison tables: http://ocsperf.ceph.redhat.com:8080/index.php?version1=13&build1=26&platform1=1&az_topology1=1&test_name%5B%5D=2&version2=14&build2=28&platform2=1&az_topology2=1&version3=&build3=&version4=&build4=&submit=Choose+options Link to Jenkins Job : https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/view/Performance/job/qe-trigger-aws-ipi-3az-rhcos-3m-3w-performance/58/ Link to must gather logs: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-058ai3c33-p/j-058ai3c33-p_20220105T230912/logs/testcases_1641427511/
ypersky the must-gather attached here is for 4.10 or 4.9? Also in order to compare the performance, we will ned must-gather for both 4.9 and 4.10 to make sure if the time taken to create and append the file is on ceph-csi level or not.
The must gather provided above was for 4.10 run. I've run the small files tests again on 4.9 and this is the link to 4.9 nyst gather : http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/lr5-ypersky-a9/lr5-ypersky-a9_20220301T225837/logs/testcases_1646217501/
Not a 4.10 blocker, moving out
@tnielsen have we confirmed this is a degradation in ceph fs performance and this is not a csi issue?
The CSI driver is only involved in the provisioning and mounting. Not sure how that would affect the cephfs write performance since the csi driver is not in the data path.
@vshankar Was there a change in performance in ceph fs in this time frame? @tnielsen Did the AWS provisioning of the PV change in this time frame? Was there a CI change?
as we discussed offline, back to you Travis
Scott and I discussed that there are no known changes to Rook or CephFS that would affect performance from 4.9 to 4.10. Yuli A common issue of AWS clusters is that the performance is not guaranteed for the devices that are provisioned. What is the storage class specified for the mons and OSDs when creating the cluster (storageClassDeviceSets volume template in the CephCluster CR)? Is it gp2? If you need consistent testing, you would to use a storage class that would guarantee consistent IOPS.
@Travis Nielsen, The test is using a default storage class, not gp-2. When we test performance, we should use the same storage class as the customers, and that would be the default storage class. Which storage class guarantees consistent IOPs to your opinion and which does not ?
io2 looks like the most predictable according to the AWS volume types page [1]: - gp2: 100 IOPS to a maximum of 16,000 IOPS, and provide up to 250 MB/s of throughput per volume - io1: up to 50 IOPS/GB to a maximum of 64,000 IOPS and provide up to 1,000 MB/s of throughput per volume - io2: 500 IOPS for every provisioned GB [1] https://aws.amazon.com/ebs/volume-types
I've run the same test ( test_small_file_workload.py) again on 4.0.10.221 , on AWS. The purpose of the run was to see whether the degradation is consistent. The relevant Jenkins job ( where all the must gather logs are stored) is: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/11859/ The Performance Dashboard comparison of the new run on 4.0.10.221 vs 4.9 is availble here: http://ocsperf.ceph.redhat.com:8080/index.php?version1=13&build1=26&platform1=1&az_topology1=1&test_name%5B%5D=2&version2=14&build2=63&platform2=1&az_topology2=1&version3=&build3=&version4=&build4=&submit=Choose+options We actually do see a similar degradation AGAIN in CephFS Small Files results. The degradation in 4 KB files size is 8% for create action and 25% for append action. The degradation in 16KB diles size is 32% for creation action. This confirm the first results that we got on build 4.0.10.73 , when I opened this BZ. Looks like the degradation IS consistent. And the test is using the default storage classes, similarly to what customers are likely to do.