Description of problem (please be detailed as possible and provide log snippests): There is a degradation of performance on VMware LSO platform in CephFS clone creation times in OCP4.9+ODF4.9 vs OCP4.8+OCS4.8 The detailed report is available here https://docs.google.com/document/d/1Ft7gzWCcID2RTXILW3GrN8a6O5v5VidDICuG_tX__v8/edit#heading=h.x348ywf5r26r CephFS Clone Creation times in OCP 4.8 + OCS 4.8: Clone size: 1 Gi, Creation time: 2.63 sec Clone size: 25 Gi, Creation time: 64.12 sec Clones size: 50 Gi, Creation time: 95.025 sec CephFS Clone Creation times in OCP 4.9 + ODF 4.9: Clone size: 1 Gi, Creation time: 8.12 sec Clone size: 25 Gi, Creation time: 64.74 sec ( here the time is the same as in 4.8 + 4.8) Clones size: 50 Gi, Creation time: 193.26 sec Please note that the degradation for 1Gi and 50Gi clones is consistent - run the test number of times with similar results. Version-Release number of selected component (if applicable): OCS versions ============== NAME DISPLAY VERSION REPLACES PHASE noobaa-operator.v4.9.0 NooBaa Operator 4.9.0 Succeeded ocs-operator.v4.9.0 OpenShift Container Storage 4.9.0 Succeeded odf-operator.v4.9.0 OpenShift Data Foundation 4.9.0 Succeeded ODF (OCS) build : full_version: 4.9.0-210.ci Rook versions =============== 2021-11-04 09:27:36.633082 I | op-flags: failed to set flag "logtostderr". no such flag -logtostderr rook: 4.9-210.f6e2005.release_4.9 go: go1.16.6 Ceph versions =============== ceph version 16.2.0-143.el8cp (0e2c6f9639c37a03e55885fb922dc0cb1b5173cb) pacific (stable) Full Version list is available here : http://ocsperf.ceph.redhat.com/logs/Performance_tests/4.9/RC0/Vmware-LSO/versions.txt Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Not relevant for perfornace bug. Is there any workaround available to the best of your knowledge? No Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? Yes Can this issue reproduce from the UI? Not relevant If this is a regression, please provide more details to justify this: CephFS Clone Creation times in OCP 4.8 + OCS 4.8: Clone size: 1 Gi, Creation time: 2.63 sec Clone size: 25 Gi, Creation time: 64.12 sec Clones size: 50 Gi, Creation time: 95.025 sec CephFS Clone Creation times in OCP 4.9 + ODF 4.9: Clone size: 1 Gi, Creation time: 8.12 sec Clone size: 25 Gi, Creation time: 64.74 sec ( here the time is the same as in 4.8 + 4.8) Clones size: 50 Gi, Creation time: 193.26 sec Please note that the degradation for 1Gi and 50Gi clones is consistent - run the test number of times with similar results. Steps to Reproduce: 1. Run test_pvc_clone_performance.py test on VMware LSO cluster 2.Compare the clone creation times to measurements in 4.8 Actual results: The current measurements show degradation ( longer creation times for 1GB and 50 GB clones) on OCP 4.9+ ODF 4.9 vs OCP 4.8 + OCS 4.8 Expected results: The measurements should be the same or better ( shorter times) in 4.9+4.9. Additional info: The full comparison VMware LSO report is available here: https://docs.google.com/document/d/1Ft7gzWCcID2RTXILW3GrN8a6O5v5VidDICuG_tX__v8/edit#
@ypersky.com I don't see a detailed description in the bug description. Can you please update the bug with the report and other important details.
I apologize for not providing proper description earlier. The first comment was update with all the information, please let me know in case you need any further inputs.
@Yug Gupta, It is not possible to deploy OCP 4.8 + ODF 4.9 on vmware LSO cluster. Not supported. As for other platforms - here is the AWS 4.9 report : https://docs.google.com/document/d/1vyufd55iDyvKeYOwoXwKSsNoRK2VR41QNTuH-iERR8s/edit#heading=h.2m8gdjc4jhzo From the comparison between 4.8 and 4.9 - no degradation in Clone creation times is seen on AWS. However, we do see degradation on VMware LSO.
@Yug Gupta, So what component should I change this BZ to ?
Regarding must gather logs - unfortunately we did not collect those logs and the cluster is not available now. If it is needed - I can reproduce the problem on a newly deployed cluster and collect the must gather /start the test from Jenkins and in this way must gather will be collected automatically.
ypersky we will need must-gather here to calculate and check the time spend by the ceph-csi here.
I've run test_pvc_clone_performance,py test again on 4.9 ( ocp + odf) vmware lso cluster. This is the link to the Jenkins job: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/10491/ This is the link to must gather logs: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/ypersky-lso9/ypersky-lso9_20220228T124019/logs/testcases_1646291211/ *please note that there is a chance that the relevant must gather might be located in one of the testcase* directories here: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/ypersky-lso9/ypersky-lso9_20220228T124019/logs/ I'm keeping here this link to be on the safe side. However, the first link should contain the relevant logs. I've also run the test_pvc_clone_performance.py test on 4.10 vmware lso cluster and must gather is available here: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/10162/ http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/ypersky-local10/ypersky-local10_20220215T103839/logs/testcases_1645436035/
Please also note the following: 1) VMWare lso comparison report is available here: https://docs.google.com/document/d/19ZRfwhfbpYF2f6hUxCM5lCt0uLoNo3ibOMWTbZTUTqw/edit# 2) When I run test_pvc_clone_creation_performance.py test on newly deployed 4.9 ocp + 4.9 odf cluster, the measurements for cephfs clone creation are : 4.9 CephFS Clone creation times: 1 GB clone: 1.758 sec 25 GB clone: 64.390 sec 50GB clone: 128.320 sec 100GB clone: 256.135 sec Those measurements are similar to 4.8 results( taken from this bug description) Clone size: 1 Gi, Creation time: 2.63 sec Clone size: 25 Gi, Creation time: 64.12 sec Clones size: 50 Gi, Creation time: 95.025 sec and much better that the 4.9 results also mentioned in the description of this bug ( copying it here): Clone size: 1 Gi, Creation time: 8.12 sec Clone size: 25 Gi, Creation time: 64.74 sec ( here the time is the same as in 4.8 + 4.8) Clones size: 50 Gi, Creation time: 193.26 sec Also please note the current 4.10 results: 4.10 CephFS Clone creation times: 1 GB clone: 2.021 sec 25GB clone: 49.336 sec 50GB clone: 131.583 sec 100GB clone: 258.400 sec I have an explanation to the DIFFERENT 4.9 measurements: it looks like we need to add more samples to the clones creation/deletion test ( this is already in our work plan). Taking all the above into consideration I think that we can close this bug. The only degradation in 4.10 vs 4.8 is in CephFS 50 GB clone ( ~30%). However, the 100GB clone creation time is similar in both 4.8 and 4.10. The QE indeed should add more samples to this test for the measurements to be more accurate.