Description of problem: In both platforms we've found a degradation in CephFS Snapshot Restore times ( in AWS - bad degradation of over 200% for 100GB Snapshot and in Vmware LSO - 30-50% degradation for 10GB and 100 GB snapshots) when we compared OCP 4.9 + ODF 4.9 results to the results of OCP 4.8 + OCS 4.8. AWS 4.9 vs 4.8 comparison report is available here: https://docs.google.com/document/d/1vyufd55iDyvKeYOwoXwKSsNoRK2VR41QNTuH-iERR8s/edit# VMware LSO 4.9 vs 4.8 comparison report is available here: https://docs.google.com/document/d/1Ft7gzWCcID2RTXILW3GrN8a6O5v5VidDICuG_tX__v8/edit# Version-Release number of selected component (if applicable): OCS versions ============== NAME DISPLAY VERSION REPLACES PHASE noobaa-operator.v4.9.0 NooBaa Operator 4.9.0 Succeeded ocs-operator.v4.9.0 OpenShift Container Storage 4.9.0 Succeeded odf-operator.v4.9.0 OpenShift Data Foundation 4.9.0 Succeeded ODF (OCS) build : full_version: 4.9.0-210.ci Rook versions =============== 2021-11-04 09:27:36.633082 I | op-flags: failed to set flag "logtostderr". no such flag -logtostderr rook: 4.9-210.f6e2005.release_4.9 go: go1.16.6 Ceph versions =============== ceph version 16.2.0-143.el8cp (0e2c6f9639c37a03e55885fb922dc0cb1b5173cb) pacific (stable) Full Version list is available here : http://ocsperf.ceph.redhat.com/logs/Performance_tests/4.9/RC0/Vmware-LSO/versions.txt How reproducible: Steps to Reproduce: 1. run test pvc_snapshot_performance.py test on AWS and VMware LSO clusters 2. compare the measurements for Snapshot Restore time to OCP 4.8 + OCS 4.8 results 3. Actual results: The results measured on both AWS and VMWare LSO clusters in OCP 4.9 + ODF 4.9 are worse than in 4.8 see AWS 4.9 vs 4.8 comparison report is available here: https://docs.google.com/document/d/1vyufd55iDyvKeYOwoXwKSsNoRK2VR41QNTuH-iERR8s/edit# VMware LSO 4.9 vs 4.8 comparison report is available here: https://docs.google.com/document/d/1Ft7gzWCcID2RTXILW3GrN8a6O5v5VidDICuG_tX__v8/edit# Expected results: Snapshot Restore time should be not worse than in 4.8 + 4.8 Additional info: AWS 4.9 vs 4.8 comparison report is available here: https://docs.google.com/document/d/1vyufd55iDyvKeYOwoXwKSsNoRK2VR41QNTuH-iERR8s/edit# VMware LSO 4.9 vs 4.8 comparison report is available here: https://docs.google.com/document/d/1Ft7gzWCcID2RTXILW3GrN8a6O5v5VidDICuG_tX__v8/edit#
Comparison data from VMware LSO report (https://docs.google.com/document/d/1Ft7gzWCcID2RTXILW3GrN8a6O5v5VidDICuG_tX__v8/edit#) : VMware LSO - Snapshot Restore times - CephFS - OCP 4.8 + OCS 4.8 1 Gi: 2.74 sec 10 Gi: 12.6 sec 100 Gi: 148.63 sec VMware LSO - Snapshot Restore times - CephFS - OCP 4.9 + ODF 4.9 1 Gi:6.489 sec 10 Gi: 21.46 sec 100 Gi: 213.9 sec Comparison data from AWS report (https://docs.google.com/document/d/1vyufd55iDyvKeYOwoXwKSsNoRK2VR41QNTuH-iERR8s/edit#) : AWS- Snapshot Restore times - CephFS - OCP 4.8 + OCS 4.8 100 Gi: 145.0 sec AWS- Snapshot Restore times - CephFS - OCP 4.9 + ODF 4.9 100 Gi: 500 sec
@Yug Gupta, AWS report with the OCP 4.8 + ODF 4.9 results is available here: https://docs.google.com/document/d/1vyufd55iDyvKeYOwoXwKSsNoRK2VR41QNTuH-iERR8s/edit#heading=h.jmpjljqpvc8y Measurements from that report: AWS - CephFS Snapshot Restore times ( OCP 4.8 + OCS 4.8) 1 Gi 3.48 sec 10 Gi 36.4 sec 100 Gi 145.0 sec AWS - CephFS Snapshot Restore times ( OCP 4.8 + ODF 4.9) 1 Gi 3.848 sec 10 Gi 43.06 sec 100 Gi 427.47 sec ( 194% degradation!) Conclusion: on AWS ( OCP 4.8 + ODF 4.9) we also see a degradation in CephFS snapshot restore times, especially for 100 Gi Snapshot restore time.
Unfortunately there must gather logs were not kept. Please let me know if you need to reproduce this bug and supply must gather from the newly deployed cluster.
ypersky we will need must-gather here to calculate and check the time spent by the ceph-csi here.
I've run Snapshot Performance test again on 4.10 and 4.9 VMware LSO platform ( report with links to the dashboard are available here: https://docs.google.com/document/d/19ZRfwhfbpYF2f6hUxCM5lCt0uLoNo3ibOMWTbZTUTqw/edit# ) . VMWare LSO 4.10 Snapshot Restore times : Note: Relevant Jenkins run is : https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/10162/ Must gather logs are available here: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/ypersky-local10/ypersky-local10_20220215T103839/logs/testcases_1645436035/ 1GB snapshot: 4.478 sec 10GB snapshot: 18.663 sec 100 GB snapshot: 151.306 sec VMWare LSO 4.9 Snapshot Restore times - new run: Note: must gather logs for this run is available here: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/ypersky-lso9/ypersky-lso9_20220228T124019/logs/testcases_1646258487/ or here http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/ypersky-lso9/ypersky-lso9_20220228T124019/logs/testcases_1646291211/ This is the relevant Jenkins run: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/10483/ : 1GB snapshot: 7.301 sec 10GB snapshot: 19.610 sec 100GB snapshot: 172.093 sec VMware LSO - Snapshot Restore times - CephFS - OCP 4.9 + ODF 4.9 ( old run - no must gather available) 1 Gi:6.489 sec 10 Gi: 21.46 sec 100 Gi: 213.9 sec VMWare LSO 4.8 Snapshot Restore times ( from this report https://docs.google.com/document/d/1vyufd55iDyvKeYOwoXwKSsNoRK2VR41QNTuH-iERR8s/edit#heading=h.jmpjljqpvc8y ) - no must gather is available for this run. 1 Gi: 2.74 sec 10 Gi: 12.6 sec 100 Gi: 148.63 sec Please note that the test reports AVERAGE of 3 sampled snapshots ( creation/restore times and speed). We do see a degradation of CephFS Snapshot Restore times in both 4.9 and 4.10 versus 4.8 results and I think that this degradation should be investigated and fixed. Please let me know in case you need any other information from my side. Please note that this is not possible to deploy a mixed versions ocp + odf cluster onh VMware LSO.
One this I did not add to this BZ is 4.8 must gather. For that I would need to deploy a new 4.8 VMware LSO cluster ( resources consuming , currently we have 4.9 and 4.10 VMWare LSO on that DC with tests running. Therefore if you need 4.8 cluster - it can be deployed AFTER the tests on 4.9 and 4.10 are done). Please let me know if the already supplied must gather are sufficient. If not - then I'll look for a way to deploy 4.8 vmware lso.
> VMWare LSO 4.8 Snapshot Restore times ( from this report > https://docs.google.com/document/d/1vyufd55iDyvKeYOwoXwKSsNoRK2VR41QNTuH-iERR8s/edit#heading=h.jmpjljqpvc8y ) - no must gather is available for this > run. > > 1 Gi: 2.74 sec > 10 Gi: 12.6 sec > 100 Gi: 148.63 sec > The duration from the doc refernced for ocs 4.8 is different 1 Gi: 3.48 10 Gi: 36.4 100 Gi: 145.0 I would say these values are pretty much match the ones for 4.10 1GB snapshot: 4.478 sec 10GB snapshot: 18.663 sec 100 GB snapshot: 151.306 sec > Please note that the test reports AVERAGE of 3 sampled snapshots ( > creation/restore times and speed). Even the 4.8 runs are averaged values?
From the data shared by Yuli, I see improvement in the 4.10 build in comparison to the 4.9 new build. Also, we don't know what went wrong earlier which showed the degradation and without must-gather, we can't debug it too. IMO, I don't see anything to debug here as we are getting a better result.
@Rakshith - > The duration from the doc refernced for ocs 4.8 is different > > 1 Gi: 3.48 > 10 Gi: 36.4 > 100 Gi: 145.0 where do those measurements appear ? I'm looking at this doc ( page 13) https://docs.google.com/document/d/1Ft7gzWCcID2RTXILW3GrN8a6O5v5VidDICuG_tX__v8/edit# and the 4.8 CephFS Snapshot Restore times are: 1Gi: 2.74 10Gi: 12.6 100 Gi: 148.63 If we compare those results to the latest 4.10 ( taken from this report: https://docs.google.com/document/d/19ZRfwhfbpYF2f6hUxCM5lCt0uLoNo3ibOMWTbZTUTqw/edit# _ 4.10 CephFS Snapshot Restore times are : 1Gi: 4.478 10Gi: 18.663 100Gi: 151.306 ===> We do see a degradation in the restore times for 1Gi and 10Gi snapshot sizes. And per your question - Yes, even 4.8 results are measurements of AVERAGE of 3 samples. So do you think this degradation is meaningless? Or is it worth investigating?
I think that this BZ is not a blocker since even if there is a regression between 4.10 and 4.8 it's not a major one. Regarding re-running the test with more samples ( currently it runs with 10 samples) - let's talk and try to understand which number of samples needs to be executed manually to determine finally whether there is a regression or not.
Moving out of 4.10 based on the above comment, we will keep investigating.
@Rakshith, The report you are looking at ( https://docs.google.com/document/d/1vyufd55iDyvKeYOwoXwKSsNoRK2VR41QNTuH-iERR8s/edit#heading=h.jmpjljqpvc8y ) is AWS 4.9 vs 4.8 comparison. Therefore , on 4.8 AWS the numbers of CephFS Snapshot Restore time are " 1 Gi: 3.48 10 Gi: 36.4 100 Gi: 145.0 on 4.8 VMware LSO , the 4.8 CephFS Snapshot Restore times are: 1Gi: 2.74 10Gi: 12.6 100 Gi: 148.63 The rest of the numbers I posted here refer to VMwareLSO as well. 4.10 CephFS Snapshot Restore times are : 1Gi: 4.478 10Gi: 18.663 100Gi: 151.306 Therefore we do see a degradation in CephFS Restore time in 4.10 for 1Gi and 10Gi snapshots. How do we proceed from here? Please note the following: VMWare LSO 4.10 Snapshot Restore times : Note: Relevant Jenkins run is : https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/10162/ Must gather logs are available here: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/ypersky-local10/ypersky-local10_20220215T103839/logs/testcases_1645436035/ 1GB snapshot: 4.478 sec 10GB snapshot: 18.663 sec 100 GB snapshot: 151.306 sec VMWare LSO 4.9 Snapshot Restore times - new run: Note: must gather logs for this run is available here: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/ypersky-lso9/ypersky-lso9_20220228T124019/logs/testcases_1646258487/ or here http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/ypersky-lso9/ypersky-lso9_20220228T124019/logs/testcases_1646291211/ This is the relevant Jenkins run: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/10483/ : 1GB snapshot: 7.301 sec 10GB snapshot: 19.610 sec 100GB snapshot: 172.093 sec
@ypersky from the above details shared by you, I see: VMWare LSO 4.10 Snapshot Restore times : 1GB snapshot: 4.478 sec 10GB snapshot: 18.663 sec 100 GB snapshot: 151.306 sec VMWare LSO 4.9 Snapshot Restore times - new run: 1GB snapshot: 7.301 sec 10GB snapshot: 19.610 sec 100GB snapshot: 172.093 sec This indicates that we have an improvement in the 4.10 build. and not sure how we can debug for the earlier build which showed degradation without any must-gather or information to debug the issue.
@Yati, I think that we can close this BZ ( since the 4.8 results may have been affected by not enough samples number), and in case we see any degradation in the future - I'll open a new BZ.
Thanks, closing this bug.