Bug 2023512
| Summary: | VMware LSO -slight degradation in RBD snapshot creation times in OCP 4.9 + ODF 4.9 vs OCP 4.8 + OCS 4.8 | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | Yuli Persky <ypersky> |
| Component: | csi-driver | Assignee: | yati padia <ypadia> |
| Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | Elad <ebenahar> |
| Severity: | low | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.10 | CC: | alayani, jopinto, kramdoss, madam, mmuench, mrajanna, muagarwa, ocs-bugs, odf-bz-bot, rar, ypadia |
| Target Milestone: | --- | Keywords: | Automation, Performance |
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2022-10-04 02:13:22 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Yuli Persky
2021-11-15 22:24:46 UTC
@Madhu - I plan to test this on OCP 4.8 + ODF 4.9 and will update in this BZ on the results once they are available. Comparison data from the report (https://docs.google.com/document/d/1Ft7gzWCcID2RTXILW3GrN8a6O5v5VidDICuG_tX__v8/edit# ) : RBD snapshot creation times in OCP4.8+OSC4.8 1 Gi : 0.526 sec 10 Gi: 0.191 sec 100 Gi : 0.377 sec RBD snapshot creation times in OCP4.9+ODF4.9: 1 Gi : 0.63 sec 10 Gi: 0.34 sec 100 Gi : 0.56 sec (In reply to Yuli Persky from comment #4) > Comparison data from the report > (https://docs.google.com/document/d/ > 1Ft7gzWCcID2RTXILW3GrN8a6O5v5VidDICuG_tX__v8/edit# ) : > > > RBD snapshot creation times in OCP4.8+OSC4.8 > > 1 Gi : 0.526 sec > 10 Gi: 0.191 sec > 100 Gi : 0.377 sec > > RBD snapshot creation times in OCP4.9+ODF4.9: > > 1 Gi : 0.63 sec > 10 Gi: 0.34 sec > 100 Gi : 0.56 sec @ypersky did you try this on OCP 4.8 + ODF 4.9 as mentioned by Madhu. It will be great if you can mention the details for this combination. @yati padia and @ @yati padia and @Madhu, After trying to deploy a VMware LSO cluster with 4.8 OCP and 4.9 ODF it was found that such combination is not supported. It is not possible to deploy an LSO cluster with 4.8 OCP and 4.9 ODF, therefore I cannot provide any statistics and measurements from such cluster. @Yug Gupta, As you've mentioned - it is not possible to deploy ODF 4.9 with OCP 4.8 on VMware LSO. As for other platforms - AWS comparison report is available here : https://docs.google.com/document/d/1vyufd55iDyvKeYOwoXwKSsNoRK2VR41QNTuH-iERR8s/edit#heading=h.2m8gdjc4jhzo and there there is a comparison between OCP 4.8 + OCS 4.8 vc vs OCP 4.9 + ODF 4.9. As it can be seen from the AWS comparison : for 100GB RBD snapshot - there is also degradation of 100%. AWS: 100GB RBD snapshot in 4.8 was created for 0.79 sec 100GB RBD snapshot in 4.8 was created for 1.6 sec However, for 1GB/10GB snapshots on AWS we see an improvement. At the same time - on VMware LSO we see degradation in snapshot creation times for all the snapshot sizes: VMware LSO: RBD snapshot creation times in OCP4.8+OSC4.8 1 Gi : 0.526 sec 10 Gi: 0.191 sec 100 Gi : 0.377 sec RBD snapshot creation times in OCP4.9+ODF4.9: 1 Gi : 0.63 sec 10 Gi: 0.34 sec 100 Gi : 0.56 sec @Yug Gupta, The relevant AWS comparison is available here: https://docs.google.com/document/d/1vyufd55iDyvKeYOwoXwKSsNoRK2VR41QNTuH-iERR8s/edit# AWS - RBD Snapshot creation times in OCP 4.8 + OCS 4.8: 1 Gi : 1.69 sec 10 Gi: 1.507 sec 100 Gi: 0.79 sec AWS - RBD Snapshot creation times in OCP 4.8 + ODF 4.9): 1 Gi: 1.99 sec ( degradation of 17%) 10 Gi: 1.84 sec ( degradation of 22%) 100 Gi: 1.72 sec ( degradation of 117% !) Conclusion: On OCP 4.8 + ODF 4.9 we see similar degradation as on VMware LSO . @Rakshith per your question - unfortunately the cluster is not available now and the must gather was not kept on Jenkins, so it's not available either. We will definitely put a not and make sure that all the Performance bugs will be opened with a link to must gather logs. ypersky we will need must-gather here to calculate and check the time spent by the ceph-csi here. Not a 4.10 blocker 1) I've run the test_pvc_snapshot_performance test again on VMWARE LSO 4.10 ocp + 4.10 odf and once again on VMWARE LSO 4.9 ocp + 4.9 odf. 4.9 + 4.9 Jenkins Job: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/10483/ 4.9 ocp + odf must gather logs link: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/ypersky-lso9/ypersky-lso9_20220228T124019/logs/testcases_1646258487/ or http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/ypersky-lso9/ypersky-lso9_20220228T124019/logs/testcases_1646291211/ 4.10 + 4.10 Jenkins Job: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/10162/ 4.10 ocp + odf must gather logs link: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/ypersky-local10/ypersky-local10_20220215T103839/logs/testcases_1645436035/ 2) Let me summarise the results of ALL the test runs: RBD snapshot creation times in VMWARE LSO OCP4.8+OSC4.8 ( no must gather available) : 1 Gi : 0.526 sec 10 Gi: 0.191 sec 100 Gi : 0.377 sec RBD snapshot creation times in VMWARE LSO OCP4.9+ODF4.9 (this is FIRST 4.9 test run - no must gather available) : 1 Gi : 0.63 sec 10 Gi: 0.34 sec 100 Gi : 0.56 sec RBD snapshot creation times in VMWARE LSO OCP4.9+ODF4.9 ( this is SECOND test run - the link to Jenkins job and must gather logs is in the beginning of this comment ) 1 Gi: 0.501 sec 10 Gi: 0.466 sec 100 Gi: 0.798 sec RBD snapshot creation times in VMWARE LSO OCP4.10+ODF4.10 ( the link to Jenkins job and must gather logs is in the beginning of this comment) : 1 Gi: 1.875 sec 10 Gi: 1.893 sec 100 Gi: 1.806 sec 3) As we can see - the SECOND 4.9 run confirms 4.9 degradation vs 4.8 and the 4.10 run shows a HUGE degradation comparing to 4.9 and for sure to 4.8. Please approve changing/ change by yourself this bug name to : VMWARE LSO - Significant degradation in RBD snapshot creation times in ODF 4.10 vs 4.9/4.8. I think that at this point this is what we need to fix, since we are seeing consequent degradation starting 4.9 and in 4.10 this becomes a major degradation. Note : fll comparison VMware LSO report ( including links to the Performance dashboard) 4.10 vs 4.9 is available here: https://docs.google.com/document/d/19ZRfwhfbpYF2f6hUxCM5lCt0uLoNo3ibOMWTbZTUTqw/edit# Hi ypersky, Checking the logs for ODF4.10, I see the snapshot creation was not successful http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/ypersky-local10/ypersky-local10_20220215T103839/logs/testcases_1645436035/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-7052e55c850e1630655e5455edd86de46b070911af017d34868a9f9f7045f6d2/namespaces/openshift-storage/pods/csi-rbdplugin-provisioner-dccd97fb8-7cwcp/csi-rbdplugin/csi-rbdplugin/logs/current.log ``` 022-02-22T08:18:44.697840602Z I0222 08:18:44.697428 1 utils.go:191] ID: 5394 Req-ID: snapshot-c4a6d03e-46ea-4868-9c6a-0c0c582b66bc GRPC call: /csi.v1.Controller/CreateSnapshot 2022-02-22T08:18:44.697840602Z I0222 08:18:44.697548 1 utils.go:195] ID: 5394 Req-ID: snapshot-c4a6d03e-46ea-4868-9c6a-0c0c582b66bc GRPC request: {"name":"snapshot-c4a6d03e-46ea-4868-9c6a-0c0c582b66bc","parameters":{"clusterID":"openshift-storage"},"secrets":"***stripped***","source_volume_id":"0001-0011-openshift-storage-0000000000000017-00efe30a-93b8-11ec-8972-0a580a800218"} 2022-02-22T08:18:44.697840602Z E0222 08:18:44.697740 1 controllerserver.go:1024] ID: 5394 Req-ID: snapshot-c4a6d03e-46ea-4868-9c6a-0c0c582b66bc failed to get backend volume for 0001-0011-openshift-storage-0000000000000017-00efe30a-93b8-11ec-8972-0a580a800218: pool not found: pool ID(23) not found in Ceph cluster 2022-02-22T08:18:44.697840602Z E0222 08:18:44.697765 1 utils.go:200] ID: 5394 Req-ID: snapshot-c4a6d03e-46ea-4868-9c6a-0c0c582b66bc GRPC error: rpc error: code = NotFound desc = pool not found: pool ID(23) not found in Ceph cluster 2022-02-22T08:18:44.723092019Z I0222 08:18:44.723039 1 utils.go:191] ID: 5395 GRPC call: /csi.v1.Identity/GetPluginInfo 2022-02-22T08:18:44.723124191Z I0222 08:18:44.723103 1 utils.go:195] ID: 5395 GRPC request: {} 2022-02-22T08:18:44.723124191Z I0222 08:18:44.723110 1 identityserver-default.go:38] ID: 5395 Using default GetPluginInfo 2022-02-22T08:18:44.723150461Z I0222 08:18:44.723136 1 utils.go:202] ID: 5395 GRPC response: {"name":"openshift-storage.rbd.csi.ceph.com","vendor_version":"release-4.10"} 2022-02-22T08:18:44.723582197Z I0222 08:18:44.723557 1 utils.go:191] ID: 5396 Req-ID: snapshot-c4a6d03e-46ea-4868-9c6a-0c0c582b66bc GRPC call: /csi.v1.Controller/CreateSnapshot 2022-02-22T08:18:44.723654902Z I0222 08:18:44.723624 1 utils.go:195] ID: 5396 Req-ID: snapshot-c4a6d03e-46ea-4868-9c6a-0c0c582b66bc GRPC request: {"name":"snapshot-c4a6d03e-46ea-4868-9c6a-0c0c582b66bc","parameters":{"clusterID":"openshift-storage"},"secrets":"***stripped***","source_volume_id":"0001-0011-openshift-storage-0000000000000017-00efe30a-93b8-11ec-8972-0a580a800218"} 2022-02-22T08:18:44.723891758Z E0222 08:18:44.723806 1 controllerserver.go:1024] ID: 5396 Req-ID: snapshot-c4a6d03e-46ea-4868-9c6a-0c0c582b66bc failed to get backend volume for 0001-0011-openshift-storage-0000000000000017-00efe30a-93b8-11ec-8972-0a580a800218: pool not found: pool ID(23) not found in Ceph cluster 2022-02-22T08:18:44.723891758Z E0222 08:18:44.723847 1 utils.go:200] ID: 5396 Req-ID: snapshot-c4a6d03e-46ea-4868-9c6a-0c0c582b66bc GRPC error: rpc error: code = NotFound desc = pool not found: pool ID(23) not found in Ceph cluster 2022-02-22T08:18:44.747813112Z I0222 08:18:44.747763 1 utils.go:191] ID: 5397 GRPC call: /csi.v1.Identity/GetPluginInfo 2022-02-22T08:18:44.747841524Z I0222 08:18:44.747822 1 utils.go:195] ID: 5397 GRPC request: {} 2022-02-22T08:18:44.747841524Z I0222 08:18:44.747828 1 identityserver-default.go:38] ID: 5397 Using default GetPluginInfo 2022-02-22T08:18:44.747902159Z I0222 08:18:44.747855 1 utils.go:202] ID: 5397 GRPC response: {"name":"openshift-storage.rbd.csi.ceph.com","vendor_version":"release-4.10"} ``` Did you check if the snapshots were successfully created with `oc get volumesnapshot`? And if the snapshot was not created how did you calculate the time. Please correct me if I am looking at the wrong place. cc @rar @Yati, I'm sure that all the sampled snapshots were successfully created. Otherwose the test would not pass ( and it did pass). @ypersky We have another bug opened to resolve the issue of `pool not found`. Once this is resolved you can retest it. Link to the bug: https://bugzilla.redhat.com/show_bug.cgi?id=1972013 Moving to 4.12 while we are waiting for the results We'll re-run the test once we havea VMWARE LSO cluster ( currently we do not have the resources to deploy such cluster). |