Description of problem (please be detailed as possible and provide log snippests): A degradation in RBD Snapshot creation time ( for all snapshot sizes) was found on 4.9 OCP + 4.9 ODF vmware LSO platform for all the PVC/snapshot sizes. The degradation is up to 50% and the bigger is the snapshot size - the bigger is the degradation. VMWare LSO full comparison report is available here: https://docs.google.com/document/d/1Ft7gzWCcID2RTXILW3GrN8a6O5v5VidDICuG_tX__v8/edit#heading=h.91a8e8qdrquk Please note that the measurements are still small numbers ( less then 1 second per one snapshot creation ) . However, the results in the previous version were better. Version-Release number of selected component (if applicable): OCS versions ============== NAME DISPLAY VERSION REPLACES PHASE noobaa-operator.v4.9.0 NooBaa Operator 4.9.0 Succeeded ocs-operator.v4.9.0 OpenShift Container Storage 4.9.0 Succeeded odf-operator.v4.9.0 OpenShift Data Foundation 4.9.0 Succeeded ODF (OCS) build : full_version: 4.9.0-210.ci Rook versions =============== 2021-11-04 09:27:36.633082 I | op-flags: failed to set flag "logtostderr". no such flag -logtostderr rook: 4.9-210.f6e2005.release_4.9 go: go1.16.6 Ceph versions =============== ceph version 16.2.0-143.el8cp (0e2c6f9639c37a03e55885fb922dc0cb1b5173cb) pacific (stable) Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? Yes Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: The measurements are better in OCP 4.8 + OCS 4.8 on the same platform. Steps to Reproduce: 1. Run test_pvc_snapshot_performance,py test 2. Measure RBD snapshot creation time in 4.9+4.9 and compare it to 4.8 + 4.8. 3. Actual results: RBD Snapshot Creation time in 4.9 + 4.9 is longer than in 4.8 + 4.8, while still each measurements is less than 1 second. Expected results: The measurements should be the same or better than in 4.8+4.8. Additional info: Comparison report is available here: https://docs.google.com/document/d/1Ft7gzWCcID2RTXILW3GrN8a6O5v5VidDICuG_tX__v8/edit#heading=h.cv93fomn5oqu
@Madhu - I plan to test this on OCP 4.8 + ODF 4.9 and will update in this BZ on the results once they are available.
Comparison data from the report (https://docs.google.com/document/d/1Ft7gzWCcID2RTXILW3GrN8a6O5v5VidDICuG_tX__v8/edit# ) : RBD snapshot creation times in OCP4.8+OSC4.8 1 Gi : 0.526 sec 10 Gi: 0.191 sec 100 Gi : 0.377 sec RBD snapshot creation times in OCP4.9+ODF4.9: 1 Gi : 0.63 sec 10 Gi: 0.34 sec 100 Gi : 0.56 sec
(In reply to Yuli Persky from comment #4) > Comparison data from the report > (https://docs.google.com/document/d/ > 1Ft7gzWCcID2RTXILW3GrN8a6O5v5VidDICuG_tX__v8/edit# ) : > > > RBD snapshot creation times in OCP4.8+OSC4.8 > > 1 Gi : 0.526 sec > 10 Gi: 0.191 sec > 100 Gi : 0.377 sec > > RBD snapshot creation times in OCP4.9+ODF4.9: > > 1 Gi : 0.63 sec > 10 Gi: 0.34 sec > 100 Gi : 0.56 sec @ypersky did you try this on OCP 4.8 + ODF 4.9 as mentioned by Madhu. It will be great if you can mention the details for this combination.
@yati padia and @
@yati padia and @Madhu, After trying to deploy a VMware LSO cluster with 4.8 OCP and 4.9 ODF it was found that such combination is not supported. It is not possible to deploy an LSO cluster with 4.8 OCP and 4.9 ODF, therefore I cannot provide any statistics and measurements from such cluster.
@Yug Gupta, As you've mentioned - it is not possible to deploy ODF 4.9 with OCP 4.8 on VMware LSO. As for other platforms - AWS comparison report is available here : https://docs.google.com/document/d/1vyufd55iDyvKeYOwoXwKSsNoRK2VR41QNTuH-iERR8s/edit#heading=h.2m8gdjc4jhzo and there there is a comparison between OCP 4.8 + OCS 4.8 vc vs OCP 4.9 + ODF 4.9. As it can be seen from the AWS comparison : for 100GB RBD snapshot - there is also degradation of 100%. AWS: 100GB RBD snapshot in 4.8 was created for 0.79 sec 100GB RBD snapshot in 4.8 was created for 1.6 sec However, for 1GB/10GB snapshots on AWS we see an improvement. At the same time - on VMware LSO we see degradation in snapshot creation times for all the snapshot sizes: VMware LSO: RBD snapshot creation times in OCP4.8+OSC4.8 1 Gi : 0.526 sec 10 Gi: 0.191 sec 100 Gi : 0.377 sec RBD snapshot creation times in OCP4.9+ODF4.9: 1 Gi : 0.63 sec 10 Gi: 0.34 sec 100 Gi : 0.56 sec
@Yug Gupta, The relevant AWS comparison is available here: https://docs.google.com/document/d/1vyufd55iDyvKeYOwoXwKSsNoRK2VR41QNTuH-iERR8s/edit# AWS - RBD Snapshot creation times in OCP 4.8 + OCS 4.8: 1 Gi : 1.69 sec 10 Gi: 1.507 sec 100 Gi: 0.79 sec AWS - RBD Snapshot creation times in OCP 4.8 + ODF 4.9): 1 Gi: 1.99 sec ( degradation of 17%) 10 Gi: 1.84 sec ( degradation of 22%) 100 Gi: 1.72 sec ( degradation of 117% !) Conclusion: On OCP 4.8 + ODF 4.9 we see similar degradation as on VMware LSO .
@Rakshith per your question - unfortunately the cluster is not available now and the must gather was not kept on Jenkins, so it's not available either. We will definitely put a not and make sure that all the Performance bugs will be opened with a link to must gather logs.
ypersky we will need must-gather here to calculate and check the time spent by the ceph-csi here.
Not a 4.10 blocker
1) I've run the test_pvc_snapshot_performance test again on VMWARE LSO 4.10 ocp + 4.10 odf and once again on VMWARE LSO 4.9 ocp + 4.9 odf. 4.9 + 4.9 Jenkins Job: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/10483/ 4.9 ocp + odf must gather logs link: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/ypersky-lso9/ypersky-lso9_20220228T124019/logs/testcases_1646258487/ or http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/ypersky-lso9/ypersky-lso9_20220228T124019/logs/testcases_1646291211/ 4.10 + 4.10 Jenkins Job: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/10162/ 4.10 ocp + odf must gather logs link: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/ypersky-local10/ypersky-local10_20220215T103839/logs/testcases_1645436035/ 2) Let me summarise the results of ALL the test runs: RBD snapshot creation times in VMWARE LSO OCP4.8+OSC4.8 ( no must gather available) : 1 Gi : 0.526 sec 10 Gi: 0.191 sec 100 Gi : 0.377 sec RBD snapshot creation times in VMWARE LSO OCP4.9+ODF4.9 (this is FIRST 4.9 test run - no must gather available) : 1 Gi : 0.63 sec 10 Gi: 0.34 sec 100 Gi : 0.56 sec RBD snapshot creation times in VMWARE LSO OCP4.9+ODF4.9 ( this is SECOND test run - the link to Jenkins job and must gather logs is in the beginning of this comment ) 1 Gi: 0.501 sec 10 Gi: 0.466 sec 100 Gi: 0.798 sec RBD snapshot creation times in VMWARE LSO OCP4.10+ODF4.10 ( the link to Jenkins job and must gather logs is in the beginning of this comment) : 1 Gi: 1.875 sec 10 Gi: 1.893 sec 100 Gi: 1.806 sec 3) As we can see - the SECOND 4.9 run confirms 4.9 degradation vs 4.8 and the 4.10 run shows a HUGE degradation comparing to 4.9 and for sure to 4.8. Please approve changing/ change by yourself this bug name to : VMWARE LSO - Significant degradation in RBD snapshot creation times in ODF 4.10 vs 4.9/4.8. I think that at this point this is what we need to fix, since we are seeing consequent degradation starting 4.9 and in 4.10 this becomes a major degradation.
Note : fll comparison VMware LSO report ( including links to the Performance dashboard) 4.10 vs 4.9 is available here: https://docs.google.com/document/d/19ZRfwhfbpYF2f6hUxCM5lCt0uLoNo3ibOMWTbZTUTqw/edit#
Hi ypersky, Checking the logs for ODF4.10, I see the snapshot creation was not successful http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/ypersky-local10/ypersky-local10_20220215T103839/logs/testcases_1645436035/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-7052e55c850e1630655e5455edd86de46b070911af017d34868a9f9f7045f6d2/namespaces/openshift-storage/pods/csi-rbdplugin-provisioner-dccd97fb8-7cwcp/csi-rbdplugin/csi-rbdplugin/logs/current.log ``` 022-02-22T08:18:44.697840602Z I0222 08:18:44.697428 1 utils.go:191] ID: 5394 Req-ID: snapshot-c4a6d03e-46ea-4868-9c6a-0c0c582b66bc GRPC call: /csi.v1.Controller/CreateSnapshot 2022-02-22T08:18:44.697840602Z I0222 08:18:44.697548 1 utils.go:195] ID: 5394 Req-ID: snapshot-c4a6d03e-46ea-4868-9c6a-0c0c582b66bc GRPC request: {"name":"snapshot-c4a6d03e-46ea-4868-9c6a-0c0c582b66bc","parameters":{"clusterID":"openshift-storage"},"secrets":"***stripped***","source_volume_id":"0001-0011-openshift-storage-0000000000000017-00efe30a-93b8-11ec-8972-0a580a800218"} 2022-02-22T08:18:44.697840602Z E0222 08:18:44.697740 1 controllerserver.go:1024] ID: 5394 Req-ID: snapshot-c4a6d03e-46ea-4868-9c6a-0c0c582b66bc failed to get backend volume for 0001-0011-openshift-storage-0000000000000017-00efe30a-93b8-11ec-8972-0a580a800218: pool not found: pool ID(23) not found in Ceph cluster 2022-02-22T08:18:44.697840602Z E0222 08:18:44.697765 1 utils.go:200] ID: 5394 Req-ID: snapshot-c4a6d03e-46ea-4868-9c6a-0c0c582b66bc GRPC error: rpc error: code = NotFound desc = pool not found: pool ID(23) not found in Ceph cluster 2022-02-22T08:18:44.723092019Z I0222 08:18:44.723039 1 utils.go:191] ID: 5395 GRPC call: /csi.v1.Identity/GetPluginInfo 2022-02-22T08:18:44.723124191Z I0222 08:18:44.723103 1 utils.go:195] ID: 5395 GRPC request: {} 2022-02-22T08:18:44.723124191Z I0222 08:18:44.723110 1 identityserver-default.go:38] ID: 5395 Using default GetPluginInfo 2022-02-22T08:18:44.723150461Z I0222 08:18:44.723136 1 utils.go:202] ID: 5395 GRPC response: {"name":"openshift-storage.rbd.csi.ceph.com","vendor_version":"release-4.10"} 2022-02-22T08:18:44.723582197Z I0222 08:18:44.723557 1 utils.go:191] ID: 5396 Req-ID: snapshot-c4a6d03e-46ea-4868-9c6a-0c0c582b66bc GRPC call: /csi.v1.Controller/CreateSnapshot 2022-02-22T08:18:44.723654902Z I0222 08:18:44.723624 1 utils.go:195] ID: 5396 Req-ID: snapshot-c4a6d03e-46ea-4868-9c6a-0c0c582b66bc GRPC request: {"name":"snapshot-c4a6d03e-46ea-4868-9c6a-0c0c582b66bc","parameters":{"clusterID":"openshift-storage"},"secrets":"***stripped***","source_volume_id":"0001-0011-openshift-storage-0000000000000017-00efe30a-93b8-11ec-8972-0a580a800218"} 2022-02-22T08:18:44.723891758Z E0222 08:18:44.723806 1 controllerserver.go:1024] ID: 5396 Req-ID: snapshot-c4a6d03e-46ea-4868-9c6a-0c0c582b66bc failed to get backend volume for 0001-0011-openshift-storage-0000000000000017-00efe30a-93b8-11ec-8972-0a580a800218: pool not found: pool ID(23) not found in Ceph cluster 2022-02-22T08:18:44.723891758Z E0222 08:18:44.723847 1 utils.go:200] ID: 5396 Req-ID: snapshot-c4a6d03e-46ea-4868-9c6a-0c0c582b66bc GRPC error: rpc error: code = NotFound desc = pool not found: pool ID(23) not found in Ceph cluster 2022-02-22T08:18:44.747813112Z I0222 08:18:44.747763 1 utils.go:191] ID: 5397 GRPC call: /csi.v1.Identity/GetPluginInfo 2022-02-22T08:18:44.747841524Z I0222 08:18:44.747822 1 utils.go:195] ID: 5397 GRPC request: {} 2022-02-22T08:18:44.747841524Z I0222 08:18:44.747828 1 identityserver-default.go:38] ID: 5397 Using default GetPluginInfo 2022-02-22T08:18:44.747902159Z I0222 08:18:44.747855 1 utils.go:202] ID: 5397 GRPC response: {"name":"openshift-storage.rbd.csi.ceph.com","vendor_version":"release-4.10"} ``` Did you check if the snapshots were successfully created with `oc get volumesnapshot`? And if the snapshot was not created how did you calculate the time. Please correct me if I am looking at the wrong place. cc @rar
@Yati, I'm sure that all the sampled snapshots were successfully created. Otherwose the test would not pass ( and it did pass).
@ypersky We have another bug opened to resolve the issue of `pool not found`. Once this is resolved you can retest it. Link to the bug: https://bugzilla.redhat.com/show_bug.cgi?id=1972013
Moving to 4.12 while we are waiting for the results
We'll re-run the test once we havea VMWARE LSO cluster ( currently we do not have the resources to deploy such cluster).