Bug 2023512 - VMware LSO -slight degradation in RBD snapshot creation times in OCP 4.9 + ODF 4.9 vs OCP 4.8 + OCS 4.8
Summary: VMware LSO -slight degradation in RBD snapshot creation times in OCP 4.9 + O...
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: csi-driver
Version: 4.10
Hardware: Unspecified
OS: Unspecified
unspecified
low
Target Milestone: ---
: ---
Assignee: yati padia
QA Contact: Elad
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-11-15 22:24 UTC by Yuli Persky
Modified: 2023-08-09 16:37 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-10-04 02:13:22 UTC
Embargoed:


Attachments (Terms of Use)

Description Yuli Persky 2021-11-15 22:24:46 UTC
Description of problem (please be detailed as possible and provide log
snippests):

A degradation in RBD Snapshot creation time ( for all snapshot sizes) was found on 4.9 OCP + 4.9 ODF vmware LSO platform for all the PVC/snapshot sizes. The degradation is up to 50% and the bigger is the snapshot size - the bigger is the degradation.


VMWare LSO full comparison report is available here: 

https://docs.google.com/document/d/1Ft7gzWCcID2RTXILW3GrN8a6O5v5VidDICuG_tX__v8/edit#heading=h.91a8e8qdrquk

Please note that the measurements are still small numbers ( less then 1 second per one snapshot creation ) . However, the results in the previous version were better. 



Version-Release number of selected component (if applicable):

OCS versions
	==============

		NAME                     DISPLAY                       VERSION   REPLACES   PHASE
		noobaa-operator.v4.9.0   NooBaa Operator               4.9.0                Succeeded
		ocs-operator.v4.9.0      OpenShift Container Storage   4.9.0                Succeeded
		odf-operator.v4.9.0      OpenShift Data Foundation     4.9.0                Succeeded
		
		ODF (OCS) build :		      full_version: 4.9.0-210.ci
		
	Rook versions
	===============

		2021-11-04 09:27:36.633082 I | op-flags: failed to set flag "logtostderr". no such flag -logtostderr
		rook: 4.9-210.f6e2005.release_4.9
		go: go1.16.6
		
	Ceph versions
	===============

		ceph version 16.2.0-143.el8cp (0e2c6f9639c37a03e55885fb922dc0cb1b5173cb) pacific (stable)



Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?

Yes 


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:

The measurements are better in OCP 4.8 + OCS 4.8 on the same platform. 


Steps to Reproduce:
1. Run test_pvc_snapshot_performance,py test
2. Measure RBD snapshot creation time in 4.9+4.9 and compare it to 4.8 + 4.8. 
3.


Actual results:

RBD Snapshot Creation time in 4.9 + 4.9 is longer than in 4.8 + 4.8, while still each measurements is less than 1 second. 


Expected results:

The measurements should be the same or better than in 4.8+4.8. 


Additional info:

Comparison report is available here: https://docs.google.com/document/d/1Ft7gzWCcID2RTXILW3GrN8a6O5v5VidDICuG_tX__v8/edit#heading=h.cv93fomn5oqu

Comment 3 Yuli Persky 2021-11-17 10:42:52 UTC
@Madhu - 

I plan to test this on OCP 4.8 + ODF 4.9 and will update in this BZ on the results once they are available.

Comment 4 Yuli Persky 2021-11-17 10:57:04 UTC
Comparison data from the report (https://docs.google.com/document/d/1Ft7gzWCcID2RTXILW3GrN8a6O5v5VidDICuG_tX__v8/edit# ) : 


RBD snapshot creation times in OCP4.8+OSC4.8

1 Gi  : 0.526 sec
10 Gi: 0.191 sec
100 Gi : 0.377 sec 

RBD snapshot creation times in OCP4.9+ODF4.9: 

1 Gi  : 0.63 sec
10 Gi: 0.34 sec
100 Gi : 0.56 sec

Comment 6 yati padia 2021-11-23 05:37:32 UTC
 (In reply to Yuli Persky from comment #4)
> Comparison data from the report
> (https://docs.google.com/document/d/
> 1Ft7gzWCcID2RTXILW3GrN8a6O5v5VidDICuG_tX__v8/edit# ) : 
> 
> 
> RBD snapshot creation times in OCP4.8+OSC4.8
> 
> 1 Gi  : 0.526 sec
> 10 Gi: 0.191 sec
> 100 Gi : 0.377 sec 
> 
> RBD snapshot creation times in OCP4.9+ODF4.9: 
> 
> 1 Gi  : 0.63 sec
> 10 Gi: 0.34 sec
> 100 Gi : 0.56 sec

@ypersky did you try this on OCP 4.8 + ODF 4.9 as mentioned by Madhu. It will be great if you can mention the details for this combination.

Comment 7 Yuli Persky 2021-11-23 21:06:31 UTC
@yati padia and @

Comment 8 Yuli Persky 2021-11-23 21:08:05 UTC
@yati padia and @Madhu,


After trying to deploy a VMware LSO cluster with 4.8 OCP and 4.9 ODF it was found that such combination is not supported. 
It is not possible to deploy an LSO cluster with 4.8 OCP and 4.9 ODF, therefore I cannot provide any statistics and measurements from such cluster.

Comment 10 Yuli Persky 2021-12-01 15:16:32 UTC
@Yug Gupta,

As you've mentioned - it is not possible to deploy ODF 4.9 with OCP 4.8 on VMware LSO. 
As for other platforms - AWS comparison report is available here : https://docs.google.com/document/d/1vyufd55iDyvKeYOwoXwKSsNoRK2VR41QNTuH-iERR8s/edit#heading=h.2m8gdjc4jhzo

and there there is a comparison between OCP 4.8 + OCS 4.8 vc vs OCP 4.9 + ODF 4.9. 

As it can be seen from the AWS comparison : 


for 100GB RBD snapshot  - there is also degradation of 100%. 

AWS: 

100GB RBD snapshot in 4.8 was created for 0.79 sec
100GB RBD snapshot in 4.8 was created for 1.6 sec

However, for 1GB/10GB snapshots on AWS we see an improvement. 

At the same time - on VMware LSO we see degradation in snapshot creation times for all the snapshot sizes: 

VMware LSO: 

RBD snapshot creation times in OCP4.8+OSC4.8

 1 Gi  : 0.526 sec
 10 Gi: 0.191 sec
 100 Gi : 0.377 sec 
 
 RBD snapshot creation times in OCP4.9+ODF4.9: 
 
 1 Gi  : 0.63 sec
 10 Gi: 0.34 sec
 100 Gi : 0.56 sec

Comment 13 Yuli Persky 2021-12-20 18:10:32 UTC
@Yug Gupta,

The relevant AWS comparison is available here: 

https://docs.google.com/document/d/1vyufd55iDyvKeYOwoXwKSsNoRK2VR41QNTuH-iERR8s/edit#

AWS - RBD Snapshot creation times in OCP 4.8 + OCS 4.8: 

1 Gi : 1.69 sec
10 Gi: 1.507 sec
100 Gi: 0.79 sec



AWS - RBD Snapshot creation times in OCP 4.8 + ODF 4.9): 

1 Gi:   1.99 sec ( degradation of 17%) 
10 Gi:  1.84 sec ( degradation of 22%)
100 Gi: 1.72 sec ( degradation of 117% !)

Conclusion: On OCP 4.8 + ODF 4.9 we see similar degradation as on VMware LSO .

Comment 15 Yuli Persky 2022-02-02 10:38:56 UTC
@Rakshith per your question - unfortunately the cluster is not available now and the must gather was not kept on Jenkins, so it's not available either. 

We will definitely put a not and make sure that all the Performance bugs will be opened with a link to must gather logs.

Comment 16 yati padia 2022-02-04 05:46:20 UTC
ypersky we will need must-gather here to calculate and check the time spent by the ceph-csi here.

Comment 17 Mudit Agarwal 2022-03-08 13:42:57 UTC
Not a 4.10 blocker

Comment 18 Yuli Persky 2022-03-08 21:05:58 UTC
1) I've run the test_pvc_snapshot_performance test again on VMWARE LSO 4.10 ocp + 4.10 odf and once again on VMWARE LSO 4.9 ocp + 4.9 odf. 

4.9 + 4.9 Jenkins Job: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/10483/

4.9 ocp + odf must gather logs link: 

http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/ypersky-lso9/ypersky-lso9_20220228T124019/logs/testcases_1646258487/
                           or
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/ypersky-lso9/ypersky-lso9_20220228T124019/logs/testcases_1646291211/                                   


4.10 + 4.10 Jenkins Job: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/10162/ 

4.10 ocp + odf must gather logs link: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/ypersky-local10/ypersky-local10_20220215T103839/logs/testcases_1645436035/


2) Let me summarise the results of ALL the test runs: 


RBD snapshot creation times in VMWARE LSO OCP4.8+OSC4.8 ( no must gather available) :  

1 Gi  : 0.526 sec
10 Gi: 0.191 sec
100 Gi : 0.377 sec 

RBD snapshot creation times in VMWARE LSO OCP4.9+ODF4.9 (this is FIRST 4.9 test run - no must gather available) : 

1 Gi  : 0.63 sec
10 Gi: 0.34 sec
100 Gi : 0.56 sec


RBD snapshot creation times in VMWARE LSO OCP4.9+ODF4.9 ( this is SECOND test run - the link to Jenkins job and must gather logs is in the beginning of this comment ) 

1 Gi: 0.501 sec
10 Gi: 0.466 sec
100 Gi: 0.798 sec

RBD snapshot creation times in VMWARE LSO OCP4.10+ODF4.10 ( the link to Jenkins job and must gather logs is in the beginning of this comment) : 

1 Gi: 1.875 sec
10 Gi: 1.893 sec
100 Gi: 1.806 sec

3) As we can see - the SECOND 4.9 run confirms 4.9 degradation vs 4.8 and the 4.10 run shows a HUGE degradation comparing to 4.9 and for sure to 4.8. 

Please approve changing/ change by yourself this bug name to : VMWARE LSO - Significant degradation in RBD snapshot creation times in ODF 4.10 vs 4.9/4.8. 

I think that at this point this is what we need to fix, since we are seeing consequent degradation starting 4.9 and in 4.10 this becomes a major degradation.

Comment 19 Yuli Persky 2022-03-08 21:06:59 UTC
Note : fll comparison VMware LSO report ( including links to the Performance dashboard) 4.10 vs 4.9 is available here: 

https://docs.google.com/document/d/19ZRfwhfbpYF2f6hUxCM5lCt0uLoNo3ibOMWTbZTUTqw/edit#

Comment 20 yati padia 2022-03-09 04:23:24 UTC
Hi ypersky,
Checking the logs for ODF4.10, I see the snapshot creation was not successful

http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/ypersky-local10/ypersky-local10_20220215T103839/logs/testcases_1645436035/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-7052e55c850e1630655e5455edd86de46b070911af017d34868a9f9f7045f6d2/namespaces/openshift-storage/pods/csi-rbdplugin-provisioner-dccd97fb8-7cwcp/csi-rbdplugin/csi-rbdplugin/logs/current.log
```
022-02-22T08:18:44.697840602Z I0222 08:18:44.697428       1 utils.go:191] ID: 5394 Req-ID: snapshot-c4a6d03e-46ea-4868-9c6a-0c0c582b66bc GRPC call: /csi.v1.Controller/CreateSnapshot
2022-02-22T08:18:44.697840602Z I0222 08:18:44.697548       1 utils.go:195] ID: 5394 Req-ID: snapshot-c4a6d03e-46ea-4868-9c6a-0c0c582b66bc GRPC request: {"name":"snapshot-c4a6d03e-46ea-4868-9c6a-0c0c582b66bc","parameters":{"clusterID":"openshift-storage"},"secrets":"***stripped***","source_volume_id":"0001-0011-openshift-storage-0000000000000017-00efe30a-93b8-11ec-8972-0a580a800218"}
2022-02-22T08:18:44.697840602Z E0222 08:18:44.697740       1 controllerserver.go:1024] ID: 5394 Req-ID: snapshot-c4a6d03e-46ea-4868-9c6a-0c0c582b66bc failed to get backend volume for 0001-0011-openshift-storage-0000000000000017-00efe30a-93b8-11ec-8972-0a580a800218: pool not found: pool ID(23) not found in Ceph cluster
2022-02-22T08:18:44.697840602Z E0222 08:18:44.697765       1 utils.go:200] ID: 5394 Req-ID: snapshot-c4a6d03e-46ea-4868-9c6a-0c0c582b66bc GRPC error: rpc error: code = NotFound desc = pool not found: pool ID(23) not found in Ceph cluster
2022-02-22T08:18:44.723092019Z I0222 08:18:44.723039       1 utils.go:191] ID: 5395 GRPC call: /csi.v1.Identity/GetPluginInfo
2022-02-22T08:18:44.723124191Z I0222 08:18:44.723103       1 utils.go:195] ID: 5395 GRPC request: {}
2022-02-22T08:18:44.723124191Z I0222 08:18:44.723110       1 identityserver-default.go:38] ID: 5395 Using default GetPluginInfo
2022-02-22T08:18:44.723150461Z I0222 08:18:44.723136       1 utils.go:202] ID: 5395 GRPC response: {"name":"openshift-storage.rbd.csi.ceph.com","vendor_version":"release-4.10"}
2022-02-22T08:18:44.723582197Z I0222 08:18:44.723557       1 utils.go:191] ID: 5396 Req-ID: snapshot-c4a6d03e-46ea-4868-9c6a-0c0c582b66bc GRPC call: /csi.v1.Controller/CreateSnapshot
2022-02-22T08:18:44.723654902Z I0222 08:18:44.723624       1 utils.go:195] ID: 5396 Req-ID: snapshot-c4a6d03e-46ea-4868-9c6a-0c0c582b66bc GRPC request: {"name":"snapshot-c4a6d03e-46ea-4868-9c6a-0c0c582b66bc","parameters":{"clusterID":"openshift-storage"},"secrets":"***stripped***","source_volume_id":"0001-0011-openshift-storage-0000000000000017-00efe30a-93b8-11ec-8972-0a580a800218"}
2022-02-22T08:18:44.723891758Z E0222 08:18:44.723806       1 controllerserver.go:1024] ID: 5396 Req-ID: snapshot-c4a6d03e-46ea-4868-9c6a-0c0c582b66bc failed to get backend volume for 0001-0011-openshift-storage-0000000000000017-00efe30a-93b8-11ec-8972-0a580a800218: pool not found: pool ID(23) not found in Ceph cluster
2022-02-22T08:18:44.723891758Z E0222 08:18:44.723847       1 utils.go:200] ID: 5396 Req-ID: snapshot-c4a6d03e-46ea-4868-9c6a-0c0c582b66bc GRPC error: rpc error: code = NotFound desc = pool not found: pool ID(23) not found in Ceph cluster
2022-02-22T08:18:44.747813112Z I0222 08:18:44.747763       1 utils.go:191] ID: 5397 GRPC call: /csi.v1.Identity/GetPluginInfo
2022-02-22T08:18:44.747841524Z I0222 08:18:44.747822       1 utils.go:195] ID: 5397 GRPC request: {}
2022-02-22T08:18:44.747841524Z I0222 08:18:44.747828       1 identityserver-default.go:38] ID: 5397 Using default GetPluginInfo
2022-02-22T08:18:44.747902159Z I0222 08:18:44.747855       1 utils.go:202] ID: 5397 GRPC response: {"name":"openshift-storage.rbd.csi.ceph.com","vendor_version":"release-4.10"}
```

Did you check if the snapshots were successfully created with `oc get volumesnapshot`?
And if the snapshot was not created how did you calculate the time.

Please correct me if I am looking at the wrong place.

cc @rar

Comment 21 Yuli Persky 2022-03-16 12:02:20 UTC
@Yati,

I'm sure that all the sampled snapshots were successfully created. Otherwose the test would not pass ( and it did pass).

Comment 22 yati padia 2022-03-23 03:46:04 UTC
@ypersky We have another bug opened to resolve the issue of `pool not found`. Once this is resolved you can retest it.
Link to the bug: https://bugzilla.redhat.com/show_bug.cgi?id=1972013

Comment 23 Mudit Agarwal 2022-06-20 14:46:24 UTC
Moving to 4.12 while we are waiting for the results

Comment 24 Yuli Persky 2022-06-26 09:32:33 UTC
We'll re-run the test once we havea VMWARE LSO cluster ( currently we do not have the resources to deploy such cluster).


Note You need to log in before you can comment on or make changes to this bug.