Bug 2023484 - VMware LSO + AWS - degradation for CephFS Snapshot Restore times for 100GB and 10GB PVCs
Summary: VMware LSO + AWS - degradation for CephFS Snapshot Restore times for 100GB an...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: csi-driver
Version: 4.9
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: yati padia
QA Contact: Elad
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-11-15 21:21 UTC by Yuli Persky
Modified: 2023-08-09 16:37 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-03-11 15:41:53 UTC
Embargoed:


Attachments (Terms of Use)

Description Yuli Persky 2021-11-15 21:21:38 UTC
Description of problem:

In both platforms we've found a degradation in CephFS Snapshot Restore times ( in AWS - bad degradation of over 200% for 100GB Snapshot and in Vmware LSO - 30-50% degradation for 10GB and 100 GB snapshots) when we compared OCP 4.9 + ODF 4.9 results to the results of OCP 4.8 + OCS 4.8. 

AWS 4.9 vs 4.8 comparison report is available here: 

https://docs.google.com/document/d/1vyufd55iDyvKeYOwoXwKSsNoRK2VR41QNTuH-iERR8s/edit#

VMware LSO 4.9 vs 4.8 comparison report is available here: 

https://docs.google.com/document/d/1Ft7gzWCcID2RTXILW3GrN8a6O5v5VidDICuG_tX__v8/edit#


Version-Release number of selected component (if applicable):

OCS versions
	==============

		NAME                     DISPLAY                       VERSION   REPLACES   PHASE
		noobaa-operator.v4.9.0   NooBaa Operator               4.9.0                Succeeded
		ocs-operator.v4.9.0      OpenShift Container Storage   4.9.0                Succeeded
		odf-operator.v4.9.0      OpenShift Data Foundation     4.9.0                Succeeded
		
		ODF (OCS) build :		      full_version: 4.9.0-210.ci
		
	Rook versions
	===============

		2021-11-04 09:27:36.633082 I | op-flags: failed to set flag "logtostderr". no such flag -logtostderr
		rook: 4.9-210.f6e2005.release_4.9
		go: go1.16.6
		
	Ceph versions
	===============

		ceph version 16.2.0-143.el8cp (0e2c6f9639c37a03e55885fb922dc0cb1b5173cb) pacific (stable)


Full Version list is available here : 

http://ocsperf.ceph.redhat.com/logs/Performance_tests/4.9/RC0/Vmware-LSO/versions.txt



How reproducible:


Steps to Reproduce:
1. run test pvc_snapshot_performance.py test on AWS and VMware LSO clusters
2. compare the measurements for Snapshot Restore time to OCP 4.8 + OCS 4.8 results 
3.

Actual results:

The results measured on both AWS and VMWare LSO clusters in OCP 4.9 + ODF 4.9 are worse than in 4.8

see 

AWS 4.9 vs 4.8 comparison report is available here: 

https://docs.google.com/document/d/1vyufd55iDyvKeYOwoXwKSsNoRK2VR41QNTuH-iERR8s/edit#

VMware LSO 4.9 vs 4.8 comparison report is available here: 

https://docs.google.com/document/d/1Ft7gzWCcID2RTXILW3GrN8a6O5v5VidDICuG_tX__v8/edit#



Expected results:

Snapshot Restore time should be not worse than in 4.8 + 4.8 


Additional info:


AWS 4.9 vs 4.8 comparison report is available here: 

https://docs.google.com/document/d/1vyufd55iDyvKeYOwoXwKSsNoRK2VR41QNTuH-iERR8s/edit#

VMware LSO 4.9 vs 4.8 comparison report is available here: 

https://docs.google.com/document/d/1Ft7gzWCcID2RTXILW3GrN8a6O5v5VidDICuG_tX__v8/edit#

Comment 2 Yuli Persky 2021-11-17 11:05:34 UTC
Comparison data from VMware LSO report (https://docs.google.com/document/d/1Ft7gzWCcID2RTXILW3GrN8a6O5v5VidDICuG_tX__v8/edit#) : 

VMware LSO - Snapshot Restore times - CephFS - OCP 4.8 + OCS 4.8

1 Gi: 2.74 sec
10 Gi: 12.6 sec
100 Gi: 148.63 sec 


VMware LSO - Snapshot Restore times - CephFS - OCP 4.9 + ODF 4.9 

1 Gi:6.489 sec
10 Gi: 21.46 sec
100 Gi: 213.9 sec 



Comparison data from AWS report (https://docs.google.com/document/d/1vyufd55iDyvKeYOwoXwKSsNoRK2VR41QNTuH-iERR8s/edit#) : 


AWS- Snapshot Restore times - CephFS - OCP 4.8 + OCS 4.8  

100 Gi: 145.0 sec

AWS- Snapshot Restore times - CephFS - OCP 4.9 + ODF 4.9

100 Gi: 500 sec

Comment 7 Yuli Persky 2021-12-20 12:26:01 UTC
@Yug Gupta,

AWS report with the OCP 4.8 + ODF 4.9 results is available here: https://docs.google.com/document/d/1vyufd55iDyvKeYOwoXwKSsNoRK2VR41QNTuH-iERR8s/edit#heading=h.jmpjljqpvc8y

Measurements from that report: 

AWS - CephFS Snapshot Restore times ( OCP 4.8 + OCS 4.8) 


1 Gi    3.48 sec
10 Gi   36.4 sec
100 Gi  145.0  sec


AWS - CephFS Snapshot Restore times ( OCP 4.8 + ODF 4.9) 

1 Gi    3.848 sec
10 Gi   43.06 sec
100 Gi  427.47  sec    ( 194% degradation!) 


Conclusion: 

on AWS ( OCP 4.8 + ODF 4.9) we also see a degradation in CephFS snapshot restore times, especially for 100 Gi Snapshot restore time.

Comment 8 Yuli Persky 2022-02-02 11:41:42 UTC
Unfortunately there must gather logs were not kept. 
Please let me know if you need to reproduce this bug and supply must gather from the newly deployed cluster.

Comment 9 yati padia 2022-02-03 12:56:43 UTC
ypersky we will need must-gather here to calculate and check the time spent by the ceph-csi here.

Comment 10 yati padia 2022-02-03 12:56:58 UTC
ypersky we will need must-gather here to calculate and check the time spent by the ceph-csi here.

Comment 11 Yuli Persky 2022-03-07 20:01:11 UTC
I've run Snapshot Performance test again on 4.10 and 4.9 VMware LSO platform ( report with links to the dashboard are available here: https://docs.google.com/document/d/19ZRfwhfbpYF2f6hUxCM5lCt0uLoNo3ibOMWTbZTUTqw/edit# ) . 

VMWare LSO 4.10 Snapshot Restore times : 
Note: Relevant Jenkins run is : https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/10162/

Must gather logs are available here: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/ypersky-local10/ypersky-local10_20220215T103839/logs/testcases_1645436035/

1GB snapshot: 4.478 sec
10GB snapshot: 18.663 sec
100 GB snapshot: 151.306 sec


VMWare LSO 4.9 Snapshot Restore times  - new run:

Note: must gather logs for this run is available here: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/ypersky-lso9/ypersky-lso9_20220228T124019/logs/testcases_1646258487/  or here http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/ypersky-lso9/ypersky-lso9_20220228T124019/logs/testcases_1646291211/

This is the relevant Jenkins run: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/10483/  : 


1GB snapshot: 7.301 sec
10GB snapshot: 19.610 sec
100GB snapshot: 172.093 sec 


VMware LSO - Snapshot Restore times - CephFS - OCP 4.9 + ODF 4.9 ( old run - no must gather available) 

1 Gi:6.489 sec
10 Gi: 21.46 sec
100 Gi: 213.9 sec 



VMWare LSO 4.8 Snapshot Restore times ( from this report https://docs.google.com/document/d/1vyufd55iDyvKeYOwoXwKSsNoRK2VR41QNTuH-iERR8s/edit#heading=h.jmpjljqpvc8y ) - no must gather is available for this run. 

1 Gi: 2.74 sec
10 Gi: 12.6 sec
100 Gi: 148.63 sec 

Please note that the test reports AVERAGE of 3 sampled snapshots ( creation/restore times and speed). 

We do see a degradation of CephFS Snapshot Restore times in both 4.9 and 4.10 versus 4.8 results and I think that this degradation should be investigated and fixed. 

Please let me know in case you need any other information from my side. 
Please note that this is not possible to deploy a mixed versions ocp + odf cluster onh VMware LSO.

Comment 12 Yuli Persky 2022-03-07 20:06:49 UTC
One this I did not add to this BZ is 4.8 must gather. 
For that I would need to deploy a new 4.8 VMware LSO cluster ( resources consuming , currently we have 4.9 and 4.10 VMWare LSO on that DC with tests running. 
Therefore if you need 4.8 cluster - it can be deployed AFTER the tests on 4.9 and 4.10 are done). 
Please let me know if the already supplied must gather are sufficient. If not - then I'll look for a way to deploy 4.8 vmware lso.

Comment 13 Rakshith 2022-03-08 04:48:12 UTC
> VMWare LSO 4.8 Snapshot Restore times ( from this report
> https://docs.google.com/document/d/1vyufd55iDyvKeYOwoXwKSsNoRK2VR41QNTuH-iERR8s/edit#heading=h.jmpjljqpvc8y ) - no must gather is available for this
> run. 
> 
> 1 Gi: 2.74 sec
> 10 Gi: 12.6 sec
> 100 Gi: 148.63 sec 
> 

The duration from the doc refernced for ocs 4.8 is different

1 Gi: 3.48
10 Gi: 36.4
100 Gi: 145.0

I would say these values are pretty much match the ones for 4.10

1GB snapshot: 4.478 sec
10GB snapshot: 18.663 sec
100 GB snapshot: 151.306 sec

> Please note that the test reports AVERAGE of 3 sampled snapshots (
> creation/restore times and speed). 

Even the 4.8 runs are averaged values?

Comment 14 yati padia 2022-03-08 05:28:33 UTC
From the data shared by Yuli, I see improvement in the 4.10 build in comparison to the 4.9 new build.
Also, we don't know what went wrong earlier which showed the degradation and without must-gather, we can't debug it too.
IMO, I don't see anything to debug here as we are getting a better result.

Comment 15 Yuli Persky 2022-03-08 21:23:25 UTC
@Rakshith - 


> The duration from the doc refernced for ocs 4.8 is different
> 
> 1 Gi: 3.48
> 10 Gi: 36.4
> 100 Gi: 145.0

where do those measurements appear ? 

I'm looking at this doc ( page 13)  https://docs.google.com/document/d/1Ft7gzWCcID2RTXILW3GrN8a6O5v5VidDICuG_tX__v8/edit#

and the 4.8 CephFS Snapshot Restore times are: 

1Gi: 2.74
10Gi: 12.6
100 Gi: 148.63


If we compare those results to the latest 4.10 ( taken from this report: https://docs.google.com/document/d/19ZRfwhfbpYF2f6hUxCM5lCt0uLoNo3ibOMWTbZTUTqw/edit# _ 

4.10 CephFS Snapshot Restore times are : 


1Gi: 4.478
10Gi: 18.663
100Gi: 151.306


===> We do see a degradation in the restore times for 1Gi and 10Gi snapshot sizes. 
And per your question - Yes, even 4.8 results are measurements of AVERAGE of 3 samples. 

So do you think this degradation is meaningless? Or is it worth investigating?

Comment 17 Yuli Persky 2022-03-09 11:48:51 UTC
I think that this BZ is not a blocker since even if there is a regression between 4.10 and 4.8 it's not a major one. 

Regarding re-running the test with more samples ( currently it runs with 10 samples) - let's talk and try to understand which number of samples needs to be executed manually to determine finally whether there is a regression or not.

Comment 18 Mudit Agarwal 2022-03-09 12:42:52 UTC
Moving out of 4.10 based on the above comment, we will keep investigating.

Comment 19 Yuli Persky 2022-03-09 22:51:21 UTC
@Rakshith,

The report you are looking at ( https://docs.google.com/document/d/1vyufd55iDyvKeYOwoXwKSsNoRK2VR41QNTuH-iERR8s/edit#heading=h.jmpjljqpvc8y ) is AWS 4.9 vs 4.8 comparison. 

Therefore , on 4.8 AWS the numbers of CephFS Snapshot Restore time are 
"
1 Gi: 3.48
10 Gi: 36.4
100 Gi: 145.0

on 4.8 VMware LSO , the 4.8 CephFS Snapshot Restore times are: 

1Gi: 2.74
10Gi: 12.6
100 Gi: 148.63

The rest of the numbers I posted here refer to VMwareLSO as well. 

4.10 CephFS Snapshot Restore times are : 


1Gi: 4.478
10Gi: 18.663
100Gi: 151.306 

Therefore we do see a degradation in CephFS Restore time in 4.10 for 1Gi and 10Gi snapshots. 

How do we proceed from here? 

Please note the following: 

VMWare LSO 4.10 Snapshot Restore times : 
Note: Relevant Jenkins run is : https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/10162/

Must gather logs are available here: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/ypersky-local10/ypersky-local10_20220215T103839/logs/testcases_1645436035/

1GB snapshot: 4.478 sec
10GB snapshot: 18.663 sec
100 GB snapshot: 151.306 sec


VMWare LSO 4.9 Snapshot Restore times  - new run:

Note: must gather logs for this run is available here: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/ypersky-lso9/ypersky-lso9_20220228T124019/logs/testcases_1646258487/  or here http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/ypersky-lso9/ypersky-lso9_20220228T124019/logs/testcases_1646291211/

This is the relevant Jenkins run: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/10483/  : 


1GB snapshot: 7.301 sec
10GB snapshot: 19.610 sec
100GB snapshot: 172.093 sec

Comment 21 yati padia 2022-03-10 05:01:05 UTC
@ypersky from the above details shared by you, I see:

VMWare LSO 4.10 Snapshot Restore times : 
1GB snapshot: 4.478 sec
10GB snapshot: 18.663 sec
100 GB snapshot: 151.306 sec


VMWare LSO 4.9 Snapshot Restore times  - new run:
1GB snapshot: 7.301 sec
10GB snapshot: 19.610 sec
100GB snapshot: 172.093 sec

This indicates that we have an improvement in the 4.10 build.
and not sure how we can debug for the earlier build which showed degradation without any must-gather or information to debug the issue.

Comment 22 Yuli Persky 2022-03-11 13:53:19 UTC
@Yati, 

I think that we can close this BZ ( since the 4.8 results may have been affected by not enough samples number), and in case we see any degradation in the future - I'll open a new BZ.

Comment 23 yati padia 2022-03-11 15:41:53 UTC
Thanks, closing this bug.


Note You need to log in before you can comment on or make changes to this bug.