Bug 2024132 - VMware LSO - degradation of performance in CephFS clone creation times in OCP4.9+ODF4.9 vs OCP4.8+OCS4.8
Summary: VMware LSO - degradation of performance in CephFS clone creation times in OCP...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: csi-driver
Version: 4.9
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: Humble Chirammal
QA Contact: Elad
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-11-17 11:58 UTC by Yuli Persky
Modified: 2023-08-09 16:37 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-03-07 02:40:06 UTC
Embargoed:


Attachments (Terms of Use)

Description Yuli Persky 2021-11-17 11:58:46 UTC
Description of problem (please be detailed as possible and provide log
snippests):

There is a degradation of performance on VMware LSO platform  in CephFS clone creation times in OCP4.9+ODF4.9 vs OCP4.8+OCS4.8

The detailed report is available here https://docs.google.com/document/d/1Ft7gzWCcID2RTXILW3GrN8a6O5v5VidDICuG_tX__v8/edit#heading=h.x348ywf5r26r

CephFS Clone Creation times in OCP 4.8 + OCS 4.8: 

Clone size: 1 Gi, Creation time: 2.63 sec
Clone size: 25 Gi, Creation time: 64.12 sec
Clones size: 50 Gi, Creation time: 95.025 sec 


CephFS Clone Creation times in OCP 4.9 + ODF 4.9: 

Clone size: 1 Gi, Creation time: 8.12 sec
Clone size: 25 Gi, Creation time: 64.74 sec   ( here the time is the same as in 4.8 + 4.8) 
Clones size: 50 Gi, Creation time: 193.26 sec 

Please note that the degradation for 1Gi and 50Gi clones is consistent - run the test number of times with similar results. 



Version-Release number of selected component (if applicable):

OCS versions
	==============

		NAME                     DISPLAY                       VERSION   REPLACES   PHASE
		noobaa-operator.v4.9.0   NooBaa Operator               4.9.0                Succeeded
		ocs-operator.v4.9.0      OpenShift Container Storage   4.9.0                Succeeded
		odf-operator.v4.9.0      OpenShift Data Foundation     4.9.0                Succeeded
		
		ODF (OCS) build :		      full_version: 4.9.0-210.ci
		
	Rook versions
	===============

		2021-11-04 09:27:36.633082 I | op-flags: failed to set flag "logtostderr". no such flag -logtostderr
		rook: 4.9-210.f6e2005.release_4.9
		go: go1.16.6
		
	Ceph versions
	===============

		ceph version 16.2.0-143.el8cp (0e2c6f9639c37a03e55885fb922dc0cb1b5173cb) pacific (stable)


Full Version list is available here : 

http://ocsperf.ceph.redhat.com/logs/Performance_tests/4.9/RC0/Vmware-LSO/versions.txt



Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

Not relevant for perfornace bug. 


Is there any workaround available to the best of your knowledge?

No 

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?

Yes

Can this issue reproduce from the UI?

Not relevant 

If this is a regression, please provide more details to justify this:

CephFS Clone Creation times in OCP 4.8 + OCS 4.8: 

Clone size: 1 Gi, Creation time: 2.63 sec
Clone size: 25 Gi, Creation time: 64.12 sec
Clones size: 50 Gi, Creation time: 95.025 sec 


CephFS Clone Creation times in OCP 4.9 + ODF 4.9: 

Clone size: 1 Gi, Creation time: 8.12 sec
Clone size: 25 Gi, Creation time: 64.74 sec   ( here the time is the same as in 4.8 + 4.8) 
Clones size: 50 Gi, Creation time: 193.26 sec 

Please note that the degradation for 1Gi and 50Gi clones is consistent - run the test number of times with similar results. 



Steps to Reproduce:
1. Run test_pvc_clone_performance.py test on VMware LSO cluster 
2.Compare the clone creation times to measurements in 4.8


Actual results:

The current measurements show degradation ( longer creation times for 1GB and 50 GB clones) on OCP 4.9+ ODF 4.9 vs OCP 4.8 + OCS 4.8


Expected results:

The measurements should be the same or better ( shorter times) in 4.9+4.9. 


Additional info:


The full comparison VMware LSO report is available here:  https://docs.google.com/document/d/1Ft7gzWCcID2RTXILW3GrN8a6O5v5VidDICuG_tX__v8/edit#

Comment 3 yati padia 2021-11-22 07:36:02 UTC
@ypersky.com I don't see a detailed description in the bug description. Can you please update the bug with the report and other important details.

Comment 4 Yuli Persky 2021-11-23 10:48:36 UTC
I apologize for not providing proper description earlier. 
The first comment was update with all the information, please let me know in case you need any further inputs.

Comment 6 Yuli Persky 2021-12-01 13:48:01 UTC
@Yug Gupta,

It is not possible to deploy OCP 4.8 + ODF 4.9 on vmware LSO cluster. Not supported. 

As for other platforms - here is the AWS 4.9 report : 

https://docs.google.com/document/d/1vyufd55iDyvKeYOwoXwKSsNoRK2VR41QNTuH-iERR8s/edit#heading=h.2m8gdjc4jhzo

From the comparison between 4.8 and 4.9 - no degradation in Clone creation times is seen on AWS. 
However, we do see degradation on VMware LSO.

Comment 9 Yuli Persky 2021-12-20 18:45:39 UTC
@Yug Gupta,

So what component should I change this BZ to ?

Comment 10 Yuli Persky 2022-02-02 11:11:05 UTC
Regarding must gather logs - unfortunately we did not collect those logs and the cluster is not available now. 
If it is needed - I can reproduce the problem on a newly deployed cluster and collect the must gather /start the test from Jenkins and in this way must gather will be collected automatically.

Comment 11 yati padia 2022-02-03 12:54:34 UTC
ypersky we will need must-gather here to calculate and check the time spend by the ceph-csi here.

Comment 12 Yuli Persky 2022-03-06 22:09:42 UTC
I've run test_pvc_clone_performance,py test again on 4.9 ( ocp + odf) vmware lso cluster.
This is the link to the Jenkins job: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/10491/
This is the link to must gather logs: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/ypersky-lso9/ypersky-lso9_20220228T124019/logs/testcases_1646291211/

*please note that there is a chance that the relevant must gather might be located in one of the testcase* directories here: 
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/ypersky-lso9/ypersky-lso9_20220228T124019/logs/
I'm keeping here this link to be on the safe side. However, the first link should contain the relevant logs. 


I've also run the test_pvc_clone_performance.py test on 4.10 vmware lso cluster and must gather is available here: 
https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/10162/

http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/ypersky-local10/ypersky-local10_20220215T103839/logs/testcases_1645436035/

Comment 13 Yuli Persky 2022-03-06 22:27:53 UTC
Please also note the following: 

1) VMWare lso comparison report is available here: 

https://docs.google.com/document/d/19ZRfwhfbpYF2f6hUxCM5lCt0uLoNo3ibOMWTbZTUTqw/edit#

2) When I run test_pvc_clone_creation_performance.py test on newly deployed 4.9 ocp + 4.9 odf cluster, the measurements for cephfs clone creation are : 

4.9 CephFS Clone creation times:

1 GB clone: 1.758 sec
25 GB clone: 64.390 sec 
50GB clone: 128.320 sec
100GB clone: 256.135 sec

Those measurements are similar to 4.8 results( taken from this bug description)

Clone size: 1 Gi, Creation time: 2.63 sec
Clone size: 25 Gi, Creation time: 64.12 sec
Clones size: 50 Gi, Creation time: 95.025 sec 

and much better that the 4.9 results also mentioned in the description of this bug ( copying it here): 

Clone size: 1 Gi, Creation time: 8.12 sec
Clone size: 25 Gi, Creation time: 64.74 sec   ( here the time is the same as in 4.8 + 4.8) 
Clones size: 50 Gi, Creation time: 193.26 sec 

Also please note the current 4.10 results: 

4.10 CephFS Clone creation times:

1 GB clone: 2.021 sec
25GB clone: 49.336 sec
50GB clone: 131.583 sec
100GB clone: 258.400 sec


I have an explanation to the DIFFERENT 4.9 measurements: it looks like we need to add more samples to the clones creation/deletion test ( this is already in our work plan). 

Taking all the above into consideration I think that we can close this bug. The only degradation in 4.10 vs 4.8 is in CephFS 50 GB clone ( ~30%). However, the 100GB clone creation time is similar in both 4.8 and 4.10. 
The QE indeed should add more samples to this test for the measurements to be more accurate.


Note You need to log in before you can comment on or make changes to this bug.