Bug 2039881

Summary: AWS - degradation in pvc attach time for both RBD and CephFS PVCs in ODF 4.10 vs ODF 4.9
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Yuli Persky <ypersky>
Component: csi-driverAssignee: Rakshith <rar>
Status: CLOSED NOTABUG QA Contact: Elad <ebenahar>
Severity: unspecified Docs Contact:
Priority: medium    
Version: 4.10CC: alayani, jopinto, kramdoss, madam, mmuench, ocs-bugs, odf-bz-bot, rar
Target Milestone: ---Keywords: Automation, Performance, Regression
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-03-10 04:15:03 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Yuli Persky 2022-01-12 16:25:13 UTC
Description of problem (please be detailed as possible and provide log
snippests):

There is a degradation in pvc attach time for both RBD and CephFS PVCs in ODF 4.10 vs ODF 4.9 

Version of all relevant components (if applicable):

ODF 4.10.0.50

Note : you may find additional details in the following Jenkins job : https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/view/Performance/job/qe-trigger-aws-ipi-3az-rhcos-3m-3w-performance/56/


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?

3


Can this issue reproducible?

Yes. 
I also reproduced this problem ( degradation) on a cluster deployed with OCP 4.9 and ODF 4.10. 


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:

In 4.9 OCP + 4.9 ODF the average of 10 PVCs attach times were: 

RBD: 7.4 sec
CephFS: 6.6 sec


In 4.10 OCP + 4.10 ODF the average of 10 PVCs attach times were: 

RBD: 10.2 sec
CephFS: 8.8 sec


In 4.9 OCP + 4.10 ODF the average of 10 PVCs attach times were: 

RBD: 12.8 sec
CephFS: 11 sec

The detailed comparison report is available here: 

https://docs.google.com/document/d/1OJfARHBAJs6bkYqri_HpSNM_N5gchUQ6P-lKe6ujQ6o/edit#


Steps to Reproduce:
1. Run test_pvc_attachtime.py test
2.Compare its results ( average attach time of 10 samples) to 4.9 results ( available in this report: https://docs.google.com/document/d/1vyufd55iDyvKeYOwoXwKSsNoRK2VR41QNTuH-iERR8s/edit ) 
3.


Actual results:

Average attach time in ODF 4.10 ( with both OCP 4.9 and OCP 4.10) is at least 30% worse than in OCP 4.9 + ODF 4.9, for both RBD and CephFS. 
Please not that this is an average of 10 samples. 

Expected results:

Average attach time should be same or shorter than in OCP 4.9 + ODF 4.9. 


Additional info:

Relevand Jenkins job:

https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/view/Performance/job/qe-trigger-aws-ipi-3az-rhcos-3m-3w-performance/56/

Comparison report: 

https://docs.google.com/document/d/1OJfARHBAJs6bkYqri_HpSNM_N5gchUQ6P-lKe6ujQ6o/edit#

Comment 5 Yuli Persky 2022-02-23 15:39:54 UTC
@Rakshith

Thank you for pointing out that the imagePullPolicy` default is Always. 
We checked our test logs and this is indeed what is going on, not only in the test_pvc_attachtime test but also in others. 
This means that the currently reported measurements of attach/reattach times include pulling image.

I've added fixing all the 3 tests to QPAS team workplan in P0 priority. 

After the tests are fixed we would be able to provide more accurate attach/reattach times.

Comment 6 Yuli Persky 2022-03-07 22:05:45 UTC
All the performance tests that were using default pull policy: Always were fixed not to pull image each time. 

I will run them on 4.10 and 4.9 and will post here the results of this comparison.

Comment 7 Yuli Persky 2022-03-09 18:10:15 UTC
An Update: 

I've run the fixed pvc_attachtime.py test ( the fix was not to pull image each time) on 4.10.0 build 184 latest and 4.9.4 build 7. 
The comparison is available here: 

http://ocsperf.ceph.redhat.com:8080/index.php?version1=17&build1=51&platform1=1&az_topology1=1&test_name%5B%5D=9&version2=14&build2=53&platform2=1&az_topology2=1&version3=&build3=&version4=&build4=&submit=Choose+options

4.9 must gather: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/lr5-ypersky-9aws/lr5-ypersky-9aws_20220309T120256/logs/testcases_1646831255/
4.10 must gather:  http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/lr5-ypersky-10aws/lr5-ypersky-10aws_20220309T120401/logs/testcases_1646831358/

The comparison shown IMPROVEMENT in 4.10.0.184 in PVC attach time for both RBD (47%) and CephFS (42%). 

In 4.9 OCP + 4.9 ODF the average of 10 PVCs attach times were: 

RBD: 7.4 sec
CephFS: 6.6 sec


In 4.10 OCP + 4.10 ODF the average of 10 PVCs attach times were: 

RBD: 10.2 sec
CephFS: 8.8 sec


In 4.9 OCP + 4.10 ODF the average of 10 PVCs attach times were: 

RBD: 12.8 sec
CephFS: 11 sec

In the newly executed fixed test on 4.10OCP + 4.10 ODF the average of 10 PVC attach times are: 


RBD: 5.4
CephFS: 5.6 

Theses are the best times measured so far. 
Therefore the BZ should be closed.