2039924 – AWS - degradation in pod reattach time for both CephFS Pods with ~850K files in ODF 4.10 vs ODF 4.9

Bug 2039924 - AWS - degradation in pod reattach time for both CephFS Pods with ~850K files in ODF 4.10 vs ODF 4.9

Summary: AWS - degradation in pod reattach time for both CephFS Pods with ~850K files...

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	csi-driver
Sub Component:
Version:	4.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	unspecified
Target Milestone:	---
Target Release:	---
Assignee:	Rakshith
QA Contact:	Elad
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-01-12 18:01 UTC by Yuli Persky
Modified:	2023-08-09 16:37 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-03-11 02:33:27 UTC
Embargoed:

Attachments	(Terms of Use)

Description Yuli Persky 2022-01-12 18:01:27 UTC

Description of problem:

There is a degradation in pod reattach time for both CephFS Pods  with ~850K files in ODF 4.10 vs ODF 4.9.


Version-Release number of selected component (if applicable):

ODF 4.10.0.50

Note : you may find additional details in the following Jenkins job : https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/view/Performance/job/qe-trigger-aws-ipi-3az-rhcos-3m-3w-performance/56/


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:

In 4.9 OCP + 4.9 ODF the average of 10 CephFS pod with around 850K files reattach times were: 


CephFS: 178 sec


In 4.10 OCP + 4.10 ODF the average of 10 Pod reattach times were: 

CephFS: 282 sec


In 4.9 OCP + 4.10 ODF the average of 10 Pod reattach times were: 


CephFS: 228 sec



How reproducible:


Steps to Reproduce:
1. Run tests/e2e/performance/csi_tests/test_pod_reattachtime.py test
2. Compare its results ( average reaattach time of 10 samples) to 4.9 results ( available in this report: https://docs.google.com/document/d/1vyufd55iDyvKeYOwoXwKSsNoRK2VR41QNTuH-iERR8s/edit ) 
3.

Actual results:

Average pod reattach time is ~30% longer in ODF 4.10


Expected results:

Average pod reattach time should be the same or shorter than in 4.9


Additional info:


Relevant Jenkins job:

https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/view/Performance/job/qe-trigger-aws-ipi-3az-rhcos-3m-3w-performance/56/

Comparison report: 

https://docs.google.com/document/d/1OJfARHBAJs6bkYqri_HpSNM_N5gchUQ6P-lKe6ujQ6o/edit#

Comment 3 Mudit Agarwal 2022-01-13 04:48:35 UTC

Yuli,

Can we also run ODF 4.9 + OCP 4.10?

Also, we will need must-gather for all the runs.

Comment 4 Yuli Persky 2022-01-13 10:19:56 UTC

@Mudit Agarwal

1) I did run the test on OCP 4.9 and ODF 4.10 ( see the results in the bug description). 

Is OCP 4.10 + ODF 4.9  a supported combination? 

Please write here if yes, and I will try to deploy it and run the test. 



2) Must gather for OCP 4.10 + ODF 4.10 run is available here: 

http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-056ai3c33-p/j-056ai3c33-p_20211230T130122/logs/testcases_1640872857/

Comment 5 Mudit Agarwal 2022-01-13 10:35:23 UTC

>> Is OCP 4.10 + ODF 4.9  a supported combination? 
Yes, till ODF 4.10 is released people will have ODF 4.9 only on their OCP 4.10 cluster.

Also, I want to narrow down the problem area. This will help us in determining whether the regression is in ODF or OCP

Comment 6 Yuli Persky 2022-01-13 13:37:11 UTC

@Mudit, 

I will deploy OCP 4.10 with ODF 4.9 , run the test and report the results.

Comment 7 Yuli Persky 2022-01-23 20:47:31 UTC

@Mudit Agarwal,

Per your request I've deployed OCP 4.10 + ODF 4.9 cluster. 

The results of CephFS pod reattach time on 4.10+4.9 also show degradation comparing to OCP 4.9 + ODF 4.9. 

Ceph reattach time for pod with ~200K files are: 


OCP 4.9 + ODF 4.9:  41 secs;  OCP 4.9 + ODF 4.10 : 41.1 sec; OCP 4.10 + ODF 4.9 : 52.78 sec; OCP 4.10+ ODF 4.10: 47.43 sec


Ceph reattach time for pod with ~850K files are: 


OCP 4.9 + ODF 4.9:  178 secs;  OCP 4.9 + ODF 4.10 : 228.9 sec; OCP 4.10 + ODF 4.9 : 266.14 sec; OCP 4.10+ ODF 4.10: 282 sec


The full comparison report which includes the results is available here: 

https://docs.google.com/document/d/1OJfARHBAJs6bkYqri_HpSNM_N5gchUQ6P-lKe6ujQ6o/edit#

Comment 8 Yuli Persky 2022-02-02 11:02:27 UTC

Please note : must gather is available here: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-056ai3c33-p/j-056ai3c33-p_20211230T130122/logs/testcases_1640872857/

Comment 10 Yuli Persky 2022-03-03 13:25:58 UTC

I've run again the pod reattach test on 4.9 OCP + 4.9 ODF. You can find relevant must gather logs here: 

http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/lr5-ypersky-a9/lr5-ypersky-a9_20220301T225837/logs/testcases_1646222463/

as for the compbinational cluster - unfortunately I do not have must gather for the combinational run.

Comment 12 Yuli Persky 2022-03-07 21:01:07 UTC

Hi Rakshith,

The performance suite was run as a bulk ( one after another) via Jenkins and must gather was collected AFTER all the tests run. 
Therefore I cannot narrow it down, unfortunately. 
Also meanwhile we did not add csi times to this test. However, this is pending in our team work plan, and I hope to have this fix added to the test in near future. 
Also please note that this test will be enhanced by the not calculating pull image each time we create a pod.

Comment 13 Yuli Persky 2022-03-09 22:40:58 UTC

The fixed ( default pod policy will not pull image each time) test is running 

on 4.9.4 build 7 : https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/10676/

on 4.10.0 build 184: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/10677/

when the tests finish I'll update on the results comparison.

Comment 14 Yuli Persky 2022-03-10 15:45:40 UTC

The results of the fixed pod reattachtime tests are available at the Performance Dashboard at this link: 

http://ocsperf.ceph.redhat.com:8080/index.php?version1=17&build1=51&platform1=1&az_topology1=1&test_name%5B%5D=6&version2=14&build2=53&platform2=1&az_topology2=1&version3=&build3=&version4=&build4=&submit=Choose+options

4.9 Jenkins Job: 

https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/10676/parameters/

4.9 must gather: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/lr5-ypersky-9aws/lr5-ypersky-9aws_20220309T120256/logs/testcases_1646865587/

4.10 Jenkins Job: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/10677/

4.10 must gather: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/lr5-ypersky-10aws/lr5-ypersky-10aws_20220309T120401/logs/testcases_1646865624/


The measurements are the following : 

4.9.4. build 7 CephFS Pod Reattach time for pod with ~850K : 308.219 sec
4.10.0 build 184 CephFS Pod Reattach time for pod with ~850K  : 315.914 sec 

Both measures are higher than the measures taken during the previous run, but they do NOT show a degradation. 
Therefore I think we should close this BZ. 

PLease note that in general the pod reattach time measurements on 4.10 are high ( 315 seconds) if we compare it to the gp2 performance. 
But that's a different issue, not related to degradation in ODF and a separate BZ will be filed on that ( providing all the details).

Note You need to log in before you can comment on or make changes to this bug.