Description of problem: There is a degradation in pod reattach time for both CephFS Pods with ~850K files in ODF 4.10 vs ODF 4.9. Version-Release number of selected component (if applicable): ODF 4.10.0.50 Note : you may find additional details in the following Jenkins job : https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/view/Performance/job/qe-trigger-aws-ipi-3az-rhcos-3m-3w-performance/56/ Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: In 4.9 OCP + 4.9 ODF the average of 10 CephFS pod with around 850K files reattach times were: CephFS: 178 sec In 4.10 OCP + 4.10 ODF the average of 10 Pod reattach times were: CephFS: 282 sec In 4.9 OCP + 4.10 ODF the average of 10 Pod reattach times were: CephFS: 228 sec How reproducible: Steps to Reproduce: 1. Run tests/e2e/performance/csi_tests/test_pod_reattachtime.py test 2. Compare its results ( average reaattach time of 10 samples) to 4.9 results ( available in this report: https://docs.google.com/document/d/1vyufd55iDyvKeYOwoXwKSsNoRK2VR41QNTuH-iERR8s/edit ) 3. Actual results: Average pod reattach time is ~30% longer in ODF 4.10 Expected results: Average pod reattach time should be the same or shorter than in 4.9 Additional info: Relevant Jenkins job: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/view/Performance/job/qe-trigger-aws-ipi-3az-rhcos-3m-3w-performance/56/ Comparison report: https://docs.google.com/document/d/1OJfARHBAJs6bkYqri_HpSNM_N5gchUQ6P-lKe6ujQ6o/edit#
Yuli, Can we also run ODF 4.9 + OCP 4.10? Also, we will need must-gather for all the runs.
@Mudit Agarwal 1) I did run the test on OCP 4.9 and ODF 4.10 ( see the results in the bug description). Is OCP 4.10 + ODF 4.9 a supported combination? Please write here if yes, and I will try to deploy it and run the test. 2) Must gather for OCP 4.10 + ODF 4.10 run is available here: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-056ai3c33-p/j-056ai3c33-p_20211230T130122/logs/testcases_1640872857/
>> Is OCP 4.10 + ODF 4.9 a supported combination? Yes, till ODF 4.10 is released people will have ODF 4.9 only on their OCP 4.10 cluster. Also, I want to narrow down the problem area. This will help us in determining whether the regression is in ODF or OCP
@Mudit, I will deploy OCP 4.10 with ODF 4.9 , run the test and report the results.
@Mudit Agarwal, Per your request I've deployed OCP 4.10 + ODF 4.9 cluster. The results of CephFS pod reattach time on 4.10+4.9 also show degradation comparing to OCP 4.9 + ODF 4.9. Ceph reattach time for pod with ~200K files are: OCP 4.9 + ODF 4.9: 41 secs; OCP 4.9 + ODF 4.10 : 41.1 sec; OCP 4.10 + ODF 4.9 : 52.78 sec; OCP 4.10+ ODF 4.10: 47.43 sec Ceph reattach time for pod with ~850K files are: OCP 4.9 + ODF 4.9: 178 secs; OCP 4.9 + ODF 4.10 : 228.9 sec; OCP 4.10 + ODF 4.9 : 266.14 sec; OCP 4.10+ ODF 4.10: 282 sec The full comparison report which includes the results is available here: https://docs.google.com/document/d/1OJfARHBAJs6bkYqri_HpSNM_N5gchUQ6P-lKe6ujQ6o/edit#
Please note : must gather is available here: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-056ai3c33-p/j-056ai3c33-p_20211230T130122/logs/testcases_1640872857/
I've run again the pod reattach test on 4.9 OCP + 4.9 ODF. You can find relevant must gather logs here: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/lr5-ypersky-a9/lr5-ypersky-a9_20220301T225837/logs/testcases_1646222463/ as for the compbinational cluster - unfortunately I do not have must gather for the combinational run.
Hi Rakshith, The performance suite was run as a bulk ( one after another) via Jenkins and must gather was collected AFTER all the tests run. Therefore I cannot narrow it down, unfortunately. Also meanwhile we did not add csi times to this test. However, this is pending in our team work plan, and I hope to have this fix added to the test in near future. Also please note that this test will be enhanced by the not calculating pull image each time we create a pod.
The fixed ( default pod policy will not pull image each time) test is running on 4.9.4 build 7 : https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/10676/ on 4.10.0 build 184: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/10677/ when the tests finish I'll update on the results comparison.
The results of the fixed pod reattachtime tests are available at the Performance Dashboard at this link: http://ocsperf.ceph.redhat.com:8080/index.php?version1=17&build1=51&platform1=1&az_topology1=1&test_name%5B%5D=6&version2=14&build2=53&platform2=1&az_topology2=1&version3=&build3=&version4=&build4=&submit=Choose+options 4.9 Jenkins Job: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/10676/parameters/ 4.9 must gather: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/lr5-ypersky-9aws/lr5-ypersky-9aws_20220309T120256/logs/testcases_1646865587/ 4.10 Jenkins Job: https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/10677/ 4.10 must gather: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/lr5-ypersky-10aws/lr5-ypersky-10aws_20220309T120401/logs/testcases_1646865624/ The measurements are the following : 4.9.4. build 7 CephFS Pod Reattach time for pod with ~850K : 308.219 sec 4.10.0 build 184 CephFS Pod Reattach time for pod with ~850K : 315.914 sec Both measures are higher than the measures taken during the previous run, but they do NOT show a degradation. Therefore I think we should close this BZ. PLease note that in general the pod reattach time measurements on 4.10 are high ( 315 seconds) if we compare it to the gp2 performance. But that's a different issue, not related to degradation in ODF and a separate BZ will be filed on that ( providing all the details).