Bug 1984804

Summary: [Tracker for OCP BZ #1988013] AWS - degradation in RBD pod reattach time in OCP 4.8 vs 4.7
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Yuli Persky <ypersky>
Component: csi-driverAssignee: Humble Chirammal <hchiramm>
Status: CLOSED NOTABUG QA Contact: Elad <ebenahar>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.8CC: alayani, kramdoss, madam, muagarwa, ocs-bugs, odf-bz-bot, owasserm, ratamir, rcyriac
Target Milestone: ---Keywords: Automation, Performance, Regression
Target Release: ---Flags: kramdoss: needinfo+
kramdoss: needinfo+
kramdoss: needinfo+
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1988013 (view as bug list) Environment:
Last Closed: 2021-09-14 09:09:19 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1988013    

Description Yuli Persky 2021-07-22 09:35:16 UTC
Description of problem:

AWS platform - There is degradation in pod reattach time for RBD interface pod in 4.8 versus 4.7. 

In 4.7 on AWS it took around 29 sec for RBD pod to reattach. 
in 4.8 we took 5 measurements which were: 

39.93 sec
39.62 sec
48.83 sec
39.8 sec
44.55 sec



Version-Release number of selected component (if applicable):

HW Platform	AWS
Number of OCS nodes	3
Number of total OSDs	3
OSD Size (TiB)	2.00
Total available storage (GiB)	6,140
OCP Version	4.8.0-0.nightly-2021-07-04-112043
OCS Version	4.8.0-444.ci
Ceph Version	14.2.11-183.el8cp

How reproducible:

Reproducible all the time on AWS ( attach pod to RBD pvc) 

Steps to Reproduce:
1. Deploy AWS cluster with 2TB OSD
2. Run tests/e2e/performance/test_pod_reattachtime.py
3.

Actual results:

Pod creation time is more than in 4.7 ( degradation of around 40% - 50%). 

Expected results:

Pod creation time should be the same or better than in 4.7. 

Additional info:

The complete AWS comparison report for 4.7 vs 4.8 is available here: 
https://docs.google.com/document/d/1-lOb4szqLM4LoWnMr_JCp9zurBqpjeva5BUEH-yer4s/edit?ts=60f62010#

The console logs are available here ( separate log for each sample execution) : 

10.70.39.233:/ypersky_report_logs/48/aws/

Must-gather logs are being collected and a link will be posted shortly.

Comment 2 Yuli Persky 2021-07-22 10:00:24 UTC
Must-gather logs are available here: 

10.70.39.233:/home/ypersky/bz_1984804/logs-20210722-145415

Comment 4 Humble Chirammal 2021-07-23 04:49:00 UTC
Yuli, how can we access the MG logs @ 10.70.39.233:/home/ypersky/bz_1984804/logs-20210722-145415

Comment 5 krishnaram Karthick 2021-07-24 15:52:20 UTC
We re-ran the tests after discussing with engineering with the following combinations to rule out any issues in OCP

1) OCP 4.8 + OCS 4.8 - https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/4756/consoleFull
2) OCP 4.7 + OCS 4.8 - https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/4757/consoleFull

Reattach time for RBD in OCP 4.8 + OCS 4.8 is 43.99 seconds
Reattach time for RBD in OCP 4.7 + OCS 4.8 is 29.17 seconds

Mustgather logs for both cases shall be attached shortly.

Comment 7 Mudit Agarwal 2021-07-26 10:10:05 UTC
Thanks Karthick.

So, the above data suggests that there is a regression in OCP 4.8
Just for the records, we are using new side car images in OCS4.8

Comment 11 Humble Chirammal 2021-07-29 05:15:28 UTC
Hi Avi,  Thanks for the collecting above metrics too. With that, from the available data it looks like below.

Reattach time for RBD in OCP 4.8 + OCS 4.7 is 43.99 seconds
Reattach time for RBD in OCP 4.8 + OCS 4.8 is 43.99 seconds
Reattach time for RBD in OCP 4.7 + OCS 4.8 is 29.17 seconds
Reattach time for RBD in OCP 4.7 + OCS 4.7 is 29.1 seconds

As mentioned earlier, it seems that OCS 4.7 and 4.8  against same OCP versions respond pretty much the same way.  However while looking at the vmware test result [1] for reattach, it has reported an improvement in performance with 4.8 versions:

For POD attach time we can observe improvement of ~70% on CephFS
For POD reattach time we can observe improvement of ~50% on RBD 

Are these build and hardware remains same across these tests in different ( aws and vmware) platforms?  

[1] https://docs.google.com/document/d/1KDPPfVywM5-Y4MzYOSUndAnAbPfhgth9UazppOOfMck/edit#

Comment 12 Avi Liani 2021-07-29 06:01:14 UTC
(In reply to Humble Chirammal from comment #11)
> Hi Avi,  Thanks for the collecting above metrics too. With that, from the
> available data it looks like below.
> 
> Reattach time for RBD in OCP 4.8 + OCS 4.7 is 43.99 seconds
> Reattach time for RBD in OCP 4.8 + OCS 4.8 is 43.99 seconds
> Reattach time for RBD in OCP 4.7 + OCS 4.8 is 29.17 seconds
> Reattach time for RBD in OCP 4.7 + OCS 4.7 is 29.1 seconds
> 
> As mentioned earlier, it seems that OCS 4.7 and 4.8  against same OCP
> versions respond pretty much the same way.  However while looking at the
> vmware test result [1] for reattach, it has reported an improvement in
> performance with 4.8 versions:
> 
> For POD attach time we can observe improvement of ~70% on CephFS
> For POD reattach time we can observe improvement of ~50% on RBD 
> 
> Are these build and hardware remains same across these tests in different (
> aws and vmware) platforms?  

Yes, during the test hardware and build remains the same.

> 
> [1]
> https://docs.google.com/document/d/1KDPPfVywM5-
> Y4MzYOSUndAnAbPfhgth9UazppOOfMck/edit#

Comment 13 Humble Chirammal 2021-07-30 05:34:32 UTC
(In reply to Avi Liani from comment #12)
> (In reply to Humble Chirammal from comment #11)
> > Hi Avi,  Thanks for the collecting above metrics too. With that, from the
> > available data it looks like below.
> > 
> > Reattach time for RBD in OCP 4.8 + OCS 4.7 is 43.99 seconds
> > Reattach time for RBD in OCP 4.8 + OCS 4.8 is 43.99 seconds
> > Reattach time for RBD in OCP 4.7 + OCS 4.8 is 29.17 seconds
> > Reattach time for RBD in OCP 4.7 + OCS 4.7 is 29.1 seconds
> > 
> > As mentioned earlier, it seems that OCS 4.7 and 4.8  against same OCP
> > versions respond pretty much the same way.  However while looking at the
> > vmware test result [1] for reattach, it has reported an improvement in
> > performance with 4.8 versions:
> > 
> > For POD attach time we can observe improvement of ~70% on CephFS
> > For POD reattach time we can observe improvement of ~50% on RBD 
> > 
> > Are these build and hardware remains same across these tests in different (
> > aws and vmware) platforms?  
> 
> Yes, during the test hardware and build remains the same.

This is bit confusing, if all the OCP builds and hardware remains same and reattach time regression showed in AWS but not in VMWARE platform. Its difficult to reach into a conclusion that, even OCP code have a regression.


> 
> > 
> > [1]
> > https://docs.google.com/document/d/1KDPPfVywM5-
> > Y4MzYOSUndAnAbPfhgth9UazppOOfMck/edit#

Comment 15 Mudit Agarwal 2021-09-06 09:41:59 UTC
Hi Yuli/Avi/Karthick

We have a request from Jan on the OCP BZ, PTAL

https://bugzilla.redhat.com/show_bug.cgi?id=1988013#c19

Comment 16 Humble Chirammal 2021-09-14 09:09:19 UTC
I am closing this bug as per the comment (https://bugzilla.redhat.com/show_bug.cgi?id=1988013#c23)  in the tracking issue. Please feel free to open a new issue  if we come across the same issue.