Bug 2181054

Summary: VMware LSO - ODF 4.12 performance is worse than ODF 4.11 , apparently following changes in ceph
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Yuli Persky <ypersky>
Component: cephAssignee: Neha Ojha <nojha>
ceph sub component: Ceph-MGR QA Contact: Elad <ebenahar>
Status: CLOSED NOTABUG Docs Contact:
Severity: unspecified    
Priority: unspecified CC: bhubbard, bniver, jopinto, kramdoss, muagarwa, odf-bz-bot, rperiyas, sostapov
Version: 4.12Keywords: Automation, Performance
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-08-08 13:03:42 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Yuli Persky 2023-03-23 00:12:06 UTC
Description of problem (please be detailed as possible and provide log
snippests):

We see a clear degradation in Performance on various 4.12 builds vs various 4.11 builds ( 4.11.4 tested) . 
The degradation is as following: 

IO Performance: 
FIO  - degradation in IOPs and Throughput  ( persistent in 4.12.0-145 and 4.12.0-167 and 4.12.173) on Sequential and Random IO on both CephFS and RBD 
FIO Compressed -  degradation in IOPs and Throughput ( persistent in 4.12.0-152, 4.12.0-167 and 4.12.0.173) for both sequential and random IO.  

Snapshots and Clones : 
Single Snapshot Restore Time ( Ceph FS)   - degradation in Restore time and speed. We see a degradation in RBD as well , while the times are still short, this is persistent in 4.12.0-145 and 4.12.0-167
Snapshot Creation - Multiple Files ( CephFS) - degradation in snapshot creation time, persistent in 4.12.0-145 and 4.12.0-167 
Single Clone Creation Time ( CephFS) - degradation in creation time ( persistent in 4.12.0-145, 4.12.0-167, 4.12.0-173) 
Multiple Clone Creation  - degradation in average creation time for both RBD and CephFS. 
Bulk Clone Creation Time ( Ceph FS) - degradation in CephFS bulk clone creation time ( persistent in 4.12.0-145 and 4.12.0-167, though in 4.12.0-167 the results are better) 

Pod Attach - Reattach time : 
There is a degradation in Pod Reattach time for both RBD and CephFS PVCs, especially CephFS, for pods with more files ( checked up to ~820K files) the degradation is much more significant. 
Bulk Pod Attach Time  - there is a degradation in Reattach time ( persistent in 4.12.0-145 and 4.12.0-167 and 4.12.0-173) for both RBD and CephFS pods. 


Version of all relevant components (if applicable):

OCP Version	4.11.0-0.nightly-2023-03-07-114656	4.12.0-0.nightly-2023-02-04-034821
ODF Version	4.11.4-4	4.12.0-173
Ceph Version	16.2.8-84.el8cp	16.2.10-94.el8cp
Cluster name	ypersky-lso411	ypersky-173a


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Yes, since IO performance and creation/deletion/attach/reattach times have a direct impact on the customer's experience. 


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
3


Can this issue reproducible?

Yes! 
I've run the Perf suite on 2 different 4.11 builds and 3 different 4.12 builds, and in comparison of any pair of the results ( 4.12 vs 4.11) we see a clear degradation in 4.12. 


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:

Yes, this is a regression, since in 4.11 the IO performance and all the creation /deletion/attach/reattach measurements are much better. 


Steps to Reproduce:

1. Deploy VMware LSO cluster with 4.12 GA build) 
2. Run Performance suite tests ( performance marker, tests/e2e/performance in ocs_ci project)
3. Compare those results to any 4.11.X results ( X up to 5, not 4.11.5)


Actual results:

We see a clear degradation. 
IO Performance: 
FIO  - degradation in IOPs and Throughput  ( persistent in 4.12.0-145 and 4.12.0-167 and 4.12.173) on Sequential and Random IO on both CephFS and RBD 
FIO Compressed -  degradation in IOPs and Throughput ( persistent in 4.12.0-152, 4.12.0-167 and 4.12.0.173) for both sequential and random IO.  

Snapshots and Clones : 
Single Snapshot Restore Time ( Ceph FS)   - degradation in Restore time and speed. We see a degradation in RBD as well , while the times are still short, this is persistent in 4.12.0-145 and 4.12.0-167
Snapshot Creation - Multiple Files ( CephFS) - degradation in snapshot creation time, persistent in 4.12.0-145 and 4.12.0-167 
Single Clone Creation Time ( CephFS) - degradation in creation time ( persistent in 4.12.0-145, 4.12.0-167, 4.12.0-173) 
Multiple Clone Creation  - degradation in average creation time for both RBD and CephFS. 
Bulk Clone Creation Time ( Ceph FS) - degradation in CephFS bulk clone creation time ( persistent in 4.12.0-145 and 4.12.0-167, though in 4.12.0-167 the results are better) 

Pod Attach - Reattach time : 
There is a degradation in Pod Reattach time for both RBD and CephFS PVCs, especially CephFS, for pods with more files ( checked up to ~820K files) the degradation is much more significant. 
Bulk Pod Attach Time  - there is a degradation in Reattach time ( persistent in 4.12.0-145 and 4.12.0-167 and 4.12.0-173) for both RBD and CephFS pods. 


Expected results:

No degradation should be seen in the 4.12 results


Additional info:

Please refer to this document ( Performance Comparison Report) https://docs.google.com/document/d/15ATM0gDw0Df25uYkLy7A_TKK9oHbNXH-Zt4DW9-t3r0/edit#
It contains link to Performance Dashboard, links to Jenkins Jobs ( with the names of the clusters and full run logs) .

Comment 4 Mudit Agarwal 2023-08-08 13:03:42 UTC
Closing due to inactivity