Bug 2181054 - VMware LSO - ODF 4.12 performance is worse than ODF 4.11 , apparently following changes in ceph
Summary: VMware LSO - ODF 4.12 performance is worse than ODF 4.11 , apparently followi...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: ceph
Version: 4.12
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: Neha Ojha
QA Contact: Elad
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-03-23 00:12 UTC by Yuli Persky
Modified: 2023-08-09 16:37 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-08-08 13:03:42 UTC
Embargoed:


Attachments (Terms of Use)

Description Yuli Persky 2023-03-23 00:12:06 UTC
Description of problem (please be detailed as possible and provide log
snippests):

We see a clear degradation in Performance on various 4.12 builds vs various 4.11 builds ( 4.11.4 tested) . 
The degradation is as following: 

IO Performance: 
FIO  - degradation in IOPs and Throughput  ( persistent in 4.12.0-145 and 4.12.0-167 and 4.12.173) on Sequential and Random IO on both CephFS and RBD 
FIO Compressed -  degradation in IOPs and Throughput ( persistent in 4.12.0-152, 4.12.0-167 and 4.12.0.173) for both sequential and random IO.  

Snapshots and Clones : 
Single Snapshot Restore Time ( Ceph FS)   - degradation in Restore time and speed. We see a degradation in RBD as well , while the times are still short, this is persistent in 4.12.0-145 and 4.12.0-167
Snapshot Creation - Multiple Files ( CephFS) - degradation in snapshot creation time, persistent in 4.12.0-145 and 4.12.0-167 
Single Clone Creation Time ( CephFS) - degradation in creation time ( persistent in 4.12.0-145, 4.12.0-167, 4.12.0-173) 
Multiple Clone Creation  - degradation in average creation time for both RBD and CephFS. 
Bulk Clone Creation Time ( Ceph FS) - degradation in CephFS bulk clone creation time ( persistent in 4.12.0-145 and 4.12.0-167, though in 4.12.0-167 the results are better) 

Pod Attach - Reattach time : 
There is a degradation in Pod Reattach time for both RBD and CephFS PVCs, especially CephFS, for pods with more files ( checked up to ~820K files) the degradation is much more significant. 
Bulk Pod Attach Time  - there is a degradation in Reattach time ( persistent in 4.12.0-145 and 4.12.0-167 and 4.12.0-173) for both RBD and CephFS pods. 


Version of all relevant components (if applicable):

OCP Version	4.11.0-0.nightly-2023-03-07-114656	4.12.0-0.nightly-2023-02-04-034821
ODF Version	4.11.4-4	4.12.0-173
Ceph Version	16.2.8-84.el8cp	16.2.10-94.el8cp
Cluster name	ypersky-lso411	ypersky-173a


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Yes, since IO performance and creation/deletion/attach/reattach times have a direct impact on the customer's experience. 


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
3


Can this issue reproducible?

Yes! 
I've run the Perf suite on 2 different 4.11 builds and 3 different 4.12 builds, and in comparison of any pair of the results ( 4.12 vs 4.11) we see a clear degradation in 4.12. 


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:

Yes, this is a regression, since in 4.11 the IO performance and all the creation /deletion/attach/reattach measurements are much better. 


Steps to Reproduce:

1. Deploy VMware LSO cluster with 4.12 GA build) 
2. Run Performance suite tests ( performance marker, tests/e2e/performance in ocs_ci project)
3. Compare those results to any 4.11.X results ( X up to 5, not 4.11.5)


Actual results:

We see a clear degradation. 
IO Performance: 
FIO  - degradation in IOPs and Throughput  ( persistent in 4.12.0-145 and 4.12.0-167 and 4.12.173) on Sequential and Random IO on both CephFS and RBD 
FIO Compressed -  degradation in IOPs and Throughput ( persistent in 4.12.0-152, 4.12.0-167 and 4.12.0.173) for both sequential and random IO.  

Snapshots and Clones : 
Single Snapshot Restore Time ( Ceph FS)   - degradation in Restore time and speed. We see a degradation in RBD as well , while the times are still short, this is persistent in 4.12.0-145 and 4.12.0-167
Snapshot Creation - Multiple Files ( CephFS) - degradation in snapshot creation time, persistent in 4.12.0-145 and 4.12.0-167 
Single Clone Creation Time ( CephFS) - degradation in creation time ( persistent in 4.12.0-145, 4.12.0-167, 4.12.0-173) 
Multiple Clone Creation  - degradation in average creation time for both RBD and CephFS. 
Bulk Clone Creation Time ( Ceph FS) - degradation in CephFS bulk clone creation time ( persistent in 4.12.0-145 and 4.12.0-167, though in 4.12.0-167 the results are better) 

Pod Attach - Reattach time : 
There is a degradation in Pod Reattach time for both RBD and CephFS PVCs, especially CephFS, for pods with more files ( checked up to ~820K files) the degradation is much more significant. 
Bulk Pod Attach Time  - there is a degradation in Reattach time ( persistent in 4.12.0-145 and 4.12.0-167 and 4.12.0-173) for both RBD and CephFS pods. 


Expected results:

No degradation should be seen in the 4.12 results


Additional info:

Please refer to this document ( Performance Comparison Report) https://docs.google.com/document/d/15ATM0gDw0Df25uYkLy7A_TKK9oHbNXH-Zt4DW9-t3r0/edit#
It contains link to Performance Dashboard, links to Jenkins Jobs ( with the names of the clusters and full run logs) .

Comment 4 Mudit Agarwal 2023-08-08 13:03:42 UTC
Closing due to inactivity


Note You need to log in before you can comment on or make changes to this bug.