Description of problem (please be detailed as possible and provide log
snippests):
We see a clear degradation in Performance on various 4.12 builds vs various 4.11 builds ( 4.11.4 tested) .
The degradation is as following:
IO Performance:
FIO - degradation in IOPs and Throughput ( persistent in 4.12.0-145 and 4.12.0-167 and 4.12.173) on Sequential and Random IO on both CephFS and RBD
FIO Compressed - degradation in IOPs and Throughput ( persistent in 4.12.0-152, 4.12.0-167 and 4.12.0.173) for both sequential and random IO.
Snapshots and Clones :
Single Snapshot Restore Time ( Ceph FS) - degradation in Restore time and speed. We see a degradation in RBD as well , while the times are still short, this is persistent in 4.12.0-145 and 4.12.0-167
Snapshot Creation - Multiple Files ( CephFS) - degradation in snapshot creation time, persistent in 4.12.0-145 and 4.12.0-167
Single Clone Creation Time ( CephFS) - degradation in creation time ( persistent in 4.12.0-145, 4.12.0-167, 4.12.0-173)
Multiple Clone Creation - degradation in average creation time for both RBD and CephFS.
Bulk Clone Creation Time ( Ceph FS) - degradation in CephFS bulk clone creation time ( persistent in 4.12.0-145 and 4.12.0-167, though in 4.12.0-167 the results are better)
Pod Attach - Reattach time :
There is a degradation in Pod Reattach time for both RBD and CephFS PVCs, especially CephFS, for pods with more files ( checked up to ~820K files) the degradation is much more significant.
Bulk Pod Attach Time - there is a degradation in Reattach time ( persistent in 4.12.0-145 and 4.12.0-167 and 4.12.0-173) for both RBD and CephFS pods.
Version of all relevant components (if applicable):
OCP Version 4.11.0-0.nightly-2023-03-07-114656 4.12.0-0.nightly-2023-02-04-034821
ODF Version 4.11.4-4 4.12.0-173
Ceph Version 16.2.8-84.el8cp 16.2.10-94.el8cp
Cluster name ypersky-lso411 ypersky-173a
Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Yes, since IO performance and creation/deletion/attach/reattach times have a direct impact on the customer's experience.
Is there any workaround available to the best of your knowledge?
Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
3
Can this issue reproducible?
Yes!
I've run the Perf suite on 2 different 4.11 builds and 3 different 4.12 builds, and in comparison of any pair of the results ( 4.12 vs 4.11) we see a clear degradation in 4.12.
Can this issue reproduce from the UI?
If this is a regression, please provide more details to justify this:
Yes, this is a regression, since in 4.11 the IO performance and all the creation /deletion/attach/reattach measurements are much better.
Steps to Reproduce:
1. Deploy VMware LSO cluster with 4.12 GA build)
2. Run Performance suite tests ( performance marker, tests/e2e/performance in ocs_ci project)
3. Compare those results to any 4.11.X results ( X up to 5, not 4.11.5)
Actual results:
We see a clear degradation.
IO Performance:
FIO - degradation in IOPs and Throughput ( persistent in 4.12.0-145 and 4.12.0-167 and 4.12.173) on Sequential and Random IO on both CephFS and RBD
FIO Compressed - degradation in IOPs and Throughput ( persistent in 4.12.0-152, 4.12.0-167 and 4.12.0.173) for both sequential and random IO.
Snapshots and Clones :
Single Snapshot Restore Time ( Ceph FS) - degradation in Restore time and speed. We see a degradation in RBD as well , while the times are still short, this is persistent in 4.12.0-145 and 4.12.0-167
Snapshot Creation - Multiple Files ( CephFS) - degradation in snapshot creation time, persistent in 4.12.0-145 and 4.12.0-167
Single Clone Creation Time ( CephFS) - degradation in creation time ( persistent in 4.12.0-145, 4.12.0-167, 4.12.0-173)
Multiple Clone Creation - degradation in average creation time for both RBD and CephFS.
Bulk Clone Creation Time ( Ceph FS) - degradation in CephFS bulk clone creation time ( persistent in 4.12.0-145 and 4.12.0-167, though in 4.12.0-167 the results are better)
Pod Attach - Reattach time :
There is a degradation in Pod Reattach time for both RBD and CephFS PVCs, especially CephFS, for pods with more files ( checked up to ~820K files) the degradation is much more significant.
Bulk Pod Attach Time - there is a degradation in Reattach time ( persistent in 4.12.0-145 and 4.12.0-167 and 4.12.0-173) for both RBD and CephFS pods.
Expected results:
No degradation should be seen in the 4.12 results
Additional info:
Please refer to this document ( Performance Comparison Report) https://docs.google.com/document/d/15ATM0gDw0Df25uYkLy7A_TKK9oHbNXH-Zt4DW9-t3r0/edit#
It contains link to Performance Dashboard, links to Jenkins Jobs ( with the names of the clusters and full run logs) .