Bug 1984590

Summary: AWS - Degradation of performance of small files CephFS 4 KB file size in 4.8 compared to 4.7 results
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Yuli Persky <ypersky>
Component: cephAssignee: Greg Farnum <gfarnum>
Status: CLOSED DUPLICATE QA Contact: Elad <ebenahar>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 4.8CC: alayani, bengland, bniver, khiremat, kramdoss, madam, muagarwa, ocs-bugs, odf-bz-bot, pdonnell, rcyriac, shberry, vshankar
Target Milestone: ---Keywords: Performance, Regression
Target Release: ---Flags: kramdoss: needinfo+
muagarwa: needinfo? (shberry)
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-03-14 14:36:44 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Yuli Persky 2021-07-21 17:40:26 UTC
Description of problem (please be detailed as possible and provide log
snippests):

There is degradation of performance (IO , Throughput and Latency are about 90% worse then is 4.7)  of small files : CephFS interface,  4 KB file size on AWS platform. 

Full comparison report is available here ( small files results - page 18) :
 
https://docs.google.com/document/d/1-lOb4szqLM4LoWnMr_JCp9zurBqpjeva5BUEH-yer4s/edit?ts=60f62010#

Version of all relevant components (if applicable):


HW Platform	vsphere
Number of OCS nodes	3
Number of total OSDs	3
OSD Size (TiB)	1.46
Total available storage (GiB)	4,467
OCP Version	4.8.0-0.nightly-2021-07-16-010020
OCS Version	4.8.0-450.ci
Ceph Version	14.2.11-184.el8cp

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

Lower Throughput and bigger Latency during io. 


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
3


Can this issue reproducible?

Yes 

Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:

In 4.7 the Throughput and IO were bigger and Latency was smaller as can be seen in the Performance report : 

https://docs.google.com/document/d/1-lOb4szqLM4LoWnMr_JCp9zurBqpjeva5BUEH-yer4s/edit?ts=60f62010#


Steps to Reproduce:

0. Deploy AWS cluster , 2TB OSD size. 
1. Run tests/e2e/performance/test_small_file_workload.py
2. Compare IO, Throuput and Latency results of Ceph 4 KB files to 4.7 results.
3.


Actual results:

IO and Throughput are about 90% worse than in 4.7 on the same platform ( AWS) 

Expected results:

The results should be same or better than in 4.7. 


Additional info:


The data is also available here : 
http://ocsperf.ceph.redhat.com:8080/index.php?version1=5&build1=10&platform1=1&az_topology1=1&test_name[]=2&test_name[]=3&version2=&build2=&version3=&build3=&submit=Choose+options

Test console log is available here: 

10.70.39.233:/ypersky_report_logs/48/aws/

Comment 2 Yuli Persky 2021-07-21 18:56:27 UTC
PLease note that the platform is AWS, not vmware ( as appears by mistake in the description).

Comment 3 Yuli Persky 2021-07-21 18:57:07 UTC
Full version list : 

HW Platform	AWS
Number of OCS nodes	3
Number of total OSDs	3
OSD Size (TiB)	2.00
Total available storage (GiB)	6,140
OCP Version	4.8.0-0.nightly-2021-07-04-112043
OCS Version	4.8.0-444.ci
Ceph Version	14.2.11-183.el8cp

Comment 7 Humble Chirammal 2021-07-26 10:38:39 UTC
While we collect the MG, Ccing CephFS team here too as I am not sure any significant changes in Ceph Core on this area.

14.2.11-184.el8cp

Comment 8 Humble Chirammal 2021-07-29 14:14:58 UTC
(In reply to Humble Chirammal from comment #7)
> While we collect the MG, Ccing CephFS team here too as I am not sure any
> significant changes in Ceph Core on this area.
> 
> 14.2.11-184.el8cp

Below are the ceph versions mentioned in the doc for 4.8 and 4.7 tests : 

4.8 version : 14.2.11-184.el8cp

4.7 version:  14.2.11-147.el8cp

Comment 13 Ben England 2021-09-07 12:30:54 UTC
The numbers in this report show regression for sure, but the difference between throughput and IOPS numbers seems a bit strange.   For 4-KiB files (indeed for any file smaller than 1 MiB), IOPS should = files/sec, and throughput (MB/s) should be files/sec x file size in KiB / 1000.     In particular, some of the test1 results do not fit the above formula.  I commented on that in the doc.  What happened?   Let's find out.  The raw log data is in a link from the perf doc, perhaps the answer is in there.

Avi, +1, if there are 3 samples, then as long as the %deviation is low (i.e. under 10%)  then I don't think it's noise in the measurement.     What is the %deviation in these measurements?   

Was cache dropping used?

Remember that smallfile is not using O_DIRECT I/O, unlike fio.  Unless you request fsync: y, it does not flush dirty pages.   Also, smallfile has no notion of a "prefill" where it preallocates the space.  So it's a very different workload.

Smallfile tests don't generate readdirs unless you specifically request that operation.

Comment 15 Shekhar Berry 2021-09-13 13:39:15 UTC
Can you change the target_size_ratio of cephfs pool and rerun the smallfile test for CEPHfs? This will tell CEPH that it should expect most data on CEPHfs poo and it will align the PGs accordingly.

Here's how you set it: 
ceph osd pool set ocs-storagecluster-cephfilesystem-data0 target_size_ratio 0.95

and change the RBD pool to 0.05
ceph osd pool set ocs-storagecluster-cephblockpool target_size_ratio 0.05

Wait for all the PGs to balance before running the test. (pg_num and pgp_num should match in the pool description, ceph osd pool ls detail)

Comment 18 Yaniv Kaul 2021-10-11 06:11:06 UTC
Avi/Yuli, how are the results of 4.9 compared to 4.8?

Comment 19 Mudit Agarwal 2021-10-18 07:56:19 UTC
Actively being looked upon by ceph folks, changing the component.

Comment 20 Yuli Persky 2021-10-19 06:42:06 UTC
@Yaniv, 

The AWS Performance report ( 4.9 vs 4.8) is available here: 

https://docs.google.com/document/d/1vyufd55iDyvKeYOwoXwKSsNoRK2VR41QNTuH-iERR8s/edit

Comment 21 Mudit Agarwal 2021-11-02 14:48:35 UTC
Not a 4.9 regression, still being discussed.
Moving it out based on the offline discussion with QE

Comment 25 Patrick Donnelly 2022-02-27 16:48:06 UTC
https://bugzilla.redhat.com/show_bug.cgi?id=2015520#c29

Defer to Ben/Venky. (I'm on paternity leave.)

Comment 26 Mudit Agarwal 2022-03-11 02:37:06 UTC
No decision yet on this, not a 4.10 blocker. Moving it out.
Setting NI on Venky based on Patrick's comment.

Comment 27 Greg Farnum 2022-03-14 14:36:44 UTC
Closing as dupe based on https://bugzilla.redhat.com/show_bug.cgi?id=2015520#c29 and the follow-on discussions there and here.

*** This bug has been marked as a duplicate of bug 2015520 ***

Comment 28 Venky Shankar 2022-03-16 13:20:03 UTC
Clearing my NI.