Description of problem (please be detailed as possible and provide log snippests): There is degradation of performance (IO , Throughput and Latency are about 90% worse then is 4.7) of small files : CephFS interface, 4 KB file size on AWS platform. Full comparison report is available here ( small files results - page 18) : https://docs.google.com/document/d/1-lOb4szqLM4LoWnMr_JCp9zurBqpjeva5BUEH-yer4s/edit?ts=60f62010# Version of all relevant components (if applicable): HW Platform vsphere Number of OCS nodes 3 Number of total OSDs 3 OSD Size (TiB) 1.46 Total available storage (GiB) 4,467 OCP Version 4.8.0-0.nightly-2021-07-16-010020 OCS Version 4.8.0-450.ci Ceph Version 14.2.11-184.el8cp Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Lower Throughput and bigger Latency during io. Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 3 Can this issue reproducible? Yes Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: In 4.7 the Throughput and IO were bigger and Latency was smaller as can be seen in the Performance report : https://docs.google.com/document/d/1-lOb4szqLM4LoWnMr_JCp9zurBqpjeva5BUEH-yer4s/edit?ts=60f62010# Steps to Reproduce: 0. Deploy AWS cluster , 2TB OSD size. 1. Run tests/e2e/performance/test_small_file_workload.py 2. Compare IO, Throuput and Latency results of Ceph 4 KB files to 4.7 results. 3. Actual results: IO and Throughput are about 90% worse than in 4.7 on the same platform ( AWS) Expected results: The results should be same or better than in 4.7. Additional info: The data is also available here : http://ocsperf.ceph.redhat.com:8080/index.php?version1=5&build1=10&platform1=1&az_topology1=1&test_name[]=2&test_name[]=3&version2=&build2=&version3=&build3=&submit=Choose+options Test console log is available here: 10.70.39.233:/ypersky_report_logs/48/aws/
PLease note that the platform is AWS, not vmware ( as appears by mistake in the description).
Full version list : HW Platform AWS Number of OCS nodes 3 Number of total OSDs 3 OSD Size (TiB) 2.00 Total available storage (GiB) 6,140 OCP Version 4.8.0-0.nightly-2021-07-04-112043 OCS Version 4.8.0-444.ci Ceph Version 14.2.11-183.el8cp
While we collect the MG, Ccing CephFS team here too as I am not sure any significant changes in Ceph Core on this area. 14.2.11-184.el8cp
(In reply to Humble Chirammal from comment #7) > While we collect the MG, Ccing CephFS team here too as I am not sure any > significant changes in Ceph Core on this area. > > 14.2.11-184.el8cp Below are the ceph versions mentioned in the doc for 4.8 and 4.7 tests : 4.8 version : 14.2.11-184.el8cp 4.7 version: 14.2.11-147.el8cp
The numbers in this report show regression for sure, but the difference between throughput and IOPS numbers seems a bit strange. For 4-KiB files (indeed for any file smaller than 1 MiB), IOPS should = files/sec, and throughput (MB/s) should be files/sec x file size in KiB / 1000. In particular, some of the test1 results do not fit the above formula. I commented on that in the doc. What happened? Let's find out. The raw log data is in a link from the perf doc, perhaps the answer is in there. Avi, +1, if there are 3 samples, then as long as the %deviation is low (i.e. under 10%) then I don't think it's noise in the measurement. What is the %deviation in these measurements? Was cache dropping used? Remember that smallfile is not using O_DIRECT I/O, unlike fio. Unless you request fsync: y, it does not flush dirty pages. Also, smallfile has no notion of a "prefill" where it preallocates the space. So it's a very different workload. Smallfile tests don't generate readdirs unless you specifically request that operation.
Can you change the target_size_ratio of cephfs pool and rerun the smallfile test for CEPHfs? This will tell CEPH that it should expect most data on CEPHfs poo and it will align the PGs accordingly. Here's how you set it: ceph osd pool set ocs-storagecluster-cephfilesystem-data0 target_size_ratio 0.95 and change the RBD pool to 0.05 ceph osd pool set ocs-storagecluster-cephblockpool target_size_ratio 0.05 Wait for all the PGs to balance before running the test. (pg_num and pgp_num should match in the pool description, ceph osd pool ls detail)
Avi/Yuli, how are the results of 4.9 compared to 4.8?
Actively being looked upon by ceph folks, changing the component.
@Yaniv, The AWS Performance report ( 4.9 vs 4.8) is available here: https://docs.google.com/document/d/1vyufd55iDyvKeYOwoXwKSsNoRK2VR41QNTuH-iERR8s/edit
Not a 4.9 regression, still being discussed. Moving it out based on the offline discussion with QE
https://bugzilla.redhat.com/show_bug.cgi?id=2015520#c29 Defer to Ben/Venky. (I'm on paternity leave.)
No decision yet on this, not a 4.10 blocker. Moving it out. Setting NI on Venky based on Patrick's comment.
Closing as dupe based on https://bugzilla.redhat.com/show_bug.cgi?id=2015520#c29 and the follow-on discussions there and here. *** This bug has been marked as a duplicate of bug 2015520 ***
Clearing my NI.