Bug 2064135
| Summary: | VMware LSO - degradation in CephFS small file workload File Per Second results in 4.10 compared to 4.9 | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | Yuli Persky <ypersky> |
| Component: | ceph | Assignee: | Venky Shankar <vshankar> |
| ceph sub component: | CephFS | QA Contact: | Elad <ebenahar> |
| Status: | CLOSED INSUFFICIENT_DATA | Docs Contact: | |
| Severity: | unspecified | ||
| Priority: | unspecified | CC: | alayani, bniver, hyelloji, jlayton, jopinto, kramdoss, madam, mmuench, muagarwa, ocs-bugs, odf-bz-bot, pnataraj, rar |
| Version: | 4.10 | Keywords: | Automation, Performance, Regression |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2022-09-29 02:26:03 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Yuli Persky
2022-03-15 07:03:48 UTC
We have nearly-identical bzs reported for 4.7->4.8 and 4.8->4.9: https://bugzilla.redhat.com/show_bug.cgi?id=1984590, https://bugzilla.redhat.com/show_bug.cgi?id=2015520#c29 Is the claim really that some block sizes are now performing at <1% of the starting performance? If not, I think there's something fundamentally broken with these tests and the reports they're generating. This is backed up by Ben England's comments at https://bugzilla.redhat.com/show_bug.cgi?id=2015520#c30. Now, the linked bz is discussing some performance changes that may look real and come down to configuration changes between OCS/ODF releases, but they're nothing like the 97% number quoted here. Can we rely on that discussion the performance engineers are having instead of assigning new tickets to the CephFS team? Not a 4.10 blocker, it needs more investigation ad mentioned by Greg in the previous comment. @Greg Farnum, Can you please specify why do you think that the small files test run in this BZ is broken? What tool/load are you using to measure IO performance which makes you think that the small files we run is reporting false results? Also, please note that the bug Ben England commented on https://bugzilla.redhat.com/show_bug.cgi?id=2015520#c30 was open for results of DIFFERENT load - which is the FIO benchmark, therefore I think this is not relevant to this BZ. Please correct me if I'm mistaken. +1 to comment 6, also how does upstream Ceph CI measure Cephfs (or RBD) performance at high file counts with small file sizes? (In reply to Yuli Persky from comment #6) > @Greg Farnum, Yuli - if you ask a question, please use the NEEDINFO feature of Bugzilla. > > Can you please specify why do you think that the small files test run in > this BZ is broken? > What tool/load are you using to measure IO performance which makes you think > that the small files we run is reporting false results? > > Also, please note that the bug Ben England commented on > https://bugzilla.redhat.com/show_bug.cgi?id=2015520#c30 was open for > results of DIFFERENT load - which is the FIO benchmark, therefore I think > this is not relevant to this BZ. > Please correct me if I'm mistaken. What kernel versions are in these OCP releases? The difference in create/delete/append performance may be explained by the async ops defaulting to on (by mistake, IIRC?) and then being turned off in the kernel client. You should needinfo on Jeff Layton to check that once you have the kernel versions. However, I don't think that can apply to the read numbers, unless those are indirectly measuring create performance as well. @Greg Farnum, 4.18.0-305.40.2.el8_4.x86_64 - this is the Kernel version that was on the 4.10 cluster. Jeff, which kernel versions were the ones that changed the async ops defaults? The patch that turned on async dirops by default went into kernel-4.18.0-357.el8, and was reverted in kernel-4.18.0-372.1.1.el8. It was never backported to 8.4 kernels (-305.el8 series). (In reply to Yuli Persky from comment #11) > @Greg Farnum, > > 4.18.0-305.40.2.el8_4.x86_64 - this is the Kernel version that was on the > 4.10 cluster. What kernel was on the 4.9 cluster? @Jeff Layton, for 4.9 cluster the versions were : kernel 4.18.0-305.40.2.el8_4 → 4.18.0-305.45.1.el8_4 kernel-core 4.18.0-305.40.2.el8_4 → 4.18.0-305.45.1.el8_4 kernel-modules 4.18.0-305.40.2.el8_4 → 4.18.0-305.45.1.el8_4 kernel-modules-extra 4.18.0-305.40.2.el8_4 → 4.18.0-305.45.1.el8_4 I took it from here: https://releases-rhcos-art.cloud.privileged.psi.redhat.com/?release=49.84.202111231504-0&stream=releases%2Frhcos-4.9#49.84.202111231504-0 |