| Summary: | iozone + fsync with IObarriers on performs poorly with small files (up to 4MB) especially for read operations | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 6 | Reporter: | Barry Marson <bmarson> | ||||||
| Component: | kernel | Assignee: | Red Hat Kernel Manager <kernel-mgr> | ||||||
| Status: | CLOSED NOTABUG | QA Contact: | Red Hat Kernel QE team <kernel-qe> | ||||||
| Severity: | medium | Docs Contact: | |||||||
| Priority: | unspecified | ||||||||
| Version: | 6.2 | CC: | bstevens, dchinner, esandeen, jeder, jmoyer, lczerner, perfbz, rwheeler, SCHAKRAB | ||||||
| Target Milestone: | rc | ||||||||
| Target Release: | 6.2 | ||||||||
| Hardware: | x86_64 | ||||||||
| OS: | Linux | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2013-06-06 15:38:03 UTC | Type: | --- | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Attachments: |
|
||||||||
|
Description
Barry Marson
2011-08-11 21:06:36 UTC
(In reply to comment #0) > This was tested with ext4, but I'm fairly sure I have data that shows > it occurs on multiple file systems. It'd be great if you could provide that data, too. (In reply to comment #0) > Steps to Reproduce: > 1. run iozone to file system with/without barriers enabled with option to > include fsync at end. I guess that means -e ? Which version of iozone do you use? For iozone3_397, that option doesn't even come into play for read tests (as I'd expect). Perhaps the storage is flushing cache internally after fsync returns, slowing down the subsequent reads? This doesn't make much sense to me at all... I'm assumming these are the overall numbers being compared (chopped off after read tests for easier reading):
I/O FILE & REC ALL INIT RE RE RANDOM RANDOM BACKWD
BARR SIZES (KB) IOS WRITE WRITE READ READ READ WRITE READ
=====--------------------------------------------------------------------
OFF ALL 328 61 68 524 1873 1695 57 504
ON ALL 315 64 70 576 1101 1024 60 556
ALL . +5.2 . +10.0 -41.2 -39.5 . +10.3
That to me looks like the read from disk is 10% faster with barriers on, but cache hit workloads (reread and random read) are significantly slower with barriers on.
I say "cache hit workloads" because I see no reason for sequential re-reads being 3x faster than the initial read unless it is coming out of cache. Hence I can't see how barriers affect these results at all. Changes in cache hit ranges are usually caused by changes in VM alogrithms, not barriers.
Indeed - why is the initial write rate only ~60MB/s, yet the read rates almost an order of magnitude faster @ 575MB/s? Is the initial read also partially hitting the cache?
These numbers from the -131 number (once again trimmed):
FILE ALL INIT RE RE RANDOM RANDOM BACKWD
SIZE (KB) IOS WRITE WRITE READ READ READ WRITE READ
=====----------------------------------------------------------------------
4 -53.8 -17.6 -21.4 -33.3 -98.3 -97.3 -21.8 -27.5
8 -51.9 -19.6 -23.6 -30.0 -97.5 -95.9 -23.2 -31.7
16 -47.8 -23.6 -6.5 -28.9 -96.0 -94.7 -11.6 -18.6
32 -40.9 . -10.9 -5.9 -93.3 -91.8 -5.1 -45.1
Tend to imply that the re-read speed is 20-30x slower than the barrier-off case, which if we are doing 1500MB/s with barriers off then we're doing 50-60 MB/s with barriers on which, according to the initial write rate is about disk speed. IOWs, the difference between a 100% cache hit workload and a 100% cache miss workload.
How big is the data set being tested? Does it fit in memory, or a significant portion of it fit in memory? I can see if the file set size is roughly the same or smaller than memory then that adding barriers might change the order of IO completion and hence potentially the order of pages being reclaimed from the page cache off the LRU. That would then affect the subsequent read rates as that then depends on what stayed resident in cache and what hasn't. And that then affects the subsequent re-read, because some pages would now be on the active LRU rather than the inactive LRU, so reclaim patterns would change again and hence more of the file might need to be re-read from disk...
Eric, Im having trouble locating the multi file system testing with barrier/nobarrier testing. I wasn't the one to run it. I may have to rerun it to verify. As far as the raw iozone data, I'll post that in a tarball ASAP. The version of iozone is 3.327 obviously not the newest but what we have been using for a while. The latest code does do an fsync even after a set of reads. Dave, As for file sizes ... The analysis shows files up to 2GB for the entire test. This represents approx. half of memory. Indeed it is an in cache test but forces the fsync to measure time time to commit to stable storage. The intent of re-read is to measure cached behaviour and hence should be near memory speed, but the barrier run does less. Reads were better with barriers on with the -173-bz kernel but it's not totally clear that it's the patch from the bz or even recent barrier code updates. I ran the release kernel comparisons mostly to see if the re-read/random read was a regression vs always been in at least RHEL6. Barry Created attachment 518031 [details]
tarball of all 6 runs, includes full raw iozone data plus more info about the system
So, I retested with a smaller matrix, and got similar results: OPTIONS="-az -f /RHTSspareLUN1/iozone-ext4 -n 128k -g 1024k -y 1k -q 128k -e" read & reread were way down with barriers. But we're not hitting the disk for reads in either case: Total (nobarrier-trace): Reads Queued: 4, 16KiB Writes Queued: 21,399, 85,596KiB Read Dispatches: 0, 0KiB Write Dispatches: 0, 0KiB Reads Requeued: 0 Writes Requeued: 0 Reads Completed: 4, 16KiB Writes Completed: 21,399, 85,596KiB Read Merges: 0, 0KiB Write Merges: 0, 0KiB IO unplugs: 1,182 Timer unplugs: 0 Total (barrier-trace): Reads Queued: 4, 16KiB Writes Queued: 22,167, 85,596KiB Read Dispatches: 0, 0KiB Write Dispatches: 0, 0KiB Reads Requeued: 0 Writes Requeued: 0 Reads Completed: 4, 16KiB Writes Completed: 22,167, 85,596KiB Read Merges: 0, 0KiB Write Merges: 0, 0KiB IO unplugs: 1,174 Timer unplugs: 0 I also retested the same above options w/ no "-e" and the slowdown on read/reread goes away, in fact it speeds up. So, um ....? If you're testing reads, then it makes zero sense to fsync. What's the rationale for that? Does iozone issue fsync() calls on files it reads during the read phase? It does, oddly enough (at least the one bmarson runs does)
5860 open("/RHTSspareLUN1/iozone-ext4", O_RDONLY) = 3
5860 fsync(3) = 0
5860 read(3, " \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096
5860 lseek(3, 0, SEEK_SET) = 0
5860 gettimeofday({1313176642, 871297}, NULL) = 0
5860 read(3, " \0\0\0\0\0\0\0\0\0\0\0\0\0\\0\0\0"..., 131072) = 131072
...
5860 fsync(3) = 0
5860 gettimeofday({1313176642, 872652}, NULL) = 0
5860 fsync(3) = 0
5860 close(3) = 0
iozone does an fsync on flush to make sure that all metadata is synchronized ... At least thats what the authors told me years back when I asked then. Barry The pattern makes no sense when using noatime (should be no dirty data?) unless the file it is reading is still dirty from the initial write and still cached? Note that fsync will always flush the cache, as that is required for proper O_DIRECT write semantics. On extN I also wouldn't be surprised it it flushes out metadata not really required. well the man page claims it will flush out metadata. Note we are not testing noatime. That was exercised to make sure it wasn't related to the performance problem (data not posted). Remember iozone gets run on many different platforms and modes ... I wonder why reads of a cached file are so much slower than the rereads. The data is already cached from the rewrite and it was fsynced. Is this the cost potentially of the vm having to flip the page cache pages to read only ? I've seen this before in different scenarios configs and even OS's. Barry if you are not running with noatime fsync will flush out the actual inode updates, which now actually hit the platter. and the rereads are faster because the default norelatime mode won't flush the atime on the second update, unless the ctime changed. Mounting with noatime, reads & rereads are both down about 50% in my testing, when barriers are on. Without noatime, read is down 96%, reread down about 50%. (this is on a single slow spindle, FWIW) (In reply to comment #10) > It does, oddly enough (at least the one bmarson runs does) > > 5860 open("/RHTSspareLUN1/iozone-ext4", O_RDONLY) = 3 > 5860 fsync(3) = 0 > 5860 read(3, " \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096 > 5860 lseek(3, 0, SEEK_SET) = 0 > 5860 gettimeofday({1313176642, 871297}, NULL) = 0 > 5860 read(3, " \0\0\0\0\0\0\0\0\0\0\0\0\0\\0\0\0"..., 131072) = 131072 > ... > 5860 fsync(3) = 0 > 5860 gettimeofday({1313176642, 872652}, NULL) = 0 > 5860 fsync(3) = 0 > 5860 close(3) = 0 So for a small file, say 4k, there's 3 fsync() calls for one 4k read? If fsync() is doing any extra work even when the inode is clean, then that's going to make a huge difference to performance.... Does anyone really do this in a performance critical loop? Since RHEL 6.2 External Beta has begun, and this bug remains unresolved, it has been rejected as it is not proposed as exception or blocker. Red Hat invites you to ask your support representative to propose this request, if appropriate and relevant, in the next release of Red Hat Enterprise Linux. It seems that we came to an understanding of this issue and can close out this BZ. Probably worth summarizes the above in a more widely viewed place for users to get at. Closing this out - a lot of code has changed since this was originally opened. If this is still an issue, let's reopen it with refreshed detail. Thanks! |