Hide Forgot
Created attachment 517894 [details] Iozone + fsync analysis comparing rhel 6.0, 6.1, and a 2.6.32-171.bz725816 kernels Description of problem: Performing additional testing because of regressions noted with bz 725816 have shown that iozone with small files ( up to 4MB ) perform poorly when barriers are enabled even to enterprise based storage. Disabling barriers makes it all go away. This was tested with ext4, but I'm fairly sure I have data that shows it occurs on multiple file systems. Deadline scheduler was used always. The typical iozone matrix operations which perform the worst are re-reads followed by random reads. Performance is upwards of 95+% off. I tried the fix in bz 725816 but saw no difference. The configuration used was perf7, a relatively old AMD server with a FC-HBA presented LUN from an HP MSA2000. dm multipath is being used but quite loosely. # multipath -ll mpathe (3600c0ff000d52efb3bbf094901000000) dm-7 HP,MSA2212fc size=140G features='1 queue_if_no_path' hwhandler='0' wp=rw `-+- policy='round-robin 0' prio=1 status=active |- 10:0:4:5 sdg 8:96 active ready running `- 10:0:5:5 sdn 8:208 active ready running ... The file system is rebuilt for each run. Attaching the iozone analysis for a 6.0, 6.1, and the kernel from bz 725816. One thing seems clear. This problem probably has little or nothing to do with barrier changes in 6.1 or the newer changes for 6.2 since the attachment shows issues with 6.0. Barry Version-Release number of selected component (if applicable): 2.6.32-71 2.6.32-131.0.15 2.6.32-171.bz725816 How reproducible: every time Steps to Reproduce: 1. run iozone to file system with/without barriers enabled with option to include fsync at end. 2. 3. Actual results: see attachement Expected results: Additional info: Just in case the problem was related to tons of metadata activity, I tried a run with noatime,nodiratime but saw no big change with the reads.
(In reply to comment #0) > This was tested with ext4, but I'm fairly sure I have data that shows > it occurs on multiple file systems. It'd be great if you could provide that data, too.
(In reply to comment #0) > Steps to Reproduce: > 1. run iozone to file system with/without barriers enabled with option to > include fsync at end. I guess that means -e ? Which version of iozone do you use? For iozone3_397, that option doesn't even come into play for read tests (as I'd expect). Perhaps the storage is flushing cache internally after fsync returns, slowing down the subsequent reads? This doesn't make much sense to me at all...
I'm assumming these are the overall numbers being compared (chopped off after read tests for easier reading): I/O FILE & REC ALL INIT RE RE RANDOM RANDOM BACKWD BARR SIZES (KB) IOS WRITE WRITE READ READ READ WRITE READ =====-------------------------------------------------------------------- OFF ALL 328 61 68 524 1873 1695 57 504 ON ALL 315 64 70 576 1101 1024 60 556 ALL . +5.2 . +10.0 -41.2 -39.5 . +10.3 That to me looks like the read from disk is 10% faster with barriers on, but cache hit workloads (reread and random read) are significantly slower with barriers on. I say "cache hit workloads" because I see no reason for sequential re-reads being 3x faster than the initial read unless it is coming out of cache. Hence I can't see how barriers affect these results at all. Changes in cache hit ranges are usually caused by changes in VM alogrithms, not barriers. Indeed - why is the initial write rate only ~60MB/s, yet the read rates almost an order of magnitude faster @ 575MB/s? Is the initial read also partially hitting the cache? These numbers from the -131 number (once again trimmed): FILE ALL INIT RE RE RANDOM RANDOM BACKWD SIZE (KB) IOS WRITE WRITE READ READ READ WRITE READ =====---------------------------------------------------------------------- 4 -53.8 -17.6 -21.4 -33.3 -98.3 -97.3 -21.8 -27.5 8 -51.9 -19.6 -23.6 -30.0 -97.5 -95.9 -23.2 -31.7 16 -47.8 -23.6 -6.5 -28.9 -96.0 -94.7 -11.6 -18.6 32 -40.9 . -10.9 -5.9 -93.3 -91.8 -5.1 -45.1 Tend to imply that the re-read speed is 20-30x slower than the barrier-off case, which if we are doing 1500MB/s with barriers off then we're doing 50-60 MB/s with barriers on which, according to the initial write rate is about disk speed. IOWs, the difference between a 100% cache hit workload and a 100% cache miss workload. How big is the data set being tested? Does it fit in memory, or a significant portion of it fit in memory? I can see if the file set size is roughly the same or smaller than memory then that adding barriers might change the order of IO completion and hence potentially the order of pages being reclaimed from the page cache off the LRU. That would then affect the subsequent read rates as that then depends on what stayed resident in cache and what hasn't. And that then affects the subsequent re-read, because some pages would now be on the active LRU rather than the inactive LRU, so reclaim patterns would change again and hence more of the file might need to be re-read from disk...
Eric, Im having trouble locating the multi file system testing with barrier/nobarrier testing. I wasn't the one to run it. I may have to rerun it to verify. As far as the raw iozone data, I'll post that in a tarball ASAP. The version of iozone is 3.327 obviously not the newest but what we have been using for a while. The latest code does do an fsync even after a set of reads. Dave, As for file sizes ... The analysis shows files up to 2GB for the entire test. This represents approx. half of memory. Indeed it is an in cache test but forces the fsync to measure time time to commit to stable storage. The intent of re-read is to measure cached behaviour and hence should be near memory speed, but the barrier run does less. Reads were better with barriers on with the -173-bz kernel but it's not totally clear that it's the patch from the bz or even recent barrier code updates. I ran the release kernel comparisons mostly to see if the re-read/random read was a regression vs always been in at least RHEL6. Barry
Created attachment 518031 [details] tarball of all 6 runs, includes full raw iozone data plus more info about the system
So, I retested with a smaller matrix, and got similar results: OPTIONS="-az -f /RHTSspareLUN1/iozone-ext4 -n 128k -g 1024k -y 1k -q 128k -e" read & reread were way down with barriers. But we're not hitting the disk for reads in either case: Total (nobarrier-trace): Reads Queued: 4, 16KiB Writes Queued: 21,399, 85,596KiB Read Dispatches: 0, 0KiB Write Dispatches: 0, 0KiB Reads Requeued: 0 Writes Requeued: 0 Reads Completed: 4, 16KiB Writes Completed: 21,399, 85,596KiB Read Merges: 0, 0KiB Write Merges: 0, 0KiB IO unplugs: 1,182 Timer unplugs: 0 Total (barrier-trace): Reads Queued: 4, 16KiB Writes Queued: 22,167, 85,596KiB Read Dispatches: 0, 0KiB Write Dispatches: 0, 0KiB Reads Requeued: 0 Writes Requeued: 0 Reads Completed: 4, 16KiB Writes Completed: 22,167, 85,596KiB Read Merges: 0, 0KiB Write Merges: 0, 0KiB IO unplugs: 1,174 Timer unplugs: 0 I also retested the same above options w/ no "-e" and the slowdown on read/reread goes away, in fact it speeds up. So, um ....?
If you're testing reads, then it makes zero sense to fsync. What's the rationale for that?
Does iozone issue fsync() calls on files it reads during the read phase?
It does, oddly enough (at least the one bmarson runs does) 5860 open("/RHTSspareLUN1/iozone-ext4", O_RDONLY) = 3 5860 fsync(3) = 0 5860 read(3, " \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096 5860 lseek(3, 0, SEEK_SET) = 0 5860 gettimeofday({1313176642, 871297}, NULL) = 0 5860 read(3, " \0\0\0\0\0\0\0\0\0\0\0\0\0\\0\0\0"..., 131072) = 131072 ... 5860 fsync(3) = 0 5860 gettimeofday({1313176642, 872652}, NULL) = 0 5860 fsync(3) = 0 5860 close(3) = 0
iozone does an fsync on flush to make sure that all metadata is synchronized ... At least thats what the authors told me years back when I asked then. Barry
The pattern makes no sense when using noatime (should be no dirty data?) unless the file it is reading is still dirty from the initial write and still cached?
Note that fsync will always flush the cache, as that is required for proper O_DIRECT write semantics. On extN I also wouldn't be surprised it it flushes out metadata not really required.
well the man page claims it will flush out metadata. Note we are not testing noatime. That was exercised to make sure it wasn't related to the performance problem (data not posted). Remember iozone gets run on many different platforms and modes ... I wonder why reads of a cached file are so much slower than the rereads. The data is already cached from the rewrite and it was fsynced. Is this the cost potentially of the vm having to flip the page cache pages to read only ? I've seen this before in different scenarios configs and even OS's. Barry
if you are not running with noatime fsync will flush out the actual inode updates, which now actually hit the platter.
and the rereads are faster because the default norelatime mode won't flush the atime on the second update, unless the ctime changed.
Mounting with noatime, reads & rereads are both down about 50% in my testing, when barriers are on. Without noatime, read is down 96%, reread down about 50%. (this is on a single slow spindle, FWIW)
(In reply to comment #10) > It does, oddly enough (at least the one bmarson runs does) > > 5860 open("/RHTSspareLUN1/iozone-ext4", O_RDONLY) = 3 > 5860 fsync(3) = 0 > 5860 read(3, " \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096 > 5860 lseek(3, 0, SEEK_SET) = 0 > 5860 gettimeofday({1313176642, 871297}, NULL) = 0 > 5860 read(3, " \0\0\0\0\0\0\0\0\0\0\0\0\0\\0\0\0"..., 131072) = 131072 > ... > 5860 fsync(3) = 0 > 5860 gettimeofday({1313176642, 872652}, NULL) = 0 > 5860 fsync(3) = 0 > 5860 close(3) = 0 So for a small file, say 4k, there's 3 fsync() calls for one 4k read? If fsync() is doing any extra work even when the inode is clean, then that's going to make a huge difference to performance.... Does anyone really do this in a performance critical loop?
Since RHEL 6.2 External Beta has begun, and this bug remains unresolved, it has been rejected as it is not proposed as exception or blocker. Red Hat invites you to ask your support representative to propose this request, if appropriate and relevant, in the next release of Red Hat Enterprise Linux.
It seems that we came to an understanding of this issue and can close out this BZ. Probably worth summarizes the above in a more widely viewed place for users to get at.
Closing this out - a lot of code has changed since this was originally opened. If this is still an issue, let's reopen it with refreshed detail. Thanks!