Bug 614605
| Summary: | [Intel 6.1 Bug] direct IO with dd seems broken compared to RHEL 5.4 | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 6 | Reporter: | Doug Nelson <doug.nelson> |
| Component: | coreutils | Assignee: | Ondrej Vasik <ovasik> |
| Status: | CLOSED ERRATA | QA Contact: | qe-baseos-daemons |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | 6.0 | CC: | andi.kleen, anil.k.garg, azelinka, doug.nelson, dshaks, esandeen, jane.lv, jmoyer, jvillalo, kdudka, keve.a.gabbert, luming.yu, luyu, lwoodman, matthew.r.wilcox, meyering, mhusnain, rdoty, rpacheco, rwheeler |
| Target Milestone: | rc | ||
| Target Release: | 6.1 | ||
| Hardware: | All | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | coreutils-8.4-12.el6 | Doc Type: | Bug Fix |
| Doc Text: |
Previously, when the dd utility used pipes, it read and wrote partial blocks. When the size of the block written was shorter than the specified maximum output block size, the "oflag=direct" would turn off, which resulted in degraded I/O performance. The workaround for this behaviour, which involves the addition of "iflag=fullblock" is now available in the information documentation.
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2011-05-19 13:50:48 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 580566, 600438 | ||
| Attachments: | |||
Thanks for report, marking that regression and will try to find out what's wrong there. However - there were many changes in dd between 5.97 and 8.4 coreutils, so it may take a while. (In reply to comment #0) > To narrow down the problem, I did this experiment with an EL6 Beta 2 base > install with both the el6 dd command and then again with the el5.4 dd command. Just to clarify, do you mean you ran the el5.4 dd binary on an el6 kernel and the problem did not present itself? Thanks. (In reply to comment #5) > (In reply to comment #0) > > > To narrow down the problem, I did this experiment with an EL6 Beta 2 base > > install with both the el6 dd command and then again with the el5.4 dd command. > > Just to clarify, do you mean you ran the el5.4 dd binary on an el6 kernel and > the problem did not present itself? > > Thanks. Yes, that is exactly what I did. So far I tried the quick test on ext4 with RHEL-5 (compiled on RHEL-6) and RHEL-6 dd binaries on RHEL-6 beta2 kernel. I saw no obvious difference in performance... It would be really good to know what's different - good start could be to strace both runs and attach the result here. Could you please do that, Doug? TIA. Another helpful thing could be callgrind profiling analyze... but strace is good place to start. (In reply to comment #8) > So far I tried the quick test on ext4 with RHEL-5 (compiled on RHEL-6) and > RHEL-6 dd binaries on RHEL-6 beta2 kernel. I saw no obvious difference in > performance... > > It would be really good to know what's different - good start could be to > strace both runs and attach the result here. Could you please do that, Doug? > TIA. > I'll try to grab some straces today. Created attachment 433269 [details]
straces for the output dd's for rhel5.4 and 6.0 beta 2. linux perf counters from high system time during el6 restore
I've included some strace output files for el5 and el6, and some linux perf counter cpu cycle and callgraph data from the high system time portion of the el6 database restore (when the mem is being freed up).
I believe that I've found a workaround for this problem. I changed the bs=1024k to obs=1024k for the second dd in my example line, and all the fcntl lines are gone from the strace and the dd's are not chewing up all my memory. Bad EL 6 behavior ------------------- dd.el6 if=/mnt/backup_raid5_1/disk-e1-d1s1.gz iflag=direct bs=1024k | gunzip -c | dd.el6 of=/dev/disk-e1-d1s1 bs=1024k oflag=direct Good EL6 behavior ----------------- dd.el6 if=/mnt/backup_raid5_1/disk-e1-d1s1.gz iflag=direct ibs=1024k | gunzip -c | dd.el6 of=/dev/disk-e1-d1s1 obs=1024k oflag=direct DIRECT_IO seems to be turned off after the open() ?? if I set bs=1024k on the dd that is receiving data from the pipe, and writing to the raw partition. Seems to be working if I only use obs=1024k I'm happy with this workaround. I'll leave it to you to decide if there is a bug here. thanks, doug I think the difference is caused by http://git.savannah.gnu.org/gitweb/?p=coreutils.git;a=commitdiff;h=5929322ccb1f9d27c1b07b746d37419d17a7cbf6 commit - if the write size is different from output blocksize, O_DIRECT flag is turned off. Re: comment 18 - does that mean that this is not a bug? Or do we expect direct I/O to work when the write size is different from output blocksize? We should either update that restriction somewhere or possibly see if we can relax it. It would seem that O_DIRECT should work fine as long as both input and output block sizes are properly aligned? (In reply to comment #18) > I think the difference is caused by > http://git.savannah.gnu.org/gitweb/?p=coreutils.git;a=commitdiff;h=5929322ccb1f9d27c1b07b746d37419d17a7cbf6 > commit - if the write size is different from output blocksize, O_DIRECT flag is > turned off. If someone is using O_DIRECT, they should well understand the limitations. I think it would be fine to spit out an error and not do anything if the specified file size is smaller than the hardware's logical block size. Falling back to buffered I/O is surely not what the user wanted. Going back to comment 11, I'm not sure why an ibs of 1024k would cause the oflag=direct to be ignored. Re comment#19 : Adding upstream maintainer to cc ... Jim, what do you think about this issue? Doug,
The problem with your command on 6.2 was that the latter dd
was reading from a pipe, which led inevitably to it reading
partial blocks, *and writing* them. The moment it wrote a block
shorter than the maximum output block size, that caused dd to
turn off O_DIRECT. As mentioned in the commit log:
* src/dd.c (iwrite): Turn off O_DIRECT for any
smaller-than-obs-sized write. Don't bother to restore it.
I suggest that you use oflag=fullblock in the latter dd invocation.
That will ensure that all but the last write is of the specified size,
and thus will not disable O_DIRECT, except, possibly, for the final write.
BTW, using obs=1M (leaving the default ibs at 512B) on the pipe-reading
dd implies reblocking, so has the same net effect: all output
blocks are "full", except possibly the last one.
I haven't tried to reproduce the VM-exhausting behavior yet.
Has anyone else succeeded in demonstrating that?
Jim, would it make sense for oflag=direct to imply the oflag=fullblock option? I find it hard to imagine a situation where the user intends the current behaviour. Matthew, that is probably the way to go. Another possibility is to warn about it, I guess. BTW, what I said above "has the same net effect" may be true in most cases when the output is a regular file, but is not true in general, i.e., in the presence of interrupts or when writing to a pipe. What about docs note in info documentation saying:
--- coreutils-8.4-orig/doc/coreutils.texi 2011-01-31 14:48:00.136484054 +0100
+++ coreutils-8.4/doc/coreutils.texi 2011-01-31 14:52:57.581472390 +0100
@@ -7909,6 +7909,9 @@ Note that the kernel may impose restrict
For example, with an ext4 destination file system and a linux-based kernel,
using @samp{oflag=direct} will cause writes to fail with @code{EINVAL} if the
output buffer size is not a multiple of 512.
+This flag is turned off automatically when partial block is written
+(e.g. after reading via pipe), you may consider using @samp{iflag=fullblock}
+to prevent that.
@item directory
@opindex directory
Hi Ondrej,
Adjusting the documentation sounds fine for 6.1:
+Note that this flag is turned off automatically when a partial block
+is written, which happens when reading from a pipe and not re-blocking.
+You can prevent that by using @samp{iflag=fullblock}.
However, for upstream we should probably do better...
as you suggested privately: making oflag=direct imply iflag=fullblock
*might* be ok. An alternative would be to make dd warn when
using oflag=direct without iflag=fullblock.
Technical note added. If any revisions are required, please edit the "Technical Notes" field
accordingly. All revisions will be proofread by the Engineering Content Services team.
New Contents:
Previously, when the dd utility used pipes, it read and wrote partial blocks. When the size of the block written was shorter than the specified maximum output block size, the "oflag=direct" would turn off, which resulted in degraded I/O performance. The workaround for this behaviour, which involves the addition of "iflag=fullblock" is now available in the information documentation.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2011-0646.html |
Created attachment 431890 [details] vmstat for both tests, proc_mount, meminfo, lsscsi, lvs, mdstat, vgs, and dd versions. Description of problem: I hit this problem during a cold database restore. I run 64 dd processes reading from two large XFS filesystems, and writing to raw disk partitions. My filesystems live on lvm volumes which are built on md raid 5 volumes. Here's an example of the dd command that I was using: dd.el6 if=/mnt/backup_raid5_1/disk-e1-d1s1.gz iflag=direct bs=1024k | gunzip -c | dd.el6 of=/dev/disk-e1-d1s1 bs=1024k oflag=direct & This command on RH 5.4 works great, and my dd's chug along just fine. The problem with the RHEL 6 dd is that it uses up all the system memory, and then all IO stops while the buffers are being flushed out. This seems like something is not working with the direct flag for dd in EL6???? Version-Release number of selected component (if applicable): RHEL 6 Beta 2 base install 2.6.32-44.el6.x86_64 kernel coreutils-8.4-7.el6.x86_64 EL6 -dd (coreutils) 8.4 EL5.4 -dd (coreutils) 5.97 How reproducible: Do some dd's with direct IO flag using the el6 dd. Repeat with the el5.4 dd and see the difference in memory consumption. Eventually, you'll chew up al the memory and got to 80% system time while you free the memory. Additional info: I've attached the /proc/mount info that I was using along with two vmstat files, one for el6 dd and one for the el5.4 dd. To narrow down the problem, I did this experiment with an EL6 Beta 2 base install with both the el6 dd command and then again with the el5.4 dd command.