Bugzilla will be upgraded to version 5.0 on a still to be determined date in the near future. The original upgrade date has been delayed.
Bug 614605 - [Intel 6.1 Bug] direct IO with dd seems broken compared to RHEL 5.4
[Intel 6.1 Bug] direct IO with dd seems broken compared to RHEL 5.4
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: coreutils (Show other bugs)
6.0
All Linux
medium Severity medium
: rc
: 6.1
Assigned To: Ondrej Vasik
qe-baseos-daemons
:
Depends On:
Blocks: 600438 580566
  Show dependency treegraph
 
Reported: 2010-07-14 16:47 EDT by Doug Nelson
Modified: 2011-05-19 09:50 EDT (History)
20 users (show)

See Also:
Fixed In Version: coreutils-8.4-12.el6
Doc Type: Bug Fix
Doc Text:
Previously, when the dd utility used pipes, it read and wrote partial blocks. When the size of the block written was shorter than the specified maximum output block size, the "oflag=direct" would turn off, which resulted in degraded I/O performance. The workaround for this behaviour, which involves the addition of "iflag=fullblock" is now available in the information documentation.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2011-05-19 09:50:48 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
vmstat for both tests, proc_mount, meminfo, lsscsi, lvs, mdstat, vgs, and dd versions. (13.16 KB, application/x-compressed-tar)
2010-07-14 16:47 EDT, Doug Nelson
no flags Details
straces for the output dd's for rhel5.4 and 6.0 beta 2. linux perf counters from high system time during el6 restore (3.96 MB, application/x-compressed-tar)
2010-07-20 17:37 EDT, Doug Nelson
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2011:0646 normal SHIPPED_LIVE coreutils bug fix update 2011-05-18 14:11:00 EDT

  None (edit)
Description Doug Nelson 2010-07-14 16:47:53 EDT
Created attachment 431890 [details]
vmstat for both tests, proc_mount, meminfo, lsscsi, lvs, mdstat, vgs, and dd versions.

Description of problem:

I hit this problem during a cold database restore.  I run 64 dd processes reading from two large XFS filesystems, and writing to raw disk partitions.  My filesystems live on lvm volumes which are built on md raid 5 volumes.

Here's an example of the dd command that I was using:

dd.el6 if=/mnt/backup_raid5_1/disk-e1-d1s1.gz iflag=direct bs=1024k | gunzip -c | dd.el6 of=/dev/disk-e1-d1s1 bs=1024k oflag=direct &


This command on RH 5.4 works great, and my dd's chug along just fine.

The problem with the RHEL 6 dd is that it uses up all the system memory, and then all IO stops while the buffers are being flushed out.  This seems like something is not working with the direct flag for dd in EL6????   

Version-Release number of selected component (if applicable):

RHEL 6 Beta 2 base install
2.6.32-44.el6.x86_64 kernel

coreutils-8.4-7.el6.x86_64

EL6   -dd (coreutils) 8.4
EL5.4 -dd (coreutils) 5.97 

How reproducible:
Do some dd's with direct IO flag using the el6 dd.   Repeat with the el5.4 dd and see the difference in memory consumption.    Eventually, you'll chew up al the memory and got to 80% system time while you free the memory.
  

Additional info:

I've attached the /proc/mount info that I was using along with two vmstat files, one for el6 dd and one for the el5.4 dd.    

To narrow down the problem, I did this experiment with an EL6 Beta 2 base install with both the el6 dd command and then again with the el5.4 dd command.
Comment 2 Ondrej Vasik 2010-07-15 04:26:07 EDT
Thanks for report, marking that regression and will try to find out what's wrong there. However - there were many changes in dd between 5.97 and 8.4 coreutils, so it may take a while.
Comment 5 Jeff Moyer 2010-07-19 10:08:24 EDT
(In reply to comment #0)

> To narrow down the problem, I did this experiment with an EL6 Beta 2 base
> install with both the el6 dd command and then again with the el5.4 dd command.    

Just to clarify, do you mean you ran the el5.4 dd binary on an el6 kernel and the problem did not present itself?

Thanks.
Comment 6 Doug Nelson 2010-07-19 11:51:07 EDT
(In reply to comment #5)
> (In reply to comment #0)
> 
> > To narrow down the problem, I did this experiment with an EL6 Beta 2 base
> > install with both the el6 dd command and then again with the el5.4 dd command.    
> 
> Just to clarify, do you mean you ran the el5.4 dd binary on an el6 kernel and
> the problem did not present itself?
> 
> Thanks.    

Yes, that is exactly what I did.
Comment 8 Ondrej Vasik 2010-07-19 13:27:13 EDT
So far I tried the quick test on ext4 with RHEL-5 (compiled on RHEL-6) and RHEL-6 dd binaries on RHEL-6 beta2 kernel. I saw no obvious difference in performance...

It would be really good to know what's different - good start could be to strace both runs and attach the result here. Could you please do that, Doug? TIA. 

Another helpful thing could be callgrind profiling analyze... but strace is good place to start.
Comment 9 Doug Nelson 2010-07-20 09:51:13 EDT
(In reply to comment #8)
> So far I tried the quick test on ext4 with RHEL-5 (compiled on RHEL-6) and
> RHEL-6 dd binaries on RHEL-6 beta2 kernel. I saw no obvious difference in
> performance...
> 
> It would be really good to know what's different - good start could be to
> strace both runs and attach the result here. Could you please do that, Doug?
> TIA. 
> 

I'll try to grab some straces today.
Comment 10 Doug Nelson 2010-07-20 17:37:19 EDT
Created attachment 433269 [details]
straces for the output dd's for rhel5.4 and 6.0 beta 2.   linux perf counters from high system time during el6 restore

I've included some strace output files for el5 and el6, and some linux perf counter cpu cycle and callgraph data from the high system time portion of the el6 database restore (when the mem is being freed up).
Comment 11 Doug Nelson 2010-07-20 20:26:46 EDT
I believe that I've found a workaround for this problem.   

I changed the bs=1024k to obs=1024k for the second dd in my example line, and all the fcntl lines are gone from the strace and the dd's are not chewing up all my memory.

Bad EL 6 behavior
-------------------
dd.el6 if=/mnt/backup_raid5_1/disk-e1-d1s1.gz iflag=direct bs=1024k | gunzip -c
| dd.el6 of=/dev/disk-e1-d1s1 bs=1024k oflag=direct


Good EL6 behavior
-----------------
dd.el6 if=/mnt/backup_raid5_1/disk-e1-d1s1.gz iflag=direct ibs=1024k | gunzip -c
| dd.el6 of=/dev/disk-e1-d1s1 obs=1024k oflag=direct


DIRECT_IO seems to be turned off after the open() ?? if I set bs=1024k  on the dd that is receiving data from the pipe, and writing to the raw partition.  

Seems to be working if I only use obs=1024k

I'm happy with this workaround.  I'll leave it to you to decide if there is a bug here.

thanks,

doug
Comment 18 Ondrej Vasik 2011-01-25 04:00:06 EST
I think the difference is caused by http://git.savannah.gnu.org/gitweb/?p=coreutils.git;a=commitdiff;h=5929322ccb1f9d27c1b07b746d37419d17a7cbf6 commit - if the write size is different from output blocksize, O_DIRECT flag is turned off.
Comment 19 Russell Doty 2011-01-27 14:50:13 EST
Re: comment 18 - does that mean that this is not a bug? Or do we expect direct I/O to work when the write size is different from output blocksize?
Comment 20 Ric Wheeler 2011-01-27 15:12:30 EST
We should either update that restriction somewhere or possibly see if we can relax it.

It would seem that O_DIRECT should work fine as long as both input and output block sizes are properly aligned?
Comment 21 Jeff Moyer 2011-01-27 15:31:47 EST
(In reply to comment #18)
> I think the difference is caused by
> http://git.savannah.gnu.org/gitweb/?p=coreutils.git;a=commitdiff;h=5929322ccb1f9d27c1b07b746d37419d17a7cbf6
> commit - if the write size is different from output blocksize, O_DIRECT flag is
> turned off.

If someone is using O_DIRECT, they should well understand the limitations.  I think it would be fine to spit out an error and not do anything if the specified file size is smaller than the hardware's logical block size.  Falling back to buffered I/O is surely not what the user wanted.
Comment 22 Jeff Moyer 2011-01-27 15:36:17 EST
Going back to comment 11, I'm not sure why an ibs of 1024k would cause the oflag=direct to be ignored.
Comment 23 Ondrej Vasik 2011-01-28 08:43:48 EST
Re comment#19 : Adding upstream maintainer to cc ... Jim, what do you think about this issue?
Comment 24 Jim Meyering 2011-01-28 09:46:29 EST
Doug,

The problem with your command on 6.2 was that the latter dd
was reading from a pipe, which led inevitably to it reading
partial blocks, *and writing* them.  The moment it wrote a block
shorter than the maximum output block size, that caused dd to
turn off O_DIRECT.  As mentioned in the commit log:

    * src/dd.c (iwrite): Turn off O_DIRECT for any
    smaller-than-obs-sized write.  Don't bother to restore it.

I suggest that you use oflag=fullblock in the latter dd invocation.
That will ensure that all but the last write is of the specified size,
and thus will not disable O_DIRECT, except, possibly, for the final write.

BTW, using obs=1M (leaving the default ibs at 512B) on the pipe-reading
dd implies reblocking, so has the same net effect: all output
blocks are "full", except possibly the last one.

I haven't tried to reproduce the VM-exhausting behavior yet.
Has anyone else succeeded in demonstrating that?
Comment 25 Matthew Wilcox 2011-01-28 09:51:58 EST
Jim, would it make sense for oflag=direct to imply the oflag=fullblock option?  I find it hard to imagine a situation where the user intends the current behaviour.
Comment 26 Jim Meyering 2011-01-28 10:05:16 EST
Matthew, that is probably the way to go.
Another possibility is to warn about it, I guess.

BTW, what I said above "has the same net effect" may be true
in most cases when the output is a regular file, but is not true in general,
i.e., in the presence of interrupts or when writing to a pipe.
Comment 27 Ondrej Vasik 2011-01-31 08:55:01 EST
What about docs note in info documentation saying:

--- coreutils-8.4-orig/doc/coreutils.texi	2011-01-31 14:48:00.136484054 +0100
+++ coreutils-8.4/doc/coreutils.texi	2011-01-31 14:52:57.581472390 +0100
@@ -7909,6 +7909,9 @@ Note that the kernel may impose restrict
 For example, with an ext4 destination file system and a linux-based kernel,
 using @samp{oflag=direct} will cause writes to fail with @code{EINVAL} if the
 output buffer size is not a multiple of 512.
+This flag is turned off automatically when partial block is written
+(e.g. after reading via pipe), you may consider using @samp{iflag=fullblock}
+to prevent that.
 
 @item directory
 @opindex directory
Comment 28 Jim Meyering 2011-02-02 06:57:02 EST
Hi Ondrej,

Adjusting the documentation sounds fine for 6.1:

+Note that this flag is turned off automatically when a partial block
+is written, which happens when reading from a pipe and not re-blocking.
+You can prevent that by using @samp{iflag=fullblock}.

However, for upstream we should probably do better...
as you suggested privately: making oflag=direct imply iflag=fullblock
*might* be ok.  An alternative would be to make dd warn when
using oflag=direct without iflag=fullblock.
Comment 34 Misha H. Ali 2011-05-10 01:25:28 EDT
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Previously, when the dd utility used pipes, it read and wrote partial blocks. When the size of the block written was shorter than the specified maximum output block size, the "oflag=direct" would turn off, which resulted in degraded I/O performance. The workaround for this behaviour, which involves the addition of "iflag=fullblock" is now available in the information documentation.
Comment 35 errata-xmlrpc 2011-05-19 09:50:48 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0646.html

Note You need to log in before you can comment on or make changes to this bug.