Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 614605

Summary:

[Intel 6.1 Bug] direct IO with dd seems broken compared to RHEL 5.4

Product:

Red Hat Enterprise Linux 6

Reporter:

Doug Nelson <doug.nelson>

Component:

coreutils

Assignee:

Ondrej Vasik <ovasik>

Status:

CLOSED ERRATA

QA Contact:

qe-baseos-daemons

Severity:

medium

Docs Contact:

Priority:

medium

Version:

6.0

CC:

andi.kleen, anil.k.garg, azelinka, doug.nelson, dshaks, esandeen, jane.lv, jmoyer, jvillalo, kdudka, keve.a.gabbert, luming.yu, luyu, lwoodman, matthew.r.wilcox, meyering, mhusnain, rdoty, rpacheco, rwheeler

Target Milestone:

Target Release:

6.1

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

coreutils-8.4-12.el6

Doc Type:

Bug Fix

Doc Text:

Previously, when the dd utility used pipes, it read and wrote partial blocks. When the size of the block written was shorter than the specified maximum output block size, the "oflag=direct" would turn off, which resulted in degraded I/O performance. The workaround for this behaviour, which involves the addition of "iflag=fullblock" is now available in the information documentation.

Story Points:

---

Clone Of:

Environment:

Last Closed:

2011-05-19 13:50:48 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

580566, 600438

Attachments:

Description	Flags
vmstat for both tests, proc_mount, meminfo, lsscsi, lvs, mdstat, vgs, and dd versions.	none
straces for the output dd's for rhel5.4 and 6.0 beta 2. linux perf counters from high system time during el6 restore	none

Description Doug Nelson 2010-07-14 20:47:53 UTC

Created attachment 431890 [details]
vmstat for both tests, proc_mount, meminfo, lsscsi, lvs, mdstat, vgs, and dd versions.

Description of problem:

I hit this problem during a cold database restore.  I run 64 dd processes reading from two large XFS filesystems, and writing to raw disk partitions.  My filesystems live on lvm volumes which are built on md raid 5 volumes.

Here's an example of the dd command that I was using:

dd.el6 if=/mnt/backup_raid5_1/disk-e1-d1s1.gz iflag=direct bs=1024k | gunzip -c | dd.el6 of=/dev/disk-e1-d1s1 bs=1024k oflag=direct &


This command on RH 5.4 works great, and my dd's chug along just fine.

The problem with the RHEL 6 dd is that it uses up all the system memory, and then all IO stops while the buffers are being flushed out.  This seems like something is not working with the direct flag for dd in EL6????   

Version-Release number of selected component (if applicable):

RHEL 6 Beta 2 base install
2.6.32-44.el6.x86_64 kernel

coreutils-8.4-7.el6.x86_64

EL6   -dd (coreutils) 8.4
EL5.4 -dd (coreutils) 5.97 

How reproducible:
Do some dd's with direct IO flag using the el6 dd.   Repeat with the el5.4 dd and see the difference in memory consumption.    Eventually, you'll chew up al the memory and got to 80% system time while you free the memory.
  

Additional info:

I've attached the /proc/mount info that I was using along with two vmstat files, one for el6 dd and one for the el5.4 dd.    

To narrow down the problem, I did this experiment with an EL6 Beta 2 base install with both the el6 dd command and then again with the el5.4 dd command.

Comment 2 Ondrej Vasik 2010-07-15 08:26:07 UTC

Thanks for report, marking that regression and will try to find out what's wrong there. However - there were many changes in dd between 5.97 and 8.4 coreutils, so it may take a while.

Comment 5 Jeff Moyer 2010-07-19 14:08:24 UTC

(In reply to comment #0)

> To narrow down the problem, I did this experiment with an EL6 Beta 2 base
> install with both the el6 dd command and then again with the el5.4 dd command.    

Just to clarify, do you mean you ran the el5.4 dd binary on an el6 kernel and the problem did not present itself?

Thanks.

Comment 6 Doug Nelson 2010-07-19 15:51:07 UTC

(In reply to comment #5)
> (In reply to comment #0)
> 
> > To narrow down the problem, I did this experiment with an EL6 Beta 2 base
> > install with both the el6 dd command and then again with the el5.4 dd command.    
> 
> Just to clarify, do you mean you ran the el5.4 dd binary on an el6 kernel and
> the problem did not present itself?
> 
> Thanks.    

Yes, that is exactly what I did.

Comment 8 Ondrej Vasik 2010-07-19 17:27:13 UTC

So far I tried the quick test on ext4 with RHEL-5 (compiled on RHEL-6) and RHEL-6 dd binaries on RHEL-6 beta2 kernel. I saw no obvious difference in performance...

It would be really good to know what's different - good start could be to strace both runs and attach the result here. Could you please do that, Doug? TIA. 

Another helpful thing could be callgrind profiling analyze... but strace is good place to start.

Comment 9 Doug Nelson 2010-07-20 13:51:13 UTC

(In reply to comment #8)
> So far I tried the quick test on ext4 with RHEL-5 (compiled on RHEL-6) and
> RHEL-6 dd binaries on RHEL-6 beta2 kernel. I saw no obvious difference in
> performance...
> 
> It would be really good to know what's different - good start could be to
> strace both runs and attach the result here. Could you please do that, Doug?
> TIA. 
> 

I'll try to grab some straces today.

Comment 10 Doug Nelson 2010-07-20 21:37:19 UTC

Created attachment 433269 [details]
straces for the output dd's for rhel5.4 and 6.0 beta 2.   linux perf counters from high system time during el6 restore

I've included some strace output files for el5 and el6, and some linux perf counter cpu cycle and callgraph data from the high system time portion of the el6 database restore (when the mem is being freed up).

Comment 11 Doug Nelson 2010-07-21 00:26:46 UTC

I believe that I've found a workaround for this problem.   

I changed the bs=1024k to obs=1024k for the second dd in my example line, and all the fcntl lines are gone from the strace and the dd's are not chewing up all my memory.

Bad EL 6 behavior
-------------------
dd.el6 if=/mnt/backup_raid5_1/disk-e1-d1s1.gz iflag=direct bs=1024k | gunzip -c
| dd.el6 of=/dev/disk-e1-d1s1 bs=1024k oflag=direct


Good EL6 behavior
-----------------
dd.el6 if=/mnt/backup_raid5_1/disk-e1-d1s1.gz iflag=direct ibs=1024k | gunzip -c
| dd.el6 of=/dev/disk-e1-d1s1 obs=1024k oflag=direct


DIRECT_IO seems to be turned off after the open() ?? if I set bs=1024k  on the dd that is receiving data from the pipe, and writing to the raw partition.  

Seems to be working if I only use obs=1024k

I'm happy with this workaround.  I'll leave it to you to decide if there is a bug here.

thanks,

doug

Comment 18 Ondrej Vasik 2011-01-25 09:00:06 UTC

I think the difference is caused by http://git.savannah.gnu.org/gitweb/?p=coreutils.git;a=commitdiff;h=5929322ccb1f9d27c1b07b746d37419d17a7cbf6 commit - if the write size is different from output blocksize, O_DIRECT flag is turned off.

Comment 19 Russell Doty 2011-01-27 19:50:13 UTC

Re: comment 18 - does that mean that this is not a bug? Or do we expect direct I/O to work when the write size is different from output blocksize?

Comment 20 Ric Wheeler 2011-01-27 20:12:30 UTC

We should either update that restriction somewhere or possibly see if we can relax it.

It would seem that O_DIRECT should work fine as long as both input and output block sizes are properly aligned?

Comment 21 Jeff Moyer 2011-01-27 20:31:47 UTC

(In reply to comment #18)
> I think the difference is caused by
> http://git.savannah.gnu.org/gitweb/?p=coreutils.git;a=commitdiff;h=5929322ccb1f9d27c1b07b746d37419d17a7cbf6
> commit - if the write size is different from output blocksize, O_DIRECT flag is
> turned off.

If someone is using O_DIRECT, they should well understand the limitations.  I think it would be fine to spit out an error and not do anything if the specified file size is smaller than the hardware's logical block size.  Falling back to buffered I/O is surely not what the user wanted.

Comment 22 Jeff Moyer 2011-01-27 20:36:17 UTC

Going back to comment 11, I'm not sure why an ibs of 1024k would cause the oflag=direct to be ignored.

Comment 23 Ondrej Vasik 2011-01-28 13:43:48 UTC

Re comment#19 : Adding upstream maintainer to cc ... Jim, what do you think about this issue?

Comment 24 Jim Meyering 2011-01-28 14:46:29 UTC

Doug,

The problem with your command on 6.2 was that the latter dd
was reading from a pipe, which led inevitably to it reading
partial blocks, *and writing* them.  The moment it wrote a block
shorter than the maximum output block size, that caused dd to
turn off O_DIRECT.  As mentioned in the commit log:

    * src/dd.c (iwrite): Turn off O_DIRECT for any
    smaller-than-obs-sized write.  Don't bother to restore it.

I suggest that you use oflag=fullblock in the latter dd invocation.
That will ensure that all but the last write is of the specified size,
and thus will not disable O_DIRECT, except, possibly, for the final write.

BTW, using obs=1M (leaving the default ibs at 512B) on the pipe-reading
dd implies reblocking, so has the same net effect: all output
blocks are "full", except possibly the last one.

I haven't tried to reproduce the VM-exhausting behavior yet.
Has anyone else succeeded in demonstrating that?

Comment 25 Matthew Wilcox 2011-01-28 14:51:58 UTC

Jim, would it make sense for oflag=direct to imply the oflag=fullblock option?  I find it hard to imagine a situation where the user intends the current behaviour.

Comment 26 Jim Meyering 2011-01-28 15:05:16 UTC

Matthew, that is probably the way to go.
Another possibility is to warn about it, I guess.

BTW, what I said above "has the same net effect" may be true
in most cases when the output is a regular file, but is not true in general,
i.e., in the presence of interrupts or when writing to a pipe.

Comment 27 Ondrej Vasik 2011-01-31 13:55:01 UTC

What about docs note in info documentation saying:

--- coreutils-8.4-orig/doc/coreutils.texi	2011-01-31 14:48:00.136484054 +0100
+++ coreutils-8.4/doc/coreutils.texi	2011-01-31 14:52:57.581472390 +0100
@@ -7909,6 +7909,9 @@ Note that the kernel may impose restrict
 For example, with an ext4 destination file system and a linux-based kernel,
 using @samp{oflag=direct} will cause writes to fail with @code{EINVAL} if the
 output buffer size is not a multiple of 512.
+This flag is turned off automatically when partial block is written
+(e.g. after reading via pipe), you may consider using @samp{iflag=fullblock}
+to prevent that.
 
 @item directory
 @opindex directory

Comment 28 Jim Meyering 2011-02-02 11:57:02 UTC

Hi Ondrej,

Adjusting the documentation sounds fine for 6.1:

+Note that this flag is turned off automatically when a partial block
+is written, which happens when reading from a pipe and not re-blocking.
+You can prevent that by using @samp{iflag=fullblock}.

However, for upstream we should probably do better...
as you suggested privately: making oflag=direct imply iflag=fullblock
*might* be ok.  An alternative would be to make dd warn when
using oflag=direct without iflag=fullblock.

Comment 34 Misha H. Ali 2011-05-10 05:25:28 UTC

    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Previously, when the dd utility used pipes, it read and wrote partial blocks. When the size of the block written was shorter than the specified maximum output block size, the "oflag=direct" would turn off, which resulted in degraded I/O performance. The workaround for this behaviour, which involves the addition of "iflag=fullblock" is now available in the information documentation.

Comment 35 errata-xmlrpc 2011-05-19 13:50:48 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0646.html