Bug 689040

Summary: Disc writes are significantly slower with RHEL kernel than using kernel built from kernel.org.
Product: Red Hat Enterprise Linux 6 Reporter: dcardimi
Component: kernelAssignee: Red Hat Kernel Manager <kernel-mgr>
Status: CLOSED NOTABUG QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: high Docs Contact:
Priority: unspecified    
Version: 6.0CC: esandeen, jmoyer, rwheeler
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-03-25 16:45:48 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
Kernel config from 2.6.32.27 build
none
Kernel config from 2.6.38 build none

Description dcardimi 2011-03-18 21:40:50 UTC
Description of problem:
Using a binary compiled from the test source referenced below, run the program on RHEL 6 kernel-2.6.32-71.18.2.el6.x86_64.rpm or kernel-2.6.32-71.el6.x86_64.rpm.  Notice the run-time output from the program is significantly longer than the kernel built from kernel.org (I tried version 2.6.32.27 & 2.6.38).  In my case, the kernels built from kernel.org were averaging 2-2.5 seconds to run this test program.  The RHEL kernels were averaging 40-41 seconds for the same program.  I'm not sure if it matters, but I was using an ext3 filesystem.  I also mounted my ext3 filesystem as ext4 with the same results.

There were some changes related to this issue on fs/buffer.c checked in kernel.org on 2/18/2009 & 4/6/2009.

Version-Release number of selected component (if applicable):
kernel-2.6.32-71.18.2.el6.x86_64.rpm
kernel-2.6.32-71.el6.x86_64.rpm

How reproducible:
Every time.

Steps to Reproduce:
Using a testcase similar to this (noted from kernel.org commit http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=78f707bfc723552e8309b7c38a8d0cc51012e813)

#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>

#define ROWS 2000

int main(int argc, char **argv)
{

int fdes, i;
FILE *fp;
struct timeval start;
struct timeval end;
struct timeval res;

gettimeofday(&start, NULL);
for (i=0; i<ROWS; i++) {
fp = fopen("test_file", "a");
fprintf(fp, "Some Text Data\n");
fdes = fileno(fp);
fsync(fdes);
fclose(fp);
}
gettimeofday(&end, NULL);

timersub(&end, &start, &res);
fprintf(stdout, "time to write %d lines is %ld(msec)\n", ROWS,
(res.tv_sec*1000000 + res.tv_usec)/1000);

return 0;
}
  
Actual results:
The program runtime is significantly longer using the RHEL kernel (~40 seconds compared to ~2 seconds)

Expected results:
The program runtime should be the same using the RHEL kernel or the stock kernel from kernel.org.

Additional info:

Comment 2 dcardimi 2011-03-18 21:59:04 UTC
This affects applications using sqlite in synchronous write mode.

Comment 3 dcardimi 2011-03-18 22:35:49 UTC
I also tried Fedora 14 kernel kernel-2.6.35.11-83.fc14.x86_64.rpm, with the same results (slow writes).

Comment 4 Ric Wheeler 2011-03-19 12:00:29 UTC
Can you add to this some description of your storage hardware and your server?

If you are using an array (i.e. something with a non-volatile write cache), can you retry with "mount -o nobarrier" ?

Thanks!

Comment 5 dcardimi 2011-03-20 16:12:36 UTC
This was tested on a Pentium 4 2.8GHz, w/ 2G RAM, with a single 250GB serial ATA Western Digital drive (no RAID setup).

Comment 6 Jeff Moyer 2011-03-21 13:47:11 UTC
Were you using the same mount options between kernel revisions?  Specifically, if barriers are enbled, it will definitely take longer to fsync.  What I/O scheduler was used in testing?

Comment 7 dcardimi 2011-03-21 19:27:34 UTC
Yes, the same mount options were used in all tests.  In all cases, the CFQ IO scheduler was used.

Comment 8 Eric Sandeen 2011-03-21 20:46:29 UTC
Can you please provide the kernel config used for the upstream kernel build & test?

Thanks,
-Eric

Comment 9 dcardimi 2011-03-21 21:18:44 UTC
Created attachment 486697 [details]
Kernel config from 2.6.32.27 build

I've attached the .config from my 2.6.32.27 build.

Comment 10 Eric Sandeen 2011-03-21 21:34:18 UTC
# CONFIG_EXT3_DEFAULTS_TO_ORDERED is not set

is likely relevant, especially in light of the fsync() call in your test.

data=writeback is faster, but can lead to data exposure after a crash, so in fedora and rhel we have data=ordered as the default.  This was changed back again later, upstream.  data=ordered does, however, have issues with fsyncs getting tangled up with all buffered data for the filesystem.

If you have the time, could you re-compare with CONFIG_EXT3_DEFAULTS_TO_ORDERED=y on your upstream kernel?

Thanks,

-Eric

Comment 11 dcardimi 2011-03-21 22:46:47 UTC
Created attachment 486701 [details]
Kernel config from 2.6.38 build

Here is a .config from my 2.6.38 kernel, which did have CONFIG_EXT3_DEFAULTS_TO_ORDERED=y.  This kernel also showed the faster runtimes.

Comment 12 Eric Sandeen 2011-03-21 22:55:51 UTC
Yes, the default was changed again by 2.6.38 (it changed back in 2.6.36)

That one config option really could make all the difference.

See also:
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=aa32a796389bedbcf1c7714385b18714a0743810

-Eric

Comment 13 dcardimi 2011-03-21 23:11:26 UTC
Should I try CONFIG_EXT3_DEFAULTS_TO_ORDERED=y with the 2.6.32.27 kernel to see if the runtime of the test case is slower?  My 2.6.38 config had it set to "y", and the runtime of the test case was still much faster than the 2.6.32-71.18.2 RH kernel.

Comment 14 Eric Sandeen 2011-03-21 23:47:33 UTC
so 2.6.32/RHEL6 and 2.6.35/F14 were "slow" runs (RHEL6 was ~40s, F14 unspecified); these both default to ordered mode.

2.6.32/vanilla was "fast" (~2s) with writeback mode.
2.6.38/vanilla was "fast" (?s) with ordered mode.

2.6.38 does seem to be a bit of an anomaly, not quite sure what to make of that yet, I guess I'll give it a whirl myself.

FWIW, you could test your vanilla 2.6.32 without rebuilding, just mount the filesystem with -o data=ordered to override the writeback default.

Comment 15 Eric Sandeen 2011-03-21 23:49:45 UTC
One other question, are you testing on a dedicated filesystem, or on the root fs?

Comment 16 dcardimi 2011-03-21 23:53:55 UTC
Yes, 2.6.32/RHEL6 & 2.6.35/F14 both took around 40s.  2.6.32 & 2.6.38 vanilla were around 2s.  In all cases, tests were done on the root filesystem.

I'll try the 2.6.32 test tomorrow with the different mount option.

Comment 17 Eric Sandeen 2011-03-22 00:13:01 UTC
Ok, for starters, testing on the root filesystem may skew results, because in data=ordered mode, the fsyncs in your testcase may end up flushing out system data.  This is a known, but unfortunate, behavior of the ext3 journal design.

In any case, I was able to do some testing here.  I should have gone with Ric's first hunch about the barriers in comment #4, because that's making all the difference, not the journal mode.

With:

# mount -o data=writeback,barrier=0 /dev/sdb5 /mnt/test

(upstream defaults in 2.6.32)

I got < 1.5s.

With:

# mount -o data=ordered,barrier=1 /dev/sdb5 /mnt/test

(RHEL6 defaults - data integrity over speed, not vice versa)

I got 55s.

For the full ordered/writeback and barrier/nobarrier matrix, it looks like this:

           barrier    nobarrier
ordered      55s         1.5s
writeback    55s         1.5s

With barriers enabled, an fsync() causes a disk cache flush.  Without this, data is not persistent on disk; it is only in the disk's write cache, and will be lost if power is lost.

http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/6/html/Storage_Administration_Guide/writebarr.html

I also see the fast time on upstream .38 (well, the F15 kernel, close enough) but again that's mounting without barriers.

With barriers on, on 2.6.38, I see 56s.

Sorry for the wild goosechase on the journal config; this is entirely a write barrier issue.

Comment 18 dcardimi 2011-03-25 16:33:14 UTC
I've retested this with a number of kernel versions w/ different mount options, and confirmed that this is simply a barrier issue.

Thanks for investigating.

Comment 19 Eric Sandeen 2011-03-25 16:45:48 UTC
Ok, thanks for testing!