Hide Forgot
Description of problem: Using a binary compiled from the test source referenced below, run the program on RHEL 6 kernel-2.6.32-71.18.2.el6.x86_64.rpm or kernel-2.6.32-71.el6.x86_64.rpm. Notice the run-time output from the program is significantly longer than the kernel built from kernel.org (I tried version 2.6.32.27 & 2.6.38). In my case, the kernels built from kernel.org were averaging 2-2.5 seconds to run this test program. The RHEL kernels were averaging 40-41 seconds for the same program. I'm not sure if it matters, but I was using an ext3 filesystem. I also mounted my ext3 filesystem as ext4 with the same results. There were some changes related to this issue on fs/buffer.c checked in kernel.org on 2/18/2009 & 4/6/2009. Version-Release number of selected component (if applicable): kernel-2.6.32-71.18.2.el6.x86_64.rpm kernel-2.6.32-71.el6.x86_64.rpm How reproducible: Every time. Steps to Reproduce: Using a testcase similar to this (noted from kernel.org commit http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=78f707bfc723552e8309b7c38a8d0cc51012e813) #include <stdio.h> #include <stdlib.h> #include <sys/time.h> #define ROWS 2000 int main(int argc, char **argv) { int fdes, i; FILE *fp; struct timeval start; struct timeval end; struct timeval res; gettimeofday(&start, NULL); for (i=0; i<ROWS; i++) { fp = fopen("test_file", "a"); fprintf(fp, "Some Text Data\n"); fdes = fileno(fp); fsync(fdes); fclose(fp); } gettimeofday(&end, NULL); timersub(&end, &start, &res); fprintf(stdout, "time to write %d lines is %ld(msec)\n", ROWS, (res.tv_sec*1000000 + res.tv_usec)/1000); return 0; } Actual results: The program runtime is significantly longer using the RHEL kernel (~40 seconds compared to ~2 seconds) Expected results: The program runtime should be the same using the RHEL kernel or the stock kernel from kernel.org. Additional info:
This affects applications using sqlite in synchronous write mode.
I also tried Fedora 14 kernel kernel-2.6.35.11-83.fc14.x86_64.rpm, with the same results (slow writes).
Can you add to this some description of your storage hardware and your server? If you are using an array (i.e. something with a non-volatile write cache), can you retry with "mount -o nobarrier" ? Thanks!
This was tested on a Pentium 4 2.8GHz, w/ 2G RAM, with a single 250GB serial ATA Western Digital drive (no RAID setup).
Were you using the same mount options between kernel revisions? Specifically, if barriers are enbled, it will definitely take longer to fsync. What I/O scheduler was used in testing?
Yes, the same mount options were used in all tests. In all cases, the CFQ IO scheduler was used.
Can you please provide the kernel config used for the upstream kernel build & test? Thanks, -Eric
Created attachment 486697 [details] Kernel config from 2.6.32.27 build I've attached the .config from my 2.6.32.27 build.
# CONFIG_EXT3_DEFAULTS_TO_ORDERED is not set is likely relevant, especially in light of the fsync() call in your test. data=writeback is faster, but can lead to data exposure after a crash, so in fedora and rhel we have data=ordered as the default. This was changed back again later, upstream. data=ordered does, however, have issues with fsyncs getting tangled up with all buffered data for the filesystem. If you have the time, could you re-compare with CONFIG_EXT3_DEFAULTS_TO_ORDERED=y on your upstream kernel? Thanks, -Eric
Created attachment 486701 [details] Kernel config from 2.6.38 build Here is a .config from my 2.6.38 kernel, which did have CONFIG_EXT3_DEFAULTS_TO_ORDERED=y. This kernel also showed the faster runtimes.
Yes, the default was changed again by 2.6.38 (it changed back in 2.6.36) That one config option really could make all the difference. See also: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=aa32a796389bedbcf1c7714385b18714a0743810 -Eric
Should I try CONFIG_EXT3_DEFAULTS_TO_ORDERED=y with the 2.6.32.27 kernel to see if the runtime of the test case is slower? My 2.6.38 config had it set to "y", and the runtime of the test case was still much faster than the 2.6.32-71.18.2 RH kernel.
so 2.6.32/RHEL6 and 2.6.35/F14 were "slow" runs (RHEL6 was ~40s, F14 unspecified); these both default to ordered mode. 2.6.32/vanilla was "fast" (~2s) with writeback mode. 2.6.38/vanilla was "fast" (?s) with ordered mode. 2.6.38 does seem to be a bit of an anomaly, not quite sure what to make of that yet, I guess I'll give it a whirl myself. FWIW, you could test your vanilla 2.6.32 without rebuilding, just mount the filesystem with -o data=ordered to override the writeback default.
One other question, are you testing on a dedicated filesystem, or on the root fs?
Yes, 2.6.32/RHEL6 & 2.6.35/F14 both took around 40s. 2.6.32 & 2.6.38 vanilla were around 2s. In all cases, tests were done on the root filesystem. I'll try the 2.6.32 test tomorrow with the different mount option.
Ok, for starters, testing on the root filesystem may skew results, because in data=ordered mode, the fsyncs in your testcase may end up flushing out system data. This is a known, but unfortunate, behavior of the ext3 journal design. In any case, I was able to do some testing here. I should have gone with Ric's first hunch about the barriers in comment #4, because that's making all the difference, not the journal mode. With: # mount -o data=writeback,barrier=0 /dev/sdb5 /mnt/test (upstream defaults in 2.6.32) I got < 1.5s. With: # mount -o data=ordered,barrier=1 /dev/sdb5 /mnt/test (RHEL6 defaults - data integrity over speed, not vice versa) I got 55s. For the full ordered/writeback and barrier/nobarrier matrix, it looks like this: barrier nobarrier ordered 55s 1.5s writeback 55s 1.5s With barriers enabled, an fsync() causes a disk cache flush. Without this, data is not persistent on disk; it is only in the disk's write cache, and will be lost if power is lost. http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/6/html/Storage_Administration_Guide/writebarr.html I also see the fast time on upstream .38 (well, the F15 kernel, close enough) but again that's mounting without barriers. With barriers on, on 2.6.38, I see 56s. Sorry for the wild goosechase on the journal config; this is entirely a write barrier issue.
I've retested this with a number of kernel versions w/ different mount options, and confirmed that this is simply a barrier issue. Thanks for investigating.
Ok, thanks for testing!