Red Hat Bugzilla – Bug 156437
Writing large file to 1TB ext3 volume sometimes very slow
Last modified: 2010-10-21 22:57:16 EDT
From Bugzilla Helper:
User-Agent: Opera/7.54 (X11; Linux i686; U) [en]
Description of problem:
We have a Dell 6450, 4 - PIII Xeon 700Mhz, 8GB RAM, Qlogic 2340 fiber channel
adapter. This system has been connected to EMC Clariion array for some time
mounting abount 15 LUNS of various sizes, two that are approximately 1TB each.
Recently we added an Apple Xserve RAID to the SAN and mapped a total of 6 LUNS
from this device (~365GB each) to this same system. We used lvm to create a
linear stripe set of three volumes into a VG and then create a LV of the entire
space giving an ext3 filesystem of approximately 1.1TB.
We then began copying the data from the old EMC 1TB volumes to the new 1.1TB LVM
volume. Initial performance seemed great, but approximately 15% into the copy
every thing slowed way down.
The files being consist of only about 24 40GB files for a total of just about
960GB. If we start the copy again there is about a 1 in 4 chance that the copy
will return back to normal speed. We have tried using various methods to perform
the copy, from normal cp, to rsync, to a tar pipe but the problem is consistent.
When rsync is being used the system can copy about 40-50MB/s, cp is typically much
faster, however, when the copy is slow we see only about 250KB/s. Other
interesting factors, when rsync is running quickly, the system shows about 25% CPU
utilization with a single CPU doing most of the work, and about 17% going to
rsync, a few % to IO wait, and a few to system, however, once the system slows
down the profile changes with <1% going to rsync, <1% to IO wait, and 25%
(basically an entire CPU) going to system.
We initially suspected this problem might be an LVM2/DM issue so we created the
volumes as older style MD linear devices (/dev/md0) and also tried LVM2 striping
and MD RAID 0 with the same basic result. We also tried the beta kernel 2.6.9-6.
37 and 2.6.9-6.40 as well as the hugemem kernel. We also tried the Qlogic 8.00.
03 from their website. None of this changed the symptom at all.
We compiled our own kernel from the src RPM package and installed the debuginfo
package. When the problem occurs the oprofile output changes significantly. I
have attached the oprofile output from a good run and a bad one. Basically, when
the copy is progressing normally most of the time is spent in rsync and in copy_to
and copy_from user. When the performance slows the profile shows large amounts of
time in init_ext3_attr and bh_lru_install and other routines mostly in ext3 and
We attempted to recreate this problem using benchmark tools like bonnie and iozone
but with those it seems we can hammer the array for hours without seeing any bad
performance, however, simply copying these very large files around we can usually
reproduce the problem in only a few minutes, although occasionally it will work
for hours before the problems shows itself.
We don't seem to be able to reproduce this problem with RHEL3. Also, we
eventually got the system to copy all of the files by simply mounting the exact
same volumes as ext2. So far we have not been able to reproduce the problem
mounted ext2 rather than ext3.
Please let us know any other information that we can provide as this is a major
problem for us.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Create 1TB linear or striped volume of three disks
2. Format and mount ext3
3. Copy/rsync large (40GB +) files around.
Actual Results: Some copies perform well, others run at a VERY slow rate.
Expected Results: All copies should occur at roughtly the same speed.
Created attachment 113882 [details]
Oprofile output from system performing normally
Created attachment 113883 [details]
Oprofile output from system when copy is running slow
As you mention this is a major problem, you would be wise to contact our support
team, either by going to http://www.redhat.com/support and logging in, or by
Bugzilla is a bug tracking system, and not a support system, so is not suited
for urgent/immediate problems.
Right, we understand that. We plan to open an official support ticket this week.
We are working to setup an environment where we are sure we can reproduce the
Historically we've not had much luck with Redhat support. Especially on issues
like this that are difficult to reproduce 100%. Actually, the few times we've had
Redhat actually provide a fix to our problem it was always via a Bugzilla, while
the support request languished untouched in the ticket system. As a matter of
fact, I don't think we've ever had a support ticket succesfully resovled.
Because of this history we usually open the Bugzilla first because we find that
sometimes other users that are having similar problems chime in and post comments
that give us clues to our problem right away. We then open a support ticket and
reference the Bugzilla entry.
I guess basically, we open the Bugzilla for the community, we open the support
ticket for us.
From the oprofile output:
74044 20.7120 vmlinux vmlinux default_idle
29079 8.1341 ext3.ko ext3 init_ext3_xattr
That looks seriously suspect:
ext3_xattr_cache = mb_cache_create("ext3_xattr", NULL,
sizeof(struct mb_cache_entry) +
sizeof(struct mb_cache_entry_index), 1, 6);
This function is called once, on module __init, and is then thrown away by the
kernel (due to the "__init" attribute.) But the fact that you're seeing one CPU
pegged at 100% when the performance goes bad indicates that we should indeed be
seeing excessive CPU cycles somewhere in the kernel. The question is where they
are really being spent.
I'll admidt to being an oprofile newbie and I'll admidt that looks bogus,
perhaps we did something wrong. We couldn't find the actual debuginfo package
for the distribution kernel so we recompiled the kernel source RPM to get a new
kernel and kernel-debuginfo RPM. We then installed our custom RPM's (both
kernel-smp and kernel-debuginfo) and ran the oprofile.
Is there anything that would make oprofile be so wrong? Could it be a mistake
on our part?
The CPU is pegged at 100% system time, almost no user or io wait time.
We're trying to setup a lab environment so that will allow us to reproduce the
issue and be more valuable in solving the problem. We happen to have the spare
capacity to setup an identical 1TB array and attempt to copy back and forth. I
guess at that point we'll find out just how reproducible the issue really is.
It could be a mis-match between the running kernel and oprofile's lookup for the
symbols: I'd recommend explicitly telling oprofile where to look for the .ko
files, for example
opreport -p /lib/modules/2.6.9-5/kernel/ -l /ext3
If that still looks bogus, we can always add the "-d" option to log every single
oprofile hit on ext3, but that generates very verbose output indeed.
The command I actually ran was
opreport -l -p /lib/modules/2.6.9-5.0.5.EL.rootsmp
which seems pretty close to the same. The -l /ext3 appears to simply limit the
results to only the ext3 module. I'm quite confident the modules match the
running kernel as I compiled the RPM's from source myself and have the following
And am running 2.6.9-5.0.5.EL.rootsmp which is my custom compiled kernel. One
thing that is suspicious, should there be a kernel-smp-debuginfo package? Also,
the debuginfo RPM appears to contain a bunch of files like ext3.ko.debug in /usr/
lib/debug/lib/modules/<kernelversion>/kernel. If I get rid of these I get totally
different output from oprofile.
Perhaps you could point me to where I could get the actualy kernel debuginfo
package for the stock kernel and I'll try again without my "compiled from source"
We hope to open a support ticket tomorrow because we think we have this setup in a
There should not be a kernel-smp-debuginfo package: the kernel-debuginfo package
contains all of the debug objects for both the UP and the SMP builds.
If you run with the official errata package, what does "opreport -d /ext3" show?
I should be able to see the hot spots from that pretty easily, and match them
up to the disassembly of the ext3.ko to work out precisely where we're spending
all the time.
Just encountered the same bug. Kernel 2.6.9-5.0.5.EL on a P4 uniprocesor.
Have a 250GB Maxtor Maxline II+ disk with a boot partition and the rest of the
disk allocated as a single PV on a single VG and with a single LV holding an
'ext3' FS. Created with 'mke2fs -v -N 256000 -j -m 0 /dev/vg01/lv00'.
Expanding large two 100GB files via a 'rsh remote cat hugefile.gz | gzip -d
>hugefile'. 'iostat' shows things humming along nicely at 30MB/second until the
'ext3' FS reaches the 128GB mark (134217728 per 'df -k'). Suddenly the transfer
rate drops down to 2MB/second and the 'gzip -d' process starts eating 100% of
the CPU. 'top' reports that it is all system time. 'strace -p <pid> --c'
confirms that vast amounts of time are being spent in the 'write' system call,
about 100 times that of 'read'. A similar 'strace' run before the 128GB mark
shows 'read' using more time than 'write'.
A clue to the nature of the problem may lie in disk fragmentation. 'fsck -f'
shows an astonishing fragmentation level of 20%. Wow! I've never seen a disk
that fragmented ever before under any circumstances--much less as the result of
writing two sequential files immediately after 'mkfs' and 'mount'.
And now for the good news: The problem goes away entirely if the FS is mounted
as 'ext2' instead of 'ext3'. The performance wall-crash at 128GB and the
hideous fragmentation disappear, though I can't say I'm impressed by the 5.4%
non-contiguous value reported in this case. Solaris typically reports values
way under 1%.
So I'm using 'ext2' for the big disk while I wait for the fix.
A potentially important detail for someone trying to reproduce the above
scenario is that a custom version of 'gzip' was used. This 'gzip' is a
modification to 1.3.3 to write data in 256KB increments (though 'strace' reports
32KB 'write' calls). The normal version of 'gzip' writes in tiny increments,
particularly on compression operations, and as a result performs horribly on
VxFS under Solaris. I could have re-tested this with regular 'gzip', but I
don't care to spend further hours on the problem now that a work-around has been
I think it is unlikely to be a problem with your modified gzip. We've reproduced
the issue with many different tools. Our problems first started when we were
using a simple 'cp' to copy several 40GB files. We later tried using rsync, and
then eventualy a tar pipe. All three methods produce the issue.
We also tested on some smaller hardware without LVM and seem to show the same
issue. Actually, we think it is trivially reproducible by simply writing any
large file over and over on any SMP system (possible UP as well but we haven't
been able to make it fail there yet).
One thing we found is that sometimes if we continually cancelled and then
restarted the copy eventually performance would return to normal, at least for
that one file.
We also worked around our issue by mounting ext2.
We got a little busy last week with other issues but we do have a setup that
seems to be able to reproduce this issue consistently now so we will probably open
a support case Monday.
We are seeing the same problem as well. We have a java "ftp" server and randomly
some threads will get caught by the problem from strace i can see that writes take
a long time to complete (sys cpu goes up to %100) but the disk is idle (from
istat). The partition is 1.7TB and the transfers are all 1GB files at the
moment. I try different elevators but the problem is still there. I'll try to
get some oprofile numbers as soon as possible.
One other thing to try --- it's likely that the ext3 reservation code is
involved here; does "mount -o noreservation" help?
I will try this today. I can usually reproduce the issue in a matter of hours.
It does still occur with 2.6.9-16 from the beta.
Some other things that we seem to see in our observations, which may or may not
1. So far I've been unable to reproduce this on a UP system.
2. For some reason it seems easier to reproduce with more CPU's. The system
that normally exhibits the problem is a 4-way, however, it took MUCH longer for
us to get the problem to show up on a 2-way box although it did eventually.
OK, then that sounds like a possible SMP livelock issue. I've got a 4-way
that's busy doing some other ext3 testing right now, but should be able to get
that trying to reproduce this problem shortly.
Please note my entries above. The problem can be easily
reproduced on a single CPU system.
You said P4 uniprocessor. Is that a HT system? Those behave more like dual
processors. Of course I'm not saying it's impossible to reproduce on a single
CPU system, but I've probably passed 100TB of data through my system with my
simple test case (copying 40GB files around on two 1TB volumes) and so far
haven't been able to create a failure with a UP kernel on my box.
On the other hand my 4-way system usually fails somewhere around 4-6 hours into
It's a cheap Dell with a single 2.4GHz P4. Perhaps it depends
on the exact way the system is stressed (maybe the pattern of
CPU/write activity inherent in our custom 'gzip'), but the
problem appears consistently at 128GB whenever the files are
expanded. Running 'mkfs' beforehand might be relevant.
I can provide any system details of interest. Don't have time
right now to run further tests though.
$ cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 15
model : 2
model name : Intel(R) Pentium(R) 4 CPU 2.40GHz
stepping : 7
cpu MHz : 2392.563
cache size : 512 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 2
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36
clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe cid xtpr
bogomips : 4718.59
My tests systems have passed over 5TB of data since yesterday when I remounted
with the 'noreservation' option. Normally I would hit the issue well before this
point, usually around 500-600GB mark. I will continue the test over the weekend,
but at this moment things are looking good.
What are the negative impacts of this option?
Reservations are a new feature of ext3 in the 2.6 kernel intended to reduce
fragmentation: basically it's a way of remembering which bits of the disk
already have files growing into them, so that they don't step on each others' space.
The reservations are remembered in memory only --- they are not present on disk,
so there are no on-disk compatibility issues with having reservations enabled or
not. Without reservations, we are just using the old 2.4-style method where
each file is growing independently without regard for others; so using -o
noreservation will have no downside relative to the old 2.4 (RHEL-3) behaviour.
My system survived the weekend after copying over 20TB of data so the
noreservation option sure seems to be a good workaround for the issue. I guess
that also implies there is a small problem in the reservation code.
Thanks for the suggestion!
I've also been trying to reproduce the problem, but without success so far. I'm
going to try reformatting with artifically-small block groups; I suspect that
there's a scaling problem when the block groups start to fill up and we have to
walk many block groups for each IO. That will be exacerbated even on smaller
filesystems if we use very small block groups (the default is 128MB, but we can
go much lower.)
There are a couple of improvements to the reservation code that have gone into
upstream kernels. I haven't marked them as critical so far because they didn't
seem to be important enough, but it is quite possible that their impact becomes
much greater on larger filesystems where per-block-group overhead is very
important. Would people be willing to test a kernel with those changes applied?
Have no time to test right now. However I'm willing to make our
'gzip' source available and make available raw data files known
to cause the problem. The data is public-domain stock price
information, two 15GB .gz data sets that expand to about 80GB each.
Each data set consists of about 100 128MB segments that can be
downloaded from an Apache httpd server using 'curl' or 'wget'.
The problem occurs when a 'rsh RMTSRV cat /???/file.gz_??? | gzip -d >file'
is run from a freshly made 'ext3' file system. It is 100%
reproducible on my machine. Alternately any large mostly text
file over 128GB in size should produce the problem after it's
been compresses with 'gzip' and is expanded with our 'gzip'
variant. Perhaps the normal 1.3 'gzip' will also reproduce
I'm not arguing, but really just curious, if it's a scaling problem, wouldn't it
happen more consistently? In our case it seems totally random, we have about 50
files that are all about 40GB in size. Just copying these around is enough to
trigger the problem, but there seems to be no predicting which one will trigger
it (although it seems to happen pretty consistently around 500-600GB, so usually
on the 10-12th file). Sometimes we can just cancel the copy and start over and
it will just return to normal for the next 3 or 4 files and then suddenly start
going slow again.
We've formatted the ext3 volumes with various options, including block sizes and
the largefile and largefile4 options without being able to make significant
differences (I understand that's not exactly what you're talking about, but just
wanted to point this out as well).
I'd be more that happy to test any kernels. I have a system that's pretty much
dedicated to testing this issue.
Also, would an oprofile output of the problem occurring be any help? I can
provide that easily.
Yes, you are arguing.
This is my final entry, I am unsubscribing from this bug.
The last time I tried, I could reproduce the problem 100% (not
99%, not 98%, not even 97%--a perfect 100%) of the time at the
128GB mark with the approach I have described on an inexpensive
I do not have the time or interest to work on it further, but am
willing to provide quality information to those who are. Simply
read my entries and you know most of what I know.
If anyone is interested in any further information from me,
contact me directly.
No seriously, it was a question. It was "how could it be" not "no way, it can't
be", and that's also why it was accompanied by a statement of being willing to
test ANY patches. I've certainly been unable to fix it and am willing to
entertain and test any theories, especially from people much smarter than me, but
that doesn't mean I'm not curious about those theories.
Once again, I'm willing to test ANY patches on any of my test platforms to see if
they solve my issues.
OK, I have finally been able to reproduce this. I think the problem has been
that we've been testing with filesystems that have either been too small, or,
paradoxically, too large. To reproduce, we need a filesystem that has a lot of
block groups, but also one which is nearly full; tests on small filesystems did
not show the problem because they have few block groups, and so CPU time spent
scanning block groups does not add up to much; and tests on very large
filesystems did not get full enough to trigger the problem.
But I can reproduce with a relatively small 15GB filesystem simply by forcing an
artifically large number of block groups to be created, by formatting with
mke2fs -j -O^resize_inode -N 256 -g 256
Once the filesystem gets above about 96% full for normal writes, I see the same
problem of throughput dropping catastrophically and the kernel spending vast
amounts of system mode CPU time in ext3/jbd.
I have been able to narrow the problem down, and have built a test kernel which
solves the problem in my local testing. I have placed x86_64 and x86 kernels
with the fix up for testing at
This eliminates the problem entirely in my local test case. It is a merge of a
set of fixes from upstream, so has been well tested in that context. The
kernels have not been properly QAed and are not supported, but have survived an
SMP filesystem stress test for a couple of hours locally. It would be useful to
know if they solve the problem in your cases.
I will set this up on my test system today. Hopefully we should have good
results in a couple of days.
I'm a little concerned since in my test cases the filesystems were not what I
would consider "near full". In some cases they were less than 40% full. Also,
some cases working even when the drive was 90%+ full. Would there be any reason
for aparent randomness, perhaps due to the size of the stored files on the
volume or something?
Anyway, I will test this diligently over the next few days. Thanks for all your
Yes, it depends on many factors, including existing fragmentation on the
filesystem, number of files being copied at once, etc. The reproducer I found
was on a freshly formatted filesystem with no other activity; you would expect
to see different patterns on filesystems with existing data or with multiple
Just an update.
The system has survived 48 hours with excellent performance so far. With previous
kernels the problem would almost always show up in 8 hours, and never worked for
24 hours. I will continue the tests until Monday afternoon which will be 96 hours
of testing but so far these kernels are looking good.
Will report again on Monday afternoon.
OK, the system survived our 96 hours stress test without any hint of slowing
down. We averaged a sustained throughput of about 50MB/s on each thread and
processed a total of almost 16TB of data with no single thread showing more than
a single digit percentage difference from any other. Previously we would see
some threads that would takes 50-100x longer for their writes to complete.
I've also attempted to reproduce this issue using a few other test cases on
another hardware platform. That platform didn't show the problem as much, but
did usually show a slow writer at least once in 24 hours. That platform also
survived the weekend.
I would have to say that these patches appear to have solved the problem in all
of my test cases.
So now I've got both an acceptable workaround (noreservation option) and a
future supported fix. I'll assume this will eventually make it's way into
RHEL4, maybe U3?
Thanks so much for your help.
Yes, I'll queue this for U3 (subject to product/engineering review as usual.)
Performance of memory pressure I/O has improved greatly (in KB/sec)
Iozone to 32GB file on a system with 4 GB memory, using 1M transfers;
KB reclen write rewrite read reread
R4 U2 33554432 1024 43507 39022 46831 46811
R4 sct 33554432 1024 106093 81218 55216 56591
x86_64 for both
*** Bug 161203 has been marked as a duplicate of this bug. ***
*** Bug 175140 has been marked as a duplicate of this bug. ***
I understand what is "reservation" according to comment #24.
Would you mind to fill up "reservation" and "noreservation" section of
Documentation/filesystems/ext3.txt ? Then send it to LKML,too.
And man for mount(8) also should have an explanation about "reservation"
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.
Just for the archives, I've seen exactly the same problem on a P3 UP machine
doing rsync from my laptop via 100mbps network. I just didn't understand what's
wrong because it has worked for years against my good old RedHat 7.2 server. I
then tried with scp, rsh/rcp, ftp and whatever and it just didn't work as
expected, always the same. Thanks to everybody involved in fixing it.