Bug 156437
Summary: | Writing large file to 1TB ext3 volume sometimes very slow | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 4 | Reporter: | Tom Sightler <ttsig> | ||||||
Component: | kernel | Assignee: | Stephen Tweedie <sct> | ||||||
Status: | CLOSED ERRATA | QA Contact: | Brian Brock <bbrock> | ||||||
Severity: | medium | Docs Contact: | |||||||
Priority: | medium | ||||||||
Version: | 4.0 | CC: | alexey, alex, davej, drew.middlesworth, dshaks, jbaron, jon, joshua.bakerlepain, jts, k.georgiou, nobody+wcheng, tao | ||||||
Target Milestone: | --- | ||||||||
Target Release: | --- | ||||||||
Hardware: | i386 | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | RHSA-2006-0132 | Doc Type: | Bug Fix | ||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2006-03-07 18:56:45 UTC | Type: | --- | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 161203, 167231, 168429 | ||||||||
Attachments: |
|
Description
Tom Sightler
2005-04-30 05:04:07 UTC
Created attachment 113882 [details]
Oprofile output from system performing normally
Created attachment 113883 [details]
Oprofile output from system when copy is running slow
As you mention this is a major problem, you would be wise to contact our support team, either by going to http://www.redhat.com/support and logging in, or by calling 800-REDHAT1 Bugzilla is a bug tracking system, and not a support system, so is not suited for urgent/immediate problems. Right, we understand that. We plan to open an official support ticket this week. We are working to setup an environment where we are sure we can reproduce the problem consistently. Historically we've not had much luck with Redhat support. Especially on issues like this that are difficult to reproduce 100%. Actually, the few times we've had Redhat actually provide a fix to our problem it was always via a Bugzilla, while the support request languished untouched in the ticket system. As a matter of fact, I don't think we've ever had a support ticket succesfully resovled. Because of this history we usually open the Bugzilla first because we find that sometimes other users that are having similar problems chime in and post comments that give us clues to our problem right away. We then open a support ticket and reference the Bugzilla entry. I guess basically, we open the Bugzilla for the community, we open the support ticket for us. Thanks, Tom From the oprofile output: 74044 20.7120 vmlinux vmlinux default_idle 29079 8.1341 ext3.ko ext3 init_ext3_xattr That looks seriously suspect: int __init init_ext3_xattr(void) { ext3_xattr_cache = mb_cache_create("ext3_xattr", NULL, sizeof(struct mb_cache_entry) + sizeof(struct mb_cache_entry_index), 1, 6); if (!ext3_xattr_cache) return -ENOMEM; return 0; } This function is called once, on module __init, and is then thrown away by the kernel (due to the "__init" attribute.) But the fact that you're seeing one CPU pegged at 100% when the performance goes bad indicates that we should indeed be seeing excessive CPU cycles somewhere in the kernel. The question is where they are really being spent. I'll admidt to being an oprofile newbie and I'll admidt that looks bogus, perhaps we did something wrong. We couldn't find the actual debuginfo package for the distribution kernel so we recompiled the kernel source RPM to get a new kernel and kernel-debuginfo RPM. We then installed our custom RPM's (both kernel-smp and kernel-debuginfo) and ran the oprofile. Is there anything that would make oprofile be so wrong? Could it be a mistake on our part? The CPU is pegged at 100% system time, almost no user or io wait time. We're trying to setup a lab environment so that will allow us to reproduce the issue and be more valuable in solving the problem. We happen to have the spare capacity to setup an identical 1TB array and attempt to copy back and forth. I guess at that point we'll find out just how reproducible the issue really is. Thanks, Tom It could be a mis-match between the running kernel and oprofile's lookup for the symbols: I'd recommend explicitly telling oprofile where to look for the .ko files, for example opreport -p /lib/modules/2.6.9-5/kernel/ -l /ext3 If that still looks bogus, we can always add the "-d" option to log every single oprofile hit on ext3, but that generates very verbose output indeed. The command I actually ran was opreport -l -p /lib/modules/2.6.9-5.0.5.EL.rootsmp which seems pretty close to the same. The -l /ext3 appears to simply limit the results to only the ext3 module. I'm quite confident the modules match the running kernel as I compiled the RPM's from source myself and have the following installed: kernel-smp-2.6.9-5.0.5.EL.root kernel-smp-devel-2.6.9-5.0.5.EL.root kernel-debuginfo-2.6.9-5.0.5.EL.root And am running 2.6.9-5.0.5.EL.rootsmp which is my custom compiled kernel. One thing that is suspicious, should there be a kernel-smp-debuginfo package? Also, the debuginfo RPM appears to contain a bunch of files like ext3.ko.debug in /usr/ lib/debug/lib/modules/<kernelversion>/kernel. If I get rid of these I get totally different output from oprofile. Perhaps you could point me to where I could get the actualy kernel debuginfo package for the stock kernel and I'll try again without my "compiled from source" versions. We hope to open a support ticket tomorrow because we think we have this setup in a reproducible way. Thanks, Tom There should not be a kernel-smp-debuginfo package: the kernel-debuginfo package contains all of the debug objects for both the UP and the SMP builds. If you run with the official errata package, what does "opreport -d /ext3" show? I should be able to see the hot spots from that pretty easily, and match them up to the disassembly of the ext3.ko to work out precisely where we're spending all the time. Just encountered the same bug. Kernel 2.6.9-5.0.5.EL on a P4 uniprocesor.
Have a 250GB Maxtor Maxline II+ disk with a boot partition and the rest of the
disk allocated as a single PV on a single VG and with a single LV holding an
'ext3' FS. Created with 'mke2fs -v -N 256000 -j -m 0 /dev/vg01/lv00'.
Expanding large two 100GB files via a 'rsh remote cat hugefile.gz | gzip -d
>hugefile'. 'iostat' shows things humming along nicely at 30MB/second until the
'ext3' FS reaches the 128GB mark (134217728 per 'df -k'). Suddenly the transfer
rate drops down to 2MB/second and the 'gzip -d' process starts eating 100% of
the CPU. 'top' reports that it is all system time. 'strace -p <pid> --c'
confirms that vast amounts of time are being spent in the 'write' system call,
about 100 times that of 'read'. A similar 'strace' run before the 128GB mark
shows 'read' using more time than 'write'.
A clue to the nature of the problem may lie in disk fragmentation. 'fsck -f'
shows an astonishing fragmentation level of 20%. Wow! I've never seen a disk
that fragmented ever before under any circumstances--much less as the result of
writing two sequential files immediately after 'mkfs' and 'mount'.
And now for the good news: The problem goes away entirely if the FS is mounted
as 'ext2' instead of 'ext3'. The performance wall-crash at 128GB and the
hideous fragmentation disappear, though I can't say I'm impressed by the 5.4%
non-contiguous value reported in this case. Solaris typically reports values
way under 1%.
So I'm using 'ext2' for the big disk while I wait for the fix.
A potentially important detail for someone trying to reproduce the above scenario is that a custom version of 'gzip' was used. This 'gzip' is a modification to 1.3.3 to write data in 256KB increments (though 'strace' reports 32KB 'write' calls). The normal version of 'gzip' writes in tiny increments, particularly on compression operations, and as a result performs horribly on VxFS under Solaris. I could have re-tested this with regular 'gzip', but I don't care to spend further hours on the problem now that a work-around has been discovered. I think it is unlikely to be a problem with your modified gzip. We've reproduced the issue with many different tools. Our problems first started when we were using a simple 'cp' to copy several 40GB files. We later tried using rsync, and then eventualy a tar pipe. All three methods produce the issue. We also tested on some smaller hardware without LVM and seem to show the same issue. Actually, we think it is trivially reproducible by simply writing any large file over and over on any SMP system (possible UP as well but we haven't been able to make it fail there yet). One thing we found is that sometimes if we continually cancelled and then restarted the copy eventually performance would return to normal, at least for that one file. We also worked around our issue by mounting ext2. We got a little busy last week with other issues but we do have a setup that seems to be able to reproduce this issue consistently now so we will probably open a support case Monday. Later, Tom We are seeing the same problem as well. We have a java "ftp" server and randomly some threads will get caught by the problem from strace i can see that writes take a long time to complete (sys cpu goes up to %100) but the disk is idle (from istat). The partition is 1.7TB and the transfers are all 1GB files at the moment. I try different elevators but the problem is still there. I'll try to get some oprofile numbers as soon as possible. One other thing to try --- it's likely that the ext3 reservation code is involved here; does "mount -o noreservation" help? I will try this today. I can usually reproduce the issue in a matter of hours. It does still occur with 2.6.9-16 from the beta. Some other things that we seem to see in our observations, which may or may not be important: 1. So far I've been unable to reproduce this on a UP system. 2. For some reason it seems easier to reproduce with more CPU's. The system that normally exhibits the problem is a 4-way, however, it took MUCH longer for us to get the problem to show up on a 2-way box although it did eventually. OK, then that sounds like a possible SMP livelock issue. I've got a 4-way that's busy doing some other ext3 testing right now, but should be able to get that trying to reproduce this problem shortly. Please note my entries above. The problem can be easily reproduced on a single CPU system. You said P4 uniprocessor. Is that a HT system? Those behave more like dual processors. Of course I'm not saying it's impossible to reproduce on a single CPU system, but I've probably passed 100TB of data through my system with my simple test case (copying 40GB files around on two 1TB volumes) and so far haven't been able to create a failure with a UP kernel on my box. On the other hand my 4-way system usually fails somewhere around 4-6 hours into the copies. It's a cheap Dell with a single 2.4GHz P4. Perhaps it depends on the exact way the system is stressed (maybe the pattern of CPU/write activity inherent in our custom 'gzip'), but the problem appears consistently at 128GB whenever the files are expanded. Running 'mkfs' beforehand might be relevant. I can provide any system details of interest. Don't have time right now to run further tests though. $ cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 15 model : 2 model name : Intel(R) Pentium(R) 4 CPU 2.40GHz stepping : 7 cpu MHz : 2392.563 cache size : 512 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 2 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe cid xtpr bogomips : 4718.59 My tests systems have passed over 5TB of data since yesterday when I remounted with the 'noreservation' option. Normally I would hit the issue well before this point, usually around 500-600GB mark. I will continue the test over the weekend, but at this moment things are looking good. What are the negative impacts of this option? Reservations are a new feature of ext3 in the 2.6 kernel intended to reduce fragmentation: basically it's a way of remembering which bits of the disk already have files growing into them, so that they don't step on each others' space. The reservations are remembered in memory only --- they are not present on disk, so there are no on-disk compatibility issues with having reservations enabled or not. Without reservations, we are just using the old 2.4-style method where each file is growing independently without regard for others; so using -o noreservation will have no downside relative to the old 2.4 (RHEL-3) behaviour. My system survived the weekend after copying over 20TB of data so the noreservation option sure seems to be a good workaround for the issue. I guess that also implies there is a small problem in the reservation code. Thanks for the suggestion! I've also been trying to reproduce the problem, but without success so far. I'm going to try reformatting with artifically-small block groups; I suspect that there's a scaling problem when the block groups start to fill up and we have to walk many block groups for each IO. That will be exacerbated even on smaller filesystems if we use very small block groups (the default is 128MB, but we can go much lower.) There are a couple of improvements to the reservation code that have gone into upstream kernels. I haven't marked them as critical so far because they didn't seem to be important enough, but it is quite possible that their impact becomes much greater on larger filesystems where per-block-group overhead is very important. Would people be willing to test a kernel with those changes applied? Have no time to test right now. However I'm willing to make our 'gzip' source available and make available raw data files known to cause the problem. The data is public-domain stock price information, two 15GB .gz data sets that expand to about 80GB each. Each data set consists of about 100 128MB segments that can be downloaded from an Apache httpd server using 'curl' or 'wget'. The problem occurs when a 'rsh RMTSRV cat /???/file.gz_??? | gzip -d >file' is run from a freshly made 'ext3' file system. It is 100% reproducible on my machine. Alternately any large mostly text file over 128GB in size should produce the problem after it's been compresses with 'gzip' and is expanded with our 'gzip' variant. Perhaps the normal 1.3 'gzip' will also reproduce the problem. I'm not arguing, but really just curious, if it's a scaling problem, wouldn't it happen more consistently? In our case it seems totally random, we have about 50 files that are all about 40GB in size. Just copying these around is enough to trigger the problem, but there seems to be no predicting which one will trigger it (although it seems to happen pretty consistently around 500-600GB, so usually on the 10-12th file). Sometimes we can just cancel the copy and start over and it will just return to normal for the next 3 or 4 files and then suddenly start going slow again. We've formatted the ext3 volumes with various options, including block sizes and the largefile and largefile4 options without being able to make significant differences (I understand that's not exactly what you're talking about, but just wanted to point this out as well). I'd be more that happy to test any kernels. I have a system that's pretty much dedicated to testing this issue. Also, would an oprofile output of the problem occurring be any help? I can provide that easily. Yes, you are arguing. This is my final entry, I am unsubscribing from this bug. The last time I tried, I could reproduce the problem 100% (not 99%, not 98%, not even 97%--a perfect 100%) of the time at the 128GB mark with the approach I have described on an inexpensive single-CPU Dell. I do not have the time or interest to work on it further, but am willing to provide quality information to those who are. Simply read my entries and you know most of what I know. If anyone is interested in any further information from me, contact me directly. No seriously, it was a question. It was "how could it be" not "no way, it can't be", and that's also why it was accompanied by a statement of being willing to test ANY patches. I've certainly been unable to fix it and am willing to entertain and test any theories, especially from people much smarter than me, but that doesn't mean I'm not curious about those theories. Once again, I'm willing to test ANY patches on any of my test platforms to see if they solve my issues. OK, I have finally been able to reproduce this. I think the problem has been that we've been testing with filesystems that have either been too small, or, paradoxically, too large. To reproduce, we need a filesystem that has a lot of block groups, but also one which is nearly full; tests on small filesystems did not show the problem because they have few block groups, and so CPU time spent scanning block groups does not add up to much; and tests on very large filesystems did not get full enough to trigger the problem. But I can reproduce with a relatively small 15GB filesystem simply by forcing an artifically large number of block groups to be created, by formatting with mke2fs -j -O^resize_inode -N 256 -g 256 Once the filesystem gets above about 96% full for normal writes, I see the same problem of throughput dropping catastrophically and the kernel spending vast amounts of system mode CPU time in ext3/jbd. I have been able to narrow the problem down, and have built a test kernel which solves the problem in my local testing. I have placed x86_64 and x86 kernels with the fix up for testing at http://people.redhat.com/sct/.private/test-kernels/kernel-2.6.9-22.EL.sct.1/ This eliminates the problem entirely in my local test case. It is a merge of a set of fixes from upstream, so has been well tested in that context. The kernels have not been properly QAed and are not supported, but have survived an SMP filesystem stress test for a couple of hours locally. It would be useful to know if they solve the problem in your cases. Thanks, Stephen. I will set this up on my test system today. Hopefully we should have good results in a couple of days. I'm a little concerned since in my test cases the filesystems were not what I would consider "near full". In some cases they were less than 40% full. Also, some cases working even when the drive was 90%+ full. Would there be any reason for aparent randomness, perhaps due to the size of the stored files on the volume or something? Anyway, I will test this diligently over the next few days. Thanks for all your work. Later, Tom Yes, it depends on many factors, including existing fragmentation on the filesystem, number of files being copied at once, etc. The reproducer I found was on a freshly formatted filesystem with no other activity; you would expect to see different patterns on filesystems with existing data or with multiple writers. Just an update. The system has survived 48 hours with excellent performance so far. With previous kernels the problem would almost always show up in 8 hours, and never worked for 24 hours. I will continue the tests until Monday afternoon which will be 96 hours of testing but so far these kernels are looking good. Will report again on Monday afternoon. Thanks, Tom OK, the system survived our 96 hours stress test without any hint of slowing down. We averaged a sustained throughput of about 50MB/s on each thread and processed a total of almost 16TB of data with no single thread showing more than a single digit percentage difference from any other. Previously we would see some threads that would takes 50-100x longer for their writes to complete. I've also attempted to reproduce this issue using a few other test cases on another hardware platform. That platform didn't show the problem as much, but did usually show a slow writer at least once in 24 hours. That platform also survived the weekend. I would have to say that these patches appear to have solved the problem in all of my test cases. So now I've got both an acceptable workaround (noreservation option) and a future supported fix. I'll assume this will eventually make it's way into RHEL4, maybe U3? Thanks so much for your help. Tom Yes, I'll queue this for U3 (subject to product/engineering review as usual.) Performance of memory pressure I/O has improved greatly (in KB/sec) Iozone to 32GB file on a system with 4 GB memory, using 1M transfers; KB reclen write rewrite read reread R4 U2 33554432 1024 43507 39022 46831 46811 R4 sct 33554432 1024 106093 81218 55216 56591 x86_64 for both 2.6.9-20.ELsmp 2.6.9-22.EL.sct.1smp *** Bug 161203 has been marked as a duplicate of this bug. *** *** Bug 175140 has been marked as a duplicate of this bug. *** I understand what is "reservation" according to comment #24. Would you mind to fill up "reservation" and "noreservation" section of Documentation/filesystems/ext3.txt ? Then send it to LKML,too. And man for mount(8) also should have an explanation about "reservation" and "noreservation" Thanks. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2006-0132.html Just for the archives, I've seen exactly the same problem on a P3 UP machine doing rsync from my laptop via 100mbps network. I just didn't understand what's wrong because it has worked for years against my good old RedHat 7.2 server. I then tried with scp, rsh/rcp, ftp and whatever and it just didn't work as expected, always the same. Thanks to everybody involved in fixing it. |