From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:0.9.4.1) Gecko/20020314 Netscape6/6.2.2 Description of problem: Version-Release number of selected component (if applicable): How reproducible: Always Steps to Reproduce: 1. Start a 2.4.18-18-bigmem kernel on a 4 processor 16GB memory machine 2. Start a cp command on a large number of files, occuping several gigabytes and wait for cache memory to use about 15GB 3. Observe the system load go up to 2-3, with kswapd using more processor time. 4. As the copy continues, system becomes so slow it is unusable. Actual Results: By doing a large copy I can trigger a system slowdown in about 30-40 minutes. At the end of that time, kswapd will start to get a larger % of CPU and the system load will be around 2-3. The system will feel sluggish at an interactive shell and it will take several seconds before a command like top to start to display. If I let it go for another 30 minutes the system is unusable were it could take 10 minutes or more to do simple commands. The copy never completes. If I abort the copy the system remains slow. Expected Results: No system slowdown. Additional info: I have also reported this as a RH service request. I have also posted this to the linux kernel mailing list. See the thread "Maybe a VM bug in 2.4.18-18 from RH 8.0?" The system is a 4 processor PE6600 running RH 8.0 with latest errata. Note that I have upgraded to kernel 2.4.18-19.7.tg3.120bigmem which I understand to be the latest RH8 errata kernel + patches to stop the tg3 hanging problem. This came from http://people.redhat.com/jgarzik/tg3/. I have also tried the latest RH errata kernel using the bcm5700 driver and it has the same problem. The system slowdown can be avioded if it placed under memory pressure enough to keep the use of cache low. I can supply a program to do this if required. When this is running the copy completes and there is no system slowdown.
Created attachment 87887 [details] Diff of /proc/slabinfo on a good and totally useless system
Created attachment 87888 [details] top output of a system just starting to slowdown
can you get a cat /proc/meminfo of the system in trouble too ? (just to validate the fix we're working on)
Created attachment 88124 [details] /proc/meminfo: good, slow and pumping mud
can you try the test kernel at http://people.redhat.com/arjanv/testkernels/ and see if that improves things?
I tried: uname -a Linux alan.une.edu.au 2.4.18-19.1bigmem #1 SMP Mon Dec 9 10:02:07 EST 2002 i686 i686 i386 GNU/Linux but it's not much better. The system seems to be be more responsive but then there is a sudden slowdown. It is not as severe as the previous kernel. It happens with about the same amount copied (<8GB and 250,000 inodes) and pretty much in the same amount of time. I didn't take the system to destruction as I'm doing this test remotely and don't have the console. I will do the test again tomorrow morning until the system is unusable, just to make sure. I've attached the meminfo and slabinfo from this run.
Created attachment 88167 [details] meminfo and slabinfo from 2.4.18-19.1bigmem cp test
on a first look it looks a slight improvement; it means that what I did needs doing more agressively; will try to get you a second kernel asap with more tuning
So I verified that the system does indeed die with 2.4.18-19.1bigmem I've attached the full log of the test. The log consists of the output of: #!/bin/sh while true do uptime df /dev/sdi1 cat /proc/meminfo cat /proc/slabinfo sleep 60 done If you plot the used column of the df output you can see the progress the cp is making. It confirmed my impression that the cp dies off and does not seem to get much work done. I have attached that column of numbers. I look at it with: gnuplot plot "dfs" Of course the time access is distorted by the system slowdown. But that only favours the cp.
Created attachment 88324 [details] uptime, df, meminfo, slabinfo log of 2.4.18-19.1bigmem cp test
Created attachment 88325 [details] Output of df used column showing cp dying
Updated my RH 8.0 + updates with: kernel-bigmem-2.4.20-2.2.i686.rpm modutils-2.4.22-1.i386.rpm mkinitrd-3.4.33-1.i386.rpm kudzu-0.99.83-1.i386.rpm hwdata-0.62-2.noarch.rpm Things have somewhat improved with the 2.4.20 kernel. It's still not what you would want however. After copying about 10G, free memory is low, cache is around 15G and the copy slows down. The system feels a little tacky but is still usable. The good news is that it does not deteriorate past this. 2.4.18 would die if left for too long after this stage. The slowdown on the copy is a bit of worry though. In the first 20 minutes of the test, around 10Gig was copied. In the next 20 minutes, around 1Gig was copied. I've attached the logs as above. See the copy die with the gnuplot command: plot "< awk '/sdi1/ {print $3}' 2.4.20-2.2bigmem.log"
Created attachment 89046 [details] uptime, df, meminfo, slabinfo log of 2.4.20-2bigmem cp test
I've run into the same problem on RH AS 2.1 2.4.9-e.16 (and e.3, e.12 as well). Please update us on the current status of this bug.
Dan Norris : AS2.1 has a totally different VM, please file a separate bug/
We have been able to duplicate this problem on a PowerEdge 6600 running RedHat Linux 9. We are using 2.4.20-18.9bigmem. A thread about this on lkml can be found at http://www.cs.helsinki.fi/linux/linux-kernel/2002-43/0123.html.
Created attachment 92488 [details] Patch to fix inode behavior for bigmem kernel This patch fixes the problem for us. To apply it we had to disable one redhat patch. The patched specfile will be the next attachment. This is for 2.4.20-18.9. Original Source: http://www.zip.com.au/~akpm/linux/patches/2.4/2.4.21-pre4/10_inode-highmem-2.patch
Created attachment 92489 [details] Fix spec file for 2.4.20-18.9 to use the highmem-inode patch This applies the highmem-inode patch and disables the redhat include inodes patch.
Thanks for the bug report. However, Red Hat no longer maintains this version of the product. Please upgrade to the latest version and open a new bug if the problem persists. The Fedora Legacy project (http://fedoralegacy.org/) maintains some older releases, and if you believe this bug is interesting to them, please report the problem in the bug tracker at: http://bugzilla.fedora.us/