79257 – VM bug in 2.4.18-18 bigmem kernel

Bug 79257 - VM bug in 2.4.18-18 bigmem kernel

Summary: VM bug in 2.4.18-18 bigmem kernel

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Linux
Classification:	Retired
Component:	kernel
Sub Component:
Version:	8.0
Hardware:	i686
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	Arjan van de Ven
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2002-12-09 01:48 UTC by Norman Gaywood
Modified:	2008-08-01 16:22 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2004-09-30 15:40:16 UTC
Embargoed:

Attachments	(Terms of Use)
Diff of /proc/slabinfo on a good and totally useless system (5.72 KB, text/plain) 2002-12-09 01:53 UTC, Norman Gaywood	no flags	Details
top output of a system just starting to slowdown (4.58 KB, text/plain) 2002-12-09 02:00 UTC, Norman Gaywood	no flags	Details
/proc/meminfo: good, slow and pumping mud (2.18 KB, text/plain) 2002-12-10 01:28 UTC, Norman Gaywood	no flags	Details
meminfo and slabinfo from 2.4.18-19.1bigmem cp test (5.01 KB, text/plain) 2002-12-10 11:47 UTC, Norman Gaywood	no flags	Details
uptime, df, meminfo, slabinfo log of 2.4.18-19.1bigmem cp test (362.08 KB, text/plain) 2002-12-11 01:59 UTC, Norman Gaywood	no flags	Details
Output of df used column showing cp dying (564 bytes, text/plain) 2002-12-11 02:01 UTC, Norman Gaywood	no flags	Details
uptime, df, meminfo, slabinfo log of 2.4.20-2bigmem cp test (401.86 KB, text/plain) 2003-01-02 05:16 UTC, Norman Gaywood	no flags	Details
Patch to fix inode behavior for bigmem kernel (3.94 KB, patch) 2003-06-19 15:33 UTC, Byron Clark	no flags	Details \| Diff
Fix spec file for 2.4.20-18.9 to use the highmem-inode patch (1.03 KB, patch) 2003-06-19 15:34 UTC, Byron Clark	no flags	Details \| Diff
View All

Description Norman Gaywood 2002-12-09 01:48:31 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:0.9.4.1) Gecko/20020314
Netscape6/6.2.2

Description of problem:


Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
1. Start a 2.4.18-18-bigmem kernel on a 4 processor 16GB memory machine
2. Start a cp command on a large number of files, occuping several gigabytes and
wait for cache memory to use about 15GB
3. Observe the system load go up to 2-3, with kswapd using more processor time.
4. As the copy continues, system becomes so slow it is unusable.
	

Actual Results:  By doing a large copy I can trigger a system slowdown in about
30-40 minutes. At the end of that time, kswapd will start to get a larger % of
CPU and the system load will be around 2-3. The system will feel sluggish at an
interactive shell and it will take several seconds before a command like top to
start to display. If I let it go for another 30 minutes the system is unusable
were it could take 10 minutes or more to do simple commands.

The copy never completes. If I abort the copy the system remains slow. 


Expected Results:  No system slowdown.

Additional info:

I have also reported this as a RH service request.

I have also posted this to the linux kernel mailing list. See the thread "Maybe
a VM bug in 2.4.18-18 from RH 8.0?"

The system is a 4 processor PE6600 running RH 8.0 with latest errata. Note that
I have upgraded to kernel 2.4.18-19.7.tg3.120bigmem which I understand to be the
latest RH8 errata kernel + patches to stop the tg3 hanging problem. This came
from http://people.redhat.com/jgarzik/tg3/. I have also tried the latest RH
errata kernel using the bcm5700 driver and it has the same problem.
The system slowdown can be avioded if it placed under memory pressure enough to
keep the use of cache low. I can supply a program to do this if required. When
this is running the copy completes and there is no system slowdown.

Comment 1 Norman Gaywood 2002-12-09 01:53:33 UTC

Created attachment 87887 [details]
Diff of /proc/slabinfo on a good and totally useless system

Comment 2 Norman Gaywood 2002-12-09 02:00:34 UTC

Created attachment 87888 [details]
top output of a system just starting to slowdown

Comment 3 Arjan van de Ven 2002-12-09 11:29:33 UTC

can you get a cat /proc/meminfo of the system in trouble too ?
(just to validate the fix we're working on)

Comment 4 Norman Gaywood 2002-12-10 01:28:22 UTC

Created attachment 88124 [details]
/proc/meminfo: good, slow and pumping mud

Comment 5 Arjan van de Ven 2002-12-10 08:35:19 UTC

can you try the test kernel at http://people.redhat.com/arjanv/testkernels/
and see if that improves things?

Comment 6 Norman Gaywood 2002-12-10 11:45:37 UTC

I tried:

uname -a
Linux alan.une.edu.au 2.4.18-19.1bigmem #1 SMP Mon Dec 9 10:02:07 EST 2002 i686
i686 i386 GNU/Linux

but it's not much better.

The system seems to be be more responsive but then there is a sudden
slowdown. It is not as severe as the previous kernel. It happens with
about the same amount copied (<8GB and 250,000 inodes) and pretty much
in the same amount of time.

I didn't take the system to destruction as I'm doing this test remotely
and don't have the console.

I will do the test again tomorrow morning until the system is unusable,
just to make sure.

I've attached the meminfo and slabinfo from this run.

Comment 7 Norman Gaywood 2002-12-10 11:47:57 UTC

Created attachment 88167 [details]
meminfo and slabinfo from 2.4.18-19.1bigmem cp test

Comment 8 Arjan van de Ven 2002-12-10 11:53:28 UTC

on a first look it looks a slight improvement; it means that what I did needs
doing more agressively; will try to get you a second kernel asap with more tuning

Comment 9 Norman Gaywood 2002-12-11 01:57:05 UTC

So I verified that the system does indeed die with 2.4.18-19.1bigmem
I've attached the full log of the test. The log consists of the output of:

#!/bin/sh

while true
do
  uptime
  df /dev/sdi1
  cat /proc/meminfo
  cat /proc/slabinfo
  sleep 60
done

If you plot the used column of the df output you can see the progress
the cp is making. It confirmed my impression that the cp dies off and
does not seem to get much work done. I have attached that column of
numbers. I look at it with:

gnuplot
plot "dfs"

Of course the time access is distorted by the system slowdown. But that
only favours the cp.

Comment 10 Norman Gaywood 2002-12-11 01:59:03 UTC

Created attachment 88324 [details]
uptime, df, meminfo, slabinfo log of 2.4.18-19.1bigmem cp test

Comment 11 Norman Gaywood 2002-12-11 02:01:13 UTC

Created attachment 88325 [details]
Output of df used column showing cp dying

Comment 12 Norman Gaywood 2003-01-02 05:13:14 UTC

Updated my RH 8.0 + updates with:

   kernel-bigmem-2.4.20-2.2.i686.rpm
   modutils-2.4.22-1.i386.rpm
   mkinitrd-3.4.33-1.i386.rpm
   kudzu-0.99.83-1.i386.rpm
   hwdata-0.62-2.noarch.rpm

Things have somewhat improved with the 2.4.20 kernel. It's still not what you
would want however. After copying about 10G, free memory is low, cache is around
15G and the copy slows down. The system feels a little tacky but is still
usable. The good news is that it does not deteriorate past this. 2.4.18 would
die if left for too long after this stage.

The slowdown on the copy is a bit of worry though. In the first 20 minutes of
the test, around 10Gig was copied. In the next 20 minutes, around 1Gig was copied.

I've attached the logs as above. See the copy die with the gnuplot command:

  plot "< awk '/sdi1/ {print $3}' 2.4.20-2.2bigmem.log"

Comment 13 Norman Gaywood 2003-01-02 05:16:50 UTC

Created attachment 89046 [details]
uptime, df, meminfo, slabinfo log of 2.4.20-2bigmem cp test

Comment 14 Dan Norris 2003-05-08 14:26:48 UTC

I've run into the same problem on RH AS 2.1 2.4.9-e.16 (and e.3, e.12 as well).
 Please update us on the current status of this bug.

Comment 15 Arjan van de Ven 2003-05-08 14:28:39 UTC

 Dan Norris : AS2.1 has a totally different VM, please file a separate bug/

Comment 16 Byron Clark 2003-06-17 22:04:52 UTC

We have been able to duplicate this problem on a PowerEdge 6600 running RedHat
Linux 9.  We are using 2.4.20-18.9bigmem.  A thread about this on lkml can be
found at http://www.cs.helsinki.fi/linux/linux-kernel/2002-43/0123.html.

Comment 17 Byron Clark 2003-06-19 15:33:24 UTC

Created attachment 92488 [details]
Patch to fix inode behavior for bigmem kernel

This patch fixes the problem for us. To apply it we had to disable one redhat
patch.	The patched specfile will be the next attachment.  This is for
2.4.20-18.9.

Original Source:
http://www.zip.com.au/~akpm/linux/patches/2.4/2.4.21-pre4/10_inode-highmem-2.patch

Comment 18 Byron Clark 2003-06-19 15:34:46 UTC

Created attachment 92489 [details]
Fix spec file for 2.4.20-18.9 to use the highmem-inode patch

This applies the highmem-inode patch and disables the redhat include inodes
patch.

Comment 19 Bugzilla owner 2004-09-30 15:40:16 UTC

Thanks for the bug report. However, Red Hat no longer maintains this version of
the product. Please upgrade to the latest version and open a new bug if the problem
persists.

The Fedora Legacy project (http://fedoralegacy.org/) maintains some older releases, 
and if you believe this bug is interesting to them, please report the problem in
the bug tracker at: http://bugzilla.fedora.us/

Note You need to log in before you can comment on or make changes to this bug.