150653 – pdflush causes kernel BUG at kernel/exit.c:840! under disk I/O stress test

Bug 150653 - pdflush causes kernel BUG at kernel/exit.c:840! under disk I/O stress test

Summary: pdflush causes kernel BUG at kernel/exit.c:840! under disk I/O stress test

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat Enterprise Linux 4
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	4.0
Hardware:	i386
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Larry Woodman
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2005-03-09 13:25 UTC by Daniel W. Ottey
Modified:	2010-06-07 05:46 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2010-06-07 05:46:47 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
/var/log/messages with boot and crash (61.89 KB, text/plain) 2005-03-09 13:25 UTC, Daniel W. Ottey	no flags	Details
oops from console (6.15 KB, text/plain) 2005-04-21 13:03 UTC, Trond H. Amundsen	no flags	Details
/var/log/messages including kernel BUG message and subsequent reboot (57.30 KB, text/plain) 2005-05-31 14:38 UTC, Gordon Farquharson	no flags	Details
/var/log/messages including kernel BUG message and subsequent reboot (10.03 KB, text/plain) 2005-05-31 14:39 UTC, Gordon Farquharson	no flags	Details
View All

Description Daniel W. Ottey 2005-03-09 13:25:26 UTC

Description of problem:
When performing some disk I/O stress tests overnight, we receive a kernel EIP
and the following error message "kernel BUG at kernel/exit.c:840!"  Both times
the process reported is "pdflush".


Version-Release number of selected component (if applicable):
2.6.9-5.ELsmp

How reproducible:
Run a disk I/O stress (reads, writes, verify files on filesystems) to two disks.
Datarel to sds, sdt

  65    32  142573568 sds
  65    33          1 sds1
  65    37   68364544 sds5
  65    38   68364576 sds6
  65    48  142573568 sdt
  65    49          1 sdt1
  65    53   68364544 sdt5
  65    54   68364576 sdt6

CONTROLLER sd
BUS 0
TARGET 0
DISK s
SLICE 5 ext2-1024
SLICE 6 ext3-4096

CONTROLLER sd
BUS 0
TARGET 0
DISK t
SLICE 5 ext2-1024
SLICE 6 ext3-4096

Comment 1 Daniel W. Ottey 2005-03-09 13:25:26 UTC

Created attachment 111809 [details]
/var/log/messages with boot and crash

Comment 2 Trond H. Amundsen 2005-04-21 13:03:49 UTC

Created attachment 113468 [details]
oops from console

We also experience this bug. The machine in question has heavy I/O, and is in
production. We use the latest errata kernel (2.6.9-5.0.5.ELsmp).

It would be great if a fix for this problem made it into the patch pool for
update 1.

Comment 3 Dave Jones 2005-04-22 01:25:38 UTC

can you try one of the test kernels at..
http://people.redhat.com/davej/kernels/RHEL4/RPMS.kernel/

please ?

Comment 4 Trond H. Amundsen 2005-04-22 08:00:24 UTC

Yes, I've put the test kernel into production. As I said this is a production box,
so I can't run any stresstesting, but the machine is recently installed and the
bug has already bit us at least once. Probably twice, it crashed a couple of days
before too, but I didn't have a serial console attached so I can't know for sure.

We'll let you know if we get the same oops. Thanks for the quick response.

Comment 5 Petter Reinholdtsen 2005-04-26 12:14:55 UTC

I've taked over the case from Trond.  We replaced the kernel four days ago,
and the production server in question have stayed up since then.  We recently
increased the IO load (ran backup, etc), and the server survived this.  I would
say this looks good, and will keep you updated with any new development in the 
case.

Comment 6 William 2005-05-02 12:34:45 UTC

I will install the kernel listed above and get back with you.  I noticed issues
with my SMB server yesterday after running flawlessly.  It turns out the cuplrit
was pdflush.  The behavior was slightlt difference.  The throughput went to the
floor but cpu usage stayed reasonable.

Comment 7 William 2005-05-02 13:13:30 UTC

Well the transfer rates have gone back up to better levels.  Instead of the
transfer rate dropping to sub 30 megabit speeds during large writes to the file
server it is now hovering around 65 megabits which is probably the limit of my
hardware(though before this surfaced i would average about 70 with spikes to 80).

Comment 8 William 2005-05-08 16:40:35 UTC

I have been using the kernel provided earlier and the problem has returned. 
Transfers to the server that cause writes now go in bursts of between 10 mega
bits and 65 mega bits after running stable for a while. Load doe snot go
throught he roof but writes go in the tank now.

Comment 9 William 2005-05-08 23:07:56 UTC

I reverted back to kernel 2.6.9-5.0.3.EL and SMB writes are now pegged at 60-80
megabits when sending files to the server.

Comment 10 William 2005-05-10 20:05:25 UTC

The load average is still pegging really high even though tranfer speeds are
good.  I am showing right now 1.75 and ssh is very very sluggish right now.

Comment 11 Charles Pfeiffer 2005-05-31 00:45:39 UTC

I believe I am having the same problem, but I am not sure if it is identical or 
if it is a new bug.  I just upgraded to RHEL4 from RHEL3 and went from kernel 
2.4.20-27 to 2.6.9-5ELsmp.  Everything worked fine for several months before 
the upgrade.

I have several Oracle DBs on it and when I back up the large DB which has some 
really large datafiles (12 GB) it seems to hang the system and we have to 
recycle power.  Some times we get "out of memory" errors and sometimes we 
don't.  Memory is in good shape on the system.  There is always about 6 GB free 
(minus buffers and cache).  

I can't verify whether this is coming from pdflush or not, but I did 
get "kernel: kernel BUG at kernel/exit.c:840!" in /var/log/messages.  

I also have a very sluggish SSH response.  

This is a production system too.  Is there a supported fix for this bug yet 
that we could try?

Comment 12 Gordon Farquharson 2005-05-31 14:38:09 UTC

Created attachment 115001 [details]
/var/log/messages including kernel BUG message and subsequent reboot

I'm pretty sure that this is the same problem. This has happened to me a couple
of times. These crashes did not occur under heavy disk load.

Comment 13 Gordon Farquharson 2005-05-31 14:39:31 UTC

Created attachment 115002 [details]
/var/log/messages including kernel BUG message and subsequent reboot

The same thing again but under different conditions.

Comment 14 Charles Pfeiffer 2005-06-06 00:08:47 UTC

I just did some more testing on my system this weekend and here is what I have 
found:
1) The bug has nothing to do with the USB drive
2) The bug and the "out of memory" problem I have been experiencing can be 
controlled by setting swappiness to 0 which favors dumping cache to swapping.  

Hmmm... since everyone thinks this is pdflush related and is caused during high 
disk I/O, I wonder if setting swappiness to 0 (the old Linux behavior by the 
way) is a good work around?  We haven't been at that setting too long, but it 
did get us through our backups which we were unable to do previously.  

Anyone from RedHat have any thoughts on this?

Comment 15 Christian Emery 2005-06-26 17:23:23 UTC

I am fairly certain that I am seeing the same issue as above. Is there a status 
update on this?

Comment 16 William 2005-07-22 16:02:34 UTC

I upgraded my machine to a celeron 1.1ghz.  The higher cpu has lessened the
problem however load averages with two people doing SMB transfers still cause
the load average to hover around 1.5.  This makes other operations(SSH for
example) sluggish even though the cpu is only at 30% usage at most.  My
swapiness is at 20 which i am lowering to zero.  I will report back on how this
works.

Comment 17 William 2005-08-02 16:05:35 UTC

Pdflush still cuases higher than expected load averages when doing large smb
transfers:

Here's a top snapshot at the peak of load:
top - 12:05:23 up 4 days, 36 min,  1 user,  load average: 1.87, 0.78, 0.32
Tasks:  40 total,   1 running,  39 sleeping,   0 stopped,   0 zombie
Cpu(s):  2.3% us, 25.1% sy,  0.0% ni,  0.0% id, 54.8% wa,  8.1% hi,  9.7% si
Mem:    256060k total,   255048k used,     1012k free,      724k buffers
Swap:   521632k total,      144k used,   521488k free,   230372k cached


swappiness is at zero.

Comment 18 William 2005-08-06 12:25:49 UTC

I am using samba on a dell sc240.  p-4 celeron 2.54 ghz with 512 megs of ram. 
While transferring 8.5 gigs of data the load kept going up.  This system did not
get overly sluggish as this was only via a single users.  The load going up this
much however is consitent however with my other entries:

top - 08:25:05 up 1 day, 10:43,  1 user,  load average: 1.46, 0.86, 0.36
Tasks:  54 total,   1 running,  53 sleeping,   0 stopped,   0 zombie
Cpu(s):  2.0% us,  8.1% sy,  0.0% ni, 67.7% id,  8.1% wa,  5.1% hi,  9.1% si
Mem:    506580k total,   353344k used,   153236k free,     5928k buffers
Swap:  1052248k total,      144k used,  1052104k free,   256888k cached

Note You need to log in before you can comment on or make changes to this bug.