Description of problem: When performing some disk I/O stress tests overnight, we receive a kernel EIP and the following error message "kernel BUG at kernel/exit.c:840!" Both times the process reported is "pdflush". Version-Release number of selected component (if applicable): 2.6.9-5.ELsmp How reproducible: Run a disk I/O stress (reads, writes, verify files on filesystems) to two disks. Datarel to sds, sdt 65 32 142573568 sds 65 33 1 sds1 65 37 68364544 sds5 65 38 68364576 sds6 65 48 142573568 sdt 65 49 1 sdt1 65 53 68364544 sdt5 65 54 68364576 sdt6 CONTROLLER sd BUS 0 TARGET 0 DISK s SLICE 5 ext2-1024 SLICE 6 ext3-4096 CONTROLLER sd BUS 0 TARGET 0 DISK t SLICE 5 ext2-1024 SLICE 6 ext3-4096
Created attachment 111809 [details] /var/log/messages with boot and crash
Created attachment 113468 [details] oops from console We also experience this bug. The machine in question has heavy I/O, and is in production. We use the latest errata kernel (2.6.9-5.0.5.ELsmp). It would be great if a fix for this problem made it into the patch pool for update 1.
can you try one of the test kernels at.. http://people.redhat.com/davej/kernels/RHEL4/RPMS.kernel/ please ?
Yes, I've put the test kernel into production. As I said this is a production box, so I can't run any stresstesting, but the machine is recently installed and the bug has already bit us at least once. Probably twice, it crashed a couple of days before too, but I didn't have a serial console attached so I can't know for sure. We'll let you know if we get the same oops. Thanks for the quick response.
I've taked over the case from Trond. We replaced the kernel four days ago, and the production server in question have stayed up since then. We recently increased the IO load (ran backup, etc), and the server survived this. I would say this looks good, and will keep you updated with any new development in the case.
I will install the kernel listed above and get back with you. I noticed issues with my SMB server yesterday after running flawlessly. It turns out the cuplrit was pdflush. The behavior was slightlt difference. The throughput went to the floor but cpu usage stayed reasonable.
Well the transfer rates have gone back up to better levels. Instead of the transfer rate dropping to sub 30 megabit speeds during large writes to the file server it is now hovering around 65 megabits which is probably the limit of my hardware(though before this surfaced i would average about 70 with spikes to 80).
I have been using the kernel provided earlier and the problem has returned. Transfers to the server that cause writes now go in bursts of between 10 mega bits and 65 mega bits after running stable for a while. Load doe snot go throught he roof but writes go in the tank now.
I reverted back to kernel 2.6.9-5.0.3.EL and SMB writes are now pegged at 60-80 megabits when sending files to the server.
The load average is still pegging really high even though tranfer speeds are good. I am showing right now 1.75 and ssh is very very sluggish right now.
I believe I am having the same problem, but I am not sure if it is identical or if it is a new bug. I just upgraded to RHEL4 from RHEL3 and went from kernel 2.4.20-27 to 2.6.9-5ELsmp. Everything worked fine for several months before the upgrade. I have several Oracle DBs on it and when I back up the large DB which has some really large datafiles (12 GB) it seems to hang the system and we have to recycle power. Some times we get "out of memory" errors and sometimes we don't. Memory is in good shape on the system. There is always about 6 GB free (minus buffers and cache). I can't verify whether this is coming from pdflush or not, but I did get "kernel: kernel BUG at kernel/exit.c:840!" in /var/log/messages. I also have a very sluggish SSH response. This is a production system too. Is there a supported fix for this bug yet that we could try?
Created attachment 115001 [details] /var/log/messages including kernel BUG message and subsequent reboot I'm pretty sure that this is the same problem. This has happened to me a couple of times. These crashes did not occur under heavy disk load.
Created attachment 115002 [details] /var/log/messages including kernel BUG message and subsequent reboot The same thing again but under different conditions.
I just did some more testing on my system this weekend and here is what I have found: 1) The bug has nothing to do with the USB drive 2) The bug and the "out of memory" problem I have been experiencing can be controlled by setting swappiness to 0 which favors dumping cache to swapping. Hmmm... since everyone thinks this is pdflush related and is caused during high disk I/O, I wonder if setting swappiness to 0 (the old Linux behavior by the way) is a good work around? We haven't been at that setting too long, but it did get us through our backups which we were unable to do previously. Anyone from RedHat have any thoughts on this?
I am fairly certain that I am seeing the same issue as above. Is there a status update on this?
I upgraded my machine to a celeron 1.1ghz. The higher cpu has lessened the problem however load averages with two people doing SMB transfers still cause the load average to hover around 1.5. This makes other operations(SSH for example) sluggish even though the cpu is only at 30% usage at most. My swapiness is at 20 which i am lowering to zero. I will report back on how this works.
Pdflush still cuases higher than expected load averages when doing large smb transfers: Here's a top snapshot at the peak of load: top - 12:05:23 up 4 days, 36 min, 1 user, load average: 1.87, 0.78, 0.32 Tasks: 40 total, 1 running, 39 sleeping, 0 stopped, 0 zombie Cpu(s): 2.3% us, 25.1% sy, 0.0% ni, 0.0% id, 54.8% wa, 8.1% hi, 9.7% si Mem: 256060k total, 255048k used, 1012k free, 724k buffers Swap: 521632k total, 144k used, 521488k free, 230372k cached swappiness is at zero.
I am using samba on a dell sc240. p-4 celeron 2.54 ghz with 512 megs of ram. While transferring 8.5 gigs of data the load kept going up. This system did not get overly sluggish as this was only via a single users. The load going up this much however is consitent however with my other entries: top - 08:25:05 up 1 day, 10:43, 1 user, load average: 1.46, 0.86, 0.36 Tasks: 54 total, 1 running, 53 sleeping, 0 stopped, 0 zombie Cpu(s): 2.0% us, 8.1% sy, 0.0% ni, 67.7% id, 8.1% wa, 5.1% hi, 9.1% si Mem: 506580k total, 353344k used, 153236k free, 5928k buffers Swap: 1052248k total, 144k used, 1052104k free, 256888k cached