Bug 64107 - Kernel Panic when running sendmail test
Summary: Kernel Panic when running sendmail test
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Public Beta
Classification: Retired
Component: kernel
Version: skipjack-beta2
Hardware: i686
OS: Linux
high
high
Target Milestone: ---
Assignee: Stephen Tweedie
QA Contact: Brian Brock
URL:
Whiteboard:
: 64549 64678 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2002-04-25 19:36 UTC by Danny Trinh
Modified: 2005-10-31 22:00 UTC (History)
15 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2002-05-02 13:57:34 UTC
Embargoed:


Attachments (Terms of Use)
Serial consol log on Merlot (16.27 KB, text/plain)
2002-04-25 19:37 UTC, Danny Trinh
no flags Details
serial consol log on Jaguar (BIGMEM kernel) (14.92 KB, text/plain)
2002-04-25 19:38 UTC, Danny Trinh
no flags Details
SST test (227.72 KB, application/octet-stream)
2002-04-25 20:50 UTC, Danny Trinh
no flags Details
console trace, ext3 full debug enabled. (13.17 KB, text/plain)
2002-04-29 12:03 UTC, Stephen Tweedie
no flags Details
Trace with additional debugging enabled (11.42 KB, text/plain)
2002-04-30 12:55 UTC, Jay Turner
no flags Details
Another trace with additional debugging (14.61 KB, text/plain)
2002-04-30 12:56 UTC, Jay Turner
no flags Details
And another one with additional debugging (16.28 KB, text/plain)
2002-04-30 12:57 UTC, Jay Turner
no flags Details
This is the kernel log info for the period when we had the problem occur (91.50 KB, text/plain)
2002-06-13 15:46 UTC, Don Kozlowski
no flags Details
2.4.18-5smp BUG at journal.c:408 (3.54 KB, text/plain)
2002-08-16 11:49 UTC, Need Real Name
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2002:085 0 normal SHIPPED_LIVE Kernel panic on SMP systems with ext3 file systems is now fixed. 2002-05-07 04:00:00 UTC

Description Danny Trinh 2002-04-25 19:36:34 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:0.9.4) Gecko/20011019
Netscape6/6.2

Description of problem:
After setup sendmail service (Bigmem/SMP kernel), I run SST test. Then about 10
minutes or so, I got kernel panic. The system hung up.

Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
1. HW config: Jaguar/Merlot + Broadcom embedded NICs
1.Install Valhalla on Jaguar or Merlot
2. Boot server up to Bigmem or SMP kernel
2. Run SST server setup. 
3. Run SST client setup.
4. Run SST test on client.
	

Actual Results:  Kernel panic when running SST to against broadcom NICs. This
test works OK with Intel 1000XT NICs.
Also, It seems to work good when I switch it to UP kernel.

Expected Results:  SST test must work when using SMP/BIGMEM kernel.

Additional info:

Comment 1 Danny Trinh 2002-04-25 19:37:39 UTC
Created attachment 55344 [details]
Serial consol log on Merlot

Comment 2 Danny Trinh 2002-04-25 19:38:31 UTC
Created attachment 55345 [details]
serial consol log on Jaguar (BIGMEM kernel)

Comment 3 Danny Trinh 2002-04-25 20:50:13 UTC
Created attachment 55365 [details]
SST test

Comment 4 Arjan van de Ven 2002-04-25 20:53:20 UTC
Assertion failure in journal_write_metadata_buffer() at journal.c:406:
"buffer_jdirty(jh2bh(jh_in))"

Comment 5 Yil-Kyu Kang 2002-04-26 16:01:36 UTC
Arjan, can you elabotate on your previous statment? 



Comment 6 Arjan van de Ven 2002-04-26 16:03:00 UTC
that's the key bit out of both attachements; just repeating it inline here ...

Comment 7 Preston Brown 2002-04-26 17:50:36 UTC
This is probably a FS issue not a NIC issue

Comment 8 Danny Trinh 2002-04-26 18:16:03 UTC
* HW config on Jaguar: 8GB memory + embedded Adaptec 7899 + embedded Broadcom NIC
I install Hampton gold and use ext3 file system.
  - I ran Bonnie ++, cpcmp (copy & compare) using NFS, SMB services ---> OK
  - I ran SST test ---> kernel panic
  - I'm running NTTCP test right now
* HW config on Merlot: 1GB memory + qla2310 (unplugged) + qla2200 (unplugged) +
PERC3/QC (plugged to PV210) + Intel pro 100 + embedded Adaptec 7899 + embedded
Broadcom NIC.
 - First, I installed Hampton gold and use ext3 file system
   - I ran SST test ---> kernel panic
 - This morning, I reinstalled Hampton gold (using the same Hw config) and use
ext2 file sytem.
   - I ran SST test ----> still running (more than 3 hrs)
  


Comment 9 Danny Trinh 2002-04-26 21:55:40 UTC
On Merlot system (same HW config):
 - I installed Hampton beta4, and ran SST test for more than 1 hour without 
kernel panic.


Comment 10 Stephen Tweedie 2002-04-29 12:02:15 UTC
Reproduced on an internal test box with ext3 debugging and buffer tracing enabled
(trace attached).  Booting with mem=512m shows no problems --- this may be a
highmem-specific fault.

Comment 11 Stephen Tweedie 2002-04-29 12:03:24 UTC
Created attachment 55764 [details]
console trace, ext3 full debug enabled.

Comment 12 Stephen Tweedie 2002-04-29 12:05:04 UTC
Update from Jay Turner:

  I rebooted the machine and unloaded the tg3 module
  in favor of the bcm5700 module (which also works with these cards.)  The
  thing still fell over after about a minute.

Definitely doesn't look like a NIC-specific problem.

Comment 13 Jay Turner 2002-04-30 12:55:36 UTC
Created attachment 55864 [details]
Trace with additional debugging enabled

Comment 14 Jay Turner 2002-04-30 12:56:44 UTC
Created attachment 55865 [details]
Another trace with additional debugging

Comment 15 Jay Turner 2002-04-30 12:57:44 UTC
Created attachment 55866 [details]
And another one with additional debugging

Comment 16 Jay Turner 2002-04-30 19:54:58 UTC
OK, the good news is that we have a kernel patch which appears to be working.  I
have been able to get through 1 hour of sst testing against a machine which used
to fall over in 5 minutes or less.  Am going to restart the test for several
hours just to make sure that nothing else odd is happening.

Comment 17 Stephen Tweedie 2002-04-30 20:25:10 UTC
I'm doing regression testing right now.  Will check the fix in in an hour or so
if there are no problems at that point, but we'll keep testing overnight.

Comment 18 Jay Turner 2002-05-01 11:15:03 UTC
SST ran for 12 hours on the machine in question without failure, so we are doing
pretty good with that respect.  Let's get the kernel pushed through the build
system and prepped for an errata.

Comment 19 Stephen Tweedie 2002-05-01 11:23:15 UTC
We're way ahead of you. :)  We did preliminary stress testing on the patch last
night, and then built kernel-2.4.18-3.1 with the fix in place.  I ran the full
Cerberus on it overnight on a 4-way 8GB machine with no problems.  We'll make it
available for wider testing shortly.

Comment 20 Stephen Tweedie 2002-05-01 14:27:36 UTC
The fault we found was an existing fault that has affected all versions of ext3
on 2.4 kernels.

Correct recovery of a journaled filesystem relies on precise control over the
order in which buffers are written to disk.  To prevent early writeback, the
journaling code clears the VM's BH_Dirty bit on dirty buffers, and stores the
dirty state in a private BH_JBDDirty bit instead.  That ensures that the buffer
will not be scheduled for writeback IO before its transaction has committed.

It turns out that there has always been a race window of about a couple of dozen
instructions where, during the refiling of a buffer between different internal
journaling lists, the buffer dirty flag was temporarily restored to BH_Dirty. 
If the bdflush writeback code sees the buffer during this window, then it can
try to flush the buffer to disk, cleaning it in the process.  The journaling
code rapidly detects that the dirty state is inconsistent, and we get this
assert failure.

This bug is not new, but it is very timing-sensitive, and there was a locking
change required between beta4 and the final Hampton kernel which affected that
timing.  It appears that as a result of that change, two CPUs can be woken up on
the same buffer at the same time, and can proceed to the race window at the same
speed, exposing the race with high repeatability where the race has previously
been impossible to trigger in practice.  That's purely a timing change: we don't
believe that the Hampton kernel actually contains a regression, as the bug was
always present.

There is no such thing as a risk-free bug fix, but I believe the risk to be low
 in this case --- the cure is  simply to clear all the dirty state during the
buggy list transition, and restore it on exit.  The buffer remains spinlocked
with respect to the rest of the journaling system during that entire transition,
so the new state will not be visible to any other journaling code.  It *will* be
visible to the bdflush VM core, but that was the whole point of the bug in the
first place --- the new code never leaves the buffer dirty bits in that unsafe
state.

Comment 21 Danny Trinh 2002-05-01 14:39:28 UTC
Stephen,
kernel-bigmem rpm give me an error saying:
"error: unpacking of archive failed on file /boot/System.map-2.4.18-
3.1bigmem;3ccfe2f7: cpio:read"

Would you rebuild the other?

Thanks,

Comment 22 Danny Trinh 2002-05-02 13:57:29 UTC
This kernel level seems to work ok for me. I can say bug fixed.

Comment 23 Matt Domsch 2002-05-02 14:47:12 UTC
Will the errata kernel also have the __module__bigmem fix to rhconfig.h applied?

Comment 24 Arjan van de Ven 2002-05-02 14:49:15 UTC
of course

Comment 25 Michael K. Johnson 2002-05-09 14:53:52 UTC
*** Bug 64678 has been marked as a duplicate of this bug. ***

Comment 26 Michael K. Johnson 2002-05-16 20:50:40 UTC
*** Bug 64549 has been marked as a duplicate of this bug. ***

Comment 27 Michael K. Johnson 2002-05-16 20:52:17 UTC
See http://rhn.redhat.com/errata/RHBA-2002-085.html for important
information on recovery.

Comment 28 Gordon Messmer 2002-05-19 20:33:15 UTC
I upgraded an SMP box (dual P100) to 2.4.18-4 from 2.4.9-31 Thursday evening. 
After rebooting to the new kernel, the kernel oops'd before it got all the way
to runlevel 3.  In order to recover, we had to move the disk into another
machine to run fsck.  The superblock was lost, and we had to use an alternate
(e2fsck suggested 8139).  All of / was corrupt, most of /bin was unrecoverable
/lib/modules/* was corrupt, among other things.  It looked like all of the
directories that had been accessed during the boot sequence were corrupt, their
contents ended up in /lost+found.

I worry that this kernel problem is not entirely fixed, but I have no logs or
console output with any more information.

Comment 29 Stephen Tweedie 2002-05-20 20:30:27 UTC
This bug could open a very small window during which a crash might leave writes
present on disk in the wrong order, but that would be nothing which a fsck would
not completely correct.  Anything involving massive data corruption is almost
certainly a different problem.  Do you have a record of the oops message?  If
so, please open another bugzilla bug with as much information as you can.

Note that if you have got a small amount of latent disk corruption for any
reason (eg. bad memory), any large filesystem write of old files such as the
kernel boot images has the ability to corrupt significant chunks of the
filesystem.  That's not the filesystem's fault, but is an inevitable result of
never running fsck if there's background corruption.

Comment 30 Don Kozlowski 2002-06-12 23:36:17 UTC
I've just experienced this bug with version...
Linux version 2.4.18-4bigmem (bhcompile.redhat.com) (gcc version 2.96 20000731 (Red Hat Linux 7.3 2.96-110)) #1 SMP Thu 
May 2 18:06:05 EDT 2002
[root@vs2 log]# 

The log shows the error... 

Assertion failure in journal_write_metadata_buffer() at journal.c:406: "buffer_jdirty(jh2bh(jh_in))"

followed by a dozen or so lines of traceback information.

This machine in question has dual 1.4Ghz PIII's, and 2GB of memory with a on board RAID (aacraid) for mirroring. Could there still be a problem, 
or 
have I just misinstalled something? Any help would be greatly appreciated.


Comment 31 Stephen Tweedie 2002-06-13 08:00:34 UTC
The only other person who has reported this against 2.4.18-4smp so far turned
out to have been accidentally running the 2.4.18-3smp kernel instead.  The full
traceback would be helpful in taking this further.


Comment 32 Don Kozlowski 2002-06-13 13:02:16 UTC
My first concern was that I might be running the wrong version of
the kernel, so I checked /proc/version before rebooting. I don't
know if an fsck had been run after the first time I had the problem.
Could this incident be caused by an already messy filesystem?

I was careful to ensure that an fsck was run this time, and I'm sure
that the 18-4 kernel is in use. If it occurs  again I'd be happy to include
any trace information you need if you could let me know how to generate
or find the information.

Comment 33 Stephen Tweedie 2002-06-13 13:13:13 UTC
You said that there were a dozen lines of traceback in the log: I really would
need that to take this any further.

You talk about "the first time I had the problem": has this happened more than
once?  It's really unclear from your problem report whether or not you checked
the version number immediately after you saw the problem, and just what the
sequence of bug, fsck and reboots was afterwards: if you can be as precise as
possible in your bug reports, that will help enormously.

Comment 34 Don Kozlowski 2002-06-13 15:46:10 UTC
Created attachment 60853 [details]
This is the kernel log info for the period when we had the problem occur

Comment 35 Don Kozlowski 2002-06-13 16:03:55 UTC
I've added the kernel log for all the time since we first had the problem appear. The first occurrence
was on May 14th with the 2.4.18-3bigmem kernel. After a second occurence on  May 16th, I installed
the 2.4.18-4bigmem kernel. If I recall correctly I did not  run fsck at that time. 

We just had a failure yesterday, where the system applications stopped working.
The filesystems that fails for us is /var/spool so the system remains functional, but the applications that
make use of /var/spool all hang.  For this reason I was able to login and check /proc/version before
rebooting the system. I subsequently rebooted with "shutdown -rF now" to have it perform the fsck on reboot.




Comment 36 Stephen Tweedie 2002-06-13 17:08:21 UTC
May 16 23:03:46 vs2 kernel: XD: Loaded as a module.
May 16 23:03:46 vs2 kernel: Trying to free nonexistent resource <00000320-00000323>
Jun 12 12:02:59 vs2 kernel: loop nfs lockd sunrpc autofs e1000 iptable_filter
ip_tables usb-ohci usbcore e
Jun 12 12:02:59 vs2 kernel: CPU:    1
Jun 12 12:02:59 vs2 kernel: EIP:    0010:[<f8840954>]    Tainted: P 

What module is tainting this kernel?  What's the "XD:" module?

Comment 37 Don Kozlowski 2002-06-13 21:02:58 UTC
The tainting is coming from the e1000.o module. This the Intel(R) PRO/1000 Network Driver.
I notice that when I load it using insmod, I get a complaint about tainting the kernel as well as
a copyright by Intel notice.

The machine itself is the Dell Poweredge 1650, and we're using the dual onboard NIC's.

If I comment out the "alias eth0 e1000" line in /etc/modules.conf, lsmod sjhows an untainted
kernel, but then I have no network connectivity :-(

I have no idea where the "XD: Loaded as a module"  or the "nonexistent resource" messages are
coming from. I get them as well on another box (Dell Opriplex GX110) which doesn't use the e1000
driver. I'm not sure where to even begin looking for this.



Comment 38 Need Real Name 2002-08-16 11:47:56 UTC
 I'm using kernel 2.4.18-5smp, and getting a very similar problem on dual 
pentium 4 2GHz xeons (with hyperthreading). I had the problem with 
2.4.18-3smp, updated to -5 and ran fsck everywhere, but the problem is still 
occuring (on all four machines) 
 
I've had panics on four different machines (all of the same hardware) when 
they are doing heavy i/o (postfix receiving ~150k mails/hour) 
 
I will attach a trace...

Comment 39 Need Real Name 2002-08-16 11:49:32 UTC
Created attachment 71064 [details]
2.4.18-5smp BUG at journal.c:408

Comment 40 Stephen Tweedie 2003-01-28 23:49:25 UTC
I found one more, much rarer, situation in which this could still occur in
2.4.18-5.  That case is fixed in all errata since then.

Comment 41 Tim Pepper 2003-03-16 05:27:53 UTC
Are these fixes also in the RHAS tree?  I've seen this panic on
2.4.9-e.10enterprise recently.

Comment 42 Arjan van de Ven 2003-03-16 09:54:48 UTC
2.4.9 didn't have the bug in the first place, so you saw something else, please
file a separate report.

Comment 43 Tim Pepper 2003-04-30 19:47:55 UTC
This fell off my list of things to do.  Just to follow up...the panic I'd seen
on 2.4.9-e.10enterprise was:

  kernel: Assertion failure in journal_write_metadata_buffer() at journal.c:372:
"buffer_jdirty(jh2bh(jh_in))"

It hasn't happened again (now running e.16) and we've had other reasons to think
there may have been some memory corruption going on elsewhere.


Note You need to log in before you can comment on or make changes to this bug.