Red Hat Bugzilla – Bug 64107
Kernel Panic when running sendmail test
Last modified: 2005-10-31 17:00:50 EST
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:0.9.4) Gecko/20011019
Description of problem:
After setup sendmail service (Bigmem/SMP kernel), I run SST test. Then about 10
minutes or so, I got kernel panic. The system hung up.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. HW config: Jaguar/Merlot + Broadcom embedded NICs
1.Install Valhalla on Jaguar or Merlot
2. Boot server up to Bigmem or SMP kernel
2. Run SST server setup.
3. Run SST client setup.
4. Run SST test on client.
Actual Results: Kernel panic when running SST to against broadcom NICs. This
test works OK with Intel 1000XT NICs.
Also, It seems to work good when I switch it to UP kernel.
Expected Results: SST test must work when using SMP/BIGMEM kernel.
Created attachment 55344 [details]
Serial consol log on Merlot
Created attachment 55345 [details]
serial consol log on Jaguar (BIGMEM kernel)
Created attachment 55365 [details]
Assertion failure in journal_write_metadata_buffer() at journal.c:406:
Arjan, can you elabotate on your previous statment?
that's the key bit out of both attachements; just repeating it inline here ...
This is probably a FS issue not a NIC issue
* HW config on Jaguar: 8GB memory + embedded Adaptec 7899 + embedded Broadcom NIC
I install Hampton gold and use ext3 file system.
- I ran Bonnie ++, cpcmp (copy & compare) using NFS, SMB services ---> OK
- I ran SST test ---> kernel panic
- I'm running NTTCP test right now
* HW config on Merlot: 1GB memory + qla2310 (unplugged) + qla2200 (unplugged) +
PERC3/QC (plugged to PV210) + Intel pro 100 + embedded Adaptec 7899 + embedded
- First, I installed Hampton gold and use ext3 file system
- I ran SST test ---> kernel panic
- This morning, I reinstalled Hampton gold (using the same Hw config) and use
ext2 file sytem.
- I ran SST test ----> still running (more than 3 hrs)
On Merlot system (same HW config):
- I installed Hampton beta4, and ran SST test for more than 1 hour without
Reproduced on an internal test box with ext3 debugging and buffer tracing enabled
(trace attached). Booting with mem=512m shows no problems --- this may be a
Created attachment 55764 [details]
console trace, ext3 full debug enabled.
Update from Jay Turner:
I rebooted the machine and unloaded the tg3 module
in favor of the bcm5700 module (which also works with these cards.) The
thing still fell over after about a minute.
Definitely doesn't look like a NIC-specific problem.
Created attachment 55864 [details]
Trace with additional debugging enabled
Created attachment 55865 [details]
Another trace with additional debugging
Created attachment 55866 [details]
And another one with additional debugging
OK, the good news is that we have a kernel patch which appears to be working. I
have been able to get through 1 hour of sst testing against a machine which used
to fall over in 5 minutes or less. Am going to restart the test for several
hours just to make sure that nothing else odd is happening.
I'm doing regression testing right now. Will check the fix in in an hour or so
if there are no problems at that point, but we'll keep testing overnight.
SST ran for 12 hours on the machine in question without failure, so we are doing
pretty good with that respect. Let's get the kernel pushed through the build
system and prepped for an errata.
We're way ahead of you. :) We did preliminary stress testing on the patch last
night, and then built kernel-2.4.18-3.1 with the fix in place. I ran the full
Cerberus on it overnight on a 4-way 8GB machine with no problems. We'll make it
available for wider testing shortly.
The fault we found was an existing fault that has affected all versions of ext3
on 2.4 kernels.
Correct recovery of a journaled filesystem relies on precise control over the
order in which buffers are written to disk. To prevent early writeback, the
journaling code clears the VM's BH_Dirty bit on dirty buffers, and stores the
dirty state in a private BH_JBDDirty bit instead. That ensures that the buffer
will not be scheduled for writeback IO before its transaction has committed.
It turns out that there has always been a race window of about a couple of dozen
instructions where, during the refiling of a buffer between different internal
journaling lists, the buffer dirty flag was temporarily restored to BH_Dirty.
If the bdflush writeback code sees the buffer during this window, then it can
try to flush the buffer to disk, cleaning it in the process. The journaling
code rapidly detects that the dirty state is inconsistent, and we get this
This bug is not new, but it is very timing-sensitive, and there was a locking
change required between beta4 and the final Hampton kernel which affected that
timing. It appears that as a result of that change, two CPUs can be woken up on
the same buffer at the same time, and can proceed to the race window at the same
speed, exposing the race with high repeatability where the race has previously
been impossible to trigger in practice. That's purely a timing change: we don't
believe that the Hampton kernel actually contains a regression, as the bug was
There is no such thing as a risk-free bug fix, but I believe the risk to be low
in this case --- the cure is simply to clear all the dirty state during the
buggy list transition, and restore it on exit. The buffer remains spinlocked
with respect to the rest of the journaling system during that entire transition,
so the new state will not be visible to any other journaling code. It *will* be
visible to the bdflush VM core, but that was the whole point of the bug in the
first place --- the new code never leaves the buffer dirty bits in that unsafe
kernel-bigmem rpm give me an error saying:
"error: unpacking of archive failed on file /boot/System.map-2.4.18-
Would you rebuild the other?
This kernel level seems to work ok for me. I can say bug fixed.
Will the errata kernel also have the __module__bigmem fix to rhconfig.h applied?
*** Bug 64678 has been marked as a duplicate of this bug. ***
*** Bug 64549 has been marked as a duplicate of this bug. ***
See http://rhn.redhat.com/errata/RHBA-2002-085.html for important
information on recovery.
I upgraded an SMP box (dual P100) to 2.4.18-4 from 2.4.9-31 Thursday evening.
After rebooting to the new kernel, the kernel oops'd before it got all the way
to runlevel 3. In order to recover, we had to move the disk into another
machine to run fsck. The superblock was lost, and we had to use an alternate
(e2fsck suggested 8139). All of / was corrupt, most of /bin was unrecoverable
/lib/modules/* was corrupt, among other things. It looked like all of the
directories that had been accessed during the boot sequence were corrupt, their
contents ended up in /lost+found.
I worry that this kernel problem is not entirely fixed, but I have no logs or
console output with any more information.
This bug could open a very small window during which a crash might leave writes
present on disk in the wrong order, but that would be nothing which a fsck would
not completely correct. Anything involving massive data corruption is almost
certainly a different problem. Do you have a record of the oops message? If
so, please open another bugzilla bug with as much information as you can.
Note that if you have got a small amount of latent disk corruption for any
reason (eg. bad memory), any large filesystem write of old files such as the
kernel boot images has the ability to corrupt significant chunks of the
filesystem. That's not the filesystem's fault, but is an inevitable result of
never running fsck if there's background corruption.
I've just experienced this bug with version...
Linux version 2.4.18-4bigmem (firstname.lastname@example.org) (gcc version 2.96 20000731 (Red Hat Linux 7.3 2.96-110)) #1 SMP Thu
May 2 18:06:05 EDT 2002
The log shows the error...
Assertion failure in journal_write_metadata_buffer() at journal.c:406: "buffer_jdirty(jh2bh(jh_in))"
followed by a dozen or so lines of traceback information.
This machine in question has dual 1.4Ghz PIII's, and 2GB of memory with a on board RAID (aacraid) for mirroring. Could there still be a problem,
have I just misinstalled something? Any help would be greatly appreciated.
The only other person who has reported this against 2.4.18-4smp so far turned
out to have been accidentally running the 2.4.18-3smp kernel instead. The full
traceback would be helpful in taking this further.
My first concern was that I might be running the wrong version of
the kernel, so I checked /proc/version before rebooting. I don't
know if an fsck had been run after the first time I had the problem.
Could this incident be caused by an already messy filesystem?
I was careful to ensure that an fsck was run this time, and I'm sure
that the 18-4 kernel is in use. If it occurs again I'd be happy to include
any trace information you need if you could let me know how to generate
or find the information.
You said that there were a dozen lines of traceback in the log: I really would
need that to take this any further.
You talk about "the first time I had the problem": has this happened more than
once? It's really unclear from your problem report whether or not you checked
the version number immediately after you saw the problem, and just what the
sequence of bug, fsck and reboots was afterwards: if you can be as precise as
possible in your bug reports, that will help enormously.
Created attachment 60853 [details]
This is the kernel log info for the period when we had the problem occur
I've added the kernel log for all the time since we first had the problem appear. The first occurrence
was on May 14th with the 2.4.18-3bigmem kernel. After a second occurence on May 16th, I installed
the 2.4.18-4bigmem kernel. If I recall correctly I did not run fsck at that time.
We just had a failure yesterday, where the system applications stopped working.
The filesystems that fails for us is /var/spool so the system remains functional, but the applications that
make use of /var/spool all hang. For this reason I was able to login and check /proc/version before
rebooting the system. I subsequently rebooted with "shutdown -rF now" to have it perform the fsck on reboot.
May 16 23:03:46 vs2 kernel: XD: Loaded as a module.
May 16 23:03:46 vs2 kernel: Trying to free nonexistent resource <00000320-00000323>
Jun 12 12:02:59 vs2 kernel: loop nfs lockd sunrpc autofs e1000 iptable_filter
ip_tables usb-ohci usbcore e
Jun 12 12:02:59 vs2 kernel: CPU: 1
Jun 12 12:02:59 vs2 kernel: EIP: 0010:[<f8840954>] Tainted: P
What module is tainting this kernel? What's the "XD:" module?
The tainting is coming from the e1000.o module. This the Intel(R) PRO/1000 Network Driver.
I notice that when I load it using insmod, I get a complaint about tainting the kernel as well as
a copyright by Intel notice.
The machine itself is the Dell Poweredge 1650, and we're using the dual onboard NIC's.
If I comment out the "alias eth0 e1000" line in /etc/modules.conf, lsmod sjhows an untainted
kernel, but then I have no network connectivity :-(
I have no idea where the "XD: Loaded as a module" or the "nonexistent resource" messages are
coming from. I get them as well on another box (Dell Opriplex GX110) which doesn't use the e1000
driver. I'm not sure where to even begin looking for this.
I'm using kernel 2.4.18-5smp, and getting a very similar problem on dual
pentium 4 2GHz xeons (with hyperthreading). I had the problem with
2.4.18-3smp, updated to -5 and ran fsck everywhere, but the problem is still
occuring (on all four machines)
I've had panics on four different machines (all of the same hardware) when
they are doing heavy i/o (postfix receiving ~150k mails/hour)
I will attach a trace...
Created attachment 71064 [details]
2.4.18-5smp BUG at journal.c:408
I found one more, much rarer, situation in which this could still occur in
2.4.18-5. That case is fixed in all errata since then.
Are these fixes also in the RHAS tree? I've seen this panic on
2.4.9 didn't have the bug in the first place, so you saw something else, please
file a separate report.
This fell off my list of things to do. Just to follow up...the panic I'd seen
on 2.4.9-e.10enterprise was:
kernel: Assertion failure in journal_write_metadata_buffer() at journal.c:372:
It hasn't happened again (now running e.16) and we've had other reasons to think
there may have been some memory corruption going on elsewhere.