From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:0.9.4) Gecko/20011019 Netscape6/6.2 Description of problem: After setup sendmail service (Bigmem/SMP kernel), I run SST test. Then about 10 minutes or so, I got kernel panic. The system hung up. Version-Release number of selected component (if applicable): How reproducible: Always Steps to Reproduce: 1. HW config: Jaguar/Merlot + Broadcom embedded NICs 1.Install Valhalla on Jaguar or Merlot 2. Boot server up to Bigmem or SMP kernel 2. Run SST server setup. 3. Run SST client setup. 4. Run SST test on client. Actual Results: Kernel panic when running SST to against broadcom NICs. This test works OK with Intel 1000XT NICs. Also, It seems to work good when I switch it to UP kernel. Expected Results: SST test must work when using SMP/BIGMEM kernel. Additional info:
Created attachment 55344 [details] Serial consol log on Merlot
Created attachment 55345 [details] serial consol log on Jaguar (BIGMEM kernel)
Created attachment 55365 [details] SST test
Assertion failure in journal_write_metadata_buffer() at journal.c:406: "buffer_jdirty(jh2bh(jh_in))"
Arjan, can you elabotate on your previous statment?
that's the key bit out of both attachements; just repeating it inline here ...
This is probably a FS issue not a NIC issue
* HW config on Jaguar: 8GB memory + embedded Adaptec 7899 + embedded Broadcom NIC I install Hampton gold and use ext3 file system. - I ran Bonnie ++, cpcmp (copy & compare) using NFS, SMB services ---> OK - I ran SST test ---> kernel panic - I'm running NTTCP test right now * HW config on Merlot: 1GB memory + qla2310 (unplugged) + qla2200 (unplugged) + PERC3/QC (plugged to PV210) + Intel pro 100 + embedded Adaptec 7899 + embedded Broadcom NIC. - First, I installed Hampton gold and use ext3 file system - I ran SST test ---> kernel panic - This morning, I reinstalled Hampton gold (using the same Hw config) and use ext2 file sytem. - I ran SST test ----> still running (more than 3 hrs)
On Merlot system (same HW config): - I installed Hampton beta4, and ran SST test for more than 1 hour without kernel panic.
Reproduced on an internal test box with ext3 debugging and buffer tracing enabled (trace attached). Booting with mem=512m shows no problems --- this may be a highmem-specific fault.
Created attachment 55764 [details] console trace, ext3 full debug enabled.
Update from Jay Turner: I rebooted the machine and unloaded the tg3 module in favor of the bcm5700 module (which also works with these cards.) The thing still fell over after about a minute. Definitely doesn't look like a NIC-specific problem.
Created attachment 55864 [details] Trace with additional debugging enabled
Created attachment 55865 [details] Another trace with additional debugging
Created attachment 55866 [details] And another one with additional debugging
OK, the good news is that we have a kernel patch which appears to be working. I have been able to get through 1 hour of sst testing against a machine which used to fall over in 5 minutes or less. Am going to restart the test for several hours just to make sure that nothing else odd is happening.
I'm doing regression testing right now. Will check the fix in in an hour or so if there are no problems at that point, but we'll keep testing overnight.
SST ran for 12 hours on the machine in question without failure, so we are doing pretty good with that respect. Let's get the kernel pushed through the build system and prepped for an errata.
We're way ahead of you. :) We did preliminary stress testing on the patch last night, and then built kernel-2.4.18-3.1 with the fix in place. I ran the full Cerberus on it overnight on a 4-way 8GB machine with no problems. We'll make it available for wider testing shortly.
The fault we found was an existing fault that has affected all versions of ext3 on 2.4 kernels. Correct recovery of a journaled filesystem relies on precise control over the order in which buffers are written to disk. To prevent early writeback, the journaling code clears the VM's BH_Dirty bit on dirty buffers, and stores the dirty state in a private BH_JBDDirty bit instead. That ensures that the buffer will not be scheduled for writeback IO before its transaction has committed. It turns out that there has always been a race window of about a couple of dozen instructions where, during the refiling of a buffer between different internal journaling lists, the buffer dirty flag was temporarily restored to BH_Dirty. If the bdflush writeback code sees the buffer during this window, then it can try to flush the buffer to disk, cleaning it in the process. The journaling code rapidly detects that the dirty state is inconsistent, and we get this assert failure. This bug is not new, but it is very timing-sensitive, and there was a locking change required between beta4 and the final Hampton kernel which affected that timing. It appears that as a result of that change, two CPUs can be woken up on the same buffer at the same time, and can proceed to the race window at the same speed, exposing the race with high repeatability where the race has previously been impossible to trigger in practice. That's purely a timing change: we don't believe that the Hampton kernel actually contains a regression, as the bug was always present. There is no such thing as a risk-free bug fix, but I believe the risk to be low in this case --- the cure is simply to clear all the dirty state during the buggy list transition, and restore it on exit. The buffer remains spinlocked with respect to the rest of the journaling system during that entire transition, so the new state will not be visible to any other journaling code. It *will* be visible to the bdflush VM core, but that was the whole point of the bug in the first place --- the new code never leaves the buffer dirty bits in that unsafe state.
Stephen, kernel-bigmem rpm give me an error saying: "error: unpacking of archive failed on file /boot/System.map-2.4.18- 3.1bigmem;3ccfe2f7: cpio:read" Would you rebuild the other? Thanks,
This kernel level seems to work ok for me. I can say bug fixed.
Will the errata kernel also have the __module__bigmem fix to rhconfig.h applied?
of course
*** Bug 64678 has been marked as a duplicate of this bug. ***
*** Bug 64549 has been marked as a duplicate of this bug. ***
See http://rhn.redhat.com/errata/RHBA-2002-085.html for important information on recovery.
I upgraded an SMP box (dual P100) to 2.4.18-4 from 2.4.9-31 Thursday evening. After rebooting to the new kernel, the kernel oops'd before it got all the way to runlevel 3. In order to recover, we had to move the disk into another machine to run fsck. The superblock was lost, and we had to use an alternate (e2fsck suggested 8139). All of / was corrupt, most of /bin was unrecoverable /lib/modules/* was corrupt, among other things. It looked like all of the directories that had been accessed during the boot sequence were corrupt, their contents ended up in /lost+found. I worry that this kernel problem is not entirely fixed, but I have no logs or console output with any more information.
This bug could open a very small window during which a crash might leave writes present on disk in the wrong order, but that would be nothing which a fsck would not completely correct. Anything involving massive data corruption is almost certainly a different problem. Do you have a record of the oops message? If so, please open another bugzilla bug with as much information as you can. Note that if you have got a small amount of latent disk corruption for any reason (eg. bad memory), any large filesystem write of old files such as the kernel boot images has the ability to corrupt significant chunks of the filesystem. That's not the filesystem's fault, but is an inevitable result of never running fsck if there's background corruption.
I've just experienced this bug with version... Linux version 2.4.18-4bigmem (bhcompile.redhat.com) (gcc version 2.96 20000731 (Red Hat Linux 7.3 2.96-110)) #1 SMP Thu May 2 18:06:05 EDT 2002 [root@vs2 log]# The log shows the error... Assertion failure in journal_write_metadata_buffer() at journal.c:406: "buffer_jdirty(jh2bh(jh_in))" followed by a dozen or so lines of traceback information. This machine in question has dual 1.4Ghz PIII's, and 2GB of memory with a on board RAID (aacraid) for mirroring. Could there still be a problem, or have I just misinstalled something? Any help would be greatly appreciated.
The only other person who has reported this against 2.4.18-4smp so far turned out to have been accidentally running the 2.4.18-3smp kernel instead. The full traceback would be helpful in taking this further.
My first concern was that I might be running the wrong version of the kernel, so I checked /proc/version before rebooting. I don't know if an fsck had been run after the first time I had the problem. Could this incident be caused by an already messy filesystem? I was careful to ensure that an fsck was run this time, and I'm sure that the 18-4 kernel is in use. If it occurs again I'd be happy to include any trace information you need if you could let me know how to generate or find the information.
You said that there were a dozen lines of traceback in the log: I really would need that to take this any further. You talk about "the first time I had the problem": has this happened more than once? It's really unclear from your problem report whether or not you checked the version number immediately after you saw the problem, and just what the sequence of bug, fsck and reboots was afterwards: if you can be as precise as possible in your bug reports, that will help enormously.
Created attachment 60853 [details] This is the kernel log info for the period when we had the problem occur
I've added the kernel log for all the time since we first had the problem appear. The first occurrence was on May 14th with the 2.4.18-3bigmem kernel. After a second occurence on May 16th, I installed the 2.4.18-4bigmem kernel. If I recall correctly I did not run fsck at that time. We just had a failure yesterday, where the system applications stopped working. The filesystems that fails for us is /var/spool so the system remains functional, but the applications that make use of /var/spool all hang. For this reason I was able to login and check /proc/version before rebooting the system. I subsequently rebooted with "shutdown -rF now" to have it perform the fsck on reboot.
May 16 23:03:46 vs2 kernel: XD: Loaded as a module. May 16 23:03:46 vs2 kernel: Trying to free nonexistent resource <00000320-00000323> Jun 12 12:02:59 vs2 kernel: loop nfs lockd sunrpc autofs e1000 iptable_filter ip_tables usb-ohci usbcore e Jun 12 12:02:59 vs2 kernel: CPU: 1 Jun 12 12:02:59 vs2 kernel: EIP: 0010:[<f8840954>] Tainted: P What module is tainting this kernel? What's the "XD:" module?
The tainting is coming from the e1000.o module. This the Intel(R) PRO/1000 Network Driver. I notice that when I load it using insmod, I get a complaint about tainting the kernel as well as a copyright by Intel notice. The machine itself is the Dell Poweredge 1650, and we're using the dual onboard NIC's. If I comment out the "alias eth0 e1000" line in /etc/modules.conf, lsmod sjhows an untainted kernel, but then I have no network connectivity :-( I have no idea where the "XD: Loaded as a module" or the "nonexistent resource" messages are coming from. I get them as well on another box (Dell Opriplex GX110) which doesn't use the e1000 driver. I'm not sure where to even begin looking for this.
I'm using kernel 2.4.18-5smp, and getting a very similar problem on dual pentium 4 2GHz xeons (with hyperthreading). I had the problem with 2.4.18-3smp, updated to -5 and ran fsck everywhere, but the problem is still occuring (on all four machines) I've had panics on four different machines (all of the same hardware) when they are doing heavy i/o (postfix receiving ~150k mails/hour) I will attach a trace...
Created attachment 71064 [details] 2.4.18-5smp BUG at journal.c:408
I found one more, much rarer, situation in which this could still occur in 2.4.18-5. That case is fixed in all errata since then.
Are these fixes also in the RHAS tree? I've seen this panic on 2.4.9-e.10enterprise recently.
2.4.9 didn't have the bug in the first place, so you saw something else, please file a separate report.
This fell off my list of things to do. Just to follow up...the panic I'd seen on 2.4.9-e.10enterprise was: kernel: Assertion failure in journal_write_metadata_buffer() at journal.c:372: "buffer_jdirty(jh2bh(jh_in))" It hasn't happened again (now running e.16) and we've had other reasons to think there may have been some memory corruption going on elsewhere.