From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux 2.4.2-2 i686; en-US; rv:0.9.1) Gecko/20010607 Netscape6/6.1b1 Description of problem: On a pe7150 w/ beta3 kernel 2.4.7-0.12.0smp, onboard scsi booting, 2GB ram, 2GB swap, bios X15. Cerberus 9-20 running for over 24 hours. Console reports: Assertion failure in journal_write_metadata_buffer() at journal.c: 365: "buffer_jdirty(jh26h(jh_in))" After Ctrl-C out of cerberus, system is unresponsive. Text could be typed at prompt, but command not executed. Could not log into other console. How reproducible: Sometimes Steps to Reproduce: 1.Install above configuration or equivalent. 2.Run cerberus for atleast 24 hours. 3.Look for error at console Actual Results: No error Expected Results: Error Additional info:
This defect is considered MUST-FIX for Fairfax.
I'll send you a more recent kernel build with the current ext3, which has a few bugs fixed. If you can still reproduce with that kernel, then I'll build you a debugging version which produces extensive buffer tracing information when it detects such errors. Do you have a serial console set up so that you can trap such debug output?
Testing 2.4.7-2.9 w/ cerberus. Yes I can capture serial console output.
Reproduced in 2.4.7-2.9. I have serial console redirection setup.
http://people.redhat.com/sct/.ia64.debug/ has kernels with extensive ext3 debugging enabled. Could you try with this new kernel, please? Thanks
Looks like I need mkinitrd 3.2.2 to install the kernel rpms.
Any update on this? I've put the new mkinitrd into the same url directory as the kernel images, in case you missed Arjan's email about where to get it.
Problem reproduced with diagnostic kernel, but the only thing in the serial console is the error at the console.
There _should_ have been a [long] debugging history trace of the buffer concenerned printed before the assert failure log message, if you're using a kernel dated later than August 28th. Is there nothing at all of that type in the console output?
I was using kernel 2.4.7-6smp downloaded from http://people.redhat.com/sct/.ia64.debug/ There was no trace in the serial console log.
Hmm. I've had a look through the code and I can't see any way that the debug code in question would be inactive on a modular ext3 (as we configure it), nor any way in which it would be omitted from a display of that particular oops. Is the total log for that boot (from initial kernel version to final oops) short enough for you to mail to me so that I can try to pick out why it's being missed?
Created attachment 31920 [details] Serial log of bootup and error
The only thing I can think of here is that the debug buffer tracing is being done at a log level which is not going to the console. The tracing is done at log level KERN_WARNING (level 4) by default. Could you try setting the console loglevel higher than that and see if you can capture the trace? Or see if /var/log/messages has captured the info (hidden console traffic may still show up in the /var/log files if the root filesystem is still working)? To raise the log level, try dmesg -n 7 or edit the LOGLEVEL in /etc/sysconfig/init. You can check the current loglevel in /proc/sys/kernel/printk: the first value in that pseudo-file is the current filter log level used to determine which messages get sent to console and which just get logged to /var/log/. If the trace has indeed been captured in /var/log/messages successfully then we won't need to worry about the serial console, but filesystem crashes may obviously prevent the /var/log spools from operating correctly.
Changed log level to 7 with dmesg -n 7 and verified in /proc/sys/kernel/printk, but nothing further in serial console.
You mean you reproduced the fault after setting dmesg? Was there anything in /var/log/messages after the bug report, or was that not accessible? I've double-checked, and the debugging code is most definitely present in the kernel image I sent you, so if we really can't get it to trigger then I'll try doing fault-injection to trigger the assertion manually on a test box to make sure that the debug code is behaving as expected.
Did not occur with kernel 2.4.9-0.18smp and 2GB ram after 24 hours of newburn.
Has it ever survived that long before?
It had been occuring within a few hours, usually within an hour, so this looks promising. I am letting it run, and will continue to monitor it.