Bug 51644

Summary: "Journal_write_metadata_buffer" error running cerberus on IA64
Product: [Retired] Red Hat Linux Reporter: Clay Cooper <clay_cooper>
Component: kernelAssignee: Stephen Tweedie <sct>
Status: CLOSED CURRENTRELEASE QA Contact: Brock Organ <borgan>
Severity: medium Docs Contact:
Priority: medium    
Version: 7.3CC: afom_m, dale_kaisner, danny_trinh, dean_oliver, john_hull, joshua_giles, mark_rusk, matt_domsch, michael_e_brown, rogelio_noriega
Target Milestone: ---   
Target Release: ---   
Hardware: ia64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2001-10-10 14:41:44 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Serial log of bootup and error none

Description Clay Cooper 2001-08-13 15:11:11 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux 2.4.2-2 i686; en-US; rv:0.9.1)
Gecko/20010607 Netscape6/6.1b1

Description of problem:
On a pe7150 w/ beta3 kernel 2.4.7-0.12.0smp, onboard scsi booting, 2GB ram,
2GB swap, bios X15.

Cerberus 9-20 running for over 24 hours.  Console reports:

Assertion failure in journal_write_metadata_buffer() at journal.c: 365:
"buffer_jdirty(jh26h(jh_in))"

After Ctrl-C out of cerberus, system is unresponsive.  Text could be typed
at prompt, but command not executed.  Could not log into other console. 

How reproducible:
Sometimes

Steps to Reproduce:
1.Install above configuration or equivalent.
2.Run cerberus for atleast 24 hours.
3.Look for error at console
	

Actual Results:  No error

Expected Results:  Error

Additional info:

Comment 1 Glen Foster 2001-08-13 19:04:47 UTC
This defect is considered MUST-FIX for Fairfax.

Comment 2 Stephen Tweedie 2001-08-25 16:25:53 UTC
I'll send you a more recent kernel build with the current ext3, which has a few
bugs fixed.  If you can still reproduce with that kernel, then I'll build you a
debugging version which produces extensive buffer tracing information when it
detects such errors.  Do you have a serial console set up so that you can trap
such debug output?

Comment 3 Clay Cooper 2001-08-27 15:00:47 UTC
Testing 2.4.7-2.9 w/ cerberus.  Yes I can capture serial console output.

Comment 4 Clay Cooper 2001-08-28 14:03:05 UTC
Reproduced in 2.4.7-2.9.  I have serial console redirection setup.

Comment 5 Stephen Tweedie 2001-08-30 15:37:32 UTC
http://people.redhat.com/sct/.ia64.debug/

has kernels with extensive ext3 debugging enabled.  Could you try with this new
kernel, please?  

Thanks

Comment 6 Clay Cooper 2001-09-04 15:36:00 UTC
Looks like I need mkinitrd 3.2.2 to install the kernel rpms.

Comment 7 Stephen Tweedie 2001-09-10 12:08:26 UTC
Any update on this?  I've put the new mkinitrd into the same url directory as
the kernel images, in case you missed Arjan's email about where to get it.

Comment 8 Clay Cooper 2001-09-10 13:06:29 UTC
Problem reproduced with diagnostic kernel, but the only thing in the serial
console is the error at the console.

Comment 9 Stephen Tweedie 2001-09-10 16:44:59 UTC
There _should_ have been a [long] debugging history trace of the buffer
concenerned printed before the assert failure log message, if you're using a
kernel dated later than August 28th.  Is there nothing at all of that type in
the console output?

Comment 10 Clay Cooper 2001-09-10 20:33:19 UTC
I was using kernel 2.4.7-6smp downloaded from
http://people.redhat.com/sct/.ia64.debug/

There was no trace in the serial console log.



Comment 11 Stephen Tweedie 2001-09-13 21:04:36 UTC
Hmm.  I've had a look through the code and I can't see any way that the debug
code in question would be inactive on a modular ext3 (as we configure it), nor
any way in which it would be omitted from a display of that particular oops.  Is
the total log for that boot (from initial kernel version to final oops) short
enough for you to mail to me so that I can try to pick out why it's being missed?

Comment 12 Clay Cooper 2001-09-17 19:55:17 UTC
Created attachment 31920 [details]
Serial log of bootup and error

Comment 13 Stephen Tweedie 2001-09-18 15:44:49 UTC
The only thing I can think of here is that the debug buffer tracing is being
done at a log level which is not going to the console.  The tracing is done at
log level KERN_WARNING (level 4) by default.  Could you try setting the console
loglevel higher than that and see if you can capture the trace?  Or see if
/var/log/messages has captured the info (hidden console traffic may still show
up in the /var/log files if the root filesystem is still working)?

To raise the log level, try

   dmesg -n 7

or edit the LOGLEVEL in /etc/sysconfig/init.  You can check the current loglevel
in /proc/sys/kernel/printk: the first value in that pseudo-file is the current
filter log level used to determine which messages get sent to console and which
just get logged to /var/log/.

If the trace has indeed been captured in /var/log/messages successfully then we
won't need to worry about the serial console, but filesystem crashes may
obviously prevent the /var/log spools from operating correctly.

Comment 14 Clay Cooper 2001-09-20 20:39:38 UTC
Changed log level to 7 with dmesg -n 7 and verified in /proc/sys/kernel/printk,
but nothing further in serial console.

Comment 15 Stephen Tweedie 2001-09-21 13:33:54 UTC
You mean you reproduced the fault after setting dmesg?  Was there anything in
/var/log/messages after the bug report, or was that not accessible?

I've double-checked, and the debugging code is most definitely present in the
kernel image I sent you, so if we really can't get it to trigger then I'll try
doing fault-injection to trigger the assertion manually on a test box to make
sure that the debug code is behaving as expected.

Comment 16 Clay Cooper 2001-10-10 14:27:15 UTC
Did not occur with kernel 2.4.9-0.18smp and 2GB ram after 24 hours of newburn.

Comment 17 Stephen Tweedie 2001-10-10 14:38:11 UTC
Has it ever survived that long before?

Comment 18 Clay Cooper 2001-10-10 14:41:38 UTC
It had been occuring within a few hours, usually within an hour, so this looks
promising.  I am letting it run, and will continue to monitor it.