51644 – "Journal_write_metadata_buffer" error running cerberus on IA64

Bug 51644 - "Journal_write_metadata_buffer" error running cerberus on IA64

Summary: "Journal_write_metadata_buffer" error running cerberus on IA64

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Linux
Classification:	Retired
Component:	kernel
Sub Component:
Version:	7.3
Hardware:	ia64
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Stephen Tweedie
QA Contact:	Brock Organ
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2001-08-13 15:11 UTC by Clay Cooper
Modified:	2007-04-18 16:35 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2001-10-10 14:41:44 UTC
Embargoed:

Attachments	(Terms of Use)
Serial log of bootup and error (9.56 KB, text/plain) 2001-09-17 19:55 UTC, Clay Cooper	no flags	Details
View All

Description Clay Cooper 2001-08-13 15:11:11 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux 2.4.2-2 i686; en-US; rv:0.9.1)
Gecko/20010607 Netscape6/6.1b1

Description of problem:
On a pe7150 w/ beta3 kernel 2.4.7-0.12.0smp, onboard scsi booting, 2GB ram,
2GB swap, bios X15.

Cerberus 9-20 running for over 24 hours.  Console reports:

Assertion failure in journal_write_metadata_buffer() at journal.c: 365:
"buffer_jdirty(jh26h(jh_in))"

After Ctrl-C out of cerberus, system is unresponsive.  Text could be typed
at prompt, but command not executed.  Could not log into other console. 

How reproducible:
Sometimes

Steps to Reproduce:
1.Install above configuration or equivalent.
2.Run cerberus for atleast 24 hours.
3.Look for error at console
	

Actual Results:  No error

Expected Results:  Error

Additional info:

Comment 1 Glen Foster 2001-08-13 19:04:47 UTC

This defect is considered MUST-FIX for Fairfax.

Comment 2 Stephen Tweedie 2001-08-25 16:25:53 UTC

I'll send you a more recent kernel build with the current ext3, which has a few
bugs fixed.  If you can still reproduce with that kernel, then I'll build you a
debugging version which produces extensive buffer tracing information when it
detects such errors.  Do you have a serial console set up so that you can trap
such debug output?

Comment 3 Clay Cooper 2001-08-27 15:00:47 UTC

Testing 2.4.7-2.9 w/ cerberus.  Yes I can capture serial console output.

Comment 4 Clay Cooper 2001-08-28 14:03:05 UTC

Reproduced in 2.4.7-2.9.  I have serial console redirection setup.

Comment 5 Stephen Tweedie 2001-08-30 15:37:32 UTC

http://people.redhat.com/sct/.ia64.debug/

has kernels with extensive ext3 debugging enabled.  Could you try with this new
kernel, please?  

Thanks

Comment 6 Clay Cooper 2001-09-04 15:36:00 UTC

Looks like I need mkinitrd 3.2.2 to install the kernel rpms.

Comment 7 Stephen Tweedie 2001-09-10 12:08:26 UTC

Any update on this?  I've put the new mkinitrd into the same url directory as
the kernel images, in case you missed Arjan's email about where to get it.

Comment 8 Clay Cooper 2001-09-10 13:06:29 UTC

Problem reproduced with diagnostic kernel, but the only thing in the serial
console is the error at the console.

Comment 9 Stephen Tweedie 2001-09-10 16:44:59 UTC

There _should_ have been a [long] debugging history trace of the buffer
concenerned printed before the assert failure log message, if you're using a
kernel dated later than August 28th.  Is there nothing at all of that type in
the console output?

Comment 10 Clay Cooper 2001-09-10 20:33:19 UTC

I was using kernel 2.4.7-6smp downloaded from
http://people.redhat.com/sct/.ia64.debug/

There was no trace in the serial console log.

Comment 11 Stephen Tweedie 2001-09-13 21:04:36 UTC

Hmm.  I've had a look through the code and I can't see any way that the debug
code in question would be inactive on a modular ext3 (as we configure it), nor
any way in which it would be omitted from a display of that particular oops.  Is
the total log for that boot (from initial kernel version to final oops) short
enough for you to mail to me so that I can try to pick out why it's being missed?

Comment 12 Clay Cooper 2001-09-17 19:55:17 UTC

Created attachment 31920 [details]
Serial log of bootup and error

Comment 13 Stephen Tweedie 2001-09-18 15:44:49 UTC

The only thing I can think of here is that the debug buffer tracing is being
done at a log level which is not going to the console.  The tracing is done at
log level KERN_WARNING (level 4) by default.  Could you try setting the console
loglevel higher than that and see if you can capture the trace?  Or see if
/var/log/messages has captured the info (hidden console traffic may still show
up in the /var/log files if the root filesystem is still working)?

To raise the log level, try

   dmesg -n 7

or edit the LOGLEVEL in /etc/sysconfig/init.  You can check the current loglevel
in /proc/sys/kernel/printk: the first value in that pseudo-file is the current
filter log level used to determine which messages get sent to console and which
just get logged to /var/log/.

If the trace has indeed been captured in /var/log/messages successfully then we
won't need to worry about the serial console, but filesystem crashes may
obviously prevent the /var/log spools from operating correctly.

Comment 14 Clay Cooper 2001-09-20 20:39:38 UTC

Changed log level to 7 with dmesg -n 7 and verified in /proc/sys/kernel/printk,
but nothing further in serial console.

Comment 15 Stephen Tweedie 2001-09-21 13:33:54 UTC

You mean you reproduced the fault after setting dmesg?  Was there anything in
/var/log/messages after the bug report, or was that not accessible?

I've double-checked, and the debugging code is most definitely present in the
kernel image I sent you, so if we really can't get it to trigger then I'll try
doing fault-injection to trigger the assertion manually on a test box to make
sure that the debug code is behaving as expected.

Comment 16 Clay Cooper 2001-10-10 14:27:15 UTC

Did not occur with kernel 2.4.9-0.18smp and 2GB ram after 24 hours of newburn.

Comment 17 Stephen Tweedie 2001-10-10 14:38:11 UTC

Has it ever survived that long before?

Comment 18 Clay Cooper 2001-10-10 14:41:38 UTC

It had been occuring within a few hours, usually within an hour, so this looks
promising.  I am letting it run, and will continue to monitor it.

Note You need to log in before you can comment on or make changes to this bug.