Bug 140361

Summary: Bad metadata assertion in trans.c: "meta_check_magic == GFS_MAGIC"
Product: [Retired] Red Hat Cluster Suite Reporter: Corey Marthaler <cmarthal>
Component: gfsAssignee: Wendy Cheng <nobody+wcheng>
Status: CLOSED WORKSFORME QA Contact: GFS Bugs <gfs-bugs>
Severity: medium Docs Contact:
Priority: medium    
Version: 3   
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2005-10-11 22:18:57 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Corey Marthaler 2004-11-22 16:34:27 UTC
Description of problem:
I was running a regression load this weekend on the latest 6.0 rpms,
 
GFS v6.0.0 (built Nov 18 2004 14:10:18) installed
Kernel 2.4.21-26.ELsmp on an i686

on morph-01 - morph-06, and morph-03 hit this panic/assertion. 

I/O load at the time:
iogen -f sync -i 30s -m random -s read,write,readv,writev
iogen -f sync -i 30s -m sequential -s read,write,readv,write
iogen -f buffered -i 30s -m sequential -s read,write,readv,write

Nov 20 04:42:15 morph-03 kernel: Bad metadata at 65130
Nov 20 04:42:15 morph-03 kernel:   mh_magic = 0x35303030
Nov 20 04:42:15 morph-03 kernel:   mh_type = 980250482
Nov 20 04:42:15 morph-03 kernel:   mh_generation = 8099773614866981999
Nov 20 04:42:15 morph-03 kernel:   mh_format = 1768892995
Nov 20 04:42:15 morph-03 kernel:   mh_incarn = 976303408
Nov 20 04:42:15 morph-03 kernel: Kernel panic: GFS: Assertion failed
on line 318 of file tr
ans.c
Nov 20 04:42:15 morph-03 kernel: GFS: assertion: "meta_check_magic ==
GFS_MAGIC"
Nov 20 04:42:15 morph-03 kernel: GFS: time = 1100947335
Nov 20 04:42:15 morph-03 kernel: GFS: fsid=morph-cluster:stripe-504K.0



Shortly afterwards, morph-01 fences morph-03 and then hits the exact
same panic/assertion:

GFS: fsid=morph-cluster:stripe-504K.1: Joined cluster. Now mounting FS...
GFS: fsid=morph-cluster:stripe-504K.1: jid=1: Trying to acquire
journal lock...
GFS: fsid=morph-cluster:stripe-504K.1: jid=1: Looking at journal...
GFS: fsid=morph-cluster:stripe-504K.1: jid=1: Done
lock_gulm: Checking for journals for node "morph-03.lab.msp.redhat.com"
GFS: fsid=morph-cluster:stripe-504K.1: jid=0: Trying to acquire
journal lock...
GFS: fsid=morph-cluster:stripe-504K.1: jid=0: Busy
Bad metadata at 65130
  mh_magic = 0x35303030
  mh_type = 980250482
  mh_generation = 8099773614866981999
  mh_format = 1768892995
  mh_incarn = 976303408
Kernel panic: GFS: Assertion failed on line 318 of file trans.c
GFS: assertion: "meta_check_magic == GFS_MAGIC"
GFS: time = 1100947822
GFS: fsid=morph-cluster:stripe-504K.1

How reproducible:
Sometimes

Comment 1 Corey Marthaler 2004-11-22 16:36:11 UTC
FWIW, morph-01 was the Gulm server

Comment 2 Corey Marthaler 2004-11-24 22:25:27 UTC
complete cmdlines:

iogen -f sync -i 30s -m random -s read,write,readv,writev -t 1b -T
40000b 40000b:rwransynclarge | doio -av

iogen -f sync -i 30s -m sequential -s read,write,readv,writev -t 1b -T
 40000b 40000b:rwransynclarge | doio -av

iogen -f buffered -i 30s -m sequential -s read,write,readv,writev -t
1b -T 40000b 40000b:rwbuflarge | doio -av

You can reduce/increase the 40000b filesize depending on your gfs size

Comment 3 michael conrad tadpol tilstra 2004-11-29 16:39:05 UTC
For what its worth, playing with this setup, and I don't get asserts,
but the qla2200 driver dumps the following line to syslog sometimes.

Nov 29 10:34:08 va13 kernel: Invalid packet 21 count! 15



Comment 4 michael conrad tadpol tilstra 2004-11-29 20:01:23 UTC
which fc card does the morph machines have?

Comment 5 Corey Marthaler 2004-11-29 20:05:20 UTC
qla2300: morph-01, morph-02, morph-03, morph-06 
 
lpfc: morph-04, morph-05 

Comment 6 michael conrad tadpol tilstra 2004-11-29 20:35:10 UTC
looking at the qla2x00.c file, that message is printed when there are
more items in an array than the array is defined to hold.  That's
nice.  And yet I'm not panicing? weird.

I don't suppose you know if there were any messages like the one I
posted above on your machines? (syslog might have caught them.)

Comment 7 Corey Marthaler 2004-11-29 20:46:05 UTC
It's possible, but the machines have been reimaged a few times since 
then so all data in /var/log/messages has long been lost. :(  
 
Hopefully we'll reproduce this in our RHEL3 rpm testing coming up. 

Comment 8 michael conrad tadpol tilstra 2004-11-29 21:01:57 UTC
Reformated my FC raid to ext2, ran the iogen load, got the same Invalid packet
message.  I very much am wondering if this is actually a driver issue.

Will wait to see what your results are.


Comment 9 michael conrad tadpol tilstra 2004-11-29 21:49:57 UTC
For good measure, I took pool out of the path as well.  Still getting the
Invalid packet counts.

Comment 10 michael conrad tadpol tilstra 2004-11-29 22:04:35 UTC
oh, for the tests I did with ext2, I was only using one node.  So no cluster needed.
Then three filesystems, each a 1/3T.  All three get the iogen load above.

Comment 11 michael conrad tadpol tilstra 2005-01-05 14:33:39 UTC
waiting to see if the qlogic driver the morph nodes is printing out warnings or
errors under the given load.

Comment 12 Cory Ranschau 2005-03-01 21:47:48 UTC
I have a qla2200 module loaded on a 2.4.21-27.0.2.ELsmp kernel that is
also giving me the message "Invalid packet 21 count! 15" when I write
to the attached disk device.  There doesn't appear to be any errors
when writing and performance is as expected.  Any ideas?

Comment 14 Corey Marthaler 2005-10-11 22:18:57 UTC
Have not seen this bug in almost a year, will reopen if seen again.