Description of problem: I was running a regression load this weekend on the latest 6.0 rpms, GFS v6.0.0 (built Nov 18 2004 14:10:18) installed Kernel 2.4.21-26.ELsmp on an i686 on morph-01 - morph-06, and morph-03 hit this panic/assertion. I/O load at the time: iogen -f sync -i 30s -m random -s read,write,readv,writev iogen -f sync -i 30s -m sequential -s read,write,readv,write iogen -f buffered -i 30s -m sequential -s read,write,readv,write Nov 20 04:42:15 morph-03 kernel: Bad metadata at 65130 Nov 20 04:42:15 morph-03 kernel: mh_magic = 0x35303030 Nov 20 04:42:15 morph-03 kernel: mh_type = 980250482 Nov 20 04:42:15 morph-03 kernel: mh_generation = 8099773614866981999 Nov 20 04:42:15 morph-03 kernel: mh_format = 1768892995 Nov 20 04:42:15 morph-03 kernel: mh_incarn = 976303408 Nov 20 04:42:15 morph-03 kernel: Kernel panic: GFS: Assertion failed on line 318 of file tr ans.c Nov 20 04:42:15 morph-03 kernel: GFS: assertion: "meta_check_magic == GFS_MAGIC" Nov 20 04:42:15 morph-03 kernel: GFS: time = 1100947335 Nov 20 04:42:15 morph-03 kernel: GFS: fsid=morph-cluster:stripe-504K.0 Shortly afterwards, morph-01 fences morph-03 and then hits the exact same panic/assertion: GFS: fsid=morph-cluster:stripe-504K.1: Joined cluster. Now mounting FS... GFS: fsid=morph-cluster:stripe-504K.1: jid=1: Trying to acquire journal lock... GFS: fsid=morph-cluster:stripe-504K.1: jid=1: Looking at journal... GFS: fsid=morph-cluster:stripe-504K.1: jid=1: Done lock_gulm: Checking for journals for node "morph-03.lab.msp.redhat.com" GFS: fsid=morph-cluster:stripe-504K.1: jid=0: Trying to acquire journal lock... GFS: fsid=morph-cluster:stripe-504K.1: jid=0: Busy Bad metadata at 65130 mh_magic = 0x35303030 mh_type = 980250482 mh_generation = 8099773614866981999 mh_format = 1768892995 mh_incarn = 976303408 Kernel panic: GFS: Assertion failed on line 318 of file trans.c GFS: assertion: "meta_check_magic == GFS_MAGIC" GFS: time = 1100947822 GFS: fsid=morph-cluster:stripe-504K.1 How reproducible: Sometimes
FWIW, morph-01 was the Gulm server
complete cmdlines: iogen -f sync -i 30s -m random -s read,write,readv,writev -t 1b -T 40000b 40000b:rwransynclarge | doio -av iogen -f sync -i 30s -m sequential -s read,write,readv,writev -t 1b -T 40000b 40000b:rwransynclarge | doio -av iogen -f buffered -i 30s -m sequential -s read,write,readv,writev -t 1b -T 40000b 40000b:rwbuflarge | doio -av You can reduce/increase the 40000b filesize depending on your gfs size
For what its worth, playing with this setup, and I don't get asserts, but the qla2200 driver dumps the following line to syslog sometimes. Nov 29 10:34:08 va13 kernel: Invalid packet 21 count! 15
which fc card does the morph machines have?
qla2300: morph-01, morph-02, morph-03, morph-06 lpfc: morph-04, morph-05
looking at the qla2x00.c file, that message is printed when there are more items in an array than the array is defined to hold. That's nice. And yet I'm not panicing? weird. I don't suppose you know if there were any messages like the one I posted above on your machines? (syslog might have caught them.)
It's possible, but the machines have been reimaged a few times since then so all data in /var/log/messages has long been lost. :( Hopefully we'll reproduce this in our RHEL3 rpm testing coming up.
Reformated my FC raid to ext2, ran the iogen load, got the same Invalid packet message. I very much am wondering if this is actually a driver issue. Will wait to see what your results are.
For good measure, I took pool out of the path as well. Still getting the Invalid packet counts.
oh, for the tests I did with ext2, I was only using one node. So no cluster needed. Then three filesystems, each a 1/3T. All three get the iogen load above.
waiting to see if the qlogic driver the morph nodes is printing out warnings or errors under the given load.
I have a qla2200 module loaded on a 2.4.21-27.0.2.ELsmp kernel that is also giving me the message "Invalid packet 21 count! 15" when I write to the attached disk device. There doesn't appear to be any errors when writing and performance is as expected. Any ideas?
Have not seen this bug in almost a year, will reopen if seen again.