163296 – GFS invalid metadata block in gfs_rgrp_read /usr/src/build/583472-i686/BUILD/smp/src/gfs/rgrp.c, line = 830

Bug 163296 - GFS invalid metadata block in gfs_rgrp_read /usr/src/build/583472-i686/BUILD/smp/src/gfs/rgrp.c, line = 830

Summary: GFS invalid metadata block in gfs_rgrp_read /usr/src/build/583472-i686/BUILD/...

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	Red Hat Cluster Suite
Classification:	Retired
Component:	gfs
Sub Component:
Version:	4
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Ben Marzinski
QA Contact:	GFS Bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2005-07-14 21:29 UTC by Dean Jansa
Modified:	2010-01-12 03:06 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2006-09-19 04:01:18 UTC
Embargoed:

Attachments	(Terms of Use)

Description Dean Jansa 2005-07-14 21:29:50 UTC

Description of problem:

Create and mount a > ~ 4.2 TB filesystem.  (Last time we saw this was on a 6.6
TB fs)

Do a df...

And you'll get the following:

[root@tank-01 ~]# df -H

Filesystem             Size   Used  Avail Use% Mounted on
/dev/mapper/VolGroup00-LogVol00
                        17G   2.2G    14G  14% /
/dev/hda5              104M    13M    86M  13% /boot
none                   2.2G      0   2.2G   0% /dev/shm
/dev/hda1              104M    16M    83M  16% /rhel3boot
GFS: fsid=tank-cluster:gfs.0: fatal: invalid metadata block
GFS: fsid=tank-cluster:gfs.0:   bh = 1429076260 (magic)
GFS: fsid=tank-cluster:gfs.0:   function = gfs_rgrp_read
GFS: fsid=tank-cluster:gfs.0:   file =
/usr/src/build/583472-i686/BUILD/smp/src/gfs/rgrp.c, line = 830
GFS: fsid=tank-cluster:gfs.0:   time = 1121354655
GFS: fsid=tank-cluster:gfs.0: about to withdraw from the cluster
GFS: fsid=tank-cluster:gfs.0: waiting for outstanding I/O
GFS: fsid=tank-cluster:gfs.0: telling LM to withdraw
lock_dlm: withdraw abandoned memory
GFS: fsid=tank-cluster:gfs.0: withdrawn
df: `/mnt/gfs0': Input/output error



If you mount with -o debug, the resulting stack looks like:
ar  3 19:19:52 tank-02 kernel: ------------[ cut here ]------------
Mar  3 19:19:52 tank-02 kernel: kernel BUG at
/usr/src/build/583472-i686/BUILD/smp/src/gfs/lm.c:190!
Mar  3 19:19:52 tank-02 kernel: invalid operand: 0000 [#1]
Mar  3 19:19:52 tank-02 kernel: SMP
Mar  3 19:19:52 tank-02 kernel: Modules linked in: gnbd(U) lock_nolock(U) gfs(U)
lock_dlm(U) dlm(U) cman(U) lock_harness(U) md5 ipv6 parport_pc lp parport
autofs4 i2c_dev i2c_core sunrpc button battery ac uhci_hcd hw_random e1000
floppy dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod qla2300 qla2xxx
scsi_transport_fc sd_mod scsi_mod
Mar  3 19:19:52 tank-02 kernel: CPU:    0
Mar  3 19:19:52 tank-02 kernel: EIP:    0060:[<f8cce3a3>]    Not tainted VLI
Mar  3 19:19:52 tank-02 kernel: EFLAGS: 00010202   (2.6.9-11.ELsmp)
Mar  3 19:19:52 tank-02 kernel: EIP is at gfs_lm_withdraw+0x51/0xc0 [gfs]
Mar  3 19:19:52 tank-02 kernel: eax: 0000003b   ebx: f8c8572c   ecx: f5a5cc74  
edx: f8ce9f1f
Mar  3 19:19:52 tank-02 kernel: esi: f8c61000   edi: f5a6c800   ebp: ea093000  
esp: f5a5cc88
Mar  3 19:19:52 tank-02 kernel: ds: 007b   es: 007b   ss: 0068
Mar  3 19:19:52 tank-02 kernel: Process df (pid: 8673, threadinfo=f5a5c000
task=f746e930)
Mar  3 19:19:52 tank-02 kernel: Stack: f8c8572c 00000004 f8ce69de f8c61000
f8ced4c9 f8c8572c f8c8572c 1dccf233
Mar  3 19:19:52 tank-02 kernel:        00000000 f8c8572c f8ce7c5f f8c8572c
f8cec54c 0000033e f8c8572c 3c82cbb8
Mar  3 19:19:52 tank-02 kernel:        00000003 f8ce2836 f8cec54c 0000033e
ea085a3c 00000005 f5696c60 f8c61000
Mar  3 19:19:52 tank-02 kernel: Call Trace:
Mar  3 19:19:52 tank-02 kernel:  [<f8ce69de>] gfs_meta_check_ii+0x2c/0x37 [gfs]
Mar  3 19:19:52 tank-02 kernel:  [<f8ce2836>] gfs_rgrp_read+0x132/0x214 [gfs]
Mar  3 19:19:52 tank-02 kernel:  [<f8cc4a73>] glock_wait_internal+0x168/0x1ef [gfs]
Mar  3 19:19:52 tank-02 kernel:  [<f8cc4e58>] gfs_glock_nq+0xe3/0x116 [gfs]
Mar  3 19:19:52 tank-02 kernel:  [<f8cc53cb>] gfs_glock_nq_init+0x13/0x26 [gfs]
Mar  3 19:19:52 tank-02 kernel:  [<f8ce2a42>] gfs_rgrp_lvb_init+0x1a/0x38 [gfs]
Mar  3 19:19:52 tank-02 kernel:  [<f8ce51d7>] stat_gfs_sync+0x63/0xa0 [gfs]
Mar  3 19:19:52 tank-02 kernel:  [<f8ce524d>] gfs_stat_gfs+0x39/0x4e [gfs]
Mar  3 19:19:52 tank-02 kernel:  [<c02111c3>] serial8250_start_tx+0x28/0x52
Mar  3 19:19:52 tank-02 kernel:  [<f8cdd427>] gfs_statfs+0x26/0xc7 [gfs]
Mar  3 19:19:52 tank-02 kernel:  [<c01542b5>] vfs_statfs+0x41/0x59
Mar  3 19:19:52 tank-02 kernel:  [<c01543ab>] vfs_statfs64+0xe/0x28
Mar  3 19:19:52 tank-02 kernel:  [<c01626e3>] __user_walk+0x4a/0x51
Mar  3 19:19:52 tank-02 kernel:  [<c01544b6>] sys_statfs64+0x52/0xb2
Mar  3 19:19:52 tank-02 kernel:  [<c01f1ac3>] tty_write+0x252/0x25c
Mar  3 19:19:52 tank-02 kernel:  [<c01f62f8>] write_chan+0x0/0x1d0
Mar  3 19:19:52 tank-02 kernel:  [<c01561e0>] vfs_write+0xda/0xe2
Mar  3 19:19:52 tank-02 kernel:  [<c0156286>] sys_write+0x3c/0x62
Mar  3 19:19:52 tank-02 kernel:  [<c02c7377>] syscall_call+0x7/0xb
Mar  3 19:19:52 tank-02 kernel: Code: ff 74 24 14 e8 66 36 45 c7 53 68 ed 9e ce
f8 e8 4a 36 45 c7 53 68 1f 9f ce f8 e8 3f 36 45 c7 83 c4 18 83 be 34 02 00 00 00
74 08 <0f> 0b be 00 36 9e ce f8 8b 86 0c 47 02 00 85 c0 74 1c ba 02 00



Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 michael conrad tadpol tilstra 2005-07-18 18:09:54 UTC

does data written to a logical volume that large verify?

Comment 2 Kiersten (Kerri) Anderson 2005-07-18 19:58:27 UTC

Mike looked at it, so reassigning.

Comment 3 Kiersten (Kerri) Anderson 2005-08-02 17:48:24 UTC

Mike, can you update this with the latest testing/findings you have done,
specifically the problems with the underlying volumes/storage when trying to
access more than 4.2TB.

Thanks
Kevin

Comment 4 michael conrad tadpol tilstra 2005-08-02 18:11:04 UTC

I haven't done any testing.  Three things.  First, in the past this type of
backtrace has typically been faulty hardware or device driver or somesuch. 
Second, it is odd that everything works when less than 4.2TB, but magic checks
fail when over 4.2TB.  Thrid, talking with aj, he mentioned various problems
with things in lvm at that size. (though we both thought that was fixed tested.)
 Based on those points, I asked to have the lvm volume checked in comment #1

What is needed right now is to determin if this is gfs or not.  A simple
write/verify test on a raw lvm volume that is over 4.2TB will tell us if this is
really a gfs problem or something deeper.

Comment 5 Stephen Tweedie 2005-08-08 15:18:20 UTC

"verify-data" is an ideal tool for such a write/verify test:

http://people.redhat.com/sct/src/verify-data/

The included man page is pretty self-explanatory.  A full (skip==0) test will
take quite a while on a 4.2TB volume, though.

Comment 10 Ben Marzinski 2005-11-15 17:51:57 UTC

O.k. I've tried a bunch of different configurations with the winchester, and
none of them cause this.  So, I'm guessing that if this is a bug, it isn't just
dependent on size.  When we saw the bug, it was with lvm creating a volume over
multiple storage arrays.  Which also means that if this is a bug, it's probably
in device mapper, and not GFS, since GFS doesn't care how the device is
assembled, only what its size is.  The best chance we will have of fixing this
bug is to
go back to the original setup that caused it.

So, whenever we have the free cycles to set that up, let me know, and I'll look
at this again.

Comment 12 Ben Marzinski 2006-01-04 19:52:51 UTC

This is the same error message as in bz #175589, and apparently 175589 could be
hit by running df.  There aren't enough similarities to mark this as a duplicate,
but when 175589 gets fixed, this one may just go away. Of course if we can never
reproduce it, that is neither here nor there.

Comment 14 Ben Marzinski 2006-09-15 18:50:11 UTC

As far as I know, this hasn't been seen in a while.  Since we can't reliably
reproduce it. We don't even know what was broken, much less whether or not it
has been fixed.

Note You need to log in before you can comment on or make changes to this bug.