Description of problem: Create and mount a > ~ 4.2 TB filesystem. (Last time we saw this was on a 6.6 TB fs) Do a df... And you'll get the following: [root@tank-01 ~]# df -H Filesystem Size Used Avail Use% Mounted on /dev/mapper/VolGroup00-LogVol00 17G 2.2G 14G 14% / /dev/hda5 104M 13M 86M 13% /boot none 2.2G 0 2.2G 0% /dev/shm /dev/hda1 104M 16M 83M 16% /rhel3boot GFS: fsid=tank-cluster:gfs.0: fatal: invalid metadata block GFS: fsid=tank-cluster:gfs.0: bh = 1429076260 (magic) GFS: fsid=tank-cluster:gfs.0: function = gfs_rgrp_read GFS: fsid=tank-cluster:gfs.0: file = /usr/src/build/583472-i686/BUILD/smp/src/gfs/rgrp.c, line = 830 GFS: fsid=tank-cluster:gfs.0: time = 1121354655 GFS: fsid=tank-cluster:gfs.0: about to withdraw from the cluster GFS: fsid=tank-cluster:gfs.0: waiting for outstanding I/O GFS: fsid=tank-cluster:gfs.0: telling LM to withdraw lock_dlm: withdraw abandoned memory GFS: fsid=tank-cluster:gfs.0: withdrawn df: `/mnt/gfs0': Input/output error If you mount with -o debug, the resulting stack looks like: ar 3 19:19:52 tank-02 kernel: ------------[ cut here ]------------ Mar 3 19:19:52 tank-02 kernel: kernel BUG at /usr/src/build/583472-i686/BUILD/smp/src/gfs/lm.c:190! Mar 3 19:19:52 tank-02 kernel: invalid operand: 0000 [#1] Mar 3 19:19:52 tank-02 kernel: SMP Mar 3 19:19:52 tank-02 kernel: Modules linked in: gnbd(U) lock_nolock(U) gfs(U) lock_dlm(U) dlm(U) cman(U) lock_harness(U) md5 ipv6 parport_pc lp parport autofs4 i2c_dev i2c_core sunrpc button battery ac uhci_hcd hw_random e1000 floppy dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod qla2300 qla2xxx scsi_transport_fc sd_mod scsi_mod Mar 3 19:19:52 tank-02 kernel: CPU: 0 Mar 3 19:19:52 tank-02 kernel: EIP: 0060:[<f8cce3a3>] Not tainted VLI Mar 3 19:19:52 tank-02 kernel: EFLAGS: 00010202 (2.6.9-11.ELsmp) Mar 3 19:19:52 tank-02 kernel: EIP is at gfs_lm_withdraw+0x51/0xc0 [gfs] Mar 3 19:19:52 tank-02 kernel: eax: 0000003b ebx: f8c8572c ecx: f5a5cc74 edx: f8ce9f1f Mar 3 19:19:52 tank-02 kernel: esi: f8c61000 edi: f5a6c800 ebp: ea093000 esp: f5a5cc88 Mar 3 19:19:52 tank-02 kernel: ds: 007b es: 007b ss: 0068 Mar 3 19:19:52 tank-02 kernel: Process df (pid: 8673, threadinfo=f5a5c000 task=f746e930) Mar 3 19:19:52 tank-02 kernel: Stack: f8c8572c 00000004 f8ce69de f8c61000 f8ced4c9 f8c8572c f8c8572c 1dccf233 Mar 3 19:19:52 tank-02 kernel: 00000000 f8c8572c f8ce7c5f f8c8572c f8cec54c 0000033e f8c8572c 3c82cbb8 Mar 3 19:19:52 tank-02 kernel: 00000003 f8ce2836 f8cec54c 0000033e ea085a3c 00000005 f5696c60 f8c61000 Mar 3 19:19:52 tank-02 kernel: Call Trace: Mar 3 19:19:52 tank-02 kernel: [<f8ce69de>] gfs_meta_check_ii+0x2c/0x37 [gfs] Mar 3 19:19:52 tank-02 kernel: [<f8ce2836>] gfs_rgrp_read+0x132/0x214 [gfs] Mar 3 19:19:52 tank-02 kernel: [<f8cc4a73>] glock_wait_internal+0x168/0x1ef [gfs] Mar 3 19:19:52 tank-02 kernel: [<f8cc4e58>] gfs_glock_nq+0xe3/0x116 [gfs] Mar 3 19:19:52 tank-02 kernel: [<f8cc53cb>] gfs_glock_nq_init+0x13/0x26 [gfs] Mar 3 19:19:52 tank-02 kernel: [<f8ce2a42>] gfs_rgrp_lvb_init+0x1a/0x38 [gfs] Mar 3 19:19:52 tank-02 kernel: [<f8ce51d7>] stat_gfs_sync+0x63/0xa0 [gfs] Mar 3 19:19:52 tank-02 kernel: [<f8ce524d>] gfs_stat_gfs+0x39/0x4e [gfs] Mar 3 19:19:52 tank-02 kernel: [<c02111c3>] serial8250_start_tx+0x28/0x52 Mar 3 19:19:52 tank-02 kernel: [<f8cdd427>] gfs_statfs+0x26/0xc7 [gfs] Mar 3 19:19:52 tank-02 kernel: [<c01542b5>] vfs_statfs+0x41/0x59 Mar 3 19:19:52 tank-02 kernel: [<c01543ab>] vfs_statfs64+0xe/0x28 Mar 3 19:19:52 tank-02 kernel: [<c01626e3>] __user_walk+0x4a/0x51 Mar 3 19:19:52 tank-02 kernel: [<c01544b6>] sys_statfs64+0x52/0xb2 Mar 3 19:19:52 tank-02 kernel: [<c01f1ac3>] tty_write+0x252/0x25c Mar 3 19:19:52 tank-02 kernel: [<c01f62f8>] write_chan+0x0/0x1d0 Mar 3 19:19:52 tank-02 kernel: [<c01561e0>] vfs_write+0xda/0xe2 Mar 3 19:19:52 tank-02 kernel: [<c0156286>] sys_write+0x3c/0x62 Mar 3 19:19:52 tank-02 kernel: [<c02c7377>] syscall_call+0x7/0xb Mar 3 19:19:52 tank-02 kernel: Code: ff 74 24 14 e8 66 36 45 c7 53 68 ed 9e ce f8 e8 4a 36 45 c7 53 68 1f 9f ce f8 e8 3f 36 45 c7 83 c4 18 83 be 34 02 00 00 00 74 08 <0f> 0b be 00 36 9e ce f8 8b 86 0c 47 02 00 85 c0 74 1c ba 02 00 Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
does data written to a logical volume that large verify?
Mike looked at it, so reassigning.
Mike, can you update this with the latest testing/findings you have done, specifically the problems with the underlying volumes/storage when trying to access more than 4.2TB. Thanks Kevin
I haven't done any testing. Three things. First, in the past this type of backtrace has typically been faulty hardware or device driver or somesuch. Second, it is odd that everything works when less than 4.2TB, but magic checks fail when over 4.2TB. Thrid, talking with aj, he mentioned various problems with things in lvm at that size. (though we both thought that was fixed tested.) Based on those points, I asked to have the lvm volume checked in comment #1 What is needed right now is to determin if this is gfs or not. A simple write/verify test on a raw lvm volume that is over 4.2TB will tell us if this is really a gfs problem or something deeper.
"verify-data" is an ideal tool for such a write/verify test: http://people.redhat.com/sct/src/verify-data/ The included man page is pretty self-explanatory. A full (skip==0) test will take quite a while on a 4.2TB volume, though.
O.k. I've tried a bunch of different configurations with the winchester, and none of them cause this. So, I'm guessing that if this is a bug, it isn't just dependent on size. When we saw the bug, it was with lvm creating a volume over multiple storage arrays. Which also means that if this is a bug, it's probably in device mapper, and not GFS, since GFS doesn't care how the device is assembled, only what its size is. The best chance we will have of fixing this bug is to go back to the original setup that caused it. So, whenever we have the free cycles to set that up, let me know, and I'll look at this again.
This is the same error message as in bz #175589, and apparently 175589 could be hit by running df. There aren't enough similarities to mark this as a duplicate, but when 175589 gets fixed, this one may just go away. Of course if we can never reproduce it, that is neither here nor there.
As far as I know, this hasn't been seen in a while. Since we can't reliably reproduce it. We don't even know what was broken, much less whether or not it has been fixed.