Description of problem: During tests of bug 586006 I found one different problem: Kernel OOPS: GFS2: fsid=a_cluster:vedder0.2: fatal: invalid metadata block GFS2: fsid=a_cluster:vedder0.2: bh = 35502140 (magic number) GFS2: fsid=a_cluster:vedder0.2: function = gfs2_meta_indirect_buffer, file = fs/gfs2/meta_io.c, line = 334 GFS2: fsid=a_cluster:vedder0.2: about to withdraw this file system kernel BUG at fs/gfs2/lm.c:109! pdflush[143]: bugcheck! 0 [1] Version-Release number of selected component (if applicable): 2.6.18-194.3.1.el5 How reproducible: 20% I could not reproduce this with kernel with fix for bug 586006, so maybe it's somehow related. I haven't hit this without quota=on option either. Steps to Reproduce: 1. create cluster + gfs2 FS with -o quota=on option 2. run reproducer for couple of mins 3. see the crash. It will be actually oops for bug 586006 many times Actual results: oops, metadata corrupted Expected results: no oops Additional info: GFS2: fsid=a_cluster:vedder0.2: fatal: invalid metadata block GFS2: fsid=a_cluster:vedder0.2: bh = 35502140 (magic number) GFS2: fsid=a_cluster:vedder0.2: function = gfs2_meta_indirect_buffer, file = fs/gfs2/meta_io.c, line = 334 GFS2: fsid=a_cluster:vedder0.2: about to withdraw this file system kernel BUG at fs/gfs2/lm.c:109! pdflush[143]: bugcheck! 0 [1] Modules linked in: nfs fscache nfs_acl lock_dlm gfs2 dlm configfs autofs4 hidp rfcomm l2cap bluetooth lockd sunrpc ipv6 xfrm_nalgo crypto_api vfat fat dm_multipath scsi_dh wmi power_meter hwmon button parport_pc lp parport sg lpfc scsi_transport_fc ide_cd e1000 cdrom dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod mptspi mptscsih mptbase scsi_transport_spi sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd Pid: 143, CPU 1, comm: pdflush psr : 00001010085a6010 ifs : 800000000000060f ip : [<a0000002031060d0>] Not tainted (2.6.18-194.3.1.el5) ip is at gfs2_lm_withdraw+0x190/0x2a0 [gfs2] unat: 0000000000000000 pfs : 000000000000060f rsc : 0000000000000003 rnat: a000000100b23668 bsps: 0000000000000004 pr : 000000000000a541 ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c8a70433f csd : 0000000000000000 ssd : 0000000000000000 b0 : a0000002031060d0 b6 : a000000100011000 b7 : a0000001002b1c00 f6 : 1003e00000000000000a0 f7 : 1003e20c49ba5e353f7cf f8 : 1003e00000000000004e2 f9 : 1003e000000000fa00000 f10 : 1003e000000003b9aca00 f11 : 1003e431bde82d7b634db r1 : a000000100c478a0 r2 : a000000100a60750 r3 : a00000010098c060 r8 : 0000000000000023 r9 : a000000100a60780 r10 : a000000100a60780 r11 : 0000000000000000 r12 : e0000001feb1fb20 r13 : e0000001feb18000 r14 : a000000100a60750 r15 : 0000000000000000 r16 : a00000010098c068 r17 : e000000106767e18 r18 : 0000000000000000 r19 : 0000000000000000 r20 : a000000100889300 r21 : a000000100a47f20 r22 : a000000100a60758 r23 : a000000100a60758 r24 : a00000010080d054 r25 : 0000000000000000 r26 : a00000010080d05c r27 : a00000010080d040 r28 : a00000010080c008 r29 : 0000063ff9c00000 r30 : 0000000000000000 r31 : 0000000000000000 Call Trace: [<a000000100013b40>] show_stack+0x40/0xa0 sp=e0000001feb1f6b0 bsp=e0000001feb19498 [<a000000100014470>] show_regs+0x870/0x8c0 sp=e0000001feb1f880 bsp=e0000001feb19440 [<a000000100037e20>] die+0x1c0/0x2c0
Created attachment 425683 [details] reproducer
Created attachment 425684 [details] metadata from broken FS
gfs2_fsck is not able to fix the filesystem: gfs2_fsck -yvvvvvv /dev/vedder/vedder0 Initializing fsck Initializing lists... jid=0: Looking at journal... jid=0: Journal is clean. jid=1: Looking at journal... jid=1: Journal is clean. jid=2: Looking at journal... jid=2: Replaying journal... jid=2: Failed Recovering journals (this may take a while) (initialize.c:401) <backtrace> - initialize() #
Logs from a2 show that the node was fenced and journal replayed: Jun 21 10:26:18 a2 fenced[4551]: fence "a1" success Jun 21 10:26:18 a2 kernel: GFS2: fsid=a_cluster:vedder0.0: jid=2: Trying to acquire journal lock... Jun 21 10:26:18 a2 kernel: GFS2: fsid=a_cluster:vedder0.0: jid=2: Looking at journal... Jun 21 10:26:18 a2 kernel: GFS2: fsid=a_cluster:vedder0.0: jid=2: Acquiring the transaction lock... Jun 21 10:26:18 a2 kernel: GFS2: fsid=a_cluster:vedder0.0: jid=2: Replaying journal... Jun 21 10:26:18 a2 kernel: GFS2: fsid=a_cluster:vedder0.0: jid=2: Replayed 5344 of 5345 blocks Jun 21 10:26:18 a2 kernel: GFS2: fsid=a_cluster:vedder0.0: jid=2: Found 1 revoke tags Jun 21 10:26:18 a2 kernel: GFS2: fsid=a_cluster:vedder0.0: jid=2: Journal replayed in 1s Jun 21 10:26:18 a2 kernel: GFS2: fsid=a_cluster:vedder0.0: jid=2: Done
I highly suspect this is a duplicate of Abhi's quota bug, but I'm reassigning the bug to him to make that assessment.
I believe this is a duplicate of bug 586008. I'm requesting needinfo so that it can be verified by the reporter that this is the case. According the bug 586006, the fix went into 2.6.18-194.4.1.el5.
can't reproduce it on 2.6.18-194.8.1.el5 (current RHN 5.5), ia64. These symptoms were most probably related, so I'd suggest closing this as duplicate and I will reopen it if it pops up again.
Closing as duplicate of bug 586008 which has been fixed already. *** This bug has been marked as a duplicate of bug 586008 ***