Bug 606447

Summary: GFS2 - kernel BUG at fs/gfs2/lm.c:109
Product: Red Hat Enterprise Linux 5 Reporter: Jaroslav Kortus <jkortus>
Component: kernelAssignee: Abhijith Das <adas>
Status: CLOSED DUPLICATE QA Contact: Cluster QE <mspqa-list>
Severity: medium Docs Contact:
Priority: low    
Version: 5.5CC: adas, bmarzins, rpeterso, swhiteho
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-07-02 20:23:30 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
reproducer
none
metadata from broken FS none

Description Jaroslav Kortus 2010-06-21 16:05:40 UTC
Description of problem:
During tests of bug 586006 I found one different problem:
Kernel OOPS:
GFS2: fsid=a_cluster:vedder0.2: fatal: invalid metadata block
GFS2: fsid=a_cluster:vedder0.2:   bh = 35502140 (magic number)
GFS2: fsid=a_cluster:vedder0.2:   function = gfs2_meta_indirect_buffer, file = fs/gfs2/meta_io.c, line = 334
GFS2: fsid=a_cluster:vedder0.2: about to withdraw this file system
kernel BUG at fs/gfs2/lm.c:109!
pdflush[143]: bugcheck! 0 [1]


Version-Release number of selected component (if applicable):
2.6.18-194.3.1.el5

How reproducible:
20%
I could not reproduce this with kernel with fix for bug 586006, so maybe it's somehow related. I haven't hit this without quota=on option either.

Steps to Reproduce:
1. create cluster + gfs2 FS with -o quota=on option
2. run reproducer for couple of mins
3. see the crash. It will be actually oops for bug 586006 many times
  
Actual results:
oops, metadata corrupted

Expected results:
no oops

Additional info:
 GFS2: fsid=a_cluster:vedder0.2: fatal: invalid metadata block
GFS2: fsid=a_cluster:vedder0.2:   bh = 35502140 (magic number)
GFS2: fsid=a_cluster:vedder0.2:   function = gfs2_meta_indirect_buffer, file = fs/gfs2/meta_io.c, line = 334
GFS2: fsid=a_cluster:vedder0.2: about to withdraw this file system
kernel BUG at fs/gfs2/lm.c:109!
pdflush[143]: bugcheck! 0 [1]
Modules linked in: nfs fscache nfs_acl lock_dlm gfs2 dlm configfs autofs4 hidp rfcomm l2cap bluetooth lockd sunrpc ipv6 xfrm_nalgo crypto_api vfat fat dm_multipath scsi_dh wmi power_meter hwmon button parport_pc lp parport sg lpfc scsi_transport_fc ide_cd e1000 cdrom dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod mptspi mptscsih mptbase scsi_transport_spi sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd

Pid: 143, CPU 1, comm:              pdflush
psr : 00001010085a6010 ifs : 800000000000060f ip  : [<a0000002031060d0>]    Not tainted (2.6.18-194.3.1.el5)
ip is at gfs2_lm_withdraw+0x190/0x2a0 [gfs2]
unat: 0000000000000000 pfs : 000000000000060f rsc : 0000000000000003
rnat: a000000100b23668 bsps: 0000000000000004 pr  : 000000000000a541
ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c8a70433f
csd : 0000000000000000 ssd : 0000000000000000
b0  : a0000002031060d0 b6  : a000000100011000 b7  : a0000001002b1c00
f6  : 1003e00000000000000a0 f7  : 1003e20c49ba5e353f7cf
f8  : 1003e00000000000004e2 f9  : 1003e000000000fa00000
f10 : 1003e000000003b9aca00 f11 : 1003e431bde82d7b634db
r1  : a000000100c478a0 r2  : a000000100a60750 r3  : a00000010098c060
r8  : 0000000000000023 r9  : a000000100a60780 r10 : a000000100a60780
r11 : 0000000000000000 r12 : e0000001feb1fb20 r13 : e0000001feb18000
r14 : a000000100a60750 r15 : 0000000000000000 r16 : a00000010098c068
r17 : e000000106767e18 r18 : 0000000000000000 r19 : 0000000000000000
r20 : a000000100889300 r21 : a000000100a47f20 r22 : a000000100a60758
r23 : a000000100a60758 r24 : a00000010080d054 r25 : 0000000000000000
r26 : a00000010080d05c r27 : a00000010080d040 r28 : a00000010080c008
r29 : 0000063ff9c00000 r30 : 0000000000000000 r31 : 0000000000000000
Call Trace:
 [<a000000100013b40>] show_stack+0x40/0xa0
                                sp=e0000001feb1f6b0 bsp=e0000001feb19498
 [<a000000100014470>] show_regs+0x870/0x8c0
                                sp=e0000001feb1f880 bsp=e0000001feb19440
 [<a000000100037e20>] die+0x1c0/0x2c0

Comment 1 Jaroslav Kortus 2010-06-21 16:06:29 UTC
Created attachment 425683 [details]
reproducer

Comment 2 Jaroslav Kortus 2010-06-21 16:07:41 UTC
Created attachment 425684 [details]
metadata from broken FS

Comment 3 Jaroslav Kortus 2010-06-21 16:09:14 UTC
gfs2_fsck is not able to fix the filesystem:

gfs2_fsck -yvvvvvv /dev/vedder/vedder0 
Initializing fsck
Initializing lists...
jid=0: Looking at journal...
jid=0: Journal is clean.
jid=1: Looking at journal...
jid=1: Journal is clean.
jid=2: Looking at journal...
jid=2: Replaying journal...
jid=2: Failed
Recovering journals (this may take a while)
(initialize.c:401)      <backtrace> - initialize()
#

Comment 4 Jaroslav Kortus 2010-06-21 16:13:15 UTC
Logs from a2 show that the node was fenced and journal replayed:

Jun 21 10:26:18 a2 fenced[4551]: fence "a1" success 
Jun 21 10:26:18 a2 kernel: GFS2: fsid=a_cluster:vedder0.0: jid=2: Trying to acquire journal lock...
Jun 21 10:26:18 a2 kernel: GFS2: fsid=a_cluster:vedder0.0: jid=2: Looking at journal...
Jun 21 10:26:18 a2 kernel: GFS2: fsid=a_cluster:vedder0.0: jid=2: Acquiring the transaction lock...
Jun 21 10:26:18 a2 kernel: GFS2: fsid=a_cluster:vedder0.0: jid=2: Replaying journal...
Jun 21 10:26:18 a2 kernel: GFS2: fsid=a_cluster:vedder0.0: jid=2: Replayed 5344 of 5345 blocks
Jun 21 10:26:18 a2 kernel: GFS2: fsid=a_cluster:vedder0.0: jid=2: Found 1 revoke tags
Jun 21 10:26:18 a2 kernel: GFS2: fsid=a_cluster:vedder0.0: jid=2: Journal replayed in 1s
Jun 21 10:26:18 a2 kernel: GFS2: fsid=a_cluster:vedder0.0: jid=2: Done

Comment 5 Robert Peterson 2010-06-22 15:45:12 UTC
I highly suspect this is a duplicate of Abhi's quota bug, but
I'm reassigning the bug to him to make that assessment.

Comment 7 Abhijith Das 2010-07-02 12:50:54 UTC
I believe this is a duplicate of bug 586008. I'm requesting needinfo so that it can be verified by the reporter that this is the case. According the bug 586006, the fix went into 2.6.18-194.4.1.el5.

Comment 8 Jaroslav Kortus 2010-07-02 19:16:02 UTC
can't reproduce it on  2.6.18-194.8.1.el5 (current RHN 5.5), ia64. These symptoms were most probably related, so I'd suggest closing this as duplicate and I will reopen it if it pops up again.

Comment 9 Abhijith Das 2010-07-02 20:23:30 UTC
Closing as duplicate of bug 586008 which has been fixed already.

*** This bug has been marked as a duplicate of bug 586008 ***