Bug 437798

Summary:	GFS2: xen virtual machine crash, gfs2 kernel bug on glock
Product:	Red Hat Enterprise Linux 5	Reporter:	Maurizio Rottin <maurizio.rottin>
Component:	kernel	Assignee:	Red Hat Kernel Manager <kernel-mgr>
Status:	CLOSED INSUFFICIENT_DATA	QA Contact:	Red Hat Kernel QE team <kernel-qe>
Severity:	medium	Docs Contact:
Priority:	low
Version:	5.1	CC:	cluster-maint, edamato, rpeterso, rwheeler, swhiteho
Target Milestone:	rc
Target Release:	---
Hardware:	i686
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2008-12-03 10:01:51 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Maurizio Rottin 2008-03-17 14:30:40 UTC

Description of problem:
xen virtual machine crashes with 0002744: kernel BUG at fs/gfs2/glock.c:1131!

Version-Release number of selected component (if applicable):
Using Centos 5.1 for testing purpose
kernel-xen-2.6.18-53.1.6.el5
kernel-headers-2.6.18-53.1.6.el5
gfs2-utils-0.1.38-1.el5
gfs-utils-0.1.12-1.el5
kmod-gfs-xen-0.1.19-7.el5_1.1
kmod-gfs2-xen-1.52-1.16.el5


How reproducible:
I set up 6 nodes cluster, using gfs2 on 5 nodes (but it happens also when only
one node is mounting the fs).
The nodes are xen virtual mahcines mounting a lvm partition with "w!" directive

  
Under no load at all randomly a machine crashes and i get on console:
kernel BUG at fs/gfs2/glock.c:1131!
invalid opcode: 0000 [0000001]
SMP
last sysfs file: /fs/gfs2/Mycluster:gfs2http/lock_module/recover_done
Modules linked in: lock_dlm gfs2 dlm configfs xennet ipv6 dm_multipath
parport_pc lp parport pcspkr dm_snapshot dm_zero dm_mirror dm_mod xenblk ext3
jbd ehci_hcd ohci_hcd uhci_hcd
CPU: 0
EIP: 0061:[<ee39a02f>] Not tainted VLI
EFLAGS: 00010296 (2.6.18-53.1.6.el5xen 0000001)
EIP is at gfs2_glock_nq+0xec/0x18e [gfs2]
eax: 00000020 ebx: ed781cf8 ecx: 00000001 edx: f5416000
esi: ed781cf8 edi: ebf4fc34 ebp: ebf4fc34 esp: ebf21de8
ds: 007b es: 007b ss: 0069
Process gfs2_quotad (pid: 2458, ti=ebf21000 task=ed31b000 task.ti=ebf21000)
Stack: ee3af190 00000002 00000003 ee3af183 0000099a ee3af190 00000002 00000003
       ee3af183 0000099a ebce1000 00000000 ebf4fc34 ebce1000 ec1fa364 ebcdd688
       ee3a99ed ed781cf8 00020050 c0669780 ed781cf8 00000001 ec8792ec 00000001
Call Trace:
 [<ee3a99ed>] gfs2_rindex_hold+0x31/0x188 [gfs2]
 [<ee399f32>] glock_wait_internal+0x1db/0x1ec [gfs2]
 [<c044d4af>] __alloc_pages+0x57/0x282
 [<ee3aa34b>] gfs2_inplace_reserve_i+0xa3/0x57d [gfs2]
 [<ee399f32>] glock_wait_internal+0x1db/0x1ec [gfs2]
 [<c042f749>] down_read+0x8/0x11
 [<ee39e61b>] gfs2_log_reserve+0x11a/0x171 [gfs2]
 [<ee3ad294>] gfs2_do_trans_begin+0xe3/0x119 [gfs2]
 [<ee3a7550>] do_sync+0x2bb/0x5aa [gfs2]
 [<ee39fe8c>] getbuf+0xfc/0x106 [gfs2]
 [<ee3a7395>] do_sync+0x100/0x5aa [gfs2]
 [<ee3a80eb>] gfs2_quota_sync+0x200/0x26a [gfs2]
 [<ee392624>] gfs2_quotad+0x0/0x12c [gfs2]
 [<ee3926cd>] gfs2_quotad+0xa9/0x12c [gfs2]
 [<c042cc71>] kthread+0xc0/0xeb
 [<c042cbb1>] kthread+0x0/0xeb
 [<c0403005>] kernel_thread_helper+0x5/0xb
 =======================
Code: d2 8b 56 20 b8 b3 f1 3a ee e8 6b c4 09 d2 ff 76 0c 68 83 f1 3a ee e8 7b 34
08 d2 ff 77 20 ff 77 14 68 90 f1 3a ee e8 6b 34 08 d2 <0f> 0b 6b 04 87 ef 3a ee
83 c4 28 8b 5e 0c 8d 4f 48 8b 47 48 eb
EIP: [<ee39a02f>] gfs2_glock_nq+0xec/0x18e [gfs2] SS:ESP 0069:ebf21de8
 <0>Kernel panic - not syncing: Fatal exception


and it seems that with a fresh made gfs2, before setting any quota or mounting
with the "quota=on" option i get:
original: gfs2_unlink+0x8c/0x160 [gfs2]
new: gfs2_unlink+0x8c/0x160 [gfs2]
------------[ cut here ]------------
kernel BUG at fs/gfs2/glock.c:1131!
invalid opcode: 0000 [0000001]
SMP
last sysfs file: /kernel/dlm/gfs2http/control
Modules linked in: lock_dlm gfs2 dlm configfs ipv6 xennet dm_mirror dm_multipath
dm_mod parport_pc lp parport pcspkr xenblk ext3 jbd ehci_hcd ohci_hcd uhci_hcd
CPU: 0
EIP: 0061:[<ee35f02f>] Not tainted VLI
EFLAGS: 00010296 (2.6.18-53.1.6.el5xen 0000001)
EIP is at gfs2_glock_nq+0xec/0x18e [gfs2]
eax: 00000020 ebx: ed39feec ecx: 00000001 edx: f5416000
esi: ed39feec edi: ecac79ac ebp: ecac79ac esp: ed39fe58
ds: 007b es: 007b ss: 0069
Process tar (pid: 5447, ti=ed39f000 task=c03e6000 task.ti=ed39f000)
Stack: ee374190 00000003 00000001 ee374183 00001547 ee374190 00000003 00000001
       ee374183 00001547 eccf1000 00000000 ed39fea4 00000000 ec029464 eaf941ac
       ee369715 eb325e60 eccf1000 ebfe890c ebfe890c ebfe88d4 00001547 00000001
Call Trace:
 [<ee369715>] gfs2_unlink+0xb8/0x160 [gfs2]
 [<ee3696a9>] gfs2_unlink+0x4c/0x160 [gfs2]
 [<ee3696c0>] gfs2_unlink+0x63/0x160 [gfs2]
 [<ee3696e9>] gfs2_unlink+0x8c/0x160 [gfs2]
 [<ee36e9e4>] gfs2_rindex_hold+0x28/0x188 [gfs2]
 [<c04738a1>] vfs_unlink+0xa3/0xd9
 [<c0475423>] do_unlinkat+0x85/0x10e
 [<c04079fe>] do_syscall_trace+0xab/0xb1
 [<c040534f>] syscall_call+0x7/0xb
 =======================
Code: d2 8b 56 20 b8 b3 41 37 ee e8 6b 74 0d d2 ff 76 0c 68 83 41 37 ee e8 7b e4
0b d2 ff 77 20 ff 77 14 68 90 41 37 ee e8 6b e4 0b d2 <0f> 0b 6b 04 87 3f 37 ee
83 c4 28 8b 5e 0c 8d 4f 48 8b 47 48 eb
EIP: [<ee35f02f>] gfs2_glock_nq+0xec/0x18e [gfs2] SS:ESP 0069:ed39fe58
 <0>Kernel panic - not syncing: Fatal exception 


then i tried to put some debug in the clusterfs.sh
and i get
-----begin------
isMounted: spec=/dev/mapper/VGhttp-LVhttp dev=/dev/mapper/VGhttp-LVhttp
tmp_dev=/dev/xvda1
isMounted: spec=/dev/mapper/VGhttp-LVhttp dev=/dev/mapper/VGhttp-LVhttp tmp_dev=proc
no real_device
isMounted: spec=/dev/mapper/VGhttp-LVhttp dev=/dev/mapper/VGhttp-LVhttp
tmp_dev=sysfs
no real_device
isMounted: spec=/dev/mapper/VGhttp-LVhttp dev=/dev/mapper/VGhttp-LVhttp
tmp_dev=devpts
no real_device
isMounted: spec=/dev/mapper/VGhttp-LVhttp dev=/dev/mapper/VGhttp-LVhttp
tmp_dev=tmpfs
no real_device
isMounted: spec=/dev/mapper/VGhttp-LVhttp dev=/dev/mapper/VGhttp-LVhttp tmp_dev=none
no real_device
isMounted: spec=/dev/mapper/VGhttp-LVhttp dev=/dev/mapper/VGhttp-LVhttp tmp_dev=none
no real_device
isMounted: spec=/dev/mapper/VGhttp-LVhttp dev=/dev/mapper/VGhttp-LVhttp
tmp_dev=/dev/mapper/VGhttp-LVhttp
Thu Mar 13 15:56:38 CET 2008
retval=0 di isMounted /dev/mapper/VGhttp-LVhttp /home
isAlive: OCF_CHECK_LEVEL=20 -lt 10?
isAlive: depth 10 test /home
isAlive: OCF_CHECK_LEVEL=20 -lt 20?
retval=0 di isAlive /home
isMounted: spec=/dev/mapper/VGhttp-LVhttp dev=/dev/mapper/VGhttp-LVhttp
tmp_dev=/dev/xvda1
isMounted: spec=/dev/mapper/VGhttp-LVhttp dev=/dev/mapper/VGhttp-LVhttp tmp_dev=proc
no real_device
isMounted: spec=/dev/mapper/VGhttp-LVhttp dev=/dev/mapper/VGhttp-LVhttp
tmp_dev=sysfs
no real_device
isMounted: spec=/dev/mapper/VGhttp-LVhttp dev=/dev/mapper/VGhttp-LVhttp
tmp_dev=devpts
no real_device
isMounted: spec=/dev/mapper/VGhttp-LVhttp dev=/dev/mapper/VGhttp-LVhttp
tmp_dev=tmpfs
no real_device
isMounted: spec=/dev/mapper/VGhttp-LVhttp dev=/dev/mapper/VGhttp-LVhttp tmp_dev=none
no real_device
isMounted: spec=/dev/mapper/VGhttp-LVhttp dev=/dev/mapper/VGhttp-LVhttp tmp_dev=none
no real_device
isMounted: spec=/dev/mapper/VGhttp-LVhttp dev=/dev/mapper/VGhttp-LVhttp
tmp_dev=/dev/mapper/VGhttp-LVhttp
Thu Mar 13 15:56:48 CET 2008
retval=0 di isMounted /dev/mapper/VGhttp-LVhttp /home
isAlive: OCF_CHECK_LEVEL=10 -lt 10?
isAlive: depth 10 test /home
isAlive: OCF_CHECK_LEVEL=10 -lt 20?
retval=0 di isAlive /home
isMounted: spec=/dev/mapper/VGhttp-LVhttp dev=/dev/mapper/VGhttp-LVhttp
tmp_dev=/dev/xvda1
isMounted: spec=/dev/mapper/VGhttp-LVhttp dev=/dev/mapper/VGhttp-LVhttp tmp_dev=proc
no real_device
isMounted: spec=/dev/mapper/VGhttp-LVhttp dev=/dev/mapper/VGhttp-LVhttp
tmp_dev=sysfs
no real_device
isMounted: spec=/dev/mapper/VGhttp-LVhttp dev=/dev/mapper/VGhttp-LVhttp
tmp_dev=devpts
no real_device
isMounted: spec=/dev/mapper/VGhttp-LVhttp dev=/dev/mapper/VGhttp-LVhttp
tmp_dev=tmpfs
no real_device
isMounted: spec=/dev/mapper/VGhttp-LVhttp dev=/dev/mapper/VGhttp-LVhttp tmp_dev=none
no real_device
isMounted: spec=/dev/mapper/VGhttp-LVhttp dev=/dev/mapper/VGhttp-LVhttp tmp_dev=none
no real_device
isMounted: spec=/dev/mapper/VGhttp-LVhttp dev=/dev/mapper/VGhttp-LVhttp
tmp_dev=/dev/mapper/VGhttp-LVhttp
Thu Mar 13 15:56:58 CET 2008
retval=0 di isMounted /dev/mapper/VGhttp-LVhttp /home
isAlive: OCF_CHECK_LEVEL=0 -lt 10?
retval=0 di isAlive /home
stop: force_umount=0
stop: clubufflush -f /dev/mapper/VGhttp-LVhttp
isMounted: spec=/dev/mapper/VGhttp-LVhttp dev=/dev/mapper/VGhttp-LVhttp
tmp_dev=/dev/xvda1
isMounted: spec=/dev/mapper/VGhttp-LVhttp dev=/dev/mapper/VGhttp-LVhttp tmp_dev=proc
stop: force_umount=0
stop: clubufflush -f /dev/mapper/VGhttp-LVhttp
------end-----

then machine crashes(usually after about a minute), and reboots (thanks to fencing?)

Comment 1 Abhijith Das 2008-03-24 03:32:50 UTC

I'm using the same bits that you're using (only that I've got rhel and I'm not
using xen). I've tried a few things (including leaving the fs alone), but
haven't been able to reproduce this problem.

a) Is it possible for you to hit this on a freshly gfs2.mkfs'ed filesystem? Can
you hit it on a standalone (mkfs'ed with -p lock_nolock) gfs2?

b) What operations do you run until you hit this BUG? The second stacktrace you
posted starts with an unlink operation, but the first one comes from
gfs2_quotad. (These may to two similar bugs)

I'll try to reproduce this and figure out what's going on, but it'd be ideal if
you can hit it reliably and share the steps.

Thanks!

Comment 2 Maurizio Rottin 2008-03-25 10:15:05 UTC

Hi, i went back to gfs(even if slower), and i have no more spare server to try now.

Anyway, can the problem be xen? i never tried without xen, but i saw this bug
trace a lot of times during one week. Can that be related to xen or some
hardware i am using?

a) the second trace is from a freshly made gfs2.mkfs, but never tried with
lock_nolock. I remember that this happened often and also with only one node
mounting the filesystem.
The first trace is from a filesystem mounted with quota=on, but never
initialized by me.


b) actually i did nothing. I believed the problem was the clusterfs.sh script,
so i mounted the fs manually, but the problem was still there.

Comment 4 Steve Whitehouse 2008-06-30 11:44:30 UTC

The glock code for RHEL 5.2 and up has been extensively cleaned up, so that it's
very unlikely that this still applies to more recent kernels. Please upsgrade
and let us know if its still a problem.

Comment 5 Maurizio Rottin 2008-06-30 19:09:26 UTC

i'm very busy right now and i must setup the same environment, i believe i'll be
able to recreate the scenario in the next week, please wait until then.
thank you.

Comment 6 Steve Whitehouse 2008-07-09 13:55:16 UTC

Its not a problem if you can't look at this right away, but we need to leave the
bug in NEEDINFO in the mean time so that it doesn't mess up our stats.

Comment 8 Steve Whitehouse 2008-11-20 16:05:32 UTC

This was originally reported against Centos 5.1, since that code is very old indeed, I'd be very surprised if this bug still exists. I've not seen anything similar reported recently.

If we don't hear any more information in the next few weeks, we'll close this bug as cannot reproduce.