Description of problem: xen virtual machine crashes with 0002744: kernel BUG at fs/gfs2/glock.c:1131! Version-Release number of selected component (if applicable): Using Centos 5.1 for testing purpose kernel-xen-2.6.18-53.1.6.el5 kernel-headers-2.6.18-53.1.6.el5 gfs2-utils-0.1.38-1.el5 gfs-utils-0.1.12-1.el5 kmod-gfs-xen-0.1.19-7.el5_1.1 kmod-gfs2-xen-1.52-1.16.el5 How reproducible: I set up 6 nodes cluster, using gfs2 on 5 nodes (but it happens also when only one node is mounting the fs). The nodes are xen virtual mahcines mounting a lvm partition with "w!" directive Under no load at all randomly a machine crashes and i get on console: kernel BUG at fs/gfs2/glock.c:1131! invalid opcode: 0000 [0000001] SMP last sysfs file: /fs/gfs2/Mycluster:gfs2http/lock_module/recover_done Modules linked in: lock_dlm gfs2 dlm configfs xennet ipv6 dm_multipath parport_pc lp parport pcspkr dm_snapshot dm_zero dm_mirror dm_mod xenblk ext3 jbd ehci_hcd ohci_hcd uhci_hcd CPU: 0 EIP: 0061:[<ee39a02f>] Not tainted VLI EFLAGS: 00010296 (2.6.18-53.1.6.el5xen 0000001) EIP is at gfs2_glock_nq+0xec/0x18e [gfs2] eax: 00000020 ebx: ed781cf8 ecx: 00000001 edx: f5416000 esi: ed781cf8 edi: ebf4fc34 ebp: ebf4fc34 esp: ebf21de8 ds: 007b es: 007b ss: 0069 Process gfs2_quotad (pid: 2458, ti=ebf21000 task=ed31b000 task.ti=ebf21000) Stack: ee3af190 00000002 00000003 ee3af183 0000099a ee3af190 00000002 00000003 ee3af183 0000099a ebce1000 00000000 ebf4fc34 ebce1000 ec1fa364 ebcdd688 ee3a99ed ed781cf8 00020050 c0669780 ed781cf8 00000001 ec8792ec 00000001 Call Trace: [<ee3a99ed>] gfs2_rindex_hold+0x31/0x188 [gfs2] [<ee399f32>] glock_wait_internal+0x1db/0x1ec [gfs2] [<c044d4af>] __alloc_pages+0x57/0x282 [<ee3aa34b>] gfs2_inplace_reserve_i+0xa3/0x57d [gfs2] [<ee399f32>] glock_wait_internal+0x1db/0x1ec [gfs2] [<c042f749>] down_read+0x8/0x11 [<ee39e61b>] gfs2_log_reserve+0x11a/0x171 [gfs2] [<ee3ad294>] gfs2_do_trans_begin+0xe3/0x119 [gfs2] [<ee3a7550>] do_sync+0x2bb/0x5aa [gfs2] [<ee39fe8c>] getbuf+0xfc/0x106 [gfs2] [<ee3a7395>] do_sync+0x100/0x5aa [gfs2] [<ee3a80eb>] gfs2_quota_sync+0x200/0x26a [gfs2] [<ee392624>] gfs2_quotad+0x0/0x12c [gfs2] [<ee3926cd>] gfs2_quotad+0xa9/0x12c [gfs2] [<c042cc71>] kthread+0xc0/0xeb [<c042cbb1>] kthread+0x0/0xeb [<c0403005>] kernel_thread_helper+0x5/0xb ======================= Code: d2 8b 56 20 b8 b3 f1 3a ee e8 6b c4 09 d2 ff 76 0c 68 83 f1 3a ee e8 7b 34 08 d2 ff 77 20 ff 77 14 68 90 f1 3a ee e8 6b 34 08 d2 <0f> 0b 6b 04 87 ef 3a ee 83 c4 28 8b 5e 0c 8d 4f 48 8b 47 48 eb EIP: [<ee39a02f>] gfs2_glock_nq+0xec/0x18e [gfs2] SS:ESP 0069:ebf21de8 <0>Kernel panic - not syncing: Fatal exception and it seems that with a fresh made gfs2, before setting any quota or mounting with the "quota=on" option i get: original: gfs2_unlink+0x8c/0x160 [gfs2] new: gfs2_unlink+0x8c/0x160 [gfs2] ------------[ cut here ]------------ kernel BUG at fs/gfs2/glock.c:1131! invalid opcode: 0000 [0000001] SMP last sysfs file: /kernel/dlm/gfs2http/control Modules linked in: lock_dlm gfs2 dlm configfs ipv6 xennet dm_mirror dm_multipath dm_mod parport_pc lp parport pcspkr xenblk ext3 jbd ehci_hcd ohci_hcd uhci_hcd CPU: 0 EIP: 0061:[<ee35f02f>] Not tainted VLI EFLAGS: 00010296 (2.6.18-53.1.6.el5xen 0000001) EIP is at gfs2_glock_nq+0xec/0x18e [gfs2] eax: 00000020 ebx: ed39feec ecx: 00000001 edx: f5416000 esi: ed39feec edi: ecac79ac ebp: ecac79ac esp: ed39fe58 ds: 007b es: 007b ss: 0069 Process tar (pid: 5447, ti=ed39f000 task=c03e6000 task.ti=ed39f000) Stack: ee374190 00000003 00000001 ee374183 00001547 ee374190 00000003 00000001 ee374183 00001547 eccf1000 00000000 ed39fea4 00000000 ec029464 eaf941ac ee369715 eb325e60 eccf1000 ebfe890c ebfe890c ebfe88d4 00001547 00000001 Call Trace: [<ee369715>] gfs2_unlink+0xb8/0x160 [gfs2] [<ee3696a9>] gfs2_unlink+0x4c/0x160 [gfs2] [<ee3696c0>] gfs2_unlink+0x63/0x160 [gfs2] [<ee3696e9>] gfs2_unlink+0x8c/0x160 [gfs2] [<ee36e9e4>] gfs2_rindex_hold+0x28/0x188 [gfs2] [<c04738a1>] vfs_unlink+0xa3/0xd9 [<c0475423>] do_unlinkat+0x85/0x10e [<c04079fe>] do_syscall_trace+0xab/0xb1 [<c040534f>] syscall_call+0x7/0xb ======================= Code: d2 8b 56 20 b8 b3 41 37 ee e8 6b 74 0d d2 ff 76 0c 68 83 41 37 ee e8 7b e4 0b d2 ff 77 20 ff 77 14 68 90 41 37 ee e8 6b e4 0b d2 <0f> 0b 6b 04 87 3f 37 ee 83 c4 28 8b 5e 0c 8d 4f 48 8b 47 48 eb EIP: [<ee35f02f>] gfs2_glock_nq+0xec/0x18e [gfs2] SS:ESP 0069:ed39fe58 <0>Kernel panic - not syncing: Fatal exception then i tried to put some debug in the clusterfs.sh and i get -----begin------ isMounted: spec=/dev/mapper/VGhttp-LVhttp dev=/dev/mapper/VGhttp-LVhttp tmp_dev=/dev/xvda1 isMounted: spec=/dev/mapper/VGhttp-LVhttp dev=/dev/mapper/VGhttp-LVhttp tmp_dev=proc no real_device isMounted: spec=/dev/mapper/VGhttp-LVhttp dev=/dev/mapper/VGhttp-LVhttp tmp_dev=sysfs no real_device isMounted: spec=/dev/mapper/VGhttp-LVhttp dev=/dev/mapper/VGhttp-LVhttp tmp_dev=devpts no real_device isMounted: spec=/dev/mapper/VGhttp-LVhttp dev=/dev/mapper/VGhttp-LVhttp tmp_dev=tmpfs no real_device isMounted: spec=/dev/mapper/VGhttp-LVhttp dev=/dev/mapper/VGhttp-LVhttp tmp_dev=none no real_device isMounted: spec=/dev/mapper/VGhttp-LVhttp dev=/dev/mapper/VGhttp-LVhttp tmp_dev=none no real_device isMounted: spec=/dev/mapper/VGhttp-LVhttp dev=/dev/mapper/VGhttp-LVhttp tmp_dev=/dev/mapper/VGhttp-LVhttp Thu Mar 13 15:56:38 CET 2008 retval=0 di isMounted /dev/mapper/VGhttp-LVhttp /home isAlive: OCF_CHECK_LEVEL=20 -lt 10? isAlive: depth 10 test /home isAlive: OCF_CHECK_LEVEL=20 -lt 20? retval=0 di isAlive /home isMounted: spec=/dev/mapper/VGhttp-LVhttp dev=/dev/mapper/VGhttp-LVhttp tmp_dev=/dev/xvda1 isMounted: spec=/dev/mapper/VGhttp-LVhttp dev=/dev/mapper/VGhttp-LVhttp tmp_dev=proc no real_device isMounted: spec=/dev/mapper/VGhttp-LVhttp dev=/dev/mapper/VGhttp-LVhttp tmp_dev=sysfs no real_device isMounted: spec=/dev/mapper/VGhttp-LVhttp dev=/dev/mapper/VGhttp-LVhttp tmp_dev=devpts no real_device isMounted: spec=/dev/mapper/VGhttp-LVhttp dev=/dev/mapper/VGhttp-LVhttp tmp_dev=tmpfs no real_device isMounted: spec=/dev/mapper/VGhttp-LVhttp dev=/dev/mapper/VGhttp-LVhttp tmp_dev=none no real_device isMounted: spec=/dev/mapper/VGhttp-LVhttp dev=/dev/mapper/VGhttp-LVhttp tmp_dev=none no real_device isMounted: spec=/dev/mapper/VGhttp-LVhttp dev=/dev/mapper/VGhttp-LVhttp tmp_dev=/dev/mapper/VGhttp-LVhttp Thu Mar 13 15:56:48 CET 2008 retval=0 di isMounted /dev/mapper/VGhttp-LVhttp /home isAlive: OCF_CHECK_LEVEL=10 -lt 10? isAlive: depth 10 test /home isAlive: OCF_CHECK_LEVEL=10 -lt 20? retval=0 di isAlive /home isMounted: spec=/dev/mapper/VGhttp-LVhttp dev=/dev/mapper/VGhttp-LVhttp tmp_dev=/dev/xvda1 isMounted: spec=/dev/mapper/VGhttp-LVhttp dev=/dev/mapper/VGhttp-LVhttp tmp_dev=proc no real_device isMounted: spec=/dev/mapper/VGhttp-LVhttp dev=/dev/mapper/VGhttp-LVhttp tmp_dev=sysfs no real_device isMounted: spec=/dev/mapper/VGhttp-LVhttp dev=/dev/mapper/VGhttp-LVhttp tmp_dev=devpts no real_device isMounted: spec=/dev/mapper/VGhttp-LVhttp dev=/dev/mapper/VGhttp-LVhttp tmp_dev=tmpfs no real_device isMounted: spec=/dev/mapper/VGhttp-LVhttp dev=/dev/mapper/VGhttp-LVhttp tmp_dev=none no real_device isMounted: spec=/dev/mapper/VGhttp-LVhttp dev=/dev/mapper/VGhttp-LVhttp tmp_dev=none no real_device isMounted: spec=/dev/mapper/VGhttp-LVhttp dev=/dev/mapper/VGhttp-LVhttp tmp_dev=/dev/mapper/VGhttp-LVhttp Thu Mar 13 15:56:58 CET 2008 retval=0 di isMounted /dev/mapper/VGhttp-LVhttp /home isAlive: OCF_CHECK_LEVEL=0 -lt 10? retval=0 di isAlive /home stop: force_umount=0 stop: clubufflush -f /dev/mapper/VGhttp-LVhttp isMounted: spec=/dev/mapper/VGhttp-LVhttp dev=/dev/mapper/VGhttp-LVhttp tmp_dev=/dev/xvda1 isMounted: spec=/dev/mapper/VGhttp-LVhttp dev=/dev/mapper/VGhttp-LVhttp tmp_dev=proc stop: force_umount=0 stop: clubufflush -f /dev/mapper/VGhttp-LVhttp ------end----- then machine crashes(usually after about a minute), and reboots (thanks to fencing?)
I'm using the same bits that you're using (only that I've got rhel and I'm not using xen). I've tried a few things (including leaving the fs alone), but haven't been able to reproduce this problem. a) Is it possible for you to hit this on a freshly gfs2.mkfs'ed filesystem? Can you hit it on a standalone (mkfs'ed with -p lock_nolock) gfs2? b) What operations do you run until you hit this BUG? The second stacktrace you posted starts with an unlink operation, but the first one comes from gfs2_quotad. (These may to two similar bugs) I'll try to reproduce this and figure out what's going on, but it'd be ideal if you can hit it reliably and share the steps. Thanks!
Hi, i went back to gfs(even if slower), and i have no more spare server to try now. Anyway, can the problem be xen? i never tried without xen, but i saw this bug trace a lot of times during one week. Can that be related to xen or some hardware i am using? a) the second trace is from a freshly made gfs2.mkfs, but never tried with lock_nolock. I remember that this happened often and also with only one node mounting the filesystem. The first trace is from a filesystem mounted with quota=on, but never initialized by me. b) actually i did nothing. I believed the problem was the clusterfs.sh script, so i mounted the fs manually, but the problem was still there.
The glock code for RHEL 5.2 and up has been extensively cleaned up, so that it's very unlikely that this still applies to more recent kernels. Please upsgrade and let us know if its still a problem.
i'm very busy right now and i must setup the same environment, i believe i'll be able to recreate the scenario in the next week, please wait until then. thank you.
Its not a problem if you can't look at this right away, but we need to leave the bug in NEEDINFO in the mean time so that it doesn't mess up our stats.
This was originally reported against Centos 5.1, since that code is very old indeed, I'd be very surprised if this bug still exists. I've not seen anything similar reported recently. If we don't hear any more information in the next few weeks, we'll close this bug as cannot reproduce.