Bug 128403 - kernel BUG at /usr/src/cluster/gfs-kernel/src/dlm/lock.c:388!
Summary: kernel BUG at /usr/src/cluster/gfs-kernel/src/dlm/lock.c:388!
Keywords:
Status: CLOSED NEXTRELEASE
Alias: None
Product: Red Hat Cluster Suite
Classification: Retired
Component: gfs
Version: 4
Hardware: All
OS: Linux
medium
medium
Target Milestone: ---
Assignee: David Teigland
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2004-07-22 16:02 UTC by Derek Anderson
Modified: 2010-01-12 02:54 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2005-03-02 14:20:02 UTC
Embargoed:


Attachments (Terms of Use)

Description Derek Anderson 2004-07-22 16:02:31 UTC
Description of problem:
Running a 3-node cluster.  One node was running tar-untar operations
on the 2.6.7 source and the other was continuously mounting/umounting
the filesystem.  The third node was doing nothing.

The node running the IO tripped the following assertion.  I will put
full logs in ~danderso/bugs/<this_bug_#>.

lock_dlm:  Assertion failed on line 388 of file
/usr/src/cluster/gfs-kernel/src/dlm/lock.c
lock_dlm:  assertion:  "!error"
lock_dlm:  time = 1649496
data1: num=2,18 err=-22 cur=0 req=5 lkf=414

------------[ cut here ]------------
kernel BUG at /usr/src/cluster/gfs-kernel/src/dlm/lock.c:388!
invalid operand: 0000 [#1]
Modules linked in: gfs lock_dlm dlm cman lock_harness ipv6 parport_pc
lp parport autofs4 sunrpc e1000 floppy sg microcode dm_mod uhci_hcd
ehci_hcd button battery asus_acpi ac ext3 jbd qla2300 qla2xxx
scsi_transport_fc sd_mod scsi_mod
CPU:    0
EIP:    0060:[<e03e1897>]    Not tainted
EFLAGS: 00010286   (2.6.7)
EIP is at do_dlm_lock+0x1b7/0x1d0 [lock_dlm]
eax: 00000001   ebx: ffffffea   ecx: 00000000   edx: c5309f24
esi: e03e1c30   edi: df74f238   ebp: c7b54958   esp: c5309f20
ds: 007b   es: 007b   ss: 0068
Process lock_dlm (pid: 3431, threadinfo=c5308000 task=c3c9b6b0)
Stack: e03e5a41 c678bf08 00000002 00000018 00000000 ffffffea 00000000
00000005
       00000414 20202020 32202020 20202020 20202020 20202020 38312020
00000018
       b11de200 c7b54958 df74f238 df74f268 c7b54958 e03e1c26 c3c9b858
c5308000
Call Trace:
 [<e03e1c26>] process_submit+0x36/0x40 [lock_dlm]
 [<e03e4e4b>] dlm_async+0x16b/0x220 [lock_dlm]
 [<c0118850>] default_wake_function+0x0/0x10
 [<c0118850>] default_wake_function+0x0/0x10
 [<e03e4ce0>] dlm_async+0x0/0x220 [lock_dlm]
 [<c010429d>] kernel_thread_helper+0x5/0x18

Code: 0f 0b 84 01 d8 53 3e e0 c7 04 24 04 54 3e e0 e8 45 98 d3 df


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 David Teigland 2004-07-23 06:18:09 UTC
This should now be fixed.  The key was "lkf=414" which shows two
incompatible flags being used together which causes the assert.
The rest of the lock_dlm debug dump was also useful in verifying
what was happening.

Comment 2 Dean Jansa 2004-09-03 13:59:21 UTC
I ran this last evening...  Hit this on the node doing IO, bad news 
is no stack, just what little is left in /var/log/messages.  The 
node reboots after that little gasp. 
 
I will try to reproduce this again in hopes of getting some useful 
info. 
 
Sep  2 18:21:22 tank-01 kernel: CMAN: killed by STARTTRANS or 
NOMINATE 
Sep  2 18:21:22 tank-01 kernel: CMAN: we are leaving the cluster 
Sep  2 18:21:22 tank-01 kernel: Unable to handle kernel NULL pointer 
dereference at virtual address 00000004 
Sep  2 18:21:22 tank-01 kernel:  printing eip: 
Sep  2 18:21:22 tank-01 kernel: f8cf51a6 
Sep  2 18:21:22 tank-01 kernel: *pde = 00000000 
 

Comment 3 Kiersten (Kerri) Anderson 2004-11-16 19:09:53 UTC
Updating version to the right level in the defects.  Sorry for the storm.

Comment 4 Derek Anderson 2005-03-02 14:20:02 UTC
Verified with 2/28/2005 build.  Ran overnight with an additional node
running heavy traffic.


Note You need to log in before you can comment on or make changes to this bug.