Bug 128403 - kernel BUG at /usr/src/cluster/gfs-kernel/src/dlm/lock.c:388!
kernel BUG at /usr/src/cluster/gfs-kernel/src/dlm/lock.c:388!
Product: Red Hat Cluster Suite
Classification: Red Hat
Component: gfs (Show other bugs)
All Linux
medium Severity medium
: ---
: ---
Assigned To: David Teigland
Cluster QE
Depends On:
  Show dependency treegraph
Reported: 2004-07-22 12:02 EDT by Derek Anderson
Modified: 2010-01-11 21:54 EST (History)
1 user (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2005-03-02 09:20:02 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)

  None (edit)
Description Derek Anderson 2004-07-22 12:02:31 EDT
Description of problem:
Running a 3-node cluster.  One node was running tar-untar operations
on the 2.6.7 source and the other was continuously mounting/umounting
the filesystem.  The third node was doing nothing.

The node running the IO tripped the following assertion.  I will put
full logs in ~danderso/bugs/<this_bug_#>.

lock_dlm:  Assertion failed on line 388 of file
lock_dlm:  assertion:  "!error"
lock_dlm:  time = 1649496
data1: num=2,18 err=-22 cur=0 req=5 lkf=414

------------[ cut here ]------------
kernel BUG at /usr/src/cluster/gfs-kernel/src/dlm/lock.c:388!
invalid operand: 0000 [#1]
Modules linked in: gfs lock_dlm dlm cman lock_harness ipv6 parport_pc
lp parport autofs4 sunrpc e1000 floppy sg microcode dm_mod uhci_hcd
ehci_hcd button battery asus_acpi ac ext3 jbd qla2300 qla2xxx
scsi_transport_fc sd_mod scsi_mod
CPU:    0
EIP:    0060:[<e03e1897>]    Not tainted
EFLAGS: 00010286   (2.6.7)
EIP is at do_dlm_lock+0x1b7/0x1d0 [lock_dlm]
eax: 00000001   ebx: ffffffea   ecx: 00000000   edx: c5309f24
esi: e03e1c30   edi: df74f238   ebp: c7b54958   esp: c5309f20
ds: 007b   es: 007b   ss: 0068
Process lock_dlm (pid: 3431, threadinfo=c5308000 task=c3c9b6b0)
Stack: e03e5a41 c678bf08 00000002 00000018 00000000 ffffffea 00000000
       00000414 20202020 32202020 20202020 20202020 20202020 38312020
       b11de200 c7b54958 df74f238 df74f268 c7b54958 e03e1c26 c3c9b858
Call Trace:
 [<e03e1c26>] process_submit+0x36/0x40 [lock_dlm]
 [<e03e4e4b>] dlm_async+0x16b/0x220 [lock_dlm]
 [<c0118850>] default_wake_function+0x0/0x10
 [<c0118850>] default_wake_function+0x0/0x10
 [<e03e4ce0>] dlm_async+0x0/0x220 [lock_dlm]
 [<c010429d>] kernel_thread_helper+0x5/0x18

Code: 0f 0b 84 01 d8 53 3e e0 c7 04 24 04 54 3e e0 e8 45 98 d3 df

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
Actual results:

Expected results:

Additional info:
Comment 1 David Teigland 2004-07-23 02:18:09 EDT
This should now be fixed.  The key was "lkf=414" which shows two
incompatible flags being used together which causes the assert.
The rest of the lock_dlm debug dump was also useful in verifying
what was happening.
Comment 2 Dean Jansa 2004-09-03 09:59:21 EDT
I ran this last evening...  Hit this on the node doing IO, bad news 
is no stack, just what little is left in /var/log/messages.  The 
node reboots after that little gasp. 
I will try to reproduce this again in hopes of getting some useful 
Sep  2 18:21:22 tank-01 kernel: CMAN: killed by STARTTRANS or 
Sep  2 18:21:22 tank-01 kernel: CMAN: we are leaving the cluster 
Sep  2 18:21:22 tank-01 kernel: Unable to handle kernel NULL pointer 
dereference at virtual address 00000004 
Sep  2 18:21:22 tank-01 kernel:  printing eip: 
Sep  2 18:21:22 tank-01 kernel: f8cf51a6 
Sep  2 18:21:22 tank-01 kernel: *pde = 00000000 
Comment 3 Kiersten (Kerri) Anderson 2004-11-16 14:09:53 EST
Updating version to the right level in the defects.  Sorry for the storm.
Comment 4 Derek Anderson 2005-03-02 09:20:02 EST
Verified with 2/28/2005 build.  Ran overnight with an additional node
running heavy traffic.

Note You need to log in before you can comment on or make changes to this bug.