Description of problem: Running a 3-node cluster. One node was running tar-untar operations on the 2.6.7 source and the other was continuously mounting/umounting the filesystem. The third node was doing nothing. The node running the IO tripped the following assertion. I will put full logs in ~danderso/bugs/<this_bug_#>. lock_dlm: Assertion failed on line 388 of file /usr/src/cluster/gfs-kernel/src/dlm/lock.c lock_dlm: assertion: "!error" lock_dlm: time = 1649496 data1: num=2,18 err=-22 cur=0 req=5 lkf=414 ------------[ cut here ]------------ kernel BUG at /usr/src/cluster/gfs-kernel/src/dlm/lock.c:388! invalid operand: 0000 [#1] Modules linked in: gfs lock_dlm dlm cman lock_harness ipv6 parport_pc lp parport autofs4 sunrpc e1000 floppy sg microcode dm_mod uhci_hcd ehci_hcd button battery asus_acpi ac ext3 jbd qla2300 qla2xxx scsi_transport_fc sd_mod scsi_mod CPU: 0 EIP: 0060:[<e03e1897>] Not tainted EFLAGS: 00010286 (2.6.7) EIP is at do_dlm_lock+0x1b7/0x1d0 [lock_dlm] eax: 00000001 ebx: ffffffea ecx: 00000000 edx: c5309f24 esi: e03e1c30 edi: df74f238 ebp: c7b54958 esp: c5309f20 ds: 007b es: 007b ss: 0068 Process lock_dlm (pid: 3431, threadinfo=c5308000 task=c3c9b6b0) Stack: e03e5a41 c678bf08 00000002 00000018 00000000 ffffffea 00000000 00000005 00000414 20202020 32202020 20202020 20202020 20202020 38312020 00000018 b11de200 c7b54958 df74f238 df74f268 c7b54958 e03e1c26 c3c9b858 c5308000 Call Trace: [<e03e1c26>] process_submit+0x36/0x40 [lock_dlm] [<e03e4e4b>] dlm_async+0x16b/0x220 [lock_dlm] [<c0118850>] default_wake_function+0x0/0x10 [<c0118850>] default_wake_function+0x0/0x10 [<e03e4ce0>] dlm_async+0x0/0x220 [lock_dlm] [<c010429d>] kernel_thread_helper+0x5/0x18 Code: 0f 0b 84 01 d8 53 3e e0 c7 04 24 04 54 3e e0 e8 45 98 d3 df Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
This should now be fixed. The key was "lkf=414" which shows two incompatible flags being used together which causes the assert. The rest of the lock_dlm debug dump was also useful in verifying what was happening.
I ran this last evening... Hit this on the node doing IO, bad news is no stack, just what little is left in /var/log/messages. The node reboots after that little gasp. I will try to reproduce this again in hopes of getting some useful info. Sep 2 18:21:22 tank-01 kernel: CMAN: killed by STARTTRANS or NOMINATE Sep 2 18:21:22 tank-01 kernel: CMAN: we are leaving the cluster Sep 2 18:21:22 tank-01 kernel: Unable to handle kernel NULL pointer dereference at virtual address 00000004 Sep 2 18:21:22 tank-01 kernel: printing eip: Sep 2 18:21:22 tank-01 kernel: f8cf51a6 Sep 2 18:21:22 tank-01 kernel: *pde = 00000000
Updating version to the right level in the defects. Sorry for the storm.
Verified with 2/28/2005 build. Ran overnight with an additional node running heavy traffic.