Bug 216113 - dlm: add_to_waiters error 1
dlm: add_to_waiters error 1
Status: CLOSED NEXTRELEASE
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel (Show other bugs)
5.0
All Linux
medium Severity medium
: ---
: ---
Assigned To: David Teigland
Cluster QE
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2006-11-17 06:04 EST by Christine Caulfield
Modified: 2007-11-30 17:07 EST (History)
3 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2007-05-14 09:55:05 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
patch posted to rhkernel on 04/19 (959 bytes, text/plain)
2007-04-19 13:27 EDT, David Teigland
no flags Details
patch posted to rhkernel on 04/19 (40.28 KB, text/plain)
2007-04-19 13:31 EDT, David Teigland
no flags Details

  None (edit)
Description Christine Caulfield 2006-11-17 06:04:38 EST
Description of problem:
Running two copies of "flood", then stopping them results in the message
"dlm: add_to_waiters error 1" being send endlessly to the console.


Version-Release number of selected component (if applicable):
gfs2-2.6-fixes as of 17th November 2006

How reproducible:
Easily

Steps to Reproduce:
1. Start up a 2 node cluster
2. Compile flood from the STABLE branch: dlm/tests/usertest/flood.c
3. run it on two nodes as ./flood -i10000
4. Stop one, then stop the other

Actual results:
One node will endlessly emit the above message to the console.

Expected results:
No message

Additional info:
Comment 1 Christine Caulfield 2006-11-17 08:22:20 EST
After a while it seems to finish with the add_to_waiters messages and oops like
this:

BUG: unable to handle kernel paging request at virtual address cae53f8c
 printing eip:
c01f9982
*pde = 0002c067
*pte = 0ae53000
Oops: 0000 [#1]
SMP DEBUG_PAGEALLOC
Modules linked in: lock_nolock lock_dlm dlm gfs2 configfs sctp ipv6
dm_round_robin iscsi_tcp libiscsi scsi_transport_iscsi dm_multipath
CPU:    0
EIP:    0060:[<c01f9982>]    Not tainted VLI
EFLAGS: 00010286   (2.6.19-rc4 #4)
EIP is at _raw_spin_lock+0xc/0x10b
eax: cbd0d5b0   ebx: cae53f88   ecx: 00000000   edx: 00000000
esi: cc045c30   edi: cae53f88   ebp: cbd27d5c   esp: cbd27d30
ds: 007b   es: 007b   ss: 0068
Process dlm_recvd (pid: 3900, ti=cbd26000 task=cbd0d5b0 task.ti=cbd26000)
Stack: cbd27d4c 00000246 cc045c30 cae53f78 cbd27d5c 00000246 00000000 00000002
       cae53f88 cc045c30 cae53f78 cbd27d80 c032906e cae53f88 00000000 00000000
       00000000 00000002 e02ec45c c8dc9998 cbd27db0 e02ec45c 00000002 00000001
Call Trace:
 [<c0103d85>] show_trace_log_lvl+0x26/0x3c
 [<c0103e38>] show_stack_log_lvl+0x9d/0xa5
 [<c01041e8>] show_registers+0x1af/0x249
 [<c01045a9>] die+0x1dd/0x2c6
 [<c0111932>] do_page_fault+0x488/0x562
 [<c0329841>] error_code+0x39/0x40
 [<c032906e>] _spin_lock+0x43/0x49
 [<e02ec45c>] dlm_user_add_ast+0xb0/0x199 [dlm]
 [<e02e008c>] dlm_add_ast+0x1f/0xdd [dlm]
 [<e02e1e82>] queue_cast+0xdb/0xe1 [dlm]
 [<e02e449a>] _receive_unlock_reply+0x52/0x86 [dlm]
 [<e02e68cf>] dlm_receive_message+0x64e/0xb7f [dlm]
 [<e02ea10e>] dlm_process_incoming_buffer+0x121/0x1a3 [dlm]
Comment 2 Kiersten (Kerri) Anderson 2006-12-05 12:57:48 EST
Further investigation shows that this requires an extraordinary load to cause
the problem to occur.  Defering this defect until 5.1 as our customers are
unlikely to hit it.
Comment 3 David Teigland 2006-12-12 16:45:45 EST
When an lkb is on the waiters list for some remote operation,
then a force-unlock or a cancel causes problems.  We end up
with two outstanding remote operations on the lkb: the first
one and the unlock/cancel is the second.  The dlm can't deal
with two outstanding remote ops on an lkb.  An obvious sign
that this is happening is the "add_to_waiters error" message
when the dlm tries to add the lkb to the waiters list a second
time.
Comment 4 Rob Kenna 2006-12-13 09:49:14 EST
Given the further understanding, does it make sense to move a repair back into 5.0?
Comment 5 David Teigland 2006-12-13 10:01:42 EST
We never came up with any good ideas on how to fix it--it's going
to be difficult, so it should remain a 5.1 project.  It's in the
same ballpark as two other difficult problems on my list.  It may
be that all three get solved together.

One of the other issues that's definately related is:
- dlm: force unlock (or cancel?) of an lkb that's still doing lookup
  e.g. unlock_proc_lock/_unlock_lock/is_remote assertion failure nodeid -1
Comment 6 David Teigland 2007-03-28 11:12:15 EDT
The big patch for this is done and sent upstream.  Will send to rhkernel soon.
Comment 7 David Teigland 2007-04-19 13:27:41 EDT
Created attachment 153030 [details]
patch posted to rhkernel on 04/19
Comment 8 David Teigland 2007-04-19 13:31:59 EDT
Created attachment 153032 [details]
patch posted to rhkernel on 04/19

(attached the wrong patch above)
Comment 9 Don Zickus 2007-05-01 14:07:21 EDT
in 2.6.18-17.el5
Comment 10 Corey Marthaler 2007-05-14 15:42:45 EDT
Just a note that I saw the following assert in 2.6.18-8.1.4.el5...


DLM:  Assertion failed on line 238 of file fs/dlm/lock.c
DLM:  assertion:  "r->res_nodeid >= 0"
DLM:  time = 4305123101
dlm: closing connection to node 4
dlm: closing connection to node 3
dlm: closing connection to node 2
dlm: closing connection to node 1
rsb: nodeid -1 flags 0 first 100e7 rlc 0 name V_VolGroup00

----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at fs/dlm/lock.c:238
invalid opcode: 0000 [1] SMP
last sysfs file: /block/ram0/dev
CPU 0
Modules linked in: gnbd(U) lock_nolock gfs(U) lock_dlm gfs2 dlm configfs autofs4
hidp rfcomm l2cap blud
Pid: 9768, comm: clvmd Not tainted 2.6.18-8.1.4.el5 #1
RIP: 0010:[<ffffffff88466843>]  [<ffffffff88466843>] :dlm:unlock_lock+0x7d/0xd1
RSP: 0018:ffff81020ebdbe08  EFLAGS: 00010286
RAX: 0000000000000004 RBX: 0000000000000000 RCX: ffffffff8042d520
RDX: 00000000ffffffff RSI: 0000000000000000 RDI: ffffffff802da65c
RBP: ffff8102173c67c0 R08: 000000000000000d R09: 000000000000003c
R10: ffffffff803b7480 R11: 0000000000000000 R12: ffff810204dfaec0
R13: ffff8102173c67d0 R14: ffff810210818800 R15: ffff810210818c80
FS:  00002aaaaaabc2c0(0000) GS:ffffffff8038a000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00000000006990d8 CR3: 000000020a046000 CR4: 00000000000006e0
Process clvmd (pid: 9768, threadinfo ffff81020ebda000, task ffff81021c188080)
Stack:  ffff81020a3270c0 ffff81020ebdbe48 ffff810204dfaec0 ffff8102031650c0
 ffff81020a327060 ffffffff884677c4 ffff81020a3270c0 ffff810210818bf0
 0000000000020000 ffff81021c188080 ffff8102031650c0 ffff81020ebdbdf0
Call Trace:
 [<ffffffff884677c4>] :dlm:dlm_clear_proc_locks+0x140/0x192
 [<ffffffff8846d33b>] :dlm:device_close+0x55/0x99
 [<ffffffff80012281>] __fput+0xae/0x198
 [<ffffffff80023647>] filp_close+0x5c/0x64
 [<ffffffff8001d5cd>] sys_close+0x88/0xa2
 [<ffffffff8005b2c1>] tracesys+0xd1/0xdc


Code: 0f 0b 68 a4 0e 47 88 c2 ee 00 48 c7 c7 0e 0f 47 88 31 c0 e8
RIP  [<ffffffff88466843>] :dlm:unlock_lock+0x7d/0xd1
 RSP <ffff81020ebdbe08>
 <0>Kernel panic - not syncing: Fatal exception

Note You need to log in before you can comment on or make changes to this bug.