128325 – recovery assertion/panic in cluster/dlm/rsb.c during recovery

Bug 128325 - recovery assertion/panic in cluster/dlm/rsb.c during recovery

Summary: recovery assertion/panic in cluster/dlm/rsb.c during recovery

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Cluster Suite
Classification:	Retired
Component:	gfs
Sub Component:
Version:	4
Hardware:	i686
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	David Teigland
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2004-07-21 18:52 UTC by Corey Marthaler
Modified:	2010-01-12 02:54 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2004-10-29 21:00:05 UTC
Embargoed:

Attachments	(Terms of Use)

Description Corey Marthaler 2004-07-21 18:52:46 UTC

Description of problem: 
basic recovery senario again, healthy cluster running I/O. Two nodes 
are shot (morph-01 and morph-03) and that causes morph-06 to assert 
and then panic:  
 
foobar0 move flags 0,0,1 ids 7,19,19 
foobar0 process held requests 
foobar0 processed 0 requests 
foobar0 resend marked requests 
foobar0 resend 20389 lq 4 flg 184000 node 3/-1 "       2 
foobar0 unlock done 20389 
foobar0 resent 1 requests 
foobar0 recover event 19 finished 
foobar0 release lkb with status 2 
 
DLM:  Assertion failed on line 64 of file cluster/dlm/rsb.c 
DLM:  assertion:  "list_empty(&r->res_grantqueue)" 
DLM:  time = 495631 
dlm: rsb 
name "       2              18" 
nodeid 4294967295 
ref 0 
 
------------[ cut here ]------------ 
kernel BUG at cluster/dlm/rsb.c:64! 
invalid operand: 0000 [#1] 
Modules linked in: gnbd lock_gulm lock_nolock lock_dlm dlm cman gfs 
lock_harness ipv6 parport_pc lp parport autofs4 sunrpc e1000 floppy 
sg microcode dm_mod uhci_hcd ehci_hcd button battery asus_acpi ac 
ext3 jbd qla2300 qla2xxx scsi_transport_fc sd_mod scsi_mod 
CPU:    0 
EIP:    0060:[<e0304a60>]    Not tainted 
EFLAGS: 00010246   (2.6.7) 
EIP is at release_rsb+0x240/0x260 [dlm] 
eax: 00000001   ebx: cd1ec3c8   ecx: c03150f0   edx: da1fdf44 
esi: da976d38   edi: da976d38   ebp: cd1ec3c8   esp: da1fdf40 
ds: 007b   es: 007b   ss: 0068 
Process dlm_astd (pid: 3830, threadinfo=da1fc000 task=da7ce8b0) 
Stack: e0306da8 00000040 e0306d96 e0308950 0007900f da976dac 
da976d38 c58ea57c 
       e02f3473 d89a97b0 00000000 7263bf00 000f42a0 da7cea58 
e030dd54 da1fc000 
       da1fdfa4 da1fdfb0 e02f3f2a e0305ef8 00000000 da7ce8b0 
c0118850 00000000 
Call Trace: 
 [<e02f3473>] process_asts+0xc3/0x190 [dlm] 
 [<e02f3f2a>] dlm_astd+0x26a/0x280 [dlm] 
 [<c0118850>] default_wake_function+0x0/0x10 
 [<c011839a>] schedule_tail+0x1a/0x60 
 [<c0118850>] default_wake_function+0x0/0x10 
 [<e02f3cc0>] dlm_astd+0x0/0x280 [dlm] 
 [<e02f3cc0>] dlm_astd+0x0/0x280 [dlm] 
 [<c010429d>] kernel_thread_helper+0x5/0x18 
 
Code: 0f 0b 40 00 96 6d 30 e0 e9 43 ff ff ff 8d 76 00 8b 5c 24 14 
 <4>CMAN: no HELLO from morph-05.lab.msp.redhat.com, removing from 
the cluster 
dlm: got connection from 4 
dlm: got connection from 2 
Jul 21 13:04:32 Unable to handle kernel paging requestmorph-06 
kernel: at virtual address 00100104 
 dlm: clvmd: mar printing eip: 
k waiting requese02fbfa6 
ts 
Jul 21 13:04*pde = 00000000 
:32 morph-06 kerOops: 0002 [#2] 
Modules linked in: gnbd lock_gulm lock_nolock lock_dlm dlm cman gfs 
lock_harness ipv6 parport_pc lp parport autofs4 sunrpc e1000 floppy 
sg microcode dm_mod uhci_hcd ehci_hcd button battery asus_acpi ac 
ext3 jbd qla2300 qla2xxx scsi_transport_fc sd_mod scsi_mod 
CPU:    0 
EIP:    0060:[<e02fbfa6>]    Not tainted 
EFLAGS: 00010287   (2.6.7) 
EIP is at process_sockets+0x36/0xa0 [dlm] 
eax: 00200200   ebx: dc9c1ad8   ecx: 00100100   edx: dc9c1aec 
esi: 00100100   edi: da292000   ebp: 00000000   esp: da293fc8 
ds: 007b   es: 007b   ss: 0068 
Process dlm_recvd (pid: 3831, threadinfo=da292000 task=da7cf3b0) 
Stack: da292000 00000000 00000000 e02fc25e e030653b 00000000 
0000007b 0000007b 
       ffffffff e02fc1c0 c010429d 00000000 00000000 00000000 
Call Trace: 
 [<e02fc25e>] dlm_recvd+0x9e/0xf0 [dlm] 
 [<e02fc1c0>] dlm_recvd+0x0/0xf0 [dlm] 
 [<c010429d>] kernel_thread_helper+0x5/0x18 
 
Code: 89 41 04 89 08 c7 42 04 00 02 20 00 c7 02 00 01 10 00 0f ba 
 nel: dlm: clvmd:<0>Kernel panic: Fatal exception in interrupt 
In interrupt handler - not syncing 
 marked 0 reques ts

Comment 1 David Teigland 2004-07-22 03:42:44 UTC

I'm glad you ran into this so quickly.  In the process of fixing
another problem yesterday (that I could reproduce) I fixed a second
related problem that I couldn't actually trigger in my test (so I
couldn't verify the second fix was actually correct.)  You've created
the condition where the second fix is exercised and found that I
missed a minor part.  The debug output (unlock done 20389) was key in
showing what was happening.

I have now reproduced this condition and the fix works in my own test.

Note: I consider everything after the first assert panic to
be noise caused by the fact that linux tries to keep running even
after a panic.  Bugs that appear in this post-panic context are
usually invalid.

Comment 2 Corey Marthaler 2004-10-29 21:00:05 UTC

unable to reproduce, marking fixed.

Comment 3 Kiersten (Kerri) Anderson 2004-11-16 19:10:49 UTC

Updating version to the right level in the defects.  Sorry for the storm.

Note You need to log in before you can comment on or make changes to this bug.