Description of problem: basic recovery senario again, healthy cluster running I/O. Two nodes are shot (morph-01 and morph-03) and that causes morph-06 to assert and then panic: foobar0 move flags 0,0,1 ids 7,19,19 foobar0 process held requests foobar0 processed 0 requests foobar0 resend marked requests foobar0 resend 20389 lq 4 flg 184000 node 3/-1 " 2 foobar0 unlock done 20389 foobar0 resent 1 requests foobar0 recover event 19 finished foobar0 release lkb with status 2 DLM: Assertion failed on line 64 of file cluster/dlm/rsb.c DLM: assertion: "list_empty(&r->res_grantqueue)" DLM: time = 495631 dlm: rsb name " 2 18" nodeid 4294967295 ref 0 ------------[ cut here ]------------ kernel BUG at cluster/dlm/rsb.c:64! invalid operand: 0000 [#1] Modules linked in: gnbd lock_gulm lock_nolock lock_dlm dlm cman gfs lock_harness ipv6 parport_pc lp parport autofs4 sunrpc e1000 floppy sg microcode dm_mod uhci_hcd ehci_hcd button battery asus_acpi ac ext3 jbd qla2300 qla2xxx scsi_transport_fc sd_mod scsi_mod CPU: 0 EIP: 0060:[<e0304a60>] Not tainted EFLAGS: 00010246 (2.6.7) EIP is at release_rsb+0x240/0x260 [dlm] eax: 00000001 ebx: cd1ec3c8 ecx: c03150f0 edx: da1fdf44 esi: da976d38 edi: da976d38 ebp: cd1ec3c8 esp: da1fdf40 ds: 007b es: 007b ss: 0068 Process dlm_astd (pid: 3830, threadinfo=da1fc000 task=da7ce8b0) Stack: e0306da8 00000040 e0306d96 e0308950 0007900f da976dac da976d38 c58ea57c e02f3473 d89a97b0 00000000 7263bf00 000f42a0 da7cea58 e030dd54 da1fc000 da1fdfa4 da1fdfb0 e02f3f2a e0305ef8 00000000 da7ce8b0 c0118850 00000000 Call Trace: [<e02f3473>] process_asts+0xc3/0x190 [dlm] [<e02f3f2a>] dlm_astd+0x26a/0x280 [dlm] [<c0118850>] default_wake_function+0x0/0x10 [<c011839a>] schedule_tail+0x1a/0x60 [<c0118850>] default_wake_function+0x0/0x10 [<e02f3cc0>] dlm_astd+0x0/0x280 [dlm] [<e02f3cc0>] dlm_astd+0x0/0x280 [dlm] [<c010429d>] kernel_thread_helper+0x5/0x18 Code: 0f 0b 40 00 96 6d 30 e0 e9 43 ff ff ff 8d 76 00 8b 5c 24 14 <4>CMAN: no HELLO from morph-05.lab.msp.redhat.com, removing from the cluster dlm: got connection from 4 dlm: got connection from 2 Jul 21 13:04:32 Unable to handle kernel paging requestmorph-06 kernel: at virtual address 00100104 dlm: clvmd: mar printing eip: k waiting requese02fbfa6 ts Jul 21 13:04*pde = 00000000 :32 morph-06 kerOops: 0002 [#2] Modules linked in: gnbd lock_gulm lock_nolock lock_dlm dlm cman gfs lock_harness ipv6 parport_pc lp parport autofs4 sunrpc e1000 floppy sg microcode dm_mod uhci_hcd ehci_hcd button battery asus_acpi ac ext3 jbd qla2300 qla2xxx scsi_transport_fc sd_mod scsi_mod CPU: 0 EIP: 0060:[<e02fbfa6>] Not tainted EFLAGS: 00010287 (2.6.7) EIP is at process_sockets+0x36/0xa0 [dlm] eax: 00200200 ebx: dc9c1ad8 ecx: 00100100 edx: dc9c1aec esi: 00100100 edi: da292000 ebp: 00000000 esp: da293fc8 ds: 007b es: 007b ss: 0068 Process dlm_recvd (pid: 3831, threadinfo=da292000 task=da7cf3b0) Stack: da292000 00000000 00000000 e02fc25e e030653b 00000000 0000007b 0000007b ffffffff e02fc1c0 c010429d 00000000 00000000 00000000 Call Trace: [<e02fc25e>] dlm_recvd+0x9e/0xf0 [dlm] [<e02fc1c0>] dlm_recvd+0x0/0xf0 [dlm] [<c010429d>] kernel_thread_helper+0x5/0x18 Code: 89 41 04 89 08 c7 42 04 00 02 20 00 c7 02 00 01 10 00 0f ba nel: dlm: clvmd:<0>Kernel panic: Fatal exception in interrupt In interrupt handler - not syncing marked 0 reques ts
I'm glad you ran into this so quickly. In the process of fixing another problem yesterday (that I could reproduce) I fixed a second related problem that I couldn't actually trigger in my test (so I couldn't verify the second fix was actually correct.) You've created the condition where the second fix is exercised and found that I missed a minor part. The debug output (unlock done 20389) was key in showing what was happening. I have now reproduced this condition and the fix works in my own test. Note: I consider everything after the first assert panic to be noise caused by the fact that linux tries to keep running even after a panic. Bugs that appear in this post-panic context are usually invalid.
unable to reproduce, marking fixed.
Updating version to the right level in the defects. Sorry for the storm.