Bug 126604

Summary:	recovery panic in dlm_recoverd
Product:	[Retired] Red Hat Cluster Suite	Reporter:	Corey Marthaler <cmarthal>
Component:	gfs	Assignee:	David Teigland <teigland>
Status:	CLOSED WORKSFORME	QA Contact:	Derek Anderson <danderso>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	4	CC:	ccaulfie, djansa
Target Milestone:	---
Target Release:	---
Hardware:	i686
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2005-01-05 22:42:20 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Corey Marthaler 2004-06-23 18:22:26 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (compatible; Konqueror/3.1; Linux)

Description of problem:
Had a gfs filesystem mounted on all 6 nodes (morph-01 - morph-06) with no I/O running and then took down one of the nodes. All other nodes instantly paniced. They panic in either dlm_recoverd or swapper


morph-03:

dlm: clvmd: recover event 12
dlm: clvmd: remove node 1
Jun 23 13:00:30 morph-03 kernel: dlm: clvmd: recover event 12
Jun 23 13:00:30 morph-03 kernel: dlm: clvmd: remove node 1
dlm: clvmd: total nodes 5
dlm: clvmd: nodes_reconfig failed 1
dlm: clvmd: recover event 12 error 1
Unable to handle kernel NULL pointer dereference at virtual address 00000010
 printing eip:
f8a7cfc4
*pde = 00000000
Oops: 0000 [#1]
Modules linked in: gnbd lock_gulm lock_nolock lock_dlm dlm cman gfs lock_harness ipv6 parport_pc lp parport autofs4 sunrpc e1000 floppy sg microcode dm_mod uhci_hcd ehci_hcd button battery asus_acpi ac ext3 jbd qla2300 qla2xxx scsi_transport_fc sd_mod scsi_mod
CPU:    0
EIP:    0060:[<f8a7cfc4>]    Not tainted
EFLAGS: 00010286   (2.6.7)
EIP is at next_move+0x9a4/0xa10 [dlm]
eax: f7fff080   ebx: f8a7f8fa   ecx: 00000023   edx: 0000000f
esi: f69de000   edi: f6d11f54   ebp: 00000001   esp: f6d11f10
ds: 007b   es: 007b   ss: 0068
Process dlm_recoverd (pid: 3655, threadinfo=f6d10000 task=c22a8eb0)
Stack: f69de000 f8a81284 00000001 00000001 00000000 0000000c 0000000d 0000000a
       00000000 0000000a 0000000d 0000000c 00000000 00000001 00000000 f6d11f88
       f6d11f84 f6bffe28 f6bffe28 f6d10000 f6d11fc0 f69de000 f6d11fa0 f8a7d05a
Call Trace:
 [<f8a7d05a>] do_ls_recovery+0x2a/0x410 [dlm]
 [<c0118850>] default_wake_function+0x0/0x10
 [<f8a7d568>] dlm_recoverd+0x128/0x160 [dlm]
 [<c0118850>] default_wake_function+0x0/0x10
 [<c0105c12>] ret_from_fork+0x6/0x14
 [<c0118850>] default_wake_function+0x0/0x10
 [<f8a7d440>] dlm_recoverd+0x0/0x160 [dlm]
 [<c010429d>] kernel_thread_helper+0x5/0x18

Code: a1 10 00 00 00 89 5c 24 04 89 44 24 08 e9 83 f7 ff ff e8 c5


morph-04:

CMAN: node morph-05.lab.msp.redhat.com is not responding - removing from the cluster
Jun 23 13:01:11 morph-04 kernel: CMAN: node morph-05.lab.msp.redhat.com is not responding - removing from the cluster
dlm: gfs0: total nodes 5
dlm: gfs0: nodes_reconfig failed 1
dlm: gfs0: recover event 11 error 1
dlm: gfs0: recover event 12
dlm: gfs0: remove node 6
Jun 23 13:01:15 morph-04 kernel: dlm: gfs0: total nodes 5
Jun 23 13:01:15 morph-04 kernel: dlm: gfs0: nodes_reconfig failed 1
Jun 23 13:01:15 morph-04 kernel: dlm: gfs0: recover event 11 error 1
Jun 23 13:01:15 morph-04 kernel: dlm: gfs0: recover event 12
Jun 23 13:01:15 morph-04 kernel: dlm: gfs0: remove node 6
------------[ cut here ]------------
kernel BUG at kernel/timer.c:405!
invalid operand: 0000 [#1]
Modules linked in: gnbd lock_gulm lock_nolock lock_dlm dlm cman gfs lock_harness ipv6 parport_pc lp parport autofs4 sunrpc e1000 floppy sg microcode dm_mod uhci_hcd ehci_hcd button battery asus_acpi ac ext3 jbd qla2300 qla2xxx scsi_transport_fc sd_mod scsi_mod
CPU:    0
EIP:    0060:[<c0121b10>]    Not tainted
EFLAGS: 00010002   (2.6.7)
EIP is at cascade+0x40/0x50
eax: f737f210   ebx: c03b5800   ecx: c03b5800   edx: c03b5800
esi: c03b5b68   edi: c03b5180   ebp: 0000003c   esp: c0367f40
ds: 007b   es: 007b   ss: 0068
Process swapper (pid: 0, threadinfo=c0366000 task=c0312a40)
Stack: 00000000 c03b4ea8 c0367f54 c0367f54 c01220d1 c0367f54 c0367f54 c0122217
       00000001 c03b4ea8 0000000a c0314e24 c011e809 00000046 c0364a00 00000000
       c011e837 00000000 c01077c5 00000000 c0367fac c0314e24 c0366000 00099100
Call Trace:
 [<c01220d1>] run_timer_softirq+0xe1/0x150
 [<c0122217>] do_timer+0xc7/0xd0
 [<c011e809>] __do_softirq+0x79/0x80
 [<c011e837>] do_softirq+0x27/0x30
 [<c01077c5>] do_IRQ+0xd5/0x110
 [<c0105e6c>] common_interrupt+0x18/0x20
 [<c0104053>] default_idle+0x23/0x40
 [<c01040e4>] cpu_idle+0x34/0x40
 [<c03685e2>] start_kernel+0x162/0x1a0
 [<c0368330>] unknown_bootoption+0x0/0x120

Code: 0f 0b 95 01 ee e4 2d c0 eb dd 8d b6 00 00 00 00 56 53 83 ec
 <0>Kernel panic: Fatal exception in interrupt
In interrupt handler - not syncing


Version-Release number of selected component (if applicable):


How reproducible:
Sometimes

Steps to Reproduce:
1. get cluster up
2. mount gfs on all nodes
3. kill one node
    

Additional info:

Comment 1 David Teigland 2004-06-25 07:54:25 UTC

AFAICS this was caused by a new debugging statement that referenced
a null pointer.  It was fixed the following day in changeset 1.1682.

Comment 2 Corey Marthaler 2004-06-29 21:02:44 UTC

is this bug fixed in cvs? I've upgraded many times since this was 
"fixed" and continue to see this bug.

Comment 3 David Teigland 2004-06-30 02:13:22 UTC

This certainly doesn't happen for me (and I don't think for Patrick).
Maybe it requires 6 node to show up (I have 4, Patrick 5).  I'm sure
it will be simple to fix if we can reproduce it.  For now, please
add #define DLM_DEBUG_ALL after DLM_DEBUG in dlm_internal.h and
collect the console output from the crash.

Comment 4 Dean Jansa 2004-07-14 20:30:16 UTC

Hit it again (July 13 cvs tree, but didn't have DLM_DEBUG_ALL turned 
on...) 
 
------------[ cut here ]------------ 
kernel BUG at kernel/timer.c:405! 
invalid operand: 0000 [#1] 
Modules linked in: gnbd lock_gulm lock_nolock lock_dlm dlm cman gfs 
lock_harnesd 
CPU:    0 
EIP:    0060:[<c0121b10>]    Not tainted 
EFLAGS: 00010002   (2.6.7) 
EIP is at cascade+0x40/0x50 
eax: f73d6290   ebx: c03b59f8   ecx: c03b59f8   edx: c03b59f8 
esi: c03b59f0   edi: c03b5180   ebp: 0000000d   esp: c0367f40 
ds: 007b   es: 007b   ss: 0068 
Process swapper (pid: 0, threadinfo=c0366000 task=c0312a40) 
Stack: 00000000 c03b4ea8 c0367f54 c0367f54 c01220d1 c0367f54 
c0367f54 c0122217 
       00000001 c03b4ea8 0000000a c0314e24 c011e809 00000046 
c0364a00 00000000 
       c011e837 00000000 c01077c5 00000000 c0367fac c0314e24 
c0366000 00099100 
Call Trace: 
 [<c01220d1>] run_timer_softirq+0xe1/0x150 
 [<c0122217>] do_timer+0xc7/0xd0 
 [<c011e809>] __do_softirq+0x79/0x80 
 [<c011e837>] do_softirq+0x27/0x30 
 [<c01077c5>] do_IRQ+0xd5/0x110 
 [<c0105e6c>] common_interrupt+0x18/0x20 
 [<c0104053>] default_idle+0x23/0x40 
 [<c01040e4>] cpu_idle+0x34/0x40 
 [<c03685e2>] start_kernel+0x162/0x1a0 
 [<c0368330>] unknown_bootoption+0x0/0x120 
 
Code: 0f 0b 95 01 ea e4 2d c0 eb dd 8d b6 00 00 00 00 56 53 83 ec 
 <0>Kernel panic: Fatal exception in interrupt 
In interrupt handler - not syncing

Comment 5 David Teigland 2004-07-15 02:00:42 UTC

I think I'll need to get a six+ node cluster from somewhere to try
this on. I don't think it'll happen on my four node cluster, at least
I've never seen it.

Comment 6 Kiersten (Kerri) Anderson 2004-11-16 19:11:59 UTC

Updating version to the right level in the defects.  Sorry for the storm.

Comment 7 Corey Marthaler 2005-01-05 22:42:20 UTC

hasn't been seen in almost 6 months.