Bug 154445

Summary:	oops in dlm_sendd after removing nodes
Product:	[Retired] Red Hat Cluster Suite	Reporter:	Corey Marthaler <cmarthal>
Component:	dlm	Assignee:	David Teigland <teigland>
Status:	CLOSED WORKSFORME	QA Contact:	Cluster QE <mspqa-list>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	4	CC:	ccaulfie, cluster-maint
Target Milestone:	---
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2006-05-04 16:50:22 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Corey Marthaler 2005-04-11 19:08:52 UTC

Description of problem:
revolver had just shot 3 (tank-04, tank-02, tank-01) out of 6 nodes in the
cluster when tank-06 hit this oops:

[...]
CMAN: removing node tank-04.lab.msp.redhat.com from the cluster : Missed too
many heartbeats
CMAN: removing node tank-01.lab.msp.redhat.com from the cluster : No response to
messages
dlm: gfs4: remote_stage error -105 230058
dlm: gfs0: remote_stage error -105 270182
Unable to handle kernel NULL pointer dereference at virtual address 00000005
 printing eip:
22318adf
*pde = 00004001
Oops: 0002 [#1]
SMP
Modules linked in: gnbd(U) lock_nolock(U) gfs(U) lock_dlm(U) dlm(U) cman(U)
lock_harness(U) md5 ipv6 parport_pc lp parport autofs4 sunrpc button battery ac
uhci_hcd ehci_hcd e1000 floppy dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod
qla2300 qla2xxx scsi_transport_fc sd_mod scsi_mod
CPU:    0
EIP:    0060:[<22318adf>]    Not tainted VLI
EFLAGS: 00010246   (2.6.9-6.37.ELhugemem)
EIP is at send_to_sock+0xef/0x212 [dlm]
eax: 00000001   ebx: 1bf0b4c0   ecx: 15411500   edx: 00000000
esi: 211d9340   edi: 00000000   ebp: 00000000   esp: 18049fa0
ds: 007b   es: 007b   ss: 0068
Process dlm_sendd (pid: 5657, threadinfo=18049000 task=182e6db0)
Stack: 0228a763 00000000 1bf0b4c0 2232b68c 00000000 22318f62 22318dac 18049000
       00000000 22318fec 18049000 1f30fea4 02131cd9 fffffffc ffffffff ffffffff
       02131c66 00000000 00000000 00000000 021041f1 1f30fe9c 00000000 00000000
Call Trace:
 [<0228a763>] tcp_sendpage+0x0/0x5e
 [<22318f62>] dlm_sendd+0x0/0x9a [dlm]
 [<22318dac>] process_output_queue+0x4d/0x67 [dlm]
 [<22318fec>] dlm_sendd+0x8a/0x9a [dlm]
 [<02131cd9>] kthread+0x73/0x9b
 [<02131c66>] kthread+0x0/0x9b
 [<021041f1>] kernel_thread_helper+0x5/0xb
Code: 58 fa df 8b 44 24 04 01 46 0c 8b 46 10 2b 44 24 04 85 c0 89 46 10 0f 85 6f
ff ff ff 83 7e 18 00 0f 85 65 ff ff ff 8b 06 8b 56 04 <89> 50 04 89 02 89 f0 c7
06 00 01 10 00 c7 46 04 00 02 20 00 e8
 <0>Fatal exception: panic in 5 seconds
odes 6
gfs4 rebuild resource directory
gfs4 rebuilt 515 resources
gfs4 purge requests
gfs4 purged 0 requests
gfs4 mark waiting requests
gfs4 marked 0 requests
gfs4 recover event 26 done
gfs4 move flags 0,0,1 ids 15,26,26
gfs4 process held requests
gfs4 processed 0 requests
gfs4 resend marked requests
gfs4 resent 0 requests
gfs4 recover event 26 finished
gfs5 move flags 1,0,0 ids 15,15,15
gfs5 add_to_requestq cmd 1 fr 1
gfs5 move flags 0,1,0 ids 15,28,15
gfs5 move use event 28
gfs5 recover event 28
gfs5 add node 2
gfs5 add_to_requestq cmd 9 fr 3
gfs5 add_to_requestq cmd 1 fr 3
gfs5 total nodes 6
gfs5 rebuild resource directory
gfs5 rebuilt 1038 resources
gfs5 purge requests
gfs5 purged 3 requests
gfs5 mark waiting requests
gfs5 marked 0 requests
gfs5 recover event 28 done
gfs5 move flags 0,0,1 ids 15,28,28
gfs5 process held requests
gfs5 processed 0 requests
gfs5 resend marked requests
gfs5 resent 0 requests
gfs5 recover event 28 finished
gfs4 remote_stage error -105 230058
gfs0 remote_stage error -105 270182
655 qc 11,2fe5c98 0,5 id 302a9 sts 0 0
[...]


The other two nodes were left hung:
Apr 11 10:52:17 tank-03 kernel: CMAN: node tank-02.lab.msp.redhat.com has been rr
emoved from the cluster : Missed too many heartbeats
Apr 11 10:52:17 tank-03 kernel: CMAN: node tank-04.lab.msp.redhat.com has been rr
emoved from the cluster : Missed too many heartbeats
Apr 11 10:52:25 tank-03 kernel: CMAN: node tank-01.lab.msp.redhat.com has been rr
emoved from the cluster : No response to messages
Apr 11 10:52:36 tank-03 kernel: CMAN: removing node tank-06.lab.msp.redhat.com ff
rom the cluster : No response to messages
Apr 11 10:52:37 tank-03 kernel: CMAN: quorum lost, blocking activity



Apr 11 10:51:50 tank-05 kernel: CMAN: removing node tank-02.lab.msp.redhat.com ff
rom the cluster : Missed too many heartbeats
Apr 11 10:51:54 tank-05 kernel: CMAN: removing node tank-01.lab.msp.redhat.com ff
rom the cluster : No response to messages
Apr 11 10:51:58 tank-05 kernel: CMAN: removing node tank-04.lab.msp.redhat.com ff
rom the cluster : No response to messages
Apr 11 10:52:09 tank-05 kernel: CMAN: node tank-06.lab.msp.redhat.com has been rr
emoved from the cluster : No response to messages
Apr 11 10:52:10 tank-05 kernel: CMAN: quorum lost, blocking activity


Version-Release number of selected component (if applicable):
CMAN 2.6.9-32.0 (built Apr  5 2005 11:57:25) installed
DLM 2.6.9-30.1 (built Mar 29 2005 18:29:40) installed

Comment 1 David Teigland 2005-04-12 02:53:45 UTC

We've seen the -105 (ENOBUFS) error before as a symptom of bz 139738.
I'm not sure if that's what's happening here or not; we usually see
some other CMAN message indicating that's what's happened.  It doesn't
appear that this version of cman (built Apr 5) includes the latest
fixes for bz 139738.

Comment 2 Christine Caulfield 2005-04-12 12:51:55 UTC

Unless there are some interesting messages missing from that console log, that
looks like it might be a genuine OOM condition. There's nothing in there that
suggests that tank-06 has been kicked out of the cluster by anyone else.

(slighty later) It looks like might be the case that sendpage() can't send
highmem pages (there's some code in NFS to trap for this) which would possibly
explain the oops.

lowcomms buffers have two parts - a kmalloced bit and a page_alloced bit. So it
could be that when low memory ran out, the page was allocated from highmem and
the sendpage oopsed. The next kmalloc call then failed because low memory has
run out.

All hypothesis of course, The sendpage oops I can get round if highmem is the
cause. The rest is harder...

Comment 3 David Teigland 2006-05-04 16:50:22 UTC

This has either been fixed or the machine just ran out of memory.