Bug 148014 - Nodes are oopsing in tcp code, possibly due to cman?
Nodes are oopsing in tcp code, possibly due to cman?
Status: CLOSED CURRENTRELEASE
Product: Red Hat Cluster Suite
Classification: Red Hat
Component: dlm (Show other bugs)
4
All Linux
medium Severity medium
: ---
: ---
Assigned To: Christine Caulfield
Cluster QE
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2005-02-14 14:33 EST by Adam "mantis" Manthei
Modified: 2009-04-24 10:42 EDT (History)
1 user (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2009-04-24 10:42:52 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Test script to show DLM socket leak (253 bytes, text/plain)
2005-02-15 11:53 EST, Christine Caulfield
no flags Details

  None (edit)
Description Adam "mantis" Manthei 2005-02-14 14:33:19 EST
Description of problem:
I stopped cman (I don't think that I unloaded the module) and let the node sit
idle over the weekend.  When I came back into the office monday morning the node
had panicked.  I don't know that this is CMAN's fault, but I am logging it here.
 Further testing will be needed to further narrow down what the minimum steps
are in 

Version-Release number of selected component (if applicable):
cman-kernel-2.6.9-18.0
cman-kernheaders-2.6.9-18.0
cman-1.0-0.pre21.0

(http://people.redhat.com/cfeist/cluster/RHEL4/alpha/cluster-2005-02-11-1100/cluster-i686-2005-02-11-1100.tar)
How reproducible:


Steps to Reproduce:
1. start ccsd, cman, fenced, clvmd and gfs
2. stop gfs, clvmd fenced.
3. I think that cman was stopped, but it's possible that it had not due to a
regression in the init scripts introduced by Dave's latest nodename changes (Bug
#147828)
  
Actual results:

Unable to handle kernel paging request at virtual address e036f418
 printing eip:
e036f418
*pde = 1f70c067
Oops: 0000 [#1]
Modules linked in: autofs4 cman(U) sunrpc md5 ipv6 microcode button battery ac
uhci_hcd ehci_hcd e1000 floppy ext3 jbd dm_mod qla
2300 qla2xxx scsi_transport_fc sd_mod scsi_mod
CPU:    0
EIP:    0060:[<e036f418>]    Not tainted VLI
EFLAGS: 00010246   (2.6.9-5.EL)
EIP is at 0xe036f418
eax: dc31b080   ebx: dc31b080   ecx: c034d540   edx: dbc5f1d0
esi: 00000007   edi: dc31b080   ebp: dc31b2e8   esp: c03d2ed0
ds: 007b   es: 007b   ss: 0068
Process swapper (pid: 0, threadinfo=c03d2000 task=c0349bc0)
Stack: c02ce6eb 00000001 27e33c0f c02d0dab 00000000 db53b034 c6463080 00001000
       c6463080 dc31b080 dc31b2e8 00000000 c02d849e 00000014 dc31b080 c6463080
       c02d899e dc31b250 c02a6b63 00000286 00000246 00004000 df584400 00000002
Call Trace:
 [<c02ce6eb>] tcp_reset+0xf8/0x106
 [<c02d0dab>] tcp_rcv_state_process+0x280/0x85d
 [<c02d849e>] tcp_v4_do_rcv+0x96/0xd0
 [<c02d899e>] tcp_v4_rcv+0x4c6/0x7c5
 [<c02a6b63>] dev_queue_xmit+0x40b/0x413
 [<c02c086e>] ip_local_deliver+0xf7/0x1a5
 [<c02c0d88>] ip_rcv+0x31d/0x399
 [<c02a70b2>] netif_receive_skb+0x1db/0x208
 [<e015eaa7>] e1000_clean_rx_irq+0x3ab/0x40c [e1000]
 [<e015e337>] e1000_clean+0x3d/0xae [e1000]
 [<c02a7211>] net_rx_action+0x59/0xc1
 [<c0124df5>] __do_softirq+0x35/0x79
 [<c0109027>] do_softirq+0x46/0x4d
 =======================
 [<c01085ee>] do_IRQ+0x239/0x242
 [<c0301d40>] common_interrupt+0x18/0x20
 [<c010403b>] default_idle+0x23/0x26
 [<c010408c>] cpu_idle+0x1f/0x34
 [<c03a76b4>] start_kernel+0x20f/0x211
Code:  Bad EIP value.
 <0>Kernel panic - not syncing: Fatal exception in interrupt

Additional info:

These are the cman related messages in the syslog.
[void] grep cman -i trin-04.messages 
Feb 11 17:57:59 trin-04 cman: Starting cman:
Feb 11 17:57:59 trin-04 kernel: CMAN 2.6.9-18.0 (built Feb 11 2005 11:18:55)
installed
Feb 11 17:58:00 trin-04 kernel: CMAN: Waiting to join or form a Linux-cluster
Feb 11 17:58:00 trin-04 ccsd[8297]: Connected to cluster infrastruture via:
CMAN/SM Plugin v1.1 
Feb 11 17:58:32 trin-04 kernel: CMAN: forming a new cluster
Feb 11 17:58:45 trin-04 kernel: CMAN: quorum regained, resuming activity
Feb 11 18:03:00 trin-04 cman:  failed
Feb 11 18:03:00 trin-04 cman: [
Feb 11 18:03:00 trin-04 cman: 
Feb 11 18:03:00 trin-04 rc: Starting cman:  failed
Feb 12 18:22:23 trin-04 cman:  failed
Feb 12 18:29:42 trin-04 cman:  failed
Feb 12 18:47:52 trin-04 cman:  failed

I think that cman was started by fence_tool.  It would appear that cman was
still logged in when the oops occurred.
Comment 1 Christine Caulfield 2005-02-15 03:28:32 EST
This should be assigned to dlm (my fault, I didn't make it clear on
IRC) as CMAN doesn't use TCP and DLM does.
Comment 2 Christine Caulfield 2005-02-15 11:53:24 EST
Created attachment 111091 [details]
Test script to show DLM socket leak
Comment 3 Christine Caulfield 2005-02-15 11:55:02 EST
If a primary connection was closed doe to EOF on the socket (eg the
remote end closed down), then a secondary would not get closed when
the DLM shutdown causing a socket leak.

Checking in lowcomms.c;
/cvs/cluster/cluster/dlm-kernel/src/lowcomms.c,v  <--  lowcomms.c
new revision: 1.22.2.7; previous revision: 1.22.2.6
done
Comment 4 Christine Caulfield 2005-02-15 11:57:05 EST
Comment on attachment 111091 [details]
Test script to show DLM socket leak

Run this on several nodes simultaneously. One or more instances will bomb out
with a socket in CLOSE_WAIT.

Note You need to log in before you can comment on or make changes to this bug.