133142 – cman oops while trying to clean up sockets

Bug 133142 - cman oops while trying to clean up sockets

Summary: cman oops while trying to clean up sockets

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Cluster Suite
Classification:	Retired
Component:	gfs
Sub Component:
Version:	4
Hardware:	i686
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Christine Caulfield
QA Contact:	GFS Bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2004-09-21 21:14 UTC by Corey Marthaler
Modified:	2010-01-12 02:58 UTC (History)
CC List:	0 users
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2004-09-22 22:32:02 UTC
Embargoed:

Attachments	(Terms of Use)

Description Corey Marthaler 2004-09-21 21:14:43 UTC

Description of problem:
I continually see this Ooops/Panic when running the following loop:

modprobe cman
ccsd
cman_tool join
sleep 45
cman_tool leave
killall ccsd
rmmod cman

It always seems to occur either during the modprobe or rmmod of cman.

Also there must be a state where the cman mod shouldn't be allowed to
be removed but the check allows it anyway. When everthing is running
fine a with a quorate cluster an rmod isn't allowed:

[root@morph-06 root]# modprobe -r cman
FATAL: Module cman is in use.


Here is the Ooops:

Sep 21 16:03:32 morph-06 kernel: CMAN: node morph-04 rejoining
slab error in kmem_cache_destroy(): cache `cluster_sock': Can't free
all objects
 [<c01467a1>] kmem_cache_destroy+0xd1/0x120
 [<c01375d1>] sys_delete_module+0x121/0x170
 [<c015007a>] unmap_vma_list+0x1a/0x30
 [<c0150408>] do_munmap+0x118/0x160
 [<c011b950>] do_page_fault+0x0/0x4fc
 [<c0105e4d>] sysenter_past_esp+0x52/0x71
CMAN <CVS> (built Sep 21 2004 09:33:42) installed
kmem_cache_create: duplicate cache cluster_sock
------------[ cut here ]------------
kernel BUG at mm/slab.c:1382!
invalid operand: 0000 [#1]
SMP
Modules linked in: cman ipv6 parport_pc lp parport autofs4 sunrpc
e1000 floppy sg microcode dm_mod uhci_hcd ehci_hcd button battery
asus_acpi ac ext3 jbd qla2300 qla2xxx scsi_transport_fc sd_mod scsi_mod
CPU:    0
EIP:    0060:[<c01463a6>]    Not tainted
EFLAGS: 00010202   (2.6.8.1)
EIP is at kmem_cache_create+0x426/0x5b0
eax: 00000030   ebx: da903f10   ecx: c0413798   edx: 000048c4
esi: e02c1a78   edi: e02c1a78   ebp: da903d80   esp: db5f1f5c
ds: 007b   es: 007b   ss: 0068
Process modprobe (pid: 2319, threadinfo=db5f0000 task=db5360b0)
Stack: c0309d24 e02c1a6b da903dd8 0000000a c0000000 ffffff80 00000080
e02c1a6b
       00000080 c03432c0 e02c9a00 c03432a4 c03432a4 e00f6060 00002000
00000000
       00000000 e02c1a50 c0139327 c14da320 00000000 4000b008 0807a0e0
00c4e09d
Call Trace:
 [<e00f6060>] cluster_init+0x60/0x3f9 [cman]
 [<c0139327>] sys_init_module+0x107/0x220
 [<c0105e4d>] sysenter_past_esp+0x52/0x71
Code: 0f 0b 66 05 20 96 30 c0 8b 0b e9 5b ff ff ff 8b 47 50 c7 04
 <1>Unable to handle kernel NULL pointer dereference at virtual
address 00000000
 printing eip:
e02ad04d
*pde = 00000000
Oops: 0000 [#2]
SMP
Modules linked in: cman ipv6 parport_pc lp parport autofs4 sunrpc
e1000 floppy sg microcode dm_mod uhci_hcd ehci_hcd button battery
asus_acpi ac ext3 jbd qla2300 qla2xxx scsi_transport_fc sd_mod scsi_mod
CPU:    0
EIP:    0060:[<e02ad04d>]    Not tainted
EFLAGS: 00010282   (2.6.8.1)
EIP is at cnxman_data_ready+0xd/0xb0 [cman]
eax: dd143300   ebx: dd143300   ecx: dd143350   edx: 00000000
esi: dd143300   edi: 00000000   ebp: 00000020   esp: db5f1c68
ds: 007b   es: 007b   ss: 0068
Process modprobe (pid: 2319, threadinfo=db5f0000 task=db5360b0)
Stack: c013a024 dd143300 dd16c780 c02ced37 00000052 dd143308 00000000
00001a99
       dd16c780 c02cef40 00000004 0000991a 00000004 dd143300 292ca8c0
dd384034
       dd16c780 00000020 dd500c80 00000020 dd16c780 c02cf25d ff2fa8c0
ff2fa8c0
Call Trace:
 [<c013a024>] __print_symbol+0x84/0xf0
 [<c02ced37>] udp_queue_rcv_skb+0x1b7/0x290
 [<c02cef40>] udp_v4_mcast_deliver+0x130/0x2d0
 [<c02cf25d>] udp_rcv+0xcd/0x3b0
 [<c02ac674>] ip_local_deliver+0xe4/0x230
 [<c02acb33>] ip_rcv+0x373/0x4c0
 [<c0294ec4>] netif_receive_skb+0x1b4/0x220
 [<e01f3449>] e1000_clean_rx_irq+0x3f9/0x470 [e1000]
 [<c012a646>] update_wall_time+0x16/0x40
 [<c012aa0f>] do_timer+0xaf/0xc0
 [<e01f2db4>] e1000_clean+0x34/0xb0 [e1000]
 [<c02950cf>] net_rx_action+0x7f/0x110
 [<c01268d4>] __do_softirq+0xb4/0xc0
 [<c012690d>] do_softirq+0x2d/0x30
 [<c0118b6c>] smp_apic_timer_interrupt+0xcc/0x130
 [<c010688e>] apic_timer_interrupt+0x1a/0x20
 [<c0106f7e>] die+0x9e/0x100
 [<c01072a0>] do_invalid_op+0x0/0xb0
 [<c010734c>] do_invalid_op+0xac/0xb0
 [<c01463a6>] kmem_cache_create+0x426/0x5b0
 [<c010680c>] common_interrupt+0x18/0x20
 [<c021456d>] serial_in+0x1d/0x40
 [<c01c4a09>] __delay+0x9/0x10
 [<c02169a9>] serial8250_console_write+0x129/0x200
 [<c0216880>] serial8250_console_write+0x0/0x200
 [<c0216880>] serial8250_console_write+0x0/0x200
 [<c0122e57>] __call_console_drivers+0x57/0x60
 [<c0106909>] error_code+0x2d/0x38
 [<c012007b>] migration_call+0x3b/0xd0
 [<c01463a6>] kmem_cache_create+0x426/0x5b0
 [<e00f6060>] cluster_init+0x60/0x3f9 [cman]
 [<c0139327>] sys_init_module+0x107/0x220
 [<c0105e4d>] sysenter_past_esp+0x52/0x71
Code: 8b 0a 0f 18 01 90 81 fa 84 aa 2c e0 74 22 90 8d 74 26 00 8b
 <0>Kernel panic: Fatal exception in interrupt
In interrupt handler - not syncing

How reproducible:
Always

Comment 1 Christine Caulfield 2004-09-22 10:11:18 UTC

I suspect this is caused by ccsd not being fully dead by the time the
rmmod is run. 

For some reason the socket ops did not have a module owner field set,
so that would have allowed cman to be unloaded while ccsd still had a
socket open. oops.

I've also fixed a couple of /proc file instances of the same thing
both in cman and the DLM.

Checking in cnxman.c;
/cvs/cluster/cluster/cman-kernel/src/cnxman.c,v  <--  cnxman.c
new revision: 1.23; previous revision: 1.22
done
Checking in proc.c;
/cvs/cluster/cluster/cman-kernel/src/proc.c,v  <--  proc.c
new revision: 1.4; previous revision: 1.3
done

Comment 2 Corey Marthaler 2004-09-22 22:32:02 UTC

fix verified.

Comment 3 Kiersten (Kerri) Anderson 2004-11-16 19:13:37 UTC

Updating version to the right level in the defects.  Sorry for the storm.

Note You need to log in before you can comment on or make changes to this bug.