Bug 133142

Summary: cman oops while trying to clean up sockets
Product: [Retired] Red Hat Cluster Suite Reporter: Corey Marthaler <cmarthal>
Component: gfsAssignee: Christine Caulfield <ccaulfie>
Status: CLOSED CURRENTRELEASE QA Contact: GFS Bugs <gfs-bugs>
Severity: medium Docs Contact:
Priority: medium    
Version: 4   
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2004-09-22 22:32:02 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Corey Marthaler 2004-09-21 21:14:43 UTC
Description of problem:
I continually see this Ooops/Panic when running the following loop:

modprobe cman
ccsd
cman_tool join
sleep 45
cman_tool leave
killall ccsd
rmmod cman

It always seems to occur either during the modprobe or rmmod of cman.

Also there must be a state where the cman mod shouldn't be allowed to
be removed but the check allows it anyway. When everthing is running
fine a with a quorate cluster an rmod isn't allowed:

[root@morph-06 root]# modprobe -r cman
FATAL: Module cman is in use.


Here is the Ooops:

Sep 21 16:03:32 morph-06 kernel: CMAN: node morph-04 rejoining
slab error in kmem_cache_destroy(): cache `cluster_sock': Can't free
all objects
 [<c01467a1>] kmem_cache_destroy+0xd1/0x120
 [<c01375d1>] sys_delete_module+0x121/0x170
 [<c015007a>] unmap_vma_list+0x1a/0x30
 [<c0150408>] do_munmap+0x118/0x160
 [<c011b950>] do_page_fault+0x0/0x4fc
 [<c0105e4d>] sysenter_past_esp+0x52/0x71
CMAN <CVS> (built Sep 21 2004 09:33:42) installed
kmem_cache_create: duplicate cache cluster_sock
------------[ cut here ]------------
kernel BUG at mm/slab.c:1382!
invalid operand: 0000 [#1]
SMP
Modules linked in: cman ipv6 parport_pc lp parport autofs4 sunrpc
e1000 floppy sg microcode dm_mod uhci_hcd ehci_hcd button battery
asus_acpi ac ext3 jbd qla2300 qla2xxx scsi_transport_fc sd_mod scsi_mod
CPU:    0
EIP:    0060:[<c01463a6>]    Not tainted
EFLAGS: 00010202   (2.6.8.1)
EIP is at kmem_cache_create+0x426/0x5b0
eax: 00000030   ebx: da903f10   ecx: c0413798   edx: 000048c4
esi: e02c1a78   edi: e02c1a78   ebp: da903d80   esp: db5f1f5c
ds: 007b   es: 007b   ss: 0068
Process modprobe (pid: 2319, threadinfo=db5f0000 task=db5360b0)
Stack: c0309d24 e02c1a6b da903dd8 0000000a c0000000 ffffff80 00000080
e02c1a6b
       00000080 c03432c0 e02c9a00 c03432a4 c03432a4 e00f6060 00002000
00000000
       00000000 e02c1a50 c0139327 c14da320 00000000 4000b008 0807a0e0
00c4e09d
Call Trace:
 [<e00f6060>] cluster_init+0x60/0x3f9 [cman]
 [<c0139327>] sys_init_module+0x107/0x220
 [<c0105e4d>] sysenter_past_esp+0x52/0x71
Code: 0f 0b 66 05 20 96 30 c0 8b 0b e9 5b ff ff ff 8b 47 50 c7 04
 <1>Unable to handle kernel NULL pointer dereference at virtual
address 00000000
 printing eip:
e02ad04d
*pde = 00000000
Oops: 0000 [#2]
SMP
Modules linked in: cman ipv6 parport_pc lp parport autofs4 sunrpc
e1000 floppy sg microcode dm_mod uhci_hcd ehci_hcd button battery
asus_acpi ac ext3 jbd qla2300 qla2xxx scsi_transport_fc sd_mod scsi_mod
CPU:    0
EIP:    0060:[<e02ad04d>]    Not tainted
EFLAGS: 00010282   (2.6.8.1)
EIP is at cnxman_data_ready+0xd/0xb0 [cman]
eax: dd143300   ebx: dd143300   ecx: dd143350   edx: 00000000
esi: dd143300   edi: 00000000   ebp: 00000020   esp: db5f1c68
ds: 007b   es: 007b   ss: 0068
Process modprobe (pid: 2319, threadinfo=db5f0000 task=db5360b0)
Stack: c013a024 dd143300 dd16c780 c02ced37 00000052 dd143308 00000000
00001a99
       dd16c780 c02cef40 00000004 0000991a 00000004 dd143300 292ca8c0
dd384034
       dd16c780 00000020 dd500c80 00000020 dd16c780 c02cf25d ff2fa8c0
ff2fa8c0
Call Trace:
 [<c013a024>] __print_symbol+0x84/0xf0
 [<c02ced37>] udp_queue_rcv_skb+0x1b7/0x290
 [<c02cef40>] udp_v4_mcast_deliver+0x130/0x2d0
 [<c02cf25d>] udp_rcv+0xcd/0x3b0
 [<c02ac674>] ip_local_deliver+0xe4/0x230
 [<c02acb33>] ip_rcv+0x373/0x4c0
 [<c0294ec4>] netif_receive_skb+0x1b4/0x220
 [<e01f3449>] e1000_clean_rx_irq+0x3f9/0x470 [e1000]
 [<c012a646>] update_wall_time+0x16/0x40
 [<c012aa0f>] do_timer+0xaf/0xc0
 [<e01f2db4>] e1000_clean+0x34/0xb0 [e1000]
 [<c02950cf>] net_rx_action+0x7f/0x110
 [<c01268d4>] __do_softirq+0xb4/0xc0
 [<c012690d>] do_softirq+0x2d/0x30
 [<c0118b6c>] smp_apic_timer_interrupt+0xcc/0x130
 [<c010688e>] apic_timer_interrupt+0x1a/0x20
 [<c0106f7e>] die+0x9e/0x100
 [<c01072a0>] do_invalid_op+0x0/0xb0
 [<c010734c>] do_invalid_op+0xac/0xb0
 [<c01463a6>] kmem_cache_create+0x426/0x5b0
 [<c010680c>] common_interrupt+0x18/0x20
 [<c021456d>] serial_in+0x1d/0x40
 [<c01c4a09>] __delay+0x9/0x10
 [<c02169a9>] serial8250_console_write+0x129/0x200
 [<c0216880>] serial8250_console_write+0x0/0x200
 [<c0216880>] serial8250_console_write+0x0/0x200
 [<c0122e57>] __call_console_drivers+0x57/0x60
 [<c0106909>] error_code+0x2d/0x38
 [<c012007b>] migration_call+0x3b/0xd0
 [<c01463a6>] kmem_cache_create+0x426/0x5b0
 [<e00f6060>] cluster_init+0x60/0x3f9 [cman]
 [<c0139327>] sys_init_module+0x107/0x220
 [<c0105e4d>] sysenter_past_esp+0x52/0x71
Code: 8b 0a 0f 18 01 90 81 fa 84 aa 2c e0 74 22 90 8d 74 26 00 8b
 <0>Kernel panic: Fatal exception in interrupt
In interrupt handler - not syncing

How reproducible:
Always

Comment 1 Christine Caulfield 2004-09-22 10:11:18 UTC
I suspect this is caused by ccsd not being fully dead by the time the
rmmod is run. 

For some reason the socket ops did not have a module owner field set,
so that would have allowed cman to be unloaded while ccsd still had a
socket open. oops.

I've also fixed a couple of /proc file instances of the same thing
both in cman and the DLM.

Checking in cnxman.c;
/cvs/cluster/cluster/cman-kernel/src/cnxman.c,v  <--  cnxman.c
new revision: 1.23; previous revision: 1.22
done
Checking in proc.c;
/cvs/cluster/cluster/cman-kernel/src/proc.c,v  <--  proc.c
new revision: 1.4; previous revision: 1.3
done


Comment 2 Corey Marthaler 2004-09-22 22:32:02 UTC
fix verified. 

Comment 3 Kiersten (Kerri) Anderson 2004-11-16 19:13:37 UTC
Updating version to the right level in the defects.  Sorry for the storm.