Description of problem: I continually see this Ooops/Panic when running the following loop: modprobe cman ccsd cman_tool join sleep 45 cman_tool leave killall ccsd rmmod cman It always seems to occur either during the modprobe or rmmod of cman. Also there must be a state where the cman mod shouldn't be allowed to be removed but the check allows it anyway. When everthing is running fine a with a quorate cluster an rmod isn't allowed: [root@morph-06 root]# modprobe -r cman FATAL: Module cman is in use. Here is the Ooops: Sep 21 16:03:32 morph-06 kernel: CMAN: node morph-04 rejoining slab error in kmem_cache_destroy(): cache `cluster_sock': Can't free all objects [<c01467a1>] kmem_cache_destroy+0xd1/0x120 [<c01375d1>] sys_delete_module+0x121/0x170 [<c015007a>] unmap_vma_list+0x1a/0x30 [<c0150408>] do_munmap+0x118/0x160 [<c011b950>] do_page_fault+0x0/0x4fc [<c0105e4d>] sysenter_past_esp+0x52/0x71 CMAN <CVS> (built Sep 21 2004 09:33:42) installed kmem_cache_create: duplicate cache cluster_sock ------------[ cut here ]------------ kernel BUG at mm/slab.c:1382! invalid operand: 0000 [#1] SMP Modules linked in: cman ipv6 parport_pc lp parport autofs4 sunrpc e1000 floppy sg microcode dm_mod uhci_hcd ehci_hcd button battery asus_acpi ac ext3 jbd qla2300 qla2xxx scsi_transport_fc sd_mod scsi_mod CPU: 0 EIP: 0060:[<c01463a6>] Not tainted EFLAGS: 00010202 (2.6.8.1) EIP is at kmem_cache_create+0x426/0x5b0 eax: 00000030 ebx: da903f10 ecx: c0413798 edx: 000048c4 esi: e02c1a78 edi: e02c1a78 ebp: da903d80 esp: db5f1f5c ds: 007b es: 007b ss: 0068 Process modprobe (pid: 2319, threadinfo=db5f0000 task=db5360b0) Stack: c0309d24 e02c1a6b da903dd8 0000000a c0000000 ffffff80 00000080 e02c1a6b 00000080 c03432c0 e02c9a00 c03432a4 c03432a4 e00f6060 00002000 00000000 00000000 e02c1a50 c0139327 c14da320 00000000 4000b008 0807a0e0 00c4e09d Call Trace: [<e00f6060>] cluster_init+0x60/0x3f9 [cman] [<c0139327>] sys_init_module+0x107/0x220 [<c0105e4d>] sysenter_past_esp+0x52/0x71 Code: 0f 0b 66 05 20 96 30 c0 8b 0b e9 5b ff ff ff 8b 47 50 c7 04 <1>Unable to handle kernel NULL pointer dereference at virtual address 00000000 printing eip: e02ad04d *pde = 00000000 Oops: 0000 [#2] SMP Modules linked in: cman ipv6 parport_pc lp parport autofs4 sunrpc e1000 floppy sg microcode dm_mod uhci_hcd ehci_hcd button battery asus_acpi ac ext3 jbd qla2300 qla2xxx scsi_transport_fc sd_mod scsi_mod CPU: 0 EIP: 0060:[<e02ad04d>] Not tainted EFLAGS: 00010282 (2.6.8.1) EIP is at cnxman_data_ready+0xd/0xb0 [cman] eax: dd143300 ebx: dd143300 ecx: dd143350 edx: 00000000 esi: dd143300 edi: 00000000 ebp: 00000020 esp: db5f1c68 ds: 007b es: 007b ss: 0068 Process modprobe (pid: 2319, threadinfo=db5f0000 task=db5360b0) Stack: c013a024 dd143300 dd16c780 c02ced37 00000052 dd143308 00000000 00001a99 dd16c780 c02cef40 00000004 0000991a 00000004 dd143300 292ca8c0 dd384034 dd16c780 00000020 dd500c80 00000020 dd16c780 c02cf25d ff2fa8c0 ff2fa8c0 Call Trace: [<c013a024>] __print_symbol+0x84/0xf0 [<c02ced37>] udp_queue_rcv_skb+0x1b7/0x290 [<c02cef40>] udp_v4_mcast_deliver+0x130/0x2d0 [<c02cf25d>] udp_rcv+0xcd/0x3b0 [<c02ac674>] ip_local_deliver+0xe4/0x230 [<c02acb33>] ip_rcv+0x373/0x4c0 [<c0294ec4>] netif_receive_skb+0x1b4/0x220 [<e01f3449>] e1000_clean_rx_irq+0x3f9/0x470 [e1000] [<c012a646>] update_wall_time+0x16/0x40 [<c012aa0f>] do_timer+0xaf/0xc0 [<e01f2db4>] e1000_clean+0x34/0xb0 [e1000] [<c02950cf>] net_rx_action+0x7f/0x110 [<c01268d4>] __do_softirq+0xb4/0xc0 [<c012690d>] do_softirq+0x2d/0x30 [<c0118b6c>] smp_apic_timer_interrupt+0xcc/0x130 [<c010688e>] apic_timer_interrupt+0x1a/0x20 [<c0106f7e>] die+0x9e/0x100 [<c01072a0>] do_invalid_op+0x0/0xb0 [<c010734c>] do_invalid_op+0xac/0xb0 [<c01463a6>] kmem_cache_create+0x426/0x5b0 [<c010680c>] common_interrupt+0x18/0x20 [<c021456d>] serial_in+0x1d/0x40 [<c01c4a09>] __delay+0x9/0x10 [<c02169a9>] serial8250_console_write+0x129/0x200 [<c0216880>] serial8250_console_write+0x0/0x200 [<c0216880>] serial8250_console_write+0x0/0x200 [<c0122e57>] __call_console_drivers+0x57/0x60 [<c0106909>] error_code+0x2d/0x38 [<c012007b>] migration_call+0x3b/0xd0 [<c01463a6>] kmem_cache_create+0x426/0x5b0 [<e00f6060>] cluster_init+0x60/0x3f9 [cman] [<c0139327>] sys_init_module+0x107/0x220 [<c0105e4d>] sysenter_past_esp+0x52/0x71 Code: 8b 0a 0f 18 01 90 81 fa 84 aa 2c e0 74 22 90 8d 74 26 00 8b <0>Kernel panic: Fatal exception in interrupt In interrupt handler - not syncing How reproducible: Always
I suspect this is caused by ccsd not being fully dead by the time the rmmod is run. For some reason the socket ops did not have a module owner field set, so that would have allowed cman to be unloaded while ccsd still had a socket open. oops. I've also fixed a couple of /proc file instances of the same thing both in cman and the DLM. Checking in cnxman.c; /cvs/cluster/cluster/cman-kernel/src/cnxman.c,v <-- cnxman.c new revision: 1.23; previous revision: 1.22 done Checking in proc.c; /cvs/cluster/cluster/cman-kernel/src/proc.c,v <-- proc.c new revision: 1.4; previous revision: 1.3 done
fix verified.
Updating version to the right level in the defects. Sorry for the storm.