Description of problem: I was continuely joining and leaving cluster membership on all nodes in a 6 node cluster. I eventually hit this Ooops on morph-06 after starting ccsd again and then running the cman_tool join on all nodes. Dec 21 17:43:18 morph-06 ccsd[4215]: Starting ccsd DEVEL.1103653322: Dec 21 17:43:18 morph-06 sshd(pam_unix)[4205]: session closed for user root Unable to handle kernel NULL pointer dereference at virtual address 00000159 printing eip: e029ce44 *pde = 00000000 Oops: 0000 [#1] SMP Modules linked in: cman ipv6 parport_pc lp parport autofs4 sunrpc e1000 microcode dm_mod uhci_hcd ehci_hcd button battery ac ext3 jbd qla2300 qla2xxx scsi_transport_fc sd_mod scsi_mod CPU: 0 EIP: 0060:[<e029ce44>] Not tainted VLI EFLAGS: 00010282 (2.6.9) EIP is at __sendmsg+0xb4/0x680 [cman] eax: 00000001 ebx: 00000000 ecx: 00000000 edx: 00000000 esi: da6f6380 edi: daa05f98 ebp: daa05f60 esp: daa05e84 ds: 007b es: 007b ss: 0068 Process cman_comms (pid: 4227, threadinfo=daa04000 task=da620430) Stack: 00000001 daa05eac c011f419 00000001 00000000 00000001 00000001 db2ccb70 db2ccb70 00000000 00000001 00000000 00000000 daa05e90 daa05ea4 00000000 00200000 01000000 daa05f74 daa05f7c 00000000 00000000 c011d467 00000001 Call Trace: [<c011f419>] __wake_up_sync+0x49/0x70 [<c011d467>] recalc_task_prio+0x97/0x190 [<c014cfbb>] zap_pmd_range+0x4b/0x70 [<c0120ae0>] autoremove_wake_function+0x0/0x50 [<c028f4c6>] kernel_recvmsg+0x36/0x50 [<e029a150>] receive_message+0x70/0xe0 [cman] [<e029d79b>] send_queued_message+0x9b/0xa0 [cman] [<e029a3be>] cluster_kthread+0x1fe/0x300 [cman] [<c011f2d0>] default_wake_function+0x0/0x10 [<e029a1c0>] cluster_kthread+0x0/0x300 [cman] [<c01042b5>] kernel_thread_helper+0x5/0x10 Code: ff ff ff 8b 40 04 89 85 50 ff ff ff 8b 95 70 ff ff ff 8b 52 04 85 d2 0f 85 9a 05 00 00 c6 85 57 ff ff ff 00 85 f6 74 10 8b 46 14 <0f> b6 80 58 01 00 00 88 85 57 ff ff ff ba 6b 00 00 00 b8 fb f0 Dec 21 17:43:25 morph-06 sshd(pam_unix)[4217]: session opened for user root by (uid=0) Dec 21 17:43:26 morph-06 kernel: CMAN: Waiting to join or form a Linux-cluster Version-Release number of selected component (if applicable): cman_tool DEVEL.1103653244 (built Dec 21 2004 12:21:57) How reproducible: Didn't try
The queued_messages list was not being cleaned at shutdown, so it's possible that cman might try to send a queued messaged from a previous incarnation. with hilarious results. Checking in cnxman.c; /cvs/cluster/cluster/cman-kernel/src/cnxman.c,v <-- cnxman.c new revision: 1.44; previous revision: 1.43 done Checking in cnxman.c; /cvs/cluster/cluster/cman-kernel/src/cnxman.c,v <-- cnxman.c new revision: 1.42.2.1; previous revision: 1.42 done
fix verified.