Running: while true do sleep 1 cman_tool leave sleep 1 cman_tool join done After several iterations the above loop will hang, at which point all further attmempts to run cman_tool fail with the error: can't open cluster socket: Device or resource busy. (which makes sense, as there is a cman_tool hung.) [root@tank-06 root]# uname -ar Linux tank-06.lab.msp.redhat.com 2.6.8.1 #1 SMP Tue Aug 24 14:47:46 CDT 2004 i686 i686 i386 GNU/Linux Version-Release number of selected component (if applicable): CMAN <CVS> (built Aug 24 2004 14:55:41) installed How reproducible: Always
Annoyingly I can't reproduce this on my UP or SMP systems unless it needs many more iterations than I infer from "several". Can you tell me whether cman_tool hangs on join or leave (latest dmesg output would also be handy) and also which cman daemons are running at the point of the hang ? Also, how many other machines are in the cluster at the time ?
OK, I've managed to reproduce this now.
I've added a missing wake call which might have been causing this. It's probably work testing again for that, it seems OK here for a few hours now. cman really needs to be updated to use the new kthreads interface at some point.
Patrick -- Running another test I hit the following. I was doing leave/joins as well as GETMEMBERS ioctls when this happend. Looks like it may be related to the above issue? If not, let's open a new bug. 6 node cluster, taking 1 or 2 members out and then checking everyone sees the same cluster view. tank-03 did a join to rejoin the cluster and: Unable to handle kernel NULL pointer dereference at virtual address 00000001 printing eip: 00000001 *pde = 00000000 Oops: 0000 [#1] SMP Modules linked in: gnbd lock_gulm lock_nolock lock_dlm dlm cman gfs lock_harness ipv6 parport_pc lp parport autofs4 sunrpc e1000 floppy sg microcode dm_mod uhci_hcd ehci_hcd button battery asus_acpi ac ext3 jbd qla2300 qla2xxx scsi_transport_fc sd_mod scsi_mod CPU: 0 EIP: 0060:[<00000001>] Not tainted EFLAGS: 00010087 (2.6.8.1) EIP is at 0x1 eax: f5b83e50 ebx: f5b83e50 ecx: 00000000 edx: 00000001 esi: 00000000 edi: 8398d040 ebp: f5b83eec esp: f5b83ecc ds: 007b es: 007b ss: 0068 Process cman_comms (pid: 3873, threadinfo=f5b82000 task=f5c8f310) Stack: c011eff7 00000000 f8a83684 00000001 00000001 00000282 f8a83680 00000014 f5b83f04 c011f05f 00000000 00000000 00000000 f759fd80 f5f68800 f8a6701e 00000000 f8a67afd f5c8f310 c2019ca0 c201a600 f759fd80 f5f68800 00000014 Call Trace: [<c011eff7>] __wake_up_common+0x37/0x70 [<c011f05f>] __wake_up+0x2f/0x40 [<f8a6701e>] unjam+0x1e/0x40 [cman] [<f8a67afd>] send_to_userport+0xad/0x560 [cman] [<f8a671bc>] receive_message+0xcc/0xf0 [cman] [<f8a67359>] cluster_kthread+0x179/0x320 [cman] [<c011efb0>] default_wake_function+0x0/0x10 [<f8a671e0>] cluster_kthread+0x0/0x320 [cman] [<c01042b5>] kernel_thread_helper+0x5/0x10 Code: Bad EIP value.
I hit another stack running the same thing, single or multi nodes leave cluster, all other nodes check that the agree on cluster view, rejoin. repeat.... New stack: (Triggered after a cman_tool leave): Unable to handle kernel NULL pointer dereference at virtual address 00000001 printing eip: 00000001 *pde = 00000000 Oops: 0000 [#1] SMP Modules linked in: cman ipv6 parport_pc lp parport autofs4 sunrpc e1000 floppy sg microcode dm_mod uhci_hcd ehci_hcd button battery asus_acpi ac ext3 jbd qla2300 qla2xxx scsi_transport_fc sd_mod scsi_mod CPU: 0 EIP: 0060:[<00000001>] Not tainted EFLAGS: 00010086 (2.6.8.1) EIP is at 0x1 eax: fffffff2 ebx: c011eff7 ecx: 00000000 edx: fffffff2 esi: 00000000 edi: f8a43684 ebp: 00000001 esp: f550ff60 ds: 007b es: 007b ss: 0068 Process cman_comms (pid: 2331, threadinfo=f550e000 task=c2362b70) Stack: 00000286 f8a43680 f7102800 f550ff84 c011f05f 00000000 00000000 f727cc80 f8a449c8 f550e000 f8a2701e 00000000 f8a2ab70 f8a39fc4 0136001e f727cc80 f8a449c8 f7102800 f550e000 f8a27443 f8a3b8e0 c2362b70 0000001f 00000000 Call Trace: [<c011f05f>] __wake_up+0x2f/0x40 [<f8a2701e>] unjam+0x1e/0x40 [cman] [<f8a2ab70>] node_shutdown+0x20/0x330 [cman] [<f8a27443>] cluster_kthread+0x263/0x320 [cman] [<c011efb0>] default_wake_function+0x0/0x10 [<f8a271e0>] cluster_kthread+0x0/0x320 [cman] [<c01042b5>] kernel_thread_helper+0x5/0x10 Code: Bad EIP value.
It's definitely a different bug. If this morning's checkin doesn't fix it then raise a new bug report.
I think this is worth a retest, bear in mind my previous comment...
I have not seen the hang after several hours of running, nor the above stacks. Did you make a change which would have gotten rid of the oops from comments 4 and 5?
I have yes. Apologies for not making that clear.
OK -- In that case I'll mark this as fixed. Thanks!
Updating version to the right level in the defects. Sorry for the storm.