Description of problem: I started `cman_tool join` on all the nodes in the cluster (I have 8 nodes) My environment is setup to start cman on startup. I rebooted all my nodes at about the same time. A few of the nodes were unable to join the cluster (I've not yet been able to figure out how to get any useful diagnostics... suggestions are welcome) while trying to figure out why one of the nodes was not joining, I did `cat /proc/cluster/status` and it segfaulted due to a kernel oops. Is this related to bug #142853? Output from console: CMAN: node trin-09 is not responding - removing from the cluster CMAN: node trin-09 is not responding - removing from the cluster CMAN: node trin-04 is not responding - removing from the cluster Unable to handle kernel paging request at virtual address 0000dd86 printing eip: c01d821c *pde = 00000000 Oops: 0000 [#1] Modules linked in: parport_pc lp parport autofs4 i2c_dev i2c_core cman(U) md5 ipv6 sunrpc dm_mod button battery ac uhci_hcd ehci_hcd hw_random e1000 floppy ext3 jbd qla2300 qla2xxx scsi_transport_fc sd_mod scsi_mod CPU: 0 EIP: 0060:[<c01d821c>] Not tainted VLI EFLAGS: 00010297 (2.6.9-1.906_EL) EIP is at vsnprintf+0x2c7/0x488 eax: 0000dd86 ebx: de2b3e64 ecx: 0000dd86 edx: fffffffe esi: de2b3de4 edi: 0000000a ebp: ffffffff esp: de2b3d9c ds: 007b es: 007b ss: 0068 Process cat (pid: 8855, threadinfo=de2b3000 task=dbe7d830) Stack: ffffffff ffffffff 00000000 ffffffff 21d4c1b8 de2b3e48 e0307815 de2b3e48 00000000 00000000 00000018 c01d843f de2b3de0 c01d8452 e02ff854 de2b3e48 e03077f8 0000dd86 ffffffff 00000002 ffffffff 242c4000 dbd3c000 e030797e Call Trace: [<c01d843f>] vsprintf+0xd/0xf [<c01d8452>] sprintf+0x11/0x12 [<e02ff854>] membership_state+0x93/0x9f [cman] [<e02ffa73>] proc_cluster_status+0x33/0x2a0 [cman] [<c019c3e1>] proc_alloc_inode+0x3c/0x54 [<c017e6da>] alloc_inode+0xf6/0x17f [<c017c9ef>] d_instantiate+0x12e/0x131 [<c019fd82>] proc_lookup+0x1a0/0x1aa [<c01710a1>] real_lookup+0x73/0xde [<c01713d1>] do_lookup+0x56/0x8f [<c017ae23>] dput+0x33/0x417 [<c01720de>] link_path_walk+0xcd4/0xd8c [<c016d0f9>] cp_new_stat64+0x124/0x139 [<c0145c9f>] buffered_rmqueue+0x1c4/0x1e7 [<c0145d76>] __alloc_pages+0xb4/0x298 [<c019f585>] proc_file_read+0x97/0x225 [<c01621fe>] vfs_read+0xb6/0xe2 [<c0162411>] sys_read+0x3c/0x62 [<c0301bfb>] syscall_call+0x7/0xb Code: 01 00 00 3b 5c 24 0c 77 f0 c6 03 20 eb eb 89 f0 83 c6 04 8b 08 b8 0f a1 31 c0 8b 54 24 04 81 f9 ff 0f 00 00 0f 46 c8 89 c8 eb 06 <80> 38 00 74 07 40 4a 83 fa ff 75 f4 29 c8 f6 44 24 08 10 89 c7 <4>CMAN: node trin-09 is not responding - removing from the cluster CMAN: nmembers in HELLO message from 5 does not match our view (got 4, exp 5) CMAN: too many transition restarts - will die CMAN: we are leaving the cluster. Reason is 5 Version-Release number of selected component (if applicable): cman-kernel-2.6.9-3.3 cman-1.0-0.pre5.0 How reproducible: I've not really tried yet. Steps to Reproduce: I've not tried to reproduce it yet
proc_cluster_status allocates a 255 byte buffer - that might be enough to kill the stack under some circumstances. so I've fix that. Looking at that code it's hard to see what else it could be. Checking in proc.c; /cvs/cluster/cluster/cman-kernel/src/proc.c,v <-- proc.c new revision: 1.10; previous revision: 1.9 done
As this has been in CVS (now git) since 2004 I think there's a good chance it's in a release!