Description of problem: Attempting to specify the nodeids for each node on the command line. Started ccsd on all, then issued "cman_tool join -N 1000" on node link-10. Issued "cman_tool join -N 2000" on link-11 and got this Oops on link-10. Unable to handle kernel paging request at virtual address e00f7f40 printing eip: e02b56d0 *pde = 014ef067 Oops: 0002 [#1] SMP Modules linked in: dlm loop cman ipv6 parport_pc lp parport autofs4 sunrpc e1000 floppy sg microcode dm_mod uhci_hcd ehci_hcd button battery asus_acpi ac ext3 jbd qla2300 qla2xxx scsi_transport_fc sd_mod scsi_mod CPU: 0 EIP: 0060:[<e02b56d0>] Not tainted EFLAGS: 00010246 (2.6.8.1) EIP is at add_new_node+0x1a0/0x310 [cman] eax: e00f6000 ebx: e00f6000 ecx: 000007d0 edx: 00000077 esi: e00e6200 edi: e00f6228 ebp: db348c00 esp: dcabbedc ds: 007b es: 007b ss: 0068 Process cman_memb (pid: 5115, threadinfo=dcaba000 task=df19a8d0) Stack: 00000000 0000007b 0000008a 00000003 e02caf40 e02caf00 e02caf48 e02cadc8 dcabbf78 e02b691f 000007d0 00000001 e02caf40 e02caf30 00000000 dcabbf70 00000000 00000048 df19a8d0 e02caf00 dcabbf78 00000048 ffffffff e02b48fa Call Trace: [<e02b691f>] do_process_joinreq+0xff/0x1b0 [cman] [<e02b48fa>] do_membership_packet+0x4a/0x1e0 [cman] [<e02b739a>] dispatch_messages+0xda/0x100 [cman] [<e02b3744>] membership_kthread+0x194/0x400 [cman] [<c0105db2>] ret_from_fork+0x6/0x14 [<c011efb0>] default_wake_function+0x0/0x10 [<e02b35b0>] membership_kthread+0x0/0x400 [cman] [<c01042b5>] kernel_thread_helper+0x5/0x10 Code: 89 2c 88 b0 01 86 05 20 bb 2c e0 bb 48 ac 2c e0 ba 77 00 00 Sep 29 14:16:08 link-10 kernel: CMAN: forming a new cluster Version-Release number of selected component (if applicable): [root@link-11 /]# cman_tool -V cman_tool DEVEL.1096386815 (built Sep 28 2004 10:54:45) Copyright (C) Red Hat, Inc. 2004 All rights reserved. How reproducible: Will try again after I file bug. Steps to Reproduce: 1. Start ccsd on two nodes 2. Run 'cman_tool join -N 1000' on one 3. Run 'cman_tool join -N 2000' on the other. Actual results: First node panics. Expected results: Cluster formation or graceful failure. Additional info:
Recreated with the -N 1000 and -N 2000. It did not Oops with -N 10 and -N 20, so those ints must be too big for something.
This affected DLM too. Both keep an array of nodeIDs for fast access on the (previously correct) assumption that nodeIDs would be contiguous. By starting a node with ID of 1000 it went way off the end of that array. Anyway, both now do rather smarter allocation of node information. I've also put a limit on the size of a nodeID simply to prevent the kernel running out of memory if someone tries to make a nodeID that's really,really big.
Verified: No Oops and no garbage-in. I approve this message. [root@link-10 root]# cman_tool join -N 5000 cman_tool: Node id must be between 1 and 4096 [root@link-10 root]# echo $? 1 [root@link-10 root]#
Updating version to the right level in the defects. Sorry for the storm.