Hide Forgot
RRP is still TP in 6.2 with planned full support for 6.3. There are several bits and pieces that need to be integrated correctly: 1) automatic multiple multicast address´ (one per ring) 2) ipv4 vs ipv6 (one ring on one protocol the other on another?) 3) cman configuration for RRP is at best annoying (mcast address is not global but per node) 4) cman needs to improve error handling for invalid RRP configs (more than 2 rings for example) Since the RRP code in cman has been barely tested, we will need some work there to make sure it is robust and easy to use (and I need a bug to do the work and commit it)
#2 wont work and seems like something we couldn't effectively support.
(In reply to comment #1) > #2 wont work and seems like something we couldn't effectively support. that is exactly why we need to track combinations. This BZ is not to increment what we support, but to make sure cman does the correct configuration checks based on what we want to support and keep features in sync between corosync and cman config.
https://www.redhat.com/archives/cluster-devel/2011-November/msg00136.html On the assumption that the patch should introduce no regressions in standard default config (tested, but not reported here): 1) don't allow more than 2 rings config <clusternodes> <clusternode name="clusternet-node1-eth1" votes="1" nodeid="1"> <fence> <method name="single"> <device name="xvm" domain="clusternet-node1"/> </method> </fence> <altname name="clusternet-node1-eth2"/> <altname name="clusternet-node1-eth3"/> </clusternode> [root@clusternet-node1 ~]# ccs_config_validate Configuration of more than 2 rings is not supported Unable to get the configuration (forcefully disable ccs_config_validate) [root@clusternet-node1 ~]# cman_tool -d join -DNONE Starting /usr/sbin/corosync corosync -f CMAN_DEBUG=255 COROSYNC_DEFAULT_CONFIG_IFACE=xmlconfig:cmanpreconfig:openaisserviceenablestable CMAN_PIPE=4 corosync [MAIN ] Corosync Cluster Engine ('1.4.2'): started and ready to provide service. corosync [MAIN ] Corosync built-in features: nss dbus rdma snmp corosync [MAIN ] Successfully read config from /etc/cluster/cluster.conf corosync [MAIN ] Configuration of more than 2 rings is not supported corosync [MAIN ] Corosync Cluster Engine exiting with status 8 at main.c:1659. forked process ID is 4634 Configuration of more than 2 rings is not supported cman_tool: corosync daemon didn't start 2) enforce either different mcast or different ports on each ring to avoid conflicts. NOTE corosync uses specified port and port-1. In a 2 rings, none of the 4 ports should ever conflict in case same mcast address is in use. (default enforces 2 different mcast addresses, same port) 2a) same mcast / same port <cman port="666"> <multicast addr="239.192.99.74"/> </cman> <clusternodes> <clusternode name="clusternet-node1-eth1" votes="1" nodeid="1"> <fence> <method name="single"> <device name="xvm" domain="clusternet-node1"/> </method> </fence> <altname name="clusternet-node1-eth2" mcast="239.192.99.74" port="666"/> </clusternode> [root@clusternet-node2 ~]# ccs_config_validate Alternate communication channel (mcast: 239.192.99.74 ports: 665,664) cannot use same address and ports of primary channel (mcast: 239.192.99.74 ports: 666,665) Unable to get the configuration [root@clusternet-node2 ~]# cman_tool join -DNONE corosync [MAIN ] Corosync Cluster Engine ('1.4.2'): started and ready to provide service. corosync [MAIN ] Corosync built-in features: nss dbus rdma snmp corosync [MAIN ] Successfully read config from /etc/cluster/cluster.conf corosync [MAIN ] Alternate communication channel (mcast: 239.192.99.74 ports: 666,665) cannot use same address and ports of primary channel (mcast: 239.192.99.74 ports: 666,665) corosync [MAIN ] Corosync Cluster Engine exiting with status 8 at main.c:1659. Alternate communication channel (mcast: 239.192.99.74 ports: 666,665) cannot use same address and ports of primary channel (mcast: 239.192.99.74 ports: 666,665) cman_tool: corosync daemon didn't start 2b) same mcast / ports overlap <cman port="666"> <multicast addr="239.192.99.74"/> </cman> <clusternodes> <clusternode name="clusternet-node1-eth1" votes="1" nodeid="1"> <fence> <method name="single"> <device name="xvm" domain="clusternet-node1"/> </method> </fence> <altname name="clusternet-node1-eth2" mcast="239.192.99.74" port="665"/> </clusternode> [root@clusternet-node2 ~]# ccs_config_validate Alternate communication channel (mcast: 239.192.99.74 ports: 665,664) cannot use same address and ports of primary channel (mcast: 239.192.99.74 ports: 666,665) Unable to get the configuration [root@clusternet-node2 ~]# cman_tool join -DNONE corosync [MAIN ] Corosync Cluster Engine ('1.4.2'): started and ready to provide service. corosync [MAIN ] Corosync built-in features: nss dbus rdma snmp corosync [MAIN ] Successfully read config from /etc/cluster/cluster.conf corosync [MAIN ] Alternate communication channel (mcast: 239.192.99.74 ports: 665,664) cannot use same address and ports of primary channel (mcast: 239.192.99.74 ports: 666,665) corosync [MAIN ] Corosync Cluster Engine exiting with status 8 at main.c:1659. Alternate communication channel (mcast: 239.192.99.74 ports: 665,664) cannot use same address and ports of primary channel (mcast: 239.192.99.74 ports: 666,665) cman_tool: corosync daemon didn't start 2c) same as 2b but backwards <cman port="665"> <multicast addr="239.192.99.74"/> </cman> <clusternodes> <clusternode name="clusternet-node1-eth1" votes="1" nodeid="1"> <fence> <method name="single"> <device name="xvm" domain="clusternet-node1"/> </method> </fence> <altname name="clusternet-node1-eth2" mcast="239.192.99.74" port="666"/> </clusternode> [root@clusternet-node2 ~]# ccs_config_validate Alternate communication channel (mcast: 239.192.99.74 ports: 666,665) cannot use same address and ports of primary channel (mcast: 239.192.99.74 ports: 665,664) Unable to get the configuration [root@clusternet-node2 ~]# cman_tool join -DNONE corosync [MAIN ] Corosync Cluster Engine ('1.4.2'): started and ready to provide service. corosync [MAIN ] Corosync built-in features: nss dbus rdma snmp corosync [MAIN ] Successfully read config from /etc/cluster/cluster.conf corosync [MAIN ] Alternate communication channel (mcast: 239.192.99.74 ports: 666,665) cannot use same address and ports of primary channel (mcast: 239.192.99.74 ports: 665,664) corosync [MAIN ] Corosync Cluster Engine exiting with status 8 at main.c:1659. Alternate communication channel (mcast: 239.192.99.74 ports: 666,665) cannot use same address and ports of primary channel (mcast: 239.192.99.74 ports: 665,664) cman_tool: corosync daemon didn't start 2d) different mcast <cman port="666"> <multicast addr="239.192.99.74"/> </cman> <clusternodes> <clusternode name="clusternet-node1-eth1" votes="1" nodeid="1"> <fence> <method name="single"> <device name="xvm" domain="clusternet-node1"/> </method> </fence> <altname name="clusternet-node1-eth2" mcast="239.192.99.75" port="666"/> </clusternode> [root@clusternet-node2 ~]# ccs_config_validate Configuration validates cman starts fine 2e) different ports <cman port="666"> <multicast addr="239.192.99.74"/> </cman> <clusternodes> <clusternode name="clusternet-node1-eth1" votes="1" nodeid="1"> <fence> <method name="single"> <device name="xvm" domain="clusternet-node1"/> </method> </fence> <altname name="clusternet-node1-eth2" mcast="239.192.99.74" port="777"/> </clusternode> [root@clusternet-node2 ~]# ccs_config_validate Configuration validates cman starts fine 3) both rings must be on the same proto (v4 / v6) setup both rings to support v4 and v6. For simplicity: 3ffe::1 clusternet-node1-eth2-v6 3ffe::2 clusternet-node2-eth2-v6 4000::1 clusternet-node1-eth1-v6 4000::2 clusternet-node2-eth1-v6 3a) both rings on v4: <cluster name="fabbione" config_version="3" > <logging debug="on"/> <clusternodes> <clusternode name="clusternet-node1-eth1" votes="1" nodeid="1"> <fence> <method name="single"> <device name="xvm" domain="clusternet-node1"/> </method> </fence> <altname name="clusternet-node1-eth2"/> </clusternode> [root@clusternet-node2 ~]# ccs_config_validate Configuration validates [root@clusternet-node2 daemon]# cman_tool status Version: 6.2.0 Config Version: 3 Cluster Name: fabbione Cluster Id: 25573 Cluster Member: Yes Cluster Generation: 88 Membership state: Cluster-Member Nodes: 2 Expected votes: 2 Total votes: 2 Node votes: 1 Quorum: 2 Active subsystems: 1 Flags: Ports Bound: 0 Node name: clusternet-node2-eth1 Node ID: 2 Multicast addresses: 239.192.99.73 239.192.99.74 Node addresses: 192.168.4.2 192.168.5.2 3b) ring 0 is on v6 ring 1 on v4 <cluster name="fabbione" config_version="3" > <logging debug="on"/> <clusternodes> <clusternode name="clusternet-node1-eth1-v6" votes="1" nodeid="1"> <fence> <method name="single"> <device name="xvm" domain="clusternet-node1"/> </method> </fence> <altname name="clusternet-node1-eth2"/> </clusternode> [root@clusternet-node2 ~]# ccs_config_validate Node address family does not match multicast address family Unable to get the configuration [root@clusternet-node2 ~]# cman_tool join Node address family does not match multicast address family Unable to get the configuration cman_tool: Not joining, configuration is not valid 3c) ring 0 is on v4 ring 1 on v6 <cluster name="fabbione" config_version="3" > <logging debug="on"/> <clusternodes> <clusternode name="clusternet-node1-eth1" votes="1" nodeid="1"> <fence> <method name="single"> <device name="xvm" domain="clusternet-node1"/> </method> </fence> <altname name="clusternet-node1-eth2-v6"/> </clusternode> [root@clusternet-node2 ~]# ccs_config_validate Node address family does not match multicast address family Unable to get the configuration [root@clusternet-node2 ~]# cman_tool join Node address family does not match multicast address family Unable to get the configuration cman_tool: Not joining, configuration is not valid 3d) both rings are on v6 <cluster name="fabbione" config_version="3" > <logging debug="on"/> <clusternodes> <clusternode name="clusternet-node1-eth1-v6" votes="1" nodeid="1"> <fence> <method name="single"> <device name="xvm" domain="clusternet-node1"/> </method> </fence> <altname name="clusternet-node1-eth2-v6"/> </clusternode> [root@clusternet-node2 ~]# ccs_config_validate Configuration validates [root@clusternet-node2 daemon]# cman_tool status Version: 6.2.0 Config Version: 3 Cluster Name: fabbione Cluster Id: 25573 Cluster Member: Yes Cluster Generation: 8 Membership state: Cluster-Member Nodes: 2 Expected votes: 2 Total votes: 2 Node votes: 1 Quorum: 2 Active subsystems: 1 Flags: Ports Bound: 0 Node name: clusternet-node2-eth1-v6 Node ID: 2 Multicast addresses: ff15::63e5 ff15::63e6 Node addresses: 4000::2 3ffe::2 4) simplify alt config current config for full custom alt mcast address and port is rather insane. Let's take the extreme case where both cman port/ttl/mcast address, alt port/ttl/mcast address needs to be customized. <cman port="666"> <multicast addr="239.192.99.73" ttl="2"/> </cman> <clusternodes> <clusternode name="clusternet-node1-eth1" votes="1" nodeid="1"> <altname name="clusternet-node1-eth2" mcast="239.192.99.88" port="888" ttl="3"/> </clusternode> <clusternode name="clusternet-node2-eth1" votes="1" nodeid="2"> <altname name="clusternet-node2-eth2" mcast="239.192.99.88" port="888" ttl="3"/> </clusternode> first issue: <cman port="666"> <multicast addr="239.192.99.73" ttl="2"/> </cman> disconnection between network data for cman second issue: <altname name="clusternet-node2-eth2" mcast="239.192.99.88" port="888" ttl="3"/> needs to be repeated for every single node in the cluster. totem.interface.ringnumber=0 totem.interface.bindnetaddr=192.168.4.2 totem.interface.mcastaddr=239.192.99.73 totem.interface.mcastport=666 totem.interface.ttl=2 totem.interface.ringnumber=1 totem.interface.bindnetaddr=192.168.5.2 totem.interface.mcastaddr=239.192.99.88 totem.interface.mcastport=888 totem.interface.ttl=3 after patch: <cman> <multicast addr="239.192.99.73" port="666" ttl="2"/> <altmulticast addr="239.192.99.88" port="888" ttl="3"/> </cman> <clusternode name="clusternet-node1-eth1" votes="1" nodeid="1"> <altname name="clusternet-node1-eth2"/> <clusternode name="clusternet-node2-eth1" votes="1" nodeid="2"> <altname name="clusternet-node2-eth2"/> totem.interface.ringnumber=0 totem.interface.bindnetaddr=192.168.4.2 totem.interface.mcastaddr=239.192.99.73 totem.interface.mcastport=666 totem.interface.ttl=2 totem.interface.ringnumber=1 totem.interface.bindnetaddr=192.168.5.2 totem.interface.mcastaddr=239.192.99.88 totem.interface.mcastport=888 totem.interface.ttl=3 backward compatibility (old config model overrides new config model) <cman port="777"> <multicast addr="239.192.99.73" port="666" ttl="2"/> <altmulticast addr="239.192.99.88" port="888" ttl="3"/> </cman> cman port has higher priority vs multicast port totem.interface.ringnumber=0 totem.interface.bindnetaddr=192.168.4.2 totem.interface.mcastaddr=239.192.99.73 totem.interface.mcastport=777 altname multicast has higher priority vs altmulticast <cman port="777"> <multicast addr="239.192.99.73" port="666" ttl="2"/> <altmulticast addr="239.192.99.88" port="888" ttl="3"/> </cman> <clusternode name="clusternet-node1-eth1" votes="1" nodeid="1"> <altname name="clusternet-node1-eth2" mcast="239.192.99.99"/> totem.interface.ringnumber=1 totem.interface.bindnetaddr=192.168.5.2 totem.interface.mcastaddr=239.192.99.99 totem.interface.mcastport=888 totem.interface.ttl=3 altname port has higher priority vs altmulticast port <cman port="777"> <multicast addr="239.192.99.73" port="666" ttl="2"/> <altmulticast addr="239.192.99.88" port="888" ttl="3"/> </cman> <clusternode name="clusternet-node1-eth1" votes="1" nodeid="1"> <altname name="clusternet-node1-eth2" mcast="239.192.99.99" port="999"/> totem.interface.ringnumber=1 totem.interface.bindnetaddr=192.168.5.2 totem.interface.mcastaddr=239.192.99.99 totem.interface.mcastport=999 totem.interface.ttl=3 altname ttl has higher priority vs altmulticast ttl <cman port="777"> <multicast addr="239.192.99.73" port="666" ttl="2"/> <altmulticast addr="239.192.99.88" port="888" ttl="3"/> </cman> <clusternode name="clusternet-node1-eth1" votes="1" nodeid="1"> <altname name="clusternet-node1-eth2" mcast="239.192.99.99" port="999" ttl="20"/> totem.interface.ringnumber=1 totem.interface.bindnetaddr=192.168.5.2 totem.interface.mcastaddr=239.192.99.99 totem.interface.mcastport=999 totem.interface.ttl=20 5) NOT RELEVANT FOR QE - fix broadcast default ports broadcast enforces address 255.255.255.255 on all interfaces and therefor needs 2 different set of ports. ring0 gets DEFAULT_PORT ring1 gets DEFAULT_PORT + 2 unless manual override is in place. <cman broadcast="yes"> </cman> <clusternodes> <clusternode name="clusternet-node1-eth1" votes="1" nodeid="1"> <fence> <method name="single"> <device name="xvm" domain="clusternet-node1"/> </method> </fence> <altname name="clusternet-node1-eth2"/> </clusternode> <clusternode name="clusternet-node2-eth1" votes="1" nodeid="2"> <fence> <method name="single"> <device name="xvm" domain="clusternet-node2"/> </method> </fence> <altname name="clusternet-node2-eth2"/> </clusternode> </clusternodes> totem.interface.ringnumber=0 totem.interface.bindnetaddr=192.168.4.2 totem.interface.broadcast=yes totem.interface.mcastport=5405 totem.interface.ringnumber=1 totem.interface.bindnetaddr=192.168.5.2 totem.interface.broadcast=yes totem.interface.mcastport=5407 <cman broadcast="yes" port="999"> </cman> <clusternodes> <clusternode name="clusternet-node1-eth1" votes="1" nodeid="1"> <fence> <method name="single"> <device name="xvm" domain="clusternet-node1"/> </method> </fence> <altname name="clusternet-node1-eth2"/> </clusternode> totem.interface.ringnumber=0 totem.interface.bindnetaddr=192.168.4.2 totem.interface.broadcast=yes totem.interface.mcastport=999 totem.interface.ringnumber=1 totem.interface.bindnetaddr=192.168.5.2 totem.interface.broadcast=yes totem.interface.mcastport=5407 <cman broadcast="yes" port="999"> </cman> <clusternodes> <clusternode name="clusternet-node1-eth1" votes="1" nodeid="1"> <fence> <method name="single"> <device name="xvm" domain="clusternet-node1"/> </method> </fence> <altname name="clusternet-node1-eth2" port="1000"/> </clusternode> [root@clusternet-node1 ~]# cman_tool join Alternate communication channel (mcast: 255.255.255.255 ports: 1000,999) cannot use same address and ports of primary channel (mcast: 255.255.255.255 ports: 999,998) Unable to get the configuration cman_tool: Not joining, configuration is not valid <cman broadcast="yes" port="999"> </cman> <clusternodes> <clusternode name="clusternet-node1-eth1" votes="1" nodeid="1"> <fence> <method name="single"> <device name="xvm" domain="clusternet-node1"/> </method> </fence> <altname name="clusternet-node1-eth2" port="1001"/> </clusternode> <clusternode name="clusternet-node2-eth1" votes="1" nodeid="2"> <fence> <method name="single"> <device name="xvm" domain="clusternet-node2"/> </method> </fence> <altname name="clusternet-node2-eth2" port="1001"/> </clusternode> </clusternodes> totem.interface.ringnumber=0 totem.interface.bindnetaddr=192.168.4.2 totem.interface.broadcast=yes totem.interface.mcastport=999 totem.interface.ringnumber=1 totem.interface.bindnetaddr=192.168.5.2 totem.interface.broadcast=yes totem.interface.mcastport=1001
Fabio, Great work!. I have general concerns about mixing ipv4 and ipv6 in RRP. This is not something we should support. Regards -steve
(In reply to comment #5) > Fabio, > > Great work!. Thanks :) >I have general concerns about mixing ipv4 and ipv6 in RRP. This > is not something we should support. The patch protects against it already. Test case #3 shows all 4 possible combinations and 2 of them (mixed) will fail to start with error.
This looks great, Fabio -- the only thing that we'll need to work out is which configuration(s) are we going to support from luci/ccs.
(In reply to comment #7) > This looks great, Fabio -- the only thing that we'll need to work out is which > configuration(s) are we going to support from luci/ccs. I think luci/ccs should create the config in the new format, but it will still need to be able to understand the old one and eventually convert it. The old format is heavy to read and maintain.
respin with correct patch
I tested this on my mixed-version cluster, gfs-a16c-0{1,2,3,4}.mpc.lab.eng.bos.redhat.com and it works.
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Cause: cman was not able to configure Redundant Ring Protocol (RRP) correctly in corosync. Consequence: RRP deployments would not work properly. Fix: cman has been improved to both configure RRP properly and perform extra sanity checks on user configurations. Result: it is now easier to deploy cluster with RRP and error reports are more extensive.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2012-0861.html