Description of problem: When a node is abruptly killed in a cluster the number of votes that is needed to acheive QUORUM changes when running qdisk. The number of votes decreases by 1. The number of votes needed to acheive Quorum should not change unless a node gracefully leaves the cluster. In this example the node is killed with sysrq crash. * Here is cluster status before the crash: root@rh5node1:bin$ cman_tool status Version: 6.2.0 Config Version: 10 Cluster Name: rh5cluster1 Cluster Id: 13721 Cluster Member: Yes Cluster Generation: 1068 Membership state: Cluster-Member Nodes: 3 Expected votes: 3 Quorum device votes: 1 Total votes: 4 Quorum: 3 Active subsystems: 8 Flags: Dirty Ports Bound: 0 Node name: rh5node1.examplerh.com Node ID: 1 Multicast addresses: 239.1.5.1 Node addresses: 192.168.1.151 * This node was crashed: root@rh5node3:~$ echo c > /proc/sysrq-trigger * The node that was crashed was fenced off correctly: root@rh5node1:bin$ tail -n 1 /var/log/messages May 19 15:28:46 rh5node1 fenced[2017]: fence "rh5node3.examplerh.com" success * Here is cluster status after the crash: root@rh5node1:bin$ cman_tool status Version: 6.2.0 Config Version: 10 Cluster Name: rh5cluster1 Cluster Id: 13721 Cluster Member: Yes Cluster Generation: 1072 Membership state: Cluster-Member Nodes: 2 Expected votes: 3 Quorum device votes: 1 Total votes: 3 Quorum: 2 Active subsystems: 8 Flags: Dirty Ports Bound: 0 Node name: rh5node1.examplerh.com Node ID: 1 Multicast addresses: 239.1.5.1 Node addresses: 192.168.1.151 root@rh5node1:bin$ Version-Release number of selected component (if applicable): cman-2.0.115-34.el5 How reproducible: Everytime Steps to Reproduce: 1. Setup cluster with qdisk on all nodes, check $(cman_tool status) for QUORUM 2. Kill a node $(echo c > /proc/sysrq-trigger) 3. Check $(cman_tool status) for QUORUM Actual results: The number of votes that is needed for QUORUM is recalculated when a node abruptly dies. Expected results: The number of votes that is needed for QUORUM should not be recalculated when a node abruptly dies. Additional info:
Created attachment 416140 [details] Patch to fix bitwise ops This is an untested patch that fixes the use of the leave_reason member variable.
lon: do you have time to test that patch ?
Ok, reproduced on 2.0.115-44.el5; now to try with patch.
[root@molly ~]# cman_tool status Version: 6.2.0 Config Version: 2822 Cluster Name: lolcats Cluster Id: 13719 Cluster Member: Yes Cluster Generation: 1860 Membership state: Cluster-Member Nodes: 2 Expected votes: 6 Quorum device votes: 1 Total votes: 5 Quorum: 4 Active subsystems: 8 Flags: Dirty Ports Bound: 0 Node name: molly Node ID: 1 Multicast addresses: 225.0.0.13 Node addresses: 192.168.122.4 [root@molly ~]# cman_tool status Version: 6.2.0 Config Version: 2822 Cluster Name: lolcats Cluster Id: 13719 Cluster Member: Yes Cluster Generation: 1864 Membership state: Cluster-Member Nodes: 1 Expected votes: 6 Quorum device votes: 1 Total votes: 5 Quorum: 3 Active subsystems: 8 Flags: Dirty Ports Bound: 0 Node name: molly Node ID: 1 Multicast addresses: 225.0.0.13 Node addresses: 192.168.122.4
The above did not happen when I ran with the patch applied: [root@molly ~]# cman_tool status Version: 6.2.0 Config Version: 2822 Cluster Name: lolcats Cluster Id: 13719 Cluster Member: Yes Cluster Generation: 1884 Membership state: Cluster-Member Nodes: 1 Expected votes: 6 Quorum device votes: 1 Total votes: 5 Quorum: 4 Active subsystems: 8 Flags: Dirty Ports Bound: 0 Node name: molly Node ID: 1 Multicast addresses: 225.0.0.13 Node addresses: 192.168.122.4
http://git.fedorahosted.org/git?p=cluster.git;a=commit;h=efcafee5e61ee01748d9f1d2d971f72def2ce089
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2011-0036.html