595394 – Qdisk: quorum is recalculated after node abruptly leaves cluster

Bug 595394 - Qdisk: quorum is recalculated after node abruptly leaves cluster

Summary: Qdisk: quorum is recalculated after node abruptly leaves cluster

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	cman
Sub Component:
Version:	5.5
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Lon Hohberger
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	596046
TreeView+	depends on / blocked

Reported:	2010-05-24 14:12 UTC by Shane Bradley
Modified:	2018-11-14 17:50 UTC (History)
CC List:	6 users (show)
Fixed In Version:	cman-2.0.115-49.el5
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	596046 (view as bug list)
Environment:
Last Closed:	2011-01-13 22:34:21 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Patch to fix bitwise ops (1.93 KB, patch) 2010-05-24 14:26 UTC, Christine Caulfield	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2011:0036	0	normal	SHIPPED_LIVE	cman bug-fix and enhancement update	2011-01-12 17:39:38 UTC

Description Shane Bradley 2010-05-24 14:12:28 UTC

Description of problem:

When a node is abruptly killed in a cluster the number of votes that
is needed to acheive QUORUM changes when running qdisk. The number of
votes decreases by 1. The number of votes needed to acheive Quorum
should not change unless a node gracefully leaves the cluster.

In this example the node is killed with sysrq crash. 

* Here is cluster status before the crash:
root@rh5node1:bin$ cman_tool status
Version: 6.2.0
Config Version: 10
Cluster Name: rh5cluster1
Cluster Id: 13721
Cluster Member: Yes
Cluster Generation: 1068
Membership state: Cluster-Member
Nodes: 3
Expected votes: 3
Quorum device votes: 1
Total votes: 4
Quorum: 3  
Active subsystems: 8
Flags: Dirty
Ports Bound: 0  
Node name: rh5node1.examplerh.com
Node ID: 1
Multicast addresses: 239.1.5.1
Node addresses: 192.168.1.151

* This node was crashed:
root@rh5node3:~$ echo c > /proc/sysrq-trigger

* The node that was crashed was fenced off correctly:
root@rh5node1:bin$ tail -n 1 /var/log/messages
May 19 15:28:46 rh5node1 fenced[2017]: fence "rh5node3.examplerh.com" success

* Here is cluster status after the crash:
root@rh5node1:bin$ cman_tool status
Version: 6.2.0
Config Version: 10
Cluster Name: rh5cluster1
Cluster Id: 13721
Cluster Member: Yes
Cluster Generation: 1072
Membership state: Cluster-Member
Nodes: 2
Expected votes: 3
Quorum device votes: 1
Total votes: 3
Quorum: 2  
Active subsystems: 8
Flags: Dirty
Ports Bound: 0  
Node name: rh5node1.examplerh.com
Node ID: 1
Multicast addresses: 239.1.5.1
Node addresses: 192.168.1.151
root@rh5node1:bin$ 


Version-Release number of selected component (if applicable):
cman-2.0.115-34.el5

How reproducible:
Everytime

Steps to Reproduce:
1. Setup cluster with qdisk on all nodes, check $(cman_tool status) for QUORUM
2. Kill a node $(echo c > /proc/sysrq-trigger)
3. Check $(cman_tool status) for QUORUM
  
Actual results:
The number of votes that is needed for QUORUM is recalculated when a
node abruptly dies.

Expected results:
The number of votes that is needed for QUORUM should not be
recalculated when a node abruptly dies.

Additional info:

Comment 2 Christine Caulfield 2010-05-24 14:26:57 UTC

Created attachment 416140 [details]
Patch to fix bitwise ops

This is an untested patch that fixes the use of the leave_reason member variable.

Comment 4 Christine Caulfield 2010-07-20 10:14:35 UTC

lon: do you have time to test that patch ?

Comment 5 Lon Hohberger 2010-08-10 18:38:35 UTC

Ok, reproduced on 2.0.115-44.el5; now to try with patch.

Comment 6 Lon Hohberger 2010-08-10 18:39:03 UTC

[root@molly ~]# cman_tool status
Version: 6.2.0
Config Version: 2822
Cluster Name: lolcats
Cluster Id: 13719
Cluster Member: Yes
Cluster Generation: 1860
Membership state: Cluster-Member
Nodes: 2
Expected votes: 6
Quorum device votes: 1
Total votes: 5
Quorum: 4  
Active subsystems: 8
Flags: Dirty 
Ports Bound: 0  
Node name: molly
Node ID: 1
Multicast addresses: 225.0.0.13 
Node addresses: 192.168.122.4 
[root@molly ~]# cman_tool status
Version: 6.2.0
Config Version: 2822
Cluster Name: lolcats
Cluster Id: 13719
Cluster Member: Yes
Cluster Generation: 1864
Membership state: Cluster-Member
Nodes: 1
Expected votes: 6
Quorum device votes: 1
Total votes: 5
Quorum: 3  
Active subsystems: 8
Flags: Dirty 
Ports Bound: 0  
Node name: molly
Node ID: 1
Multicast addresses: 225.0.0.13 
Node addresses: 192.168.122.4

Comment 8 Lon Hohberger 2010-08-10 18:52:35 UTC

The above did not happen when I ran with the patch applied:

[root@molly ~]# cman_tool status
Version: 6.2.0
Config Version: 2822
Cluster Name: lolcats
Cluster Id: 13719
Cluster Member: Yes
Cluster Generation: 1884
Membership state: Cluster-Member
Nodes: 1
Expected votes: 6
Quorum device votes: 1
Total votes: 5
Quorum: 4  
Active subsystems: 8
Flags: Dirty 
Ports Bound: 0  
Node name: molly
Node ID: 1
Multicast addresses: 225.0.0.13 
Node addresses: 192.168.122.4

Comment 9 Lon Hohberger 2010-08-10 18:54:52 UTC

http://git.fedorahosted.org/git?p=cluster.git;a=commit;h=efcafee5e61ee01748d9f1d2d971f72def2ce089

Comment 12 errata-xmlrpc 2011-01-13 22:34:21 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0036.html

Note You need to log in before you can comment on or make changes to this bug.