Bug 620679

Summary: qdiskd should stop voting if no <quorumd config is available
Product: Red Hat Enterprise Linux 6 Reporter: Fabio Massimo Di Nitto <fdinitto>
Component: clusterAssignee: Lon Hohberger <lhh>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: medium Docs Contact:
Priority: low    
Version: 6.0CC: bbrock, ccaulfie, cluster-maint, jkortus, lhh, rpeterso, ssaha, teigland
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: cluster-3.0.12-27.el6 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 615926 Environment:
Last Closed: 2011-05-19 12:53:27 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 615926    
Bug Blocks:    
Attachments:
Description Flags
Fix none

Comment 1 Fabio Massimo Di Nitto 2010-08-03 09:16:33 UTC
When testing bz 615926 I also tested the other direction:

2 nodes cluster with qdiskd running

remove qdiskd from the configuration

qdiskd daemon is not killed in this case, configuration change is dispatched, but qdiskd keeps happily voting:

<?xml version="1.0"?>
<cluster config_version="3" name="rhel6">
        <fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="3"/>
        <clusternodes>
                <clusternode name="rhel6-node1" nodeid="1" votes="1">
                        <fence/>
                </clusternode>
                <clusternode name="rhel6-node2" nodeid="2" votes="1">
                        <fence/>
                </clusternode>
        </clusternodes>
        <cman expected_votes="1" two_node="1"/>
        <fencedevices/>
        <rm>
                <failoverdomains/>
                <resources/>
        </rm>
</cluster>

[root@rhel6-node2 cluster]# ps ax|grep qdiskd
 2794 ?        SLsl   0:01 qdiskd -Q

[root@rhel6-node2 cluster]# cman_tool status
Version: 6.2.0
Config Version: 3
Cluster Name: rhel6
Cluster Id: 60348
Cluster Member: Yes
Cluster Generation: 8
Membership state: Cluster-Member
Nodes: 2
Expected votes: 1
Quorum device votes: 1
Total votes: 3
Node votes: 1
Quorum: 1  
Active subsystems: 11
Flags: 2node 
Ports Bound: 0 11 177  
Node name: rhel6-node2
Node ID: 2
Multicast addresses: 239.192.235.168 
Node addresses: 192.168.2.66 

[root@rhel6-node2 cluster]# cman_tool nodes
Node  Sts   Inc   Joined               Name
   0   M      0   2010-08-03 11:09:36  /dev/block/252:17
   1   M      8   2010-08-03 11:07:25  rhel6-node1
   2   M      4   2010-08-03 11:07:25  rhel6-node2

expected behavior from qdiskd is to go idle if not configured in cluster.conf

Comment 4 RHEL Program Management 2010-08-03 09:47:49 UTC
This issue has been proposed when we are only considering blocker
issues in the current Red Hat Enterprise Linux release.

** If you would still like this issue considered for the current
release, ask your support representative to file as a blocker on
your behalf. Otherwise ask that it be considered for the next
Red Hat Enterprise Linux release. **

Comment 5 Lon Hohberger 2010-08-03 14:02:03 UTC
This requires careful consideration.

The reconfiguration order is important.  Consider the case of a 4 node cluster + qdiskd.

Expected votes is 7

If you remove qdiskd from cluster.conf, cman will recalc expected votes to 4.  Then, before qdiskd processes the config change, it calls cman_poll_quorum_device().  This bumps expected votes back to 7.

Then qdiskd exits.

For now, it's much safer to:

(a) ensure all nodes are in the cluster
(b) kill qdiskd with SIGTERM on all nodes
(c) remove qdiskd from cluster.conf

Comment 6 Fabio Massimo Di Nitto 2010-08-03 14:53:47 UTC
so I understand all the issues described above.

In the specific case (a) and (c) are already true. qdiskd is already gone from cluster.conf and all nodes are in the cluster and active.

we don't have a way to tell qdiskd to die.

Doesn't cman_poll_quorum_device() recalculate every time based on qdiskd votes? if so, votes from qdiskd would go down to 0 (no config?no votes ;)) and expected votes recalculated.

Comment 7 Lon Hohberger 2010-08-03 16:38:30 UTC
No, cman_poll_quorum_device does not recalculate; you have to tell cman to drop the votes.

I was already working on a patch.  What it does is:

 - if previously configured and device & label are no longer present:
   - print a log message
   - reregister with 0 votes (causes recalculate_quorum())
   - clean shutdown e.g.:
     - write logout message to quorum disk
     - cman_unregister_quorum_device()

Aug  3 12:36:30 crackle modcluster: Updating cluster.conf
Aug  3 12:36:32 crackle corosync[1262]:   [QUORUM] Members[2]: 1 2
Aug  3 12:36:32 crackle corosync[1262]:   [CMAN  ] quorum device re-registered
Aug  3 12:36:32 crackle qdiskd[15384]: Quorum device removed from the configuration.  Shutting down.
Aug  3 12:36:43 crackle corosync[1262]:   [CMAN  ] lost contact with quorum device
Aug  3 12:36:43 crackle corosync[1262]:   [QUORUM] Members[2]: 1 2

Note however that because qdiskd was a member previously that it will still appear in both 'clustat' and 'cman_tool nodes' output.

[root@crackle ~]# cman_tool status
Version: 6.2.0
Config Version: 25
Cluster Name: cereal
Cluster Id: 27600
Cluster Member: Yes
Cluster Generation: 1248
Membership state: Cluster-Member
Nodes: 2
Expected votes: 2
Total votes: 2
Node votes: 1
Quorum: 2  
Active subsystems: 7
Flags: 
Ports Bound: 0  
Node name: crackle
Node ID: 2
Multicast addresses: 239.192.107.60 
Node addresses: 192.168.122.21 

(I used a two node cluster to illustrate that the fix works - if it didn't, expected votes would be 3 still).

Comment 8 Lon Hohberger 2010-08-03 16:39:13 UTC
Created attachment 436324 [details]
Fix

Patch not applied to any branches at this point.

Comment 9 Lon Hohberger 2010-08-03 16:57:25 UTC
http://git.fedorahosted.org/git?p=cluster.git;a=commit;h=e118d34dce64325a93c92833b1e074fbabb1a516

Updated patch posted to upstream STABLE3 branch.

Comment 10 Lon Hohberger 2010-08-03 16:58:52 UTC
Logs from updated patch:

Aug  3 12:47:56 snap modcluster: Updating cluster.conf
Aug  3 12:47:57 snap corosync[3446]:   [QUORUM] Members[2]: 1 2
Aug  3 12:47:57 snap corosync[3446]:   [CMAN  ] quorum device re-registered
Aug  3 12:47:57 snap corosync[3446]:   [QUORUM] Members[2]: 1 2
Aug  3 12:47:57 snap qdiskd[5751]: Quorum device removed from the configuration.  Shutting down.
Aug  3 12:47:57 snap qdiskd[5751]: Unregistering quorum device.
Aug  3 12:48:10 snap corosync[3446]:   [CMAN  ] lost contact with quorum device
Aug  3 12:48:10 snap corosync[3446]:   [QUORUM] Members[2]: 1 2

Comment 13 Fabio Massimo Di Nitto 2010-11-22 18:11:57 UTC
devel_ack, we already have the fix

Comment 17 errata-xmlrpc 2011-05-19 12:53:27 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0537.html