Bug 620679 - qdiskd should stop voting if no <quorumd config is available
qdiskd should stop voting if no <quorumd config is available
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: cluster (Show other bugs)
6.0
All Linux
low Severity medium
: rc
: ---
Assigned To: Lon Hohberger
Cluster QE
:
Depends On: 615926
Blocks:
  Show dependency treegraph
 
Reported: 2010-08-03 05:14 EDT by Fabio Massimo Di Nitto
Modified: 2016-04-26 12:40 EDT (History)
8 users (show)

See Also:
Fixed In Version: cluster-3.0.12-27.el6
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 615926
Environment:
Last Closed: 2011-05-19 08:53:27 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Fix (1.99 KB, patch)
2010-08-03 12:39 EDT, Lon Hohberger
no flags Details | Diff


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2011:0537 normal SHIPPED_LIVE cluster and gfs2-utils bug fix update 2011-05-18 13:57:40 EDT

  None (edit)
Comment 1 Fabio Massimo Di Nitto 2010-08-03 05:16:33 EDT
When testing bz 615926 I also tested the other direction:

2 nodes cluster with qdiskd running

remove qdiskd from the configuration

qdiskd daemon is not killed in this case, configuration change is dispatched, but qdiskd keeps happily voting:

<?xml version="1.0"?>
<cluster config_version="3" name="rhel6">
        <fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="3"/>
        <clusternodes>
                <clusternode name="rhel6-node1" nodeid="1" votes="1">
                        <fence/>
                </clusternode>
                <clusternode name="rhel6-node2" nodeid="2" votes="1">
                        <fence/>
                </clusternode>
        </clusternodes>
        <cman expected_votes="1" two_node="1"/>
        <fencedevices/>
        <rm>
                <failoverdomains/>
                <resources/>
        </rm>
</cluster>

[root@rhel6-node2 cluster]# ps ax|grep qdiskd
 2794 ?        SLsl   0:01 qdiskd -Q

[root@rhel6-node2 cluster]# cman_tool status
Version: 6.2.0
Config Version: 3
Cluster Name: rhel6
Cluster Id: 60348
Cluster Member: Yes
Cluster Generation: 8
Membership state: Cluster-Member
Nodes: 2
Expected votes: 1
Quorum device votes: 1
Total votes: 3
Node votes: 1
Quorum: 1  
Active subsystems: 11
Flags: 2node 
Ports Bound: 0 11 177  
Node name: rhel6-node2
Node ID: 2
Multicast addresses: 239.192.235.168 
Node addresses: 192.168.2.66 

[root@rhel6-node2 cluster]# cman_tool nodes
Node  Sts   Inc   Joined               Name
   0   M      0   2010-08-03 11:09:36  /dev/block/252:17
   1   M      8   2010-08-03 11:07:25  rhel6-node1
   2   M      4   2010-08-03 11:07:25  rhel6-node2

expected behavior from qdiskd is to go idle if not configured in cluster.conf
Comment 4 RHEL Product and Program Management 2010-08-03 05:47:49 EDT
This issue has been proposed when we are only considering blocker
issues in the current Red Hat Enterprise Linux release.

** If you would still like this issue considered for the current
release, ask your support representative to file as a blocker on
your behalf. Otherwise ask that it be considered for the next
Red Hat Enterprise Linux release. **
Comment 5 Lon Hohberger 2010-08-03 10:02:03 EDT
This requires careful consideration.

The reconfiguration order is important.  Consider the case of a 4 node cluster + qdiskd.

Expected votes is 7

If you remove qdiskd from cluster.conf, cman will recalc expected votes to 4.  Then, before qdiskd processes the config change, it calls cman_poll_quorum_device().  This bumps expected votes back to 7.

Then qdiskd exits.

For now, it's much safer to:

(a) ensure all nodes are in the cluster
(b) kill qdiskd with SIGTERM on all nodes
(c) remove qdiskd from cluster.conf
Comment 6 Fabio Massimo Di Nitto 2010-08-03 10:53:47 EDT
so I understand all the issues described above.

In the specific case (a) and (c) are already true. qdiskd is already gone from cluster.conf and all nodes are in the cluster and active.

we don't have a way to tell qdiskd to die.

Doesn't cman_poll_quorum_device() recalculate every time based on qdiskd votes? if so, votes from qdiskd would go down to 0 (no config?no votes ;)) and expected votes recalculated.
Comment 7 Lon Hohberger 2010-08-03 12:38:30 EDT
No, cman_poll_quorum_device does not recalculate; you have to tell cman to drop the votes.

I was already working on a patch.  What it does is:

 - if previously configured and device & label are no longer present:
   - print a log message
   - reregister with 0 votes (causes recalculate_quorum())
   - clean shutdown e.g.:
     - write logout message to quorum disk
     - cman_unregister_quorum_device()

Aug  3 12:36:30 crackle modcluster: Updating cluster.conf
Aug  3 12:36:32 crackle corosync[1262]:   [QUORUM] Members[2]: 1 2
Aug  3 12:36:32 crackle corosync[1262]:   [CMAN  ] quorum device re-registered
Aug  3 12:36:32 crackle qdiskd[15384]: Quorum device removed from the configuration.  Shutting down.
Aug  3 12:36:43 crackle corosync[1262]:   [CMAN  ] lost contact with quorum device
Aug  3 12:36:43 crackle corosync[1262]:   [QUORUM] Members[2]: 1 2

Note however that because qdiskd was a member previously that it will still appear in both 'clustat' and 'cman_tool nodes' output.

[root@crackle ~]# cman_tool status
Version: 6.2.0
Config Version: 25
Cluster Name: cereal
Cluster Id: 27600
Cluster Member: Yes
Cluster Generation: 1248
Membership state: Cluster-Member
Nodes: 2
Expected votes: 2
Total votes: 2
Node votes: 1
Quorum: 2  
Active subsystems: 7
Flags: 
Ports Bound: 0  
Node name: crackle
Node ID: 2
Multicast addresses: 239.192.107.60 
Node addresses: 192.168.122.21 

(I used a two node cluster to illustrate that the fix works - if it didn't, expected votes would be 3 still).
Comment 8 Lon Hohberger 2010-08-03 12:39:13 EDT
Created attachment 436324 [details]
Fix

Patch not applied to any branches at this point.
Comment 9 Lon Hohberger 2010-08-03 12:57:25 EDT
http://git.fedorahosted.org/git?p=cluster.git;a=commit;h=e118d34dce64325a93c92833b1e074fbabb1a516

Updated patch posted to upstream STABLE3 branch.
Comment 10 Lon Hohberger 2010-08-03 12:58:52 EDT
Logs from updated patch:

Aug  3 12:47:56 snap modcluster: Updating cluster.conf
Aug  3 12:47:57 snap corosync[3446]:   [QUORUM] Members[2]: 1 2
Aug  3 12:47:57 snap corosync[3446]:   [CMAN  ] quorum device re-registered
Aug  3 12:47:57 snap corosync[3446]:   [QUORUM] Members[2]: 1 2
Aug  3 12:47:57 snap qdiskd[5751]: Quorum device removed from the configuration.  Shutting down.
Aug  3 12:47:57 snap qdiskd[5751]: Unregistering quorum device.
Aug  3 12:48:10 snap corosync[3446]:   [CMAN  ] lost contact with quorum device
Aug  3 12:48:10 snap corosync[3446]:   [QUORUM] Members[2]: 1 2
Comment 13 Fabio Massimo Di Nitto 2010-11-22 13:11:57 EST
devel_ack, we already have the fix
Comment 17 errata-xmlrpc 2011-05-19 08:53:27 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0537.html

Note You need to log in before you can comment on or make changes to this bug.