620679 – qdiskd should stop voting if no <quorumd config is available

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 620679 - qdiskd should stop voting if no <quorumd config is available

Summary: qdiskd should stop voting if no <quorumd config is available

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	cluster
Sub Component:
Version:	6.0
Hardware:	All
OS:	Linux
Priority:	low
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Lon Hohberger
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:	615926
Blocks:
TreeView+	depends on / blocked

Reported:	2010-08-03 09:14 UTC by Fabio Massimo Di Nitto
Modified:	2016-04-26 16:40 UTC (History)
CC List:	8 users (show)
Fixed In Version:	cluster-3.0.12-27.el6
Doc Type:	Bug Fix
Doc Text:
Clone Of:	615926
Environment:
Last Closed:	2011-05-19 12:53:27 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Fix (1.99 KB, patch) 2010-08-03 16:39 UTC, Lon Hohberger	no flags	Details \| Diff
Show Obsolete (1) View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2011:0537	0	normal	SHIPPED_LIVE	cluster and gfs2-utils bug fix update	2011-05-18 17:57:40 UTC

Comment 1 Fabio Massimo Di Nitto 2010-08-03 09:16:33 UTC

When testing bz 615926 I also tested the other direction:

2 nodes cluster with qdiskd running

remove qdiskd from the configuration

qdiskd daemon is not killed in this case, configuration change is dispatched, but qdiskd keeps happily voting:

<?xml version="1.0"?>
<cluster config_version="3" name="rhel6">
        <fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="3"/>
        <clusternodes>
                <clusternode name="rhel6-node1" nodeid="1" votes="1">
                        <fence/>
                </clusternode>
                <clusternode name="rhel6-node2" nodeid="2" votes="1">
                        <fence/>
                </clusternode>
        </clusternodes>
        <cman expected_votes="1" two_node="1"/>
        <fencedevices/>
        <rm>
                <failoverdomains/>
                <resources/>
        </rm>
</cluster>

[root@rhel6-node2 cluster]# ps ax|grep qdiskd
 2794 ?        SLsl   0:01 qdiskd -Q

[root@rhel6-node2 cluster]# cman_tool status
Version: 6.2.0
Config Version: 3
Cluster Name: rhel6
Cluster Id: 60348
Cluster Member: Yes
Cluster Generation: 8
Membership state: Cluster-Member
Nodes: 2
Expected votes: 1
Quorum device votes: 1
Total votes: 3
Node votes: 1
Quorum: 1  
Active subsystems: 11
Flags: 2node 
Ports Bound: 0 11 177  
Node name: rhel6-node2
Node ID: 2
Multicast addresses: 239.192.235.168 
Node addresses: 192.168.2.66 

[root@rhel6-node2 cluster]# cman_tool nodes
Node  Sts   Inc   Joined               Name
   0   M      0   2010-08-03 11:09:36  /dev/block/252:17
   1   M      8   2010-08-03 11:07:25  rhel6-node1
   2   M      4   2010-08-03 11:07:25  rhel6-node2

expected behavior from qdiskd is to go idle if not configured in cluster.conf

Comment 4 RHEL Program Management 2010-08-03 09:47:49 UTC

This issue has been proposed when we are only considering blocker
issues in the current Red Hat Enterprise Linux release.

** If you would still like this issue considered for the current
release, ask your support representative to file as a blocker on
your behalf. Otherwise ask that it be considered for the next
Red Hat Enterprise Linux release. **

Comment 5 Lon Hohberger 2010-08-03 14:02:03 UTC

This requires careful consideration.

The reconfiguration order is important.  Consider the case of a 4 node cluster + qdiskd.

Expected votes is 7

If you remove qdiskd from cluster.conf, cman will recalc expected votes to 4.  Then, before qdiskd processes the config change, it calls cman_poll_quorum_device().  This bumps expected votes back to 7.

Then qdiskd exits.

For now, it's much safer to:

(a) ensure all nodes are in the cluster
(b) kill qdiskd with SIGTERM on all nodes
(c) remove qdiskd from cluster.conf

Comment 6 Fabio Massimo Di Nitto 2010-08-03 14:53:47 UTC

so I understand all the issues described above.

In the specific case (a) and (c) are already true. qdiskd is already gone from cluster.conf and all nodes are in the cluster and active.

we don't have a way to tell qdiskd to die.

Doesn't cman_poll_quorum_device() recalculate every time based on qdiskd votes? if so, votes from qdiskd would go down to 0 (no config?no votes ;)) and expected votes recalculated.

Comment 7 Lon Hohberger 2010-08-03 16:38:30 UTC

No, cman_poll_quorum_device does not recalculate; you have to tell cman to drop the votes.

I was already working on a patch.  What it does is:

 - if previously configured and device & label are no longer present:
   - print a log message
   - reregister with 0 votes (causes recalculate_quorum())
   - clean shutdown e.g.:
     - write logout message to quorum disk
     - cman_unregister_quorum_device()

Aug  3 12:36:30 crackle modcluster: Updating cluster.conf
Aug  3 12:36:32 crackle corosync[1262]:   [QUORUM] Members[2]: 1 2
Aug  3 12:36:32 crackle corosync[1262]:   [CMAN  ] quorum device re-registered
Aug  3 12:36:32 crackle qdiskd[15384]: Quorum device removed from the configuration.  Shutting down.
Aug  3 12:36:43 crackle corosync[1262]:   [CMAN  ] lost contact with quorum device
Aug  3 12:36:43 crackle corosync[1262]:   [QUORUM] Members[2]: 1 2

Note however that because qdiskd was a member previously that it will still appear in both 'clustat' and 'cman_tool nodes' output.

[root@crackle ~]# cman_tool status
Version: 6.2.0
Config Version: 25
Cluster Name: cereal
Cluster Id: 27600
Cluster Member: Yes
Cluster Generation: 1248
Membership state: Cluster-Member
Nodes: 2
Expected votes: 2
Total votes: 2
Node votes: 1
Quorum: 2  
Active subsystems: 7
Flags: 
Ports Bound: 0  
Node name: crackle
Node ID: 2
Multicast addresses: 239.192.107.60 
Node addresses: 192.168.122.21 

(I used a two node cluster to illustrate that the fix works - if it didn't, expected votes would be 3 still).

Comment 8 Lon Hohberger 2010-08-03 16:39:13 UTC

Created attachment 436324 [details]
Fix

Patch not applied to any branches at this point.

Comment 9 Lon Hohberger 2010-08-03 16:57:25 UTC

http://git.fedorahosted.org/git?p=cluster.git;a=commit;h=e118d34dce64325a93c92833b1e074fbabb1a516

Updated patch posted to upstream STABLE3 branch.

Comment 10 Lon Hohberger 2010-08-03 16:58:52 UTC

Logs from updated patch:

Aug  3 12:47:56 snap modcluster: Updating cluster.conf
Aug  3 12:47:57 snap corosync[3446]:   [QUORUM] Members[2]: 1 2
Aug  3 12:47:57 snap corosync[3446]:   [CMAN  ] quorum device re-registered
Aug  3 12:47:57 snap corosync[3446]:   [QUORUM] Members[2]: 1 2
Aug  3 12:47:57 snap qdiskd[5751]: Quorum device removed from the configuration.  Shutting down.
Aug  3 12:47:57 snap qdiskd[5751]: Unregistering quorum device.
Aug  3 12:48:10 snap corosync[3446]:   [CMAN  ] lost contact with quorum device
Aug  3 12:48:10 snap corosync[3446]:   [QUORUM] Members[2]: 1 2

Comment 13 Fabio Massimo Di Nitto 2010-11-22 18:11:57 UTC

devel_ack, we already have the fix

Comment 14 Lon Hohberger 2011-01-12 16:37:26 UTC

http://git.fedorahosted.org/git?p=cluster.git;a=commit;h=be452ae8cb9eb7764b5f07ba08edeb04cd868134

Comment 17 errata-xmlrpc 2011-05-19 12:53:27 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0537.html

Note You need to log in before you can comment on or make changes to this bug.