Bug 766586

Summary: corosync cfg stops working after one membership change (master)
Product: [Retired] Corosync Cluster Engine Reporter: Fabio Massimo Di Nitto <fdinitto>
Component: unknownAssignee: Angus Salkeld <asalkeld>
Status: CLOSED UPSTREAM QA Contact:
Severity: high Docs Contact:
Priority: high    
Version: 1.4CC: asalkeld, jfriesse, sdake
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-01-11 07:36:35 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Fabio Massimo Di Nitto 2011-12-12 11:29:19 UTC
This is a rather easy one to reproduce.

2 nodes, running pure corosync (no cman or anything else)

corosync.conf has usual stuff for interface and debugging on, plus:

quorum {
    provider: corosync_quorum_ykd
    expected_votes: 2
    votes: 1
    quorumdev_poll: 0
    leaving_timeout: 2
    disallowed: 0
    quorate: 1
    two_node: 1
}

Take corosync-quorumtool from topic-quorum-fabbione (commit eceaf9ac0695e72d3115e7f844aa59d33b3f9129).

Use against corosync flatiron-1.4 (expected and working behaviour)

[root@fedora-master-node1 tools]# ./corosync-quorumtool -m
Version:          1.8.0pre.331-1304-dirty
Nodes:            2
Ring ID:          240
Quorum type:      corosync_quorum_ykd
Quorate:          Yes
starting monitoring loop

date: Mon Dec 12 12:15:25 2011
Nodes:            2
Ring ID:          240
Quorate:          Yes
Nodeid  Name
3238176960      fedora-master-node1.int.fabbione.net
3254954176      fedora-master-node2.int.fabbione.net

date: Mon Dec 12 12:15:28 2011
Nodes:            1
Ring ID:          244
Quorate:          No
Nodeid  Name
3238176960      fedora-master-node1.int.fabbione.net

date: Mon Dec 12 12:15:28 2011
Nodes:            1
Ring ID:          244
Quorate:          Yes
Nodeid  Name
3238176960      fedora-master-node1.int.fabbione.net

date: Mon Dec 12 12:15:36 2011
Nodes:            2
Ring ID:          248
Quorate:          No
Nodeid  Name
3238176960      fedora-master-node1.int.fabbione.net
3254954176      fedora-master-node2.int.fabbione.net

^^^^ the node names are resolved via *node_name function in corosync-quorumtool.c that calls

err = corosync_cfg_get_node_addrs(c_handle, nodeid, INTERFACE_MAX, &numaddrs, addrs);

on each membership change basically.

Running the same tool against master branch or topic-quorum-fabbione:

-----------------------

[root@fedora-master-node1 tools]# ./corosync-quorumtool -m
Version:          1.8.0pre.331-1304-dirty
Nodes:            2
Ring ID:          256
Quorum type:      corosync_quorum_ykd
Quorate:          Yes
starting monitoring loop

date: Mon Dec 12 12:17:40 2011
Nodes:            2
Ring ID:          256
Quorate:          Yes
Nodeid  Name
3238176960      fedora-master-node1.int.fabbione.net
3254954176      fedora-master-node2.int.fabbione.net

date: Mon Dec 12 12:17:41 2011
Nodes:            1
Ring ID:          260
Quorate:          No
Nodeid  Name
Unable to get node address for nodeid 3238176960: 6
3238176960

date: Mon Dec 12 12:17:41 2011
Nodes:            1
Ring ID:          260
Quorate:          Yes
Nodeid  Name
Unable to get node address for nodeid 3238176960: 6
3238176960

it appears that the same call is sending back a TRYAGAIN that doesn´t look correct to me at all...

Comment 1 Angus Salkeld 2012-01-11 06:15:14 UTC
Fabio, this should be fixed now.