Bug 766586 - corosync cfg stops working after one membership change (master)
Summary: corosync cfg stops working after one membership change (master)
Keywords:
Status: CLOSED UPSTREAM
Alias: None
Product: Corosync Cluster Engine
Classification: Retired
Component: unknown
Version: 1.4
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
Assignee: Angus Salkeld
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-12-12 11:29 UTC by Fabio Massimo Di Nitto
Modified: 2012-01-11 07:36 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2012-01-11 07:36:35 UTC


Attachments (Terms of Use)

Description Fabio Massimo Di Nitto 2011-12-12 11:29:19 UTC
This is a rather easy one to reproduce.

2 nodes, running pure corosync (no cman or anything else)

corosync.conf has usual stuff for interface and debugging on, plus:

quorum {
    provider: corosync_quorum_ykd
    expected_votes: 2
    votes: 1
    quorumdev_poll: 0
    leaving_timeout: 2
    disallowed: 0
    quorate: 1
    two_node: 1
}

Take corosync-quorumtool from topic-quorum-fabbione (commit eceaf9ac0695e72d3115e7f844aa59d33b3f9129).

Use against corosync flatiron-1.4 (expected and working behaviour)

[root@fedora-master-node1 tools]# ./corosync-quorumtool -m
Version:          1.8.0pre.331-1304-dirty
Nodes:            2
Ring ID:          240
Quorum type:      corosync_quorum_ykd
Quorate:          Yes
starting monitoring loop

date: Mon Dec 12 12:15:25 2011
Nodes:            2
Ring ID:          240
Quorate:          Yes
Nodeid  Name
3238176960      fedora-master-node1.int.fabbione.net
3254954176      fedora-master-node2.int.fabbione.net

date: Mon Dec 12 12:15:28 2011
Nodes:            1
Ring ID:          244
Quorate:          No
Nodeid  Name
3238176960      fedora-master-node1.int.fabbione.net

date: Mon Dec 12 12:15:28 2011
Nodes:            1
Ring ID:          244
Quorate:          Yes
Nodeid  Name
3238176960      fedora-master-node1.int.fabbione.net

date: Mon Dec 12 12:15:36 2011
Nodes:            2
Ring ID:          248
Quorate:          No
Nodeid  Name
3238176960      fedora-master-node1.int.fabbione.net
3254954176      fedora-master-node2.int.fabbione.net

^^^^ the node names are resolved via *node_name function in corosync-quorumtool.c that calls

err = corosync_cfg_get_node_addrs(c_handle, nodeid, INTERFACE_MAX, &numaddrs, addrs);

on each membership change basically.

Running the same tool against master branch or topic-quorum-fabbione:

-----------------------

[root@fedora-master-node1 tools]# ./corosync-quorumtool -m
Version:          1.8.0pre.331-1304-dirty
Nodes:            2
Ring ID:          256
Quorum type:      corosync_quorum_ykd
Quorate:          Yes
starting monitoring loop

date: Mon Dec 12 12:17:40 2011
Nodes:            2
Ring ID:          256
Quorate:          Yes
Nodeid  Name
3238176960      fedora-master-node1.int.fabbione.net
3254954176      fedora-master-node2.int.fabbione.net

date: Mon Dec 12 12:17:41 2011
Nodes:            1
Ring ID:          260
Quorate:          No
Nodeid  Name
Unable to get node address for nodeid 3238176960: 6
3238176960

date: Mon Dec 12 12:17:41 2011
Nodes:            1
Ring ID:          260
Quorate:          Yes
Nodeid  Name
Unable to get node address for nodeid 3238176960: 6
3238176960

it appears that the same call is sending back a TRYAGAIN that doesn´t look correct to me at all...

Comment 1 Angus Salkeld 2012-01-11 06:15:14 UTC
Fabio, this should be fixed now.


Note You need to log in before you can comment on or make changes to this bug.