Bug 1374218

Summary: qdevice lms algorithm gets stuck in a loop on network failure causing cluster split
Product: Red Hat Enterprise Linux 7 Reporter: Roman Bednář <rbednar>
Component: corosyncAssignee: Jan Friesse <jfriesse>
Status: CLOSED NOTABUG QA Contact: cluster-qe <cluster-qe>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 7.3CC: ccaulfie, cluster-maint, jfriesse, jkortus, rbednar
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-09-14 11:59:43 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
logs
none
logs2
none
new_log none

Description Roman Bednář 2016-09-08 09:30:31 UTC
Created attachment 1198957 [details]
logs

It seems that qdevice lms algorithm does not handle network failures correctly.
In this case it seems like it got stuck in a loop while simulating network 
failure on two nodes at once. The simptoms are similar when trying different
combinations of failures, like cutting off only one node, then trying the same
while connection to qnetd is NOT interrupted. In such cases I was able to observe that vote from qnetd 'Wait for reply' was applied to one extra node that was left untouched during tests on other nodes. Although it might be a result of the same bug.

Reproducer:
1) have 4 nodes, qnetd is running on separate 5th node using lms algorithm

2) simulate a network failure two nodes at once using iptables,
   (disabled ipv6 on all nodes in advance):

   NODE3 and NODE4:
	# iptables -L -n
	Chain INPUT (policy ACCEPT)
	target     prot opt source               destination         
	DROP       all  --  <IP1_NODE1>          0.0.0.0/0           
	DROP       all  --  <IP2_NODE1>          0.0.0.0/0           
	DROP       all  --  <IP1_NODE2>          0.0.0.0/0           
	DROP       all  --  <IP2_NODE2>          0.0.0.0/0           
	DROP       all  --  <IP1_QNETD>          0.0.0.0/0           
	DROP       all  --  <IP2_QNETD>          0.0.0.0/0           

	Chain FORWARD (policy ACCEPT)
	target     prot opt source               destination         

	Chain OUTPUT (policy ACCEPT)
	target     prot opt source               destination         
	DROP       all  --  <IP1_NODE1>          0.0.0.0/0           
	DROP       all  --  <IP2_NODE1>          0.0.0.0/0           
	DROP       all  --  <IP1_NODE2>          0.0.0.0/0           
	DROP       all  --  <IP2_NODE2>          0.0.0.0/0           
	DROP       all  --  <IP1_QNETD>          0.0.0.0/0           
	DROP       all  --  <IP2_QNETD>          0.0.0.0/0         
	
   Doing this we block all communication from node 3,4 to node 1,2 and qnetd server.

3) observe the loop of qdevice status:


# pcs qdevice status net
QNetd address:			*:5403
TLS:				Supported (client certificate required)
Connected clients:		2
Connected clusters:		1
Cluster "STSRHTS16137":
    Algorithm:		LMS
    Tie-breaker:	Node with lowest node ID
    Node ID 1:
        Client address:		::ffff:10.34.70.161:33042
        Configured node list:	1, 2, 3, 4
        Membership node list:	1, 2
        Vote:			ACK (ACK)
    Node ID 2:
        Client address:		::ffff:10.34.70.162:54548
        Configured node list:	1, 2, 3, 4
        Membership node list:	1, 2
        Vote:			ACK (ACK)

# pcs qdevice status net
QNetd address:			*:5403
TLS:				Supported (client certificate required)
Connected clients:		2
Connected clusters:		1
Cluster "STSRHTS16137":
    Algorithm:		LMS
    Tie-breaker:	Node with lowest node ID
    Node ID 1:
        Client address:		::ffff:10.34.70.161:33042
        Configured node list:	1, 2, 3, 4
        Membership node list:	1, 2
        Vote:			Wait for reply (ACK)
    Node ID 2:
        Client address:		::ffff:10.34.70.162:54548
        Configured node list:	1, 2, 3, 4
        Membership node list:	1, 2
        Vote:			ACK (ACK)

# pcs qdevice status net
QNetd address:			*:5403
TLS:				Supported (client certificate required)
Connected clients:		2
Connected clusters:		1
Cluster "STSRHTS16137":
    Algorithm:		LMS
    Tie-breaker:	Node with lowest node ID
    Node ID 1:
        Client address:		::ffff:10.34.70.161:33042
        Configured node list:	1, 2, 3, 4
        Membership node list:	1, 2
        Vote:			ACK (ACK)
    Node ID 2:
        Client address:		::ffff:10.34.70.162:54548
        Configured node list:	1, 2, 3, 4
        Membership node list:	1, 2
        Vote:			ACK (ACK
==============================================
Logs and corosync.conf are in attachments.
Debugging set in /etc/sysconfig/corosync-qnetd:
COROSYNC_QNETD_OPTIONS="-dd"

and corosync.conf:
logging {
...
        logger_subsys {
                subsys: QDEVICE
                debug: on
        }
...
}


NOTES: 
while doing the above it seems that on the 'survival' partition, 
corosync commands (or pcs using those) are blocked. E.g.: 
# corosync-quorumtool
(does nothing at all)

# pcs status
Cluster name: STSRHTS16137
(stops here)

Also tested the following (might have a common cause):
1) reopen communication on node 4, reboot. Qnetd vote is still NACK
2) perform the failure only on one node out of 4 - gives similar results,
   usually ends up by 'Wait for reply' from qnetd for one of the nodes
   that DID NOT get disconnected
3) tried also the same as above but without blocking qnetd server connection,
   it ends up again with 'Wait for reply' on two nodes instead of one




=================================
corosynclib-2.4.0-4.el7.x86_64
corosync-2.4.0-4.el7.x86_64
corosync-qdevice-2.4.0-4.el7.x86_64
corosync-qnetd-2.4.0-4.el7.x86_64
pcs-0.9.152-8.el7.x86_64

Comment 2 Christine Caulfield 2016-09-08 13:42:36 UTC
I can't reproduce this on my VMs. Can you attach less abbreviated logs please?

Comment 3 Roman Bednář 2016-09-08 14:53:06 UTC
Created attachment 1199138 [details]
logs2

Comment 4 Roman Bednář 2016-09-08 14:59:06 UTC
Sure. Added the logs here (zeroed before "disconnecting" the node).

Comment 5 Christine Caulfield 2016-09-08 15:00:32 UTC
Thanks. I'm off tomorrow but I'll look at them on Monday

Comment 6 Jan Friesse 2016-09-08 17:06:41 UTC
Does same happens also for ffsplit? This question is general for whatever bug filled in future (if make sense for ffsplit).

Comment 7 Roman Bednář 2016-09-12 10:04:55 UTC
With ffsplit it does not end up in a loop, but otherwise it seems similar. Corosync/pcs commands are blocked on the surviving partition too, qdevice status tells us this:

# pcs qdevice status net
QNetd address:			*:5403
TLS:				Supported (client certificate required)
Connected clients:		2
Connected clusters:		1
Cluster "STSRHTS114":
    Algorithm:		Fifty-Fifty split
    Tie-breaker:	Node with lowest node ID
    Node ID 2:
        Client address:		::ffff:10.34.70.170:49356
        Configured node list:	1, 2, 3, 4
        Membership node list:	1, 2
        Vote:			ACK (ACK)
    Node ID 1:
        Client address:		::ffff:10.34.70.149:50492
        Configured node list:	1, 2, 3, 4
        Membership node list:	1, 2
        Vote:			No change (ACK)

And cluster messages log:
corosync[2437]:  [VOTEQ ] waiting for quorum device Qdevice poll (but maximum for 30000 ms)
corosync[2437]:  [TOTEM ] A new membership (10.34.70.149:660) was formed. Members
corosync[2437]:  [VOTEQ ] waiting for quorum device Qdevice poll (but maximum for 30000 ms)
corosync[2437]:  [TOTEM ] A new membership (10.34.70.149:664) was formed. Members
corosync[2437]:  [VOTEQ ] waiting for quorum device Qdevice poll (but maximum for 30000 ms)
corosync[2437]:  [TOTEM ] A new membership (10.34.70.149:668) was formed. Members
corosync[2437]:  [VOTEQ ] waiting for quorum device Qdevice poll (but maximum for 30000 ms)
corosync[2437]:  [TOTEM ] A new membership (10.34.70.149:672) was formed. Members
corosync[2437]:  [VOTEQ ] waiting for quorum device Qdevice poll (but maximum for 30000 ms)
corosync[2437]:  [TOTEM ] A new membership (10.34.70.149:676) was formed. Members
corosync[2437]:  [VOTEQ ] waiting for quorum device Qdevice poll (but maximum for 30000 ms)
corosync[2437]:  [TOTEM ] A new membership (10.34.70.149:680) was formed. Members
corosync[2437]:  [VOTEQ ] waiting for quorum device Qdevice poll (but maximum for 30000 ms)


Additionally I'm observing another problem, which is corosync high CPU load during these tests (up to 95% spikes). Happens both with lms and ffsplit.

Comment 8 Jan Friesse 2016-09-12 12:40:27 UTC
@rbednar:
I'm lost. What you mean by loop? What you mean by blocked? During sync and before qdevice gets ACK/NACK, votequorum blocks other connections for (default configuration) 30 sec, that's normal and expected behavior.

Comment 9 Jan Friesse 2016-09-12 12:41:52 UTC
Also are you sure you really block NODE 3 and NODE 4 on node 1 and 2? Because if not, you create byzantine failure.

Comment 10 Jan Friesse 2016-09-12 12:46:46 UTC
Ok I see it.


	Chain OUTPUT (policy ACCEPT)
	target     prot opt source               destination         
	DROP       all  --  <IP1_NODE1>          0.0.0.0/0           
	DROP       all  --  <IP2_NODE1>          0.0.0.0/0           
	DROP       all  --  <IP1_NODE2>          0.0.0.0/0           
	DROP       all  --  <IP2_NODE2>          0.0.0.0/0           
	DROP       all  --  <IP1_QNETD>          0.0.0.0/0           
	DROP       all  --  <IP2_QNETD>          0.0.0.0/0     

is wrong. You block output from source IP_NODE1. You should block source 0.0.0.0/0 and destination IP_NODE1.

Please fix your test and try it again.

Comment 11 Roman Bednář 2016-09-13 15:57:58 UTC
Sorry, my bad, this was not intended. Tried again with correct iptables:

Chain INPUT (policy ACCEPT)
target     prot opt source               destination         
DROP       all  --  10.34.70.138         0.0.0.0/0           
DROP       all  --  192.168.0.12         0.0.0.0/0           
DROP       all  --  10.34.70.139         0.0.0.0/0           
DROP       all  --  192.168.0.13         0.0.0.0/0           
DROP       all  --  10.34.70.137         0.0.0.0/0           
DROP       all  --  192.168.0.11         0.0.0.0/0           

Chain FORWARD (policy ACCEPT)
target     prot opt source               destination         

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination         
DROP       all  --  0.0.0.0/0            10.34.70.138        
DROP       all  --  0.0.0.0/0            192.168.0.12        
DROP       all  --  0.0.0.0/0            10.34.70.139        
DROP       all  --  0.0.0.0/0            192.168.0.13        
DROP       all  --  0.0.0.0/0            10.34.70.137        
DROP       all  --  0.0.0.0/0            192.168.0.11   

Now it seems ok when done on two nodes (4node cluster). But when attempted only on one node I'm still observing similar issues like high CPU load and unstable qdevice votes, like shown below:

# pcs qdevice status net
QNetd address:			*:5403
TLS:				Supported (client certificate required)
Connected clients:		3
Connected clusters:		1
Cluster "STSRHTS11405":
    Algorithm:		LMS
    Tie-breaker:	Node with lowest node ID
    Node ID 4:
        Client address:		::ffff:10.34.70.155:38968
        Configured node list:	1, 2, 3, 4
        Membership node list:	3, 4
        Vote:			NACK (NACK)
    Node ID 1:
        Client address:		::ffff:10.34.70.138:57178
        Configured node list:	1, 2, 3, 4
        Membership node list:	1, 2
        Vote:			Wait for reply (ACK)
    Node ID 2:
        Client address:		::ffff:10.34.70.139:35488
        Configured node list:	1, 2, 3, 4
        Membership node list:	1, 2
        Vote:			ACK (ACK)

# pcs qdevice status net
QNetd address:			*:5403
TLS:				Supported (client certificate required)
Connected clients:		3
Connected clusters:		1
Cluster "STSRHTS11405":
    Algorithm:		LMS
    Tie-breaker:	Node with lowest node ID
    Node ID 4:
        Client address:		::ffff:10.34.70.155:38968
        Configured node list:	1, 2, 3, 4
        Membership node list:	3, 4
        Vote:			NACK (NACK)
    Node ID 1:
        Client address:		::ffff:10.34.70.138:57178
        Configured node list:	1, 2, 3, 4
        Membership node list:	1, 2
        Vote:			ACK (ACK)
    Node ID 2:
        Client address:		::ffff:10.34.70.139:35488
        Configured node list:	1, 2, 3, 4
        Membership node list:	1, 2
        Vote:			ACK (ACK)



# pcs qdevice status net
QNetd address:			*:5403
TLS:				Supported (client certificate required)
Connected clients:		3
Connected clusters:		1
Cluster "STSRHTS11405":
    Algorithm:		LMS
    Tie-breaker:	Node with lowest node ID
    Node ID 4:
        Client address:		::ffff:10.34.70.155:38968
        Configured node list:	1, 2, 3, 4
        Membership node list:	3, 4
        Vote:			NACK (NACK)
    Node ID 1:
        Client address:		::ffff:10.34.70.138:57178
        Configured node list:	1, 2, 3, 4
        Membership node list:	1, 2
        Vote:			Wait for reply (ACK)
    Node ID 2:
        Client address:		::ffff:10.34.70.139:35488
        Configured node list:	1, 2, 3, 4
        Membership node list:	1, 2
        Vote:			ACK (ACK)


The 'disconnected' node ID is 3 in this case so the NACK on node ID 4 does not seem ok either.
Tried with ffsplit too, similar result:

# pcs qdevice status net
QNetd address:			*:5403
TLS:				Supported (client certificate required)
Connected clients:		3
Connected clusters:		1
Cluster "STSRHTS11405":
    Algorithm:		Fifty-Fifty split
    Tie-breaker:	Node with lowest node ID
    Node ID 1:
        Client address:		::ffff:10.34.70.138:57996
        Configured node list:	1, 2, 3, 4
        Membership node list:	1, 2, 4
        Vote:			No change (ACK)
    Node ID 2:
        Client address:		::ffff:10.34.70.139:47830
        Configured node list:	1, 2, 3, 4
        Membership node list:	1, 2, 4
        Vote:			ACK (ACK)
    Node ID 4:
        Client address:		::ffff:10.34.70.155:41912
        Configured node list:	1, 2, 3, 4
        Membership node list:	1, 2, 4
        Vote:			ACK (ACK)

# pcs qdevice status net
QNetd address:			*:5403
TLS:				Supported (client certificate required)
Connected clients:		3
Connected clusters:		1
Cluster "STSRHTS11405":
    Algorithm:		Fifty-Fifty split
    Tie-breaker:	Node with lowest node ID
    Node ID 1:
        Client address:		::ffff:10.34.70.138:57996
        Configured node list:	1, 2, 3, 4
        Membership node list:	1, 2, 4
        Vote:			No change (ACK)
    Node ID 2:
        Client address:		::ffff:10.34.70.139:47830
        Configured node list:	1, 2, 3, 4
        Membership node list:	1, 2, 4
        Vote:			Wait for reply (ACK)
    Node ID 4:
        Client address:		::ffff:10.34.70.155:41912
        Configured node list:	1, 2, 3, 4
        Membership node list:	1, 2, 4
        Vote:			Wait for reply (ACK)

# pcs qdevice status net
QNetd address:			*:5403
TLS:				Supported (client certificate required)
Connected clients:		3
Connected clusters:		1
Cluster "STSRHTS11405":
    Algorithm:		Fifty-Fifty split
    Tie-breaker:	Node with lowest node ID
    Node ID 1:
        Client address:		::ffff:10.34.70.138:57996
        Configured node list:	1, 2, 3, 4
        Membership node list:	1, 2, 4
        Vote:			No change (ACK)
    Node ID 2:
        Client address:		::ffff:10.34.70.139:47830
        Configured node list:	1, 2, 3, 4
        Membership node list:	1, 2, 4
        Vote:			ACK (ACK)
    Node ID 4:
        Client address:		::ffff:10.34.70.155:41912
        Configured node list:	1, 2, 3, 4
        Membership node list:	1, 2, 4
        Vote:			ACK (ACK)


Adding new cluster log, let me know if any more are needed.

Comment 12 Roman Bednář 2016-09-13 15:58:37 UTC
Created attachment 1200555 [details]
new_log

Comment 13 Jan Friesse 2016-09-14 06:55:01 UTC
I strongly believe it is again something wrong on iptables side (you didn't paste 1 failed node case). Basically if you see:

Sep 13 16:35:16 virt-011.cluster-qe.lab.eng.brq.redhat.com corosync[2643]:  [VOTEQ ] waiting for quorum device Qdevice poll (but maximum for 30000 ms)
Sep 13 16:35:19 virt-011.cluster-qe.lab.eng.brq.redhat.com corosync[2643]:  [TOTEM ] A new membership (10.34.70.138:1008) was formed. Members
Sep 13 16:35:19 virt-011.cluster-qe.lab.eng.brq.redhat.com corosync[2643]:  [VOTEQ ] waiting for quorum device Qdevice poll (but maximum for 30000 ms)
Sep 13 16:35:22 virt-011.cluster-qe.lab.eng.brq.redhat.com corosync[2643]:  [TOTEM ] A new membership (10.34.70.138:1012) was formed. Members
Sep 13 16:35:22 virt-011.cluster-qe.lab.eng.brq.redhat.com corosync[2643]:  [VOTEQ ] waiting for quorum device Qdevice poll (but maximum for 30000 ms)
Sep 13 16:35:25 virt-011.cluster-qe.lab.eng.brq.redhat.com corosync[2643]:  

Repeating over and over again, it means you did byzantine failure and high cpu load is pretty normal. If you want to isolate node you really need to make sure it cannot receive and send messages.

Comment 14 Roman Bednář 2016-09-14 10:13:19 UTC
The problem seems to be that while testing one node failure I did not block one of the nodes which really does not make sense in the end.

Everything seems to work fine when attempted with cleaner way of blocking (instead of specifying IPs), like:

iptables -A INPUT ! -i lo -j DROP && iptables -A OUTPUT ! -o lo -j DROP

We can close this as 'not a bug'.

Comment 15 Jan Friesse 2016-09-14 11:59:43 UTC
@rbednar: good you find and fixed the problem in the tests. Closing this bz.