| Summary: | qdevice lms algorithm gets stuck in a loop on network failure causing cluster split | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Roman Bednář <rbednar> | ||||||||
| Component: | corosync | Assignee: | Jan Friesse <jfriesse> | ||||||||
| Status: | CLOSED NOTABUG | QA Contact: | cluster-qe <cluster-qe> | ||||||||
| Severity: | unspecified | Docs Contact: | |||||||||
| Priority: | unspecified | ||||||||||
| Version: | 7.3 | CC: | ccaulfie, cluster-maint, jfriesse, jkortus, rbednar | ||||||||
| Target Milestone: | rc | ||||||||||
| Target Release: | --- | ||||||||||
| Hardware: | x86_64 | ||||||||||
| OS: | Linux | ||||||||||
| Whiteboard: | |||||||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||
| Doc Text: | Story Points: | --- | |||||||||
| Clone Of: | Environment: | ||||||||||
| Last Closed: | 2016-09-14 11:59:43 UTC | Type: | Bug | ||||||||
| Regression: | --- | Mount Type: | --- | ||||||||
| Documentation: | --- | CRM: | |||||||||
| Verified Versions: | Category: | --- | |||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||
| Attachments: |
|
||||||||||
I can't reproduce this on my VMs. Can you attach less abbreviated logs please? Created attachment 1199138 [details]
logs2
Sure. Added the logs here (zeroed before "disconnecting" the node). Thanks. I'm off tomorrow but I'll look at them on Monday Does same happens also for ffsplit? This question is general for whatever bug filled in future (if make sense for ffsplit). With ffsplit it does not end up in a loop, but otherwise it seems similar. Corosync/pcs commands are blocked on the surviving partition too, qdevice status tells us this:
# pcs qdevice status net
QNetd address: *:5403
TLS: Supported (client certificate required)
Connected clients: 2
Connected clusters: 1
Cluster "STSRHTS114":
Algorithm: Fifty-Fifty split
Tie-breaker: Node with lowest node ID
Node ID 2:
Client address: ::ffff:10.34.70.170:49356
Configured node list: 1, 2, 3, 4
Membership node list: 1, 2
Vote: ACK (ACK)
Node ID 1:
Client address: ::ffff:10.34.70.149:50492
Configured node list: 1, 2, 3, 4
Membership node list: 1, 2
Vote: No change (ACK)
And cluster messages log:
corosync[2437]: [VOTEQ ] waiting for quorum device Qdevice poll (but maximum for 30000 ms)
corosync[2437]: [TOTEM ] A new membership (10.34.70.149:660) was formed. Members
corosync[2437]: [VOTEQ ] waiting for quorum device Qdevice poll (but maximum for 30000 ms)
corosync[2437]: [TOTEM ] A new membership (10.34.70.149:664) was formed. Members
corosync[2437]: [VOTEQ ] waiting for quorum device Qdevice poll (but maximum for 30000 ms)
corosync[2437]: [TOTEM ] A new membership (10.34.70.149:668) was formed. Members
corosync[2437]: [VOTEQ ] waiting for quorum device Qdevice poll (but maximum for 30000 ms)
corosync[2437]: [TOTEM ] A new membership (10.34.70.149:672) was formed. Members
corosync[2437]: [VOTEQ ] waiting for quorum device Qdevice poll (but maximum for 30000 ms)
corosync[2437]: [TOTEM ] A new membership (10.34.70.149:676) was formed. Members
corosync[2437]: [VOTEQ ] waiting for quorum device Qdevice poll (but maximum for 30000 ms)
corosync[2437]: [TOTEM ] A new membership (10.34.70.149:680) was formed. Members
corosync[2437]: [VOTEQ ] waiting for quorum device Qdevice poll (but maximum for 30000 ms)
Additionally I'm observing another problem, which is corosync high CPU load during these tests (up to 95% spikes). Happens both with lms and ffsplit.
@rbednar: I'm lost. What you mean by loop? What you mean by blocked? During sync and before qdevice gets ACK/NACK, votequorum blocks other connections for (default configuration) 30 sec, that's normal and expected behavior. Also are you sure you really block NODE 3 and NODE 4 on node 1 and 2? Because if not, you create byzantine failure. Ok I see it. Chain OUTPUT (policy ACCEPT) target prot opt source destination DROP all -- <IP1_NODE1> 0.0.0.0/0 DROP all -- <IP2_NODE1> 0.0.0.0/0 DROP all -- <IP1_NODE2> 0.0.0.0/0 DROP all -- <IP2_NODE2> 0.0.0.0/0 DROP all -- <IP1_QNETD> 0.0.0.0/0 DROP all -- <IP2_QNETD> 0.0.0.0/0 is wrong. You block output from source IP_NODE1. You should block source 0.0.0.0/0 and destination IP_NODE1. Please fix your test and try it again. Sorry, my bad, this was not intended. Tried again with correct iptables:
Chain INPUT (policy ACCEPT)
target prot opt source destination
DROP all -- 10.34.70.138 0.0.0.0/0
DROP all -- 192.168.0.12 0.0.0.0/0
DROP all -- 10.34.70.139 0.0.0.0/0
DROP all -- 192.168.0.13 0.0.0.0/0
DROP all -- 10.34.70.137 0.0.0.0/0
DROP all -- 192.168.0.11 0.0.0.0/0
Chain FORWARD (policy ACCEPT)
target prot opt source destination
Chain OUTPUT (policy ACCEPT)
target prot opt source destination
DROP all -- 0.0.0.0/0 10.34.70.138
DROP all -- 0.0.0.0/0 192.168.0.12
DROP all -- 0.0.0.0/0 10.34.70.139
DROP all -- 0.0.0.0/0 192.168.0.13
DROP all -- 0.0.0.0/0 10.34.70.137
DROP all -- 0.0.0.0/0 192.168.0.11
Now it seems ok when done on two nodes (4node cluster). But when attempted only on one node I'm still observing similar issues like high CPU load and unstable qdevice votes, like shown below:
# pcs qdevice status net
QNetd address: *:5403
TLS: Supported (client certificate required)
Connected clients: 3
Connected clusters: 1
Cluster "STSRHTS11405":
Algorithm: LMS
Tie-breaker: Node with lowest node ID
Node ID 4:
Client address: ::ffff:10.34.70.155:38968
Configured node list: 1, 2, 3, 4
Membership node list: 3, 4
Vote: NACK (NACK)
Node ID 1:
Client address: ::ffff:10.34.70.138:57178
Configured node list: 1, 2, 3, 4
Membership node list: 1, 2
Vote: Wait for reply (ACK)
Node ID 2:
Client address: ::ffff:10.34.70.139:35488
Configured node list: 1, 2, 3, 4
Membership node list: 1, 2
Vote: ACK (ACK)
# pcs qdevice status net
QNetd address: *:5403
TLS: Supported (client certificate required)
Connected clients: 3
Connected clusters: 1
Cluster "STSRHTS11405":
Algorithm: LMS
Tie-breaker: Node with lowest node ID
Node ID 4:
Client address: ::ffff:10.34.70.155:38968
Configured node list: 1, 2, 3, 4
Membership node list: 3, 4
Vote: NACK (NACK)
Node ID 1:
Client address: ::ffff:10.34.70.138:57178
Configured node list: 1, 2, 3, 4
Membership node list: 1, 2
Vote: ACK (ACK)
Node ID 2:
Client address: ::ffff:10.34.70.139:35488
Configured node list: 1, 2, 3, 4
Membership node list: 1, 2
Vote: ACK (ACK)
# pcs qdevice status net
QNetd address: *:5403
TLS: Supported (client certificate required)
Connected clients: 3
Connected clusters: 1
Cluster "STSRHTS11405":
Algorithm: LMS
Tie-breaker: Node with lowest node ID
Node ID 4:
Client address: ::ffff:10.34.70.155:38968
Configured node list: 1, 2, 3, 4
Membership node list: 3, 4
Vote: NACK (NACK)
Node ID 1:
Client address: ::ffff:10.34.70.138:57178
Configured node list: 1, 2, 3, 4
Membership node list: 1, 2
Vote: Wait for reply (ACK)
Node ID 2:
Client address: ::ffff:10.34.70.139:35488
Configured node list: 1, 2, 3, 4
Membership node list: 1, 2
Vote: ACK (ACK)
The 'disconnected' node ID is 3 in this case so the NACK on node ID 4 does not seem ok either.
Tried with ffsplit too, similar result:
# pcs qdevice status net
QNetd address: *:5403
TLS: Supported (client certificate required)
Connected clients: 3
Connected clusters: 1
Cluster "STSRHTS11405":
Algorithm: Fifty-Fifty split
Tie-breaker: Node with lowest node ID
Node ID 1:
Client address: ::ffff:10.34.70.138:57996
Configured node list: 1, 2, 3, 4
Membership node list: 1, 2, 4
Vote: No change (ACK)
Node ID 2:
Client address: ::ffff:10.34.70.139:47830
Configured node list: 1, 2, 3, 4
Membership node list: 1, 2, 4
Vote: ACK (ACK)
Node ID 4:
Client address: ::ffff:10.34.70.155:41912
Configured node list: 1, 2, 3, 4
Membership node list: 1, 2, 4
Vote: ACK (ACK)
# pcs qdevice status net
QNetd address: *:5403
TLS: Supported (client certificate required)
Connected clients: 3
Connected clusters: 1
Cluster "STSRHTS11405":
Algorithm: Fifty-Fifty split
Tie-breaker: Node with lowest node ID
Node ID 1:
Client address: ::ffff:10.34.70.138:57996
Configured node list: 1, 2, 3, 4
Membership node list: 1, 2, 4
Vote: No change (ACK)
Node ID 2:
Client address: ::ffff:10.34.70.139:47830
Configured node list: 1, 2, 3, 4
Membership node list: 1, 2, 4
Vote: Wait for reply (ACK)
Node ID 4:
Client address: ::ffff:10.34.70.155:41912
Configured node list: 1, 2, 3, 4
Membership node list: 1, 2, 4
Vote: Wait for reply (ACK)
# pcs qdevice status net
QNetd address: *:5403
TLS: Supported (client certificate required)
Connected clients: 3
Connected clusters: 1
Cluster "STSRHTS11405":
Algorithm: Fifty-Fifty split
Tie-breaker: Node with lowest node ID
Node ID 1:
Client address: ::ffff:10.34.70.138:57996
Configured node list: 1, 2, 3, 4
Membership node list: 1, 2, 4
Vote: No change (ACK)
Node ID 2:
Client address: ::ffff:10.34.70.139:47830
Configured node list: 1, 2, 3, 4
Membership node list: 1, 2, 4
Vote: ACK (ACK)
Node ID 4:
Client address: ::ffff:10.34.70.155:41912
Configured node list: 1, 2, 3, 4
Membership node list: 1, 2, 4
Vote: ACK (ACK)
Adding new cluster log, let me know if any more are needed.
Created attachment 1200555 [details]
new_log
I strongly believe it is again something wrong on iptables side (you didn't paste 1 failed node case). Basically if you see: Sep 13 16:35:16 virt-011.cluster-qe.lab.eng.brq.redhat.com corosync[2643]: [VOTEQ ] waiting for quorum device Qdevice poll (but maximum for 30000 ms) Sep 13 16:35:19 virt-011.cluster-qe.lab.eng.brq.redhat.com corosync[2643]: [TOTEM ] A new membership (10.34.70.138:1008) was formed. Members Sep 13 16:35:19 virt-011.cluster-qe.lab.eng.brq.redhat.com corosync[2643]: [VOTEQ ] waiting for quorum device Qdevice poll (but maximum for 30000 ms) Sep 13 16:35:22 virt-011.cluster-qe.lab.eng.brq.redhat.com corosync[2643]: [TOTEM ] A new membership (10.34.70.138:1012) was formed. Members Sep 13 16:35:22 virt-011.cluster-qe.lab.eng.brq.redhat.com corosync[2643]: [VOTEQ ] waiting for quorum device Qdevice poll (but maximum for 30000 ms) Sep 13 16:35:25 virt-011.cluster-qe.lab.eng.brq.redhat.com corosync[2643]: Repeating over and over again, it means you did byzantine failure and high cpu load is pretty normal. If you want to isolate node you really need to make sure it cannot receive and send messages. The problem seems to be that while testing one node failure I did not block one of the nodes which really does not make sense in the end. Everything seems to work fine when attempted with cleaner way of blocking (instead of specifying IPs), like: iptables -A INPUT ! -i lo -j DROP && iptables -A OUTPUT ! -o lo -j DROP We can close this as 'not a bug'. @rbednar: good you find and fixed the problem in the tests. Closing this bz. |
Created attachment 1198957 [details] logs It seems that qdevice lms algorithm does not handle network failures correctly. In this case it seems like it got stuck in a loop while simulating network failure on two nodes at once. The simptoms are similar when trying different combinations of failures, like cutting off only one node, then trying the same while connection to qnetd is NOT interrupted. In such cases I was able to observe that vote from qnetd 'Wait for reply' was applied to one extra node that was left untouched during tests on other nodes. Although it might be a result of the same bug. Reproducer: 1) have 4 nodes, qnetd is running on separate 5th node using lms algorithm 2) simulate a network failure two nodes at once using iptables, (disabled ipv6 on all nodes in advance): NODE3 and NODE4: # iptables -L -n Chain INPUT (policy ACCEPT) target prot opt source destination DROP all -- <IP1_NODE1> 0.0.0.0/0 DROP all -- <IP2_NODE1> 0.0.0.0/0 DROP all -- <IP1_NODE2> 0.0.0.0/0 DROP all -- <IP2_NODE2> 0.0.0.0/0 DROP all -- <IP1_QNETD> 0.0.0.0/0 DROP all -- <IP2_QNETD> 0.0.0.0/0 Chain FORWARD (policy ACCEPT) target prot opt source destination Chain OUTPUT (policy ACCEPT) target prot opt source destination DROP all -- <IP1_NODE1> 0.0.0.0/0 DROP all -- <IP2_NODE1> 0.0.0.0/0 DROP all -- <IP1_NODE2> 0.0.0.0/0 DROP all -- <IP2_NODE2> 0.0.0.0/0 DROP all -- <IP1_QNETD> 0.0.0.0/0 DROP all -- <IP2_QNETD> 0.0.0.0/0 Doing this we block all communication from node 3,4 to node 1,2 and qnetd server. 3) observe the loop of qdevice status: # pcs qdevice status net QNetd address: *:5403 TLS: Supported (client certificate required) Connected clients: 2 Connected clusters: 1 Cluster "STSRHTS16137": Algorithm: LMS Tie-breaker: Node with lowest node ID Node ID 1: Client address: ::ffff:10.34.70.161:33042 Configured node list: 1, 2, 3, 4 Membership node list: 1, 2 Vote: ACK (ACK) Node ID 2: Client address: ::ffff:10.34.70.162:54548 Configured node list: 1, 2, 3, 4 Membership node list: 1, 2 Vote: ACK (ACK) # pcs qdevice status net QNetd address: *:5403 TLS: Supported (client certificate required) Connected clients: 2 Connected clusters: 1 Cluster "STSRHTS16137": Algorithm: LMS Tie-breaker: Node with lowest node ID Node ID 1: Client address: ::ffff:10.34.70.161:33042 Configured node list: 1, 2, 3, 4 Membership node list: 1, 2 Vote: Wait for reply (ACK) Node ID 2: Client address: ::ffff:10.34.70.162:54548 Configured node list: 1, 2, 3, 4 Membership node list: 1, 2 Vote: ACK (ACK) # pcs qdevice status net QNetd address: *:5403 TLS: Supported (client certificate required) Connected clients: 2 Connected clusters: 1 Cluster "STSRHTS16137": Algorithm: LMS Tie-breaker: Node with lowest node ID Node ID 1: Client address: ::ffff:10.34.70.161:33042 Configured node list: 1, 2, 3, 4 Membership node list: 1, 2 Vote: ACK (ACK) Node ID 2: Client address: ::ffff:10.34.70.162:54548 Configured node list: 1, 2, 3, 4 Membership node list: 1, 2 Vote: ACK (ACK ============================================== Logs and corosync.conf are in attachments. Debugging set in /etc/sysconfig/corosync-qnetd: COROSYNC_QNETD_OPTIONS="-dd" and corosync.conf: logging { ... logger_subsys { subsys: QDEVICE debug: on } ... } NOTES: while doing the above it seems that on the 'survival' partition, corosync commands (or pcs using those) are blocked. E.g.: # corosync-quorumtool (does nothing at all) # pcs status Cluster name: STSRHTS16137 (stops here) Also tested the following (might have a common cause): 1) reopen communication on node 4, reboot. Qnetd vote is still NACK 2) perform the failure only on one node out of 4 - gives similar results, usually ends up by 'Wait for reply' from qnetd for one of the nodes that DID NOT get disconnected 3) tried also the same as above but without blocking qnetd server connection, it ends up again with 'Wait for reply' on two nodes instead of one ================================= corosynclib-2.4.0-4.el7.x86_64 corosync-2.4.0-4.el7.x86_64 corosync-qdevice-2.4.0-4.el7.x86_64 corosync-qnetd-2.4.0-4.el7.x86_64 pcs-0.9.152-8.el7.x86_64