Hide Forgot
Description of problem: SSIA
More info how qdevice heuristics are expected to work. It is possible to configure multiple commands "shell" commands to be executed. Commands are executed: 1. when new membership is formed 2. on regular time basis Qdevice heuristics can be: - Disabled - Behavior should be same as qdevice without heuristics (qdevice in RHEL 7.3) - Executed only on 1. - Executed on 1. and 2. When heuristics are required, all "shell" commands are executed. Qdevice checks error code of command. If all shell commands success, heuristics success, otherwise it fails. Heuristics success/fail is then used by qnetd as another (primary) level of tie-breaker. Examples: - 2 nodes, algorithm ffsplit. Nodes split, node A is able to execute all shell commands, node B isn't. node A should stay quorate, node B should be unquorate. - 2 nodes, algorithm ffsplit. Nodes split, node B is able to execute all shell commands, node A isn't. node B should stay quorate, node A should be unquorate. - 2 nodes, algorithm ffsplit. Nodes split, both nodes are able (or unable) to execute all shell commands. Because result of both nodes are same, configured tie_breaker (quorum.device.net.tie_breaker) is used (= same behavior as without heuristics). Example config file snip: --- quorum { provider: corosync_votequorum device { votes: 1 model: net net { tls: on host: localhost algorithm: ffsplit } heuristics { # Mode - on/off/sync mode: on # Default 1/2 instance->heartbeat_interval # timeout: 5 # Default 1/2 instance->sync_heartbeat_interval # sync_timeout: 15 # Default 3 * instance->heartbeat_interval # interval: 30 # Executables exec_ping: ping -q -c 1 "127.0.0.1" exec_ls: test -f /tmp/test } } } ---
tl;dr: I would like to request that we also consider and pursue corosync-qdevice allowing the use of heuristics as a quorum-determining factor _without_ requiring the use of a qnetd server. Long version: As we've been working with a specific customer, we have identified an additional aspect that would be useful to have incorporated into corosync-qdevice's heuristic-based functionality. This customer has a storage-based tie-breaker method that they are happy with and pay good money for. They had a requirement that their RHEL HA cluster be able to use this mechanism to influence membership/fencing decisions, but we do not have any simple way to achieve this in RHEL 7. They're able to use connectivity to a third/neutral site as a tie-breaker for membership decisions (as this is what their storage solution does), but they aren't able to deploy additional servers in that location. They would prefer to just be able to ping a gateway and have that serve as a determining factor. We're pursuing a few changes in pacemaker to try to allow fencing decisions to be made in a way that aligns with these requirements, and we've gotten close by using sbd (which aligns with the storage decision) and ping scripts or resources (aligning with the network-based tie breaker to the third site), and a proposed heuristic-based fence-agent. We will probably end up delivering some combination of these to them as a short term solution, but the challenges around this have given us reason to consider what the optimal solution to this would be for widespread usage across our customer base, since these seem like reasonable requirements that may continue to come up. With corosync's QDevice being the solution we're positioning as the optimal way to achieve arbitration in single-membership clusters, it seems like this is the best place to develop any features that would enable these use cases. With heuristics already being a planned feature that's in progress, the only additional piece it seems we would need is the ability to arbitrate quorum _only_ through those heuristics, and not require any connection to a qnetd server. So, I would like to tack this request onto the work that is already underway / soon-to-happen for the heuristic feature. If you'd like another bug tracking that additional request, let me know and I can open one. If there are any concerns or thoughts, let me know.
@John, heuristics only solution is for sure interesting. I really have to think about it much more but in theory this could remove need to have qdevice disk model and just interface sbd. What I'm not so sure is how to really achieve that only one partition gets qdevice vote because tie-breaker is then not in our hands and we must trust 3rd side provider. So at least official support may be kind of problematic.
Known issues: - Regular heuristics is supported only by ffsplit. This is not a problem for clusters with power fencing, but deployments where non-quorate partition continues to operate may see this as a problem. - Qdevice-tool status doesn't contain detailed information about heuristics. - Qdevice-tool doesn't have a possibility to trigger heuristics re-execute. For QA: Please see corosync-qdevice.8 for short example how to configure heuristics Quick test: - Two nodes both with heuristics from example - On first node create file and on second don't - Use iptables to split both nodes - First node should get the vote - Repeat the test but now create file on second node and not on the first node. If you choose ffsplit, it should work even without restart of daemons or join/leave of new nodes. Backwards compatibility test: - When heuristics mode is off, or no exec_ variables are defined, qdevice should be able to connect to old qnetd and everything should work. - When heuristics mode is sync or on (with defined exec_variabled), qdevice should fail connecting to old qnetd - Old qdevice version should work with new qnetd without any problems (heuristics are then "Undefined").
I have used ha addon reference [1] and pcs test [3] to setup the cluster with two nodes and quorum disk [3] Quorum device is configured to check for existence of /tmp/test. Both nodes do have /root/iptables [4] script for blocking the network traffic from the other host. The connection is blocked on the node that is supposed to be rebooted so it can join after reboot. The following tests have been successfully performed: > block network connections between virt-429 to virt-430 with the script [4] with: * /tmp/test not on lowest id node (virt-429): heuristicts fails lowest node id (virt-429) and it is rebooted * /tmp/test on both nodes: heuristics pass on both nodes, common non-heuristicts method is used: second node is rebooted (virt-430) * /tmp/test on neither: not lowest id node is rebooted (virt-430) heuristics fail on both nodes, common non-heuristicts method is used: second node is rebooted (virt-430) --- > [1] https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html-single/high_availability_add-on_reference/index#s1-quorumdev-HAAR > [2] https://bugzilla.redhat.com/show_bug.cgi?id=1389209#c30 > [3] cluster and quorum device configuration > [root@virt-428 ~]# pcs quorum device update heuristics "exec_ls= /usr/bin/test -f /tmp/test" > [root@virt-428 ~]# pcs quorum device update heuristics mode=on > [root@virt-428 ~]# pcs qdevice status net --full QNetd address: *:5403 TLS: Supported (client certificate required) Connected clients: 2 Connected clusters: 1 Maximum send/receive size: 32768/32768 bytes Cluster "STSRHTS20495": Algorithm: Fifty-Fifty split Tie-breaker: Node with lowest node ID Node ID 1: Client address: 2620:52:0:25a4:1800:ff:fe00:1ad:37524 HB interval: 8000ms Configured node list: 1, 2 Ring ID: 1.aa8 Membership node list: 1, 2 Heuristics: Fail (membership: Fail, regular: Undefined) TLS active: Yes (client certificate verified) Vote: ACK (ACK) Node ID 2: Client address: 2620:52:0:25a4:1800:ff:fe00:1ae:54402 HB interval: 8000ms Configured node list: 1, 2 Ring ID: 1.aa8 Membership node list: 1, 2 Heuristics: Pass (membership: Pass, regular: Undefined) TLS active: Yes (client certificate verified) Vote: No change (ACK) >> note that for Heuristics: lines membership is status at the time of join or >> membership change and regular is regulary run result on node which is >> updated on change only. Also it starts as undefined which is kind of bug. >[root@virt-429 tests]# pcs quorum status Quorum information ------------------ Date: Thu Jan 18 13:00:25 2018 Quorum provider: corosync_votequorum Nodes: 2 Node ID: 1 Ring ID: 1/2728 Quorate: Yes Votequorum information ---------------------- Expected votes: 3 Highest expected: 3 Total votes: 3 Quorum: 2 Flags: Quorate Qdevice Membership information ---------------------- Nodeid Votes Qdevice Name 1 1 A,V,NMW virt-429 (local) 2 1 A,V,NMW virt-430 0 1 Qdevice >[root@virt-429 tests]# pcs quorum device status Qdevice information ------------------- Model: Net Node ID: 1 Configured node list: 0 Node ID = 1 1 Node ID = 2 Membership node list: 1, 2 Qdevice-net information ---------------------- Cluster name: STSRHTS20495 QNetd host: virt-428:5403 Algorithm: Fifty-Fifty split Tie-breaker: Node with lowest node ID State: Connected Heuristics result: Fail >[root@virt-429 tests]# pcs config Cluster Name: STSRHTS20495 Corosync Nodes: virt-429 virt-430 Pacemaker Nodes: virt-429 virt-430 Resources: Clone: dlm-clone Meta Attrs: interleave=true ordered=true Resource: dlm (class=ocf provider=pacemaker type=controld) Operations: monitor interval=30s on-fail=fence (dlm-monitor-interval-30s) start interval=0s timeout=90 (dlm-start-interval-0s) stop interval=0s timeout=100 (dlm-stop-interval-0s) Clone: clvmd-clone Meta Attrs: interleave=true ordered=true Resource: clvmd (class=ocf provider=heartbeat type=clvm) Attributes: with_cmirrord=1 Operations: monitor interval=30s on-fail=fence (clvmd-monitor-interval-30s) start interval=0s timeout=90 (clvmd-start-interval-0s) stop interval=0s timeout=90 (clvmd-stop-interval-0s) Stonith Devices: Resource: fence-virt-429 (class=stonith type=fence_xvm) Attributes: pcmk_host_check=static-list pcmk_host_list=virt-429 pcmk_host_map=virt-429:virt-429.cluster-qe.lab.eng.brq.redhat.com Operations: monitor interval=60s (fence-virt-429-monitor-interval-60s) Resource: fence-virt-430 (class=stonith type=fence_xvm) Attributes: pcmk_host_check=static-list pcmk_host_list=virt-430 pcmk_host_map=virt-430:virt-430.cluster-qe.lab.eng.brq.redhat.com Operations: monitor interval=60s (fence-virt-430-monitor-interval-60s) Fencing Levels: Location Constraints: Ordering Constraints: start dlm-clone then start clvmd-clone (kind:Mandatory) Colocation Constraints: clvmd-clone with dlm-clone (score:INFINITY) Ticket Constraints: Alerts: No alerts defined Resources Defaults: No defaults set Operations Defaults: No defaults set Cluster Properties: cluster-infrastructure: corosync cluster-name: STSRHTS20495 dc-version: 1.1.18-8.el7-2b07d5c5a9 have-watchdog: false last-lrm-refresh: 1516276684 no-quorum-policy: freeze Quorum: Options: Device: votes: 1 Model: net algorithm: ffsplit host: virt-428 Heuristics: exec_ls: /usr/bin/test -f /tmp/test [4] /root/iptables #!/bin/bash -x for bin in iptables ip6tables; do $bin -F other_nodes_drop || $bin -N other_nodes_drop $bin -A other_nodes_drop -s <THE_OTHER_NODE> -j DROP $bin -nvL INPUT | grep -q other_nodes_drop || $bin -I INPUT -j other_nodes_drop done
The same setup as in the previous comment is created with the following differences: * quorum devices set for 'lms' [1], [3] > pcs quorum device add model net host=virt-428 algorithm=lms > pcs quorum device update heuristics "exec_ls= /usr/bin/test -f /tmp/test" > pcs quorum device update heuristics mode=on * cluster have three nodes (+virt-431) [2] * iptables scripts modified to create two groups (virt-429 and virt-430 with virt-431) [4] --- The following tests have been successfully performed: * block connection between nodes but not to the quorum device: * /tmp/test present on all or none of nodes: lowest id stays quorate, other nodes rebooted * /tmp/test present on virt-431 only (not lowest id node): virt-430 stays quorate, other nodes (including lowest id node) rebooted --- > [1] # pcs qdevice status net --full QNetd address: *:5403 TLS: Supported (client certificate required) Connected clients: 3 Connected clusters: 1 Maximum send/receive size: 32768/32768 bytes Cluster "STSRHTS20495": Algorithm: LMS Tie-breaker: Node with lowest node ID Node ID 3: Client address: 2620:52:0:25a4:1800:ff:fe00:1af:56514 HB interval: 8000ms Configured node list: 1, 2, 3 Ring ID: 1.1b74 Membership node list: 1, 2, 3 Heuristics: Pass (membership: Pass, regular: Undefined) TLS active: Yes (client certificate verified) Vote: ACK (ACK) Node ID 1: Client address: 2620:52:0:25a4:1800:ff:fe00:1ad:53452 HB interval: 8000ms Configured node list: 1, 2, 3 Ring ID: 1.1b74 Membership node list: 1, 2, 3 Heuristics: Pass (membership: Pass, regular: Undefined) TLS active: Yes (client certificate verified) Vote: ACK (ACK) Node ID 2: Client address: 2620:52:0:25a4:1800:ff:fe00:1ae:49566 HB interval: 8000ms Configured node list: 1, 2, 3 Ring ID: 1.1b74 Membership node list: 1, 2, 3 Heuristics: Pass (membership: Pass, regular: Undefined) TLS active: Yes (client certificate verified) Vote: ACK (ACK) > [2] pcs cluster config Cluster Name: STSRHTS20495 Corosync Nodes: virt-429 virt-430 virt-431 Pacemaker Nodes: virt-429 virt-430 virt-431 Resources: Clone: dlm-clone Meta Attrs: interleave=true ordered=true Resource: dlm (class=ocf provider=pacemaker type=controld) Operations: monitor interval=30s on-fail=fence (dlm-monitor-interval-30s) start interval=0s timeout=90 (dlm-start-interval-0s) stop interval=0s timeout=100 (dlm-stop-interval-0s) Clone: clvmd-clone Meta Attrs: interleave=true ordered=true Resource: clvmd (class=ocf provider=heartbeat type=clvm) Attributes: with_cmirrord=1 Operations: monitor interval=30s on-fail=fence (clvmd-monitor-interval-30s) start interval=0s timeout=90 (clvmd-start-interval-0s) stop interval=0s timeout=90 (clvmd-stop-interval-0s) Clone: container-logs-clone Resource: container-logs (class=ocf provider=heartbeat type=Filesystem) Attributes: device=/dev/sdb directory=/var/log/containers fstype=gfs2 Operations: monitor interval=20 timeout=40 (container-logs-monitor-interval-20) notify interval=0s timeout=60 (container-logs-notify-interval-0s) start interval=0s timeout=60 (container-logs-start-interval-0s) stop interval=0s timeout=60 (container-logs-stop-interval-0s) Group: mysql-g Resource: db-vip (class=ocf provider=heartbeat type=IPaddr) Attributes: cidr_netmask=22 ip=10.37.165.126 Operations: monitor interval=10s timeout=20s (db-vip-monitor-interval-10s) start interval=0s timeout=20s (db-vip-start-interval-0s) stop interval=0s timeout=20s (db-vip-stop-interval-0s) Resource: db-lvm (class=ocf provider=heartbeat type=LVM) Attributes: volgrpname=dbvg Operations: methods interval=0s timeout=5 (db-lvm-methods-interval-0s) monitor interval=10 timeout=30 (db-lvm-monitor-interval-10) start interval=0s timeout=30 (db-lvm-start-interval-0s) stop interval=0s timeout=30 (db-lvm-stop-interval-0s) Resource: db-fs (class=ocf provider=heartbeat type=Filesystem) Attributes: device=/dev/dbvg/dblv directory=/var/lib/mysql fstype=ext4 Operations: monitor interval=20 timeout=40 (db-fs-monitor-interval-20) notify interval=0s timeout=60 (db-fs-notify-interval-0s) start interval=0s timeout=60 (db-fs-start-interval-0s) stop interval=0s timeout=60 (db-fs-stop-interval-0s) Resource: mysql (class=ocf provider=heartbeat type=mysql) Attributes: datadir=/var/lib/mysql log=/var/log/mariadb/mariadb.log pid=/run/mariadb/mariadb.pid Operations: demote interval=0s timeout=120 (mysql-demote-interval-0s) monitor interval=20 timeout=30 (mysql-monitor-interval-20) monitor interval=10 role=Master timeout=30 (mysql-monitor-interval-10) monitor interval=30 role=Slave timeout=30 (mysql-monitor-interval-30) notify interval=0s timeout=90 (mysql-notify-interval-0s) promote interval=0s timeout=120 (mysql-promote-interval-0s) start interval=0s timeout=120 (mysql-start-interval-0s) stop interval=0s timeout=120 (mysql-stop-interval-0s) Stonith Devices: Resource: fence-virt-429 (class=stonith type=fence_xvm) Attributes: pcmk_host_check=static-list pcmk_host_list=virt-429 pcmk_host_map=virt-429:virt-429.cluster-qe.lab.eng.brq.redhat.com Operations: monitor interval=60s (fence-virt-429-monitor-interval-60s) Resource: fence-virt-430 (class=stonith type=fence_xvm) Attributes: pcmk_host_check=static-list pcmk_host_list=virt-430 pcmk_host_map=virt-430:virt-430.cluster-qe.lab.eng.brq.redhat.com Operations: monitor interval=60s (fence-virt-430-monitor-interval-60s) Resource: fence-virt-431 (class=stonith type=fence_xvm) Attributes: pcmk_host_check=static-list pcmk_host_list=virt-431 pcmk_host_map=virt-431:virt-431.cluster-qe.lab.eng.brq.redhat.com Operations: monitor interval=60s (fence-virt-431-monitor-interval-60s) Fencing Levels: Location Constraints: Resource: mysql Enabled on: virt-429 (score:INFINITY) (role: Started) (id:cli-prefer-mysql) Ordering Constraints: start dlm-clone then start clvmd-clone (kind:Mandatory) Colocation Constraints: clvmd-clone with dlm-clone (score:INFINITY) container-logs-clone with clvmd-clone (score:INFINITY) Ticket Constraints: Alerts: No alerts defined Resources Defaults: No defaults set Operations Defaults: No defaults set Cluster Properties: cluster-infrastructure: corosync cluster-name: STSRHTS20495 dc-version: 1.1.18-8.el7-2b07d5c5a9 have-watchdog: false last-lrm-refresh: 1516298753 no-quorum-policy: freeze Quorum: Options: Device: Model: net algorithm: lms host: virt-428 Heuristics: exec_ls: /usr/bin/test -f /tmp/test mode: on > [3] pcs cluster quorum [root@virt-430 ~]# pcs quorum status Quorum information ------------------ Date: Fri Jan 19 11:47:23 2018 Quorum provider: corosync_votequorum Nodes: 3 Node ID: 2 Ring ID: 1/7028 Quorate: Yes Votequorum information ---------------------- Expected votes: 5 Highest expected: 5 Total votes: 5 Quorum: 3 Flags: Quorate Qdevice Membership information ---------------------- Nodeid Votes Qdevice Name 1 1 A,V,NMW virt-429 2 1 A,V,NMW virt-430 (local) 3 1 A,V,NMW virt-431 0 2 Qdevice > [4] /root/connection-to-other-nodes-lost virt-430: #!/bin/bash -x for bin in iptables ip6tables; do $bin -F other_nodes_drop || $bin -N other_nodes_drop for node in virt-429 virt-431; do $bin -A other_nodes_drop ! -i lo -s $node -p udp -j REJECT $bin -A other_nodes_drop ! -i lo -s $node -p tcp -j REJECT $bin -A other_nodes_drop ! -i lo -d $node -p udp -j REJECT $bin -A other_nodes_drop ! -i lo -d $node -p tcp -j REJECT done $bin -nvL INPUT | grep -q other_nodes_drop || $bin -I INPUT -j other_nodes_drop $bin -nvL OUTPUT | grep -q other_nodes_drop || $bin -I OUTPUT -j other_nodes_drop done
I have verified that quorum device heuristics functionality for ffsplit mode [comment #10] and lms mode [comment #11] works as expected in corosync-2.4.3-1.el7.x86_64.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:0920