Bug 1413573 - [RFE][TechPreview] qdevice: Include support for heuristics
Summary: [RFE][TechPreview] qdevice: Include support for heuristics
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: corosync
Version: 7.3
Hardware: All
OS: All
medium
medium
Target Milestone: rc
: ---
Assignee: Jan Friesse
QA Contact: cluster-qe@redhat.com
Steven J. Levine
URL:
Whiteboard:
Depends On:
Blocks: 1280348 1507087 1389209
TreeView+ depends on / blocked
 
Reported: 2017-01-16 12:21 UTC by Jan Friesse
Modified: 2019-04-21 09:56 UTC (History)
7 users (show)

Fixed In Version: corosync-2.4.3-1.el7
Doc Type: Technology Preview
Doc Text:
.Heuristics in `corosync-qdevice` available as a Technology Preview Heuristics are a set of commands executed locally on startup, cluster membership change, successful connect to `corosync-qnetd`, and, optionally, on a periodic basis. When all commands finish successfully on time (their return error code is zero), heuristics have passed; otherwise, they have failed. The heuristics result is sent to `corosync-qnetd` where it is used in calculations to determine which partition should be quorate.
Clone Of:
: 1507087 (view as bug list)
Environment:
Last Closed: 2018-04-10 16:52:19 UTC


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2018:0920 None None None 2018-04-10 16:53:42 UTC
Red Hat Bugzilla 1389209 None CLOSED [TechPreview] add support for managing qdevice heuristics 2019-09-20 18:06:15 UTC
Red Hat Bugzilla 1476401 None None None 2019-09-20 18:06:15 UTC
Red Hat Bugzilla 1535979 None CLOSED add requirement to install and start 'pcsd' on quorum device node as well 2019-09-20 18:06:15 UTC

Internal Links: 1389209 1476401 1535979

Description Jan Friesse 2017-01-16 12:21:22 UTC
Description of problem:
SSIA

Comment 1 Jan Friesse 2017-02-01 08:35:02 UTC
More info how qdevice heuristics are expected to work.

It is possible to configure multiple commands "shell" commands to be executed.
Commands are executed:
1. when new membership is formed
2. on regular time basis

Qdevice heuristics can be:
- Disabled - Behavior should be same as qdevice without heuristics (qdevice in RHEL 7.3)
- Executed only on 1.
- Executed on 1. and 2.

When heuristics are required, all "shell" commands are executed. Qdevice checks error code of command. If all  shell commands success, heuristics success, otherwise it fails.

Heuristics success/fail is then used by qnetd as another (primary) level of tie-breaker.

Examples:
- 2 nodes, algorithm ffsplit. Nodes split, node A is able to execute all shell commands, node B isn't. node A should stay quorate, node B should be unquorate.

- 2 nodes, algorithm ffsplit. Nodes split, node B is able to execute all shell commands, node A isn't. node B should stay quorate, node A should be unquorate.

- 2 nodes, algorithm ffsplit. Nodes split, both nodes are able (or unable) to execute all shell commands. Because result of both nodes are same, configured tie_breaker (quorum.device.net.tie_breaker) is used (= same behavior as without heuristics).

Example config file snip:
---
quorum {
        provider: corosync_votequorum
        device {
            votes: 1
            model: net
            net {
                tls: on
                host: localhost
                algorithm: ffsplit
            }
            heuristics {
# Mode - on/off/sync
                mode: on
# Default 1/2 instance->heartbeat_interval
#               timeout: 5
# Default 1/2 instance->sync_heartbeat_interval
#               sync_timeout: 15
# Default 3 * instance->heartbeat_interval
#               interval: 30
# Executables
                exec_ping: ping -q -c 1 "127.0.0.1"
                exec_ls: test -f /tmp/test
            }
        }
}
---

Comment 5 John Ruemker 2017-08-10 16:05:16 UTC
tl;dr: I would like to request that we also consider and pursue corosync-qdevice allowing the use of heuristics as a quorum-determining factor _without_ requiring the use of a qnetd server. 

Long version:

As we've been working with a specific customer, we have identified an additional aspect that would be useful to have incorporated into corosync-qdevice's heuristic-based functionality.

This customer has a storage-based tie-breaker method that they are happy with and pay good money for.  They had a requirement that their RHEL HA cluster be able to use this mechanism to influence membership/fencing decisions, but we do not have any simple way to achieve this in RHEL 7.

They're able to use connectivity to a third/neutral site as a tie-breaker for membership decisions (as this is what their storage solution does), but they aren't able to deploy additional servers in that location.  They would prefer to just be able to ping a gateway and have that serve as a determining factor.

We're pursuing a few changes in pacemaker to try to allow fencing decisions to be made in a way that aligns with these requirements, and we've gotten close by using sbd (which aligns with the storage decision) and ping scripts or resources (aligning with the network-based tie breaker to the third site), and a proposed heuristic-based fence-agent.  We will probably end up delivering some combination of these to them as a short term solution, but the challenges around this have given us reason to consider what the optimal solution to this would be for widespread usage across our customer base, since these seem like reasonable requirements that may continue to come up.

With corosync's QDevice being the solution we're positioning as the optimal way to achieve arbitration in single-membership clusters, it seems like this is the best place to develop any features that would enable these use cases.  

With heuristics already being a planned feature that's in progress, the only additional piece it seems we would need is the ability to arbitrate quorum _only_ through those heuristics, and not require any connection to a qnetd server.  

So, I would like to tack this request onto the work that is already underway / soon-to-happen for the heuristic feature.  If you'd like another bug tracking that additional request, let me know and I can open one.

If there are any concerns or thoughts, let me know.

Comment 6 Jan Friesse 2017-08-11 06:02:30 UTC
@John,
heuristics only solution is for sure interesting. I really have to think about it much more but in theory this could remove need to have qdevice disk model and just interface sbd.

What I'm not so sure is how to really achieve that only one partition gets qdevice vote because tie-breaker is then not in our hands and we must trust 3rd side provider. So at least official support may be kind of problematic.

Comment 8 Jan Friesse 2017-10-20 15:01:32 UTC
Known issues:
 - Regular heuristics is supported only by ffsplit. This is not a
   problem for clusters with power fencing, but deployments where
   non-quorate partition continues to operate may see this as a problem.
- Qdevice-tool status doesn't contain detailed information about
  heuristics.
- Qdevice-tool doesn't have a possibility to trigger heuristics
  re-execute.

For QA:
Please see corosync-qdevice.8 for short example how to configure heuristics

Quick test:
- Two nodes both with heuristics from example
- On first node create file and on second don't
- Use iptables to split both nodes
- First node should get the vote
- Repeat the test but now create file on second node and not on the first node. If you choose ffsplit, it should work even without restart of daemons or join/leave of new nodes.

Backwards compatibility test:
- When heuristics mode is off, or no exec_ variables are defined, qdevice should be able to connect to old qnetd and everything should work.
- When heuristics mode is sync or on (with defined exec_variabled), qdevice should fail connecting to old qnetd
- Old qdevice version should work with new qnetd without any problems (heuristics are then "Undefined").

Comment 10 michal novacek 2018-01-18 17:30:52 UTC
I have used ha addon reference [1] and pcs test [3] to setup the cluster with two nodes and quorum disk [3]

Quorum device is configured to check for existence of /tmp/test.

Both nodes do have /root/iptables [4] script for blocking the network traffic
from the other host. The connection is blocked on the node that is supposed to
be rebooted so it can join after reboot.


The following tests have been successfully performed:
> block network connections between virt-429 to virt-430 with the script [4] with:
    * /tmp/test not on lowest id node (virt-429): 
        heuristicts fails lowest node id (virt-429) and it is rebooted
    * /tmp/test on both nodes: 
        heuristics pass on both nodes, common non-heuristicts method is used:
        second node is rebooted (virt-430)
    * /tmp/test on neither: not lowest id node is rebooted (virt-430)
        heuristics fail on both nodes, common non-heuristicts method is used:
        second node is rebooted (virt-430)

---

> [1] https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html-single/high_availability_add-on_reference/index#s1-quorumdev-HAAR

> [2] https://bugzilla.redhat.com/show_bug.cgi?id=1389209#c30

> [3] cluster and quorum device configuration
> [root@virt-428 ~]# pcs quorum device update heuristics "exec_ls= /usr/bin/test -f /tmp/test"
> [root@virt-428 ~]# pcs quorum device update heuristics mode=on
> [root@virt-428 ~]# pcs qdevice status net --full
QNetd address:                  *:5403
TLS:                            Supported (client certificate required)
Connected clients:              2
Connected clusters:             1
Maximum send/receive size:      32768/32768 bytes
Cluster "STSRHTS20495":
    Algorithm:          Fifty-Fifty split
    Tie-breaker:        Node with lowest node ID
    Node ID 1:
        Client address:         2620:52:0:25a4:1800:ff:fe00:1ad:37524
        HB interval:            8000ms
        Configured node list:   1, 2
        Ring ID:                1.aa8
        Membership node list:   1, 2
        Heuristics:             Fail (membership: Fail, regular: Undefined)
        TLS active:             Yes (client certificate verified)
        Vote:                   ACK (ACK)
    Node ID 2:
        Client address:         2620:52:0:25a4:1800:ff:fe00:1ae:54402
        HB interval:            8000ms
        Configured node list:   1, 2
        Ring ID:                1.aa8
        Membership node list:   1, 2
        Heuristics:             Pass (membership: Pass, regular: Undefined)
        TLS active:             Yes (client certificate verified)
        Vote:                   No change (ACK)

>> note that for Heuristics: lines membership is status at the time of join or
>> membership change and regular is regulary run result on node which is
>> updated on change only. Also it starts as undefined which is kind of bug.

>[root@virt-429 tests]# pcs quorum status
Quorum information
------------------
Date:             Thu Jan 18 13:00:25 2018
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          1
Ring ID:          1/2728
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      3
Quorum:           2
Flags:            Quorate Qdevice

Membership information
----------------------
    Nodeid      Votes    Qdevice Name
         1          1    A,V,NMW virt-429 (local)
         2          1    A,V,NMW virt-430
         0          1            Qdevice

>[root@virt-429 tests]# pcs quorum device status
Qdevice information
-------------------
Model:                  Net
Node ID:                1
Configured node list:
    0   Node ID = 1
    1   Node ID = 2
Membership node list:   1, 2

Qdevice-net information
----------------------
Cluster name:           STSRHTS20495
QNetd host:             virt-428:5403
Algorithm:              Fifty-Fifty split
Tie-breaker:            Node with lowest node ID
State:                  Connected
Heuristics result:      Fail

>[root@virt-429 tests]# pcs config
Cluster Name: STSRHTS20495
Corosync Nodes:
 virt-429 virt-430
Pacemaker Nodes:
 virt-429 virt-430

Resources:
 Clone: dlm-clone
  Meta Attrs: interleave=true ordered=true 
  Resource: dlm (class=ocf provider=pacemaker type=controld)
   Operations: monitor interval=30s on-fail=fence (dlm-monitor-interval-30s)
               start interval=0s timeout=90 (dlm-start-interval-0s)
               stop interval=0s timeout=100 (dlm-stop-interval-0s)
 Clone: clvmd-clone
  Meta Attrs: interleave=true ordered=true 
  Resource: clvmd (class=ocf provider=heartbeat type=clvm)
   Attributes: with_cmirrord=1
   Operations: monitor interval=30s on-fail=fence (clvmd-monitor-interval-30s)
               start interval=0s timeout=90 (clvmd-start-interval-0s)
               stop interval=0s timeout=90 (clvmd-stop-interval-0s)

Stonith Devices:
 Resource: fence-virt-429 (class=stonith type=fence_xvm)
  Attributes: pcmk_host_check=static-list pcmk_host_list=virt-429 pcmk_host_map=virt-429:virt-429.cluster-qe.lab.eng.brq.redhat.com
  Operations: monitor interval=60s (fence-virt-429-monitor-interval-60s)
 Resource: fence-virt-430 (class=stonith type=fence_xvm)
  Attributes: pcmk_host_check=static-list pcmk_host_list=virt-430 pcmk_host_map=virt-430:virt-430.cluster-qe.lab.eng.brq.redhat.com
  Operations: monitor interval=60s (fence-virt-430-monitor-interval-60s)
Fencing Levels:

Location Constraints:
Ordering Constraints:
  start dlm-clone then start clvmd-clone (kind:Mandatory)
Colocation Constraints:
  clvmd-clone with dlm-clone (score:INFINITY)
Ticket Constraints:

Alerts:
 No alerts defined

Resources Defaults:
 No defaults set
Operations Defaults:
 No defaults set

Cluster Properties:
 cluster-infrastructure: corosync
 cluster-name: STSRHTS20495
 dc-version: 1.1.18-8.el7-2b07d5c5a9
 have-watchdog: false
 last-lrm-refresh: 1516276684
 no-quorum-policy: freeze

Quorum:
  Options:
  Device:
    votes: 1
    Model: net
      algorithm: ffsplit
      host: virt-428
    Heuristics:
      exec_ls: /usr/bin/test -f /tmp/test

[4] /root/iptables
#!/bin/bash -x
for bin in iptables ip6tables;
do
    $bin -F other_nodes_drop || $bin -N other_nodes_drop
    $bin -A other_nodes_drop -s <THE_OTHER_NODE> -j DROP
    $bin -nvL INPUT | grep -q other_nodes_drop || $bin -I INPUT -j other_nodes_drop
done

Comment 11 michal novacek 2018-01-22 10:44:45 UTC
The same setup as in the previous comment is created with the following
differences:

* quorum devices set for 'lms' [1], [3]
> pcs quorum device add model net host=virt-428 algorithm=lms
> pcs quorum device update heuristics "exec_ls= /usr/bin/test -f /tmp/test"
> pcs quorum device update heuristics mode=on

* cluster have three nodes (+virt-431) [2]

* iptables scripts modified to create two groups (virt-429 and virt-430 with virt-431) [4]

---

The following tests have been successfully performed:

* block connection between nodes but not to the quorum device:
    * /tmp/test present on all or none of nodes: 
        lowest id stays quorate, other nodes rebooted
    * /tmp/test present on virt-431 only (not lowest id node): 
        virt-430 stays quorate, other nodes (including lowest id node) rebooted

---

> [1] # pcs qdevice status net --full
QNetd address:                  *:5403
TLS:                            Supported (client certificate required)
Connected clients:              3
Connected clusters:             1
Maximum send/receive size:      32768/32768 bytes
Cluster "STSRHTS20495":
    Algorithm:          LMS
    Tie-breaker:        Node with lowest node ID
    Node ID 3:
        Client address:         2620:52:0:25a4:1800:ff:fe00:1af:56514
        HB interval:            8000ms
        Configured node list:   1, 2, 3
        Ring ID:                1.1b74
        Membership node list:   1, 2, 3
        Heuristics:             Pass (membership: Pass, regular: Undefined)
        TLS active:             Yes (client certificate verified)
        Vote:                   ACK (ACK)
    Node ID 1:
        Client address:         2620:52:0:25a4:1800:ff:fe00:1ad:53452
        HB interval:            8000ms
        Configured node list:   1, 2, 3
        Ring ID:                1.1b74
        Membership node list:   1, 2, 3
        Heuristics:             Pass (membership: Pass, regular: Undefined)
        TLS active:             Yes (client certificate verified)
        Vote:                   ACK (ACK)
    Node ID 2:
        Client address:         2620:52:0:25a4:1800:ff:fe00:1ae:49566
        HB interval:            8000ms
        Configured node list:   1, 2, 3
        Ring ID:                1.1b74
        Membership node list:   1, 2, 3
        Heuristics:             Pass (membership: Pass, regular: Undefined)
        TLS active:             Yes (client certificate verified)
        Vote:                   ACK (ACK)

> [2] pcs cluster config
Cluster Name: STSRHTS20495
Corosync Nodes:
 virt-429 virt-430 virt-431
Pacemaker Nodes:
 virt-429 virt-430 virt-431

Resources:
 Clone: dlm-clone
  Meta Attrs: interleave=true ordered=true 
  Resource: dlm (class=ocf provider=pacemaker type=controld)
   Operations: monitor interval=30s on-fail=fence (dlm-monitor-interval-30s)
               start interval=0s timeout=90 (dlm-start-interval-0s)
               stop interval=0s timeout=100 (dlm-stop-interval-0s)
 Clone: clvmd-clone
  Meta Attrs: interleave=true ordered=true 
  Resource: clvmd (class=ocf provider=heartbeat type=clvm)
   Attributes: with_cmirrord=1
   Operations: monitor interval=30s on-fail=fence (clvmd-monitor-interval-30s)
               start interval=0s timeout=90 (clvmd-start-interval-0s)
               stop interval=0s timeout=90 (clvmd-stop-interval-0s)
 Clone: container-logs-clone
  Resource: container-logs (class=ocf provider=heartbeat type=Filesystem)
   Attributes: device=/dev/sdb directory=/var/log/containers fstype=gfs2
   Operations: monitor interval=20 timeout=40 (container-logs-monitor-interval-20)
               notify interval=0s timeout=60 (container-logs-notify-interval-0s)
               start interval=0s timeout=60 (container-logs-start-interval-0s)
               stop interval=0s timeout=60 (container-logs-stop-interval-0s)
 Group: mysql-g
  Resource: db-vip (class=ocf provider=heartbeat type=IPaddr)
   Attributes: cidr_netmask=22 ip=10.37.165.126
   Operations: monitor interval=10s timeout=20s (db-vip-monitor-interval-10s)
               start interval=0s timeout=20s (db-vip-start-interval-0s)
               stop interval=0s timeout=20s (db-vip-stop-interval-0s)
  Resource: db-lvm (class=ocf provider=heartbeat type=LVM)
   Attributes: volgrpname=dbvg
   Operations: methods interval=0s timeout=5 (db-lvm-methods-interval-0s)
               monitor interval=10 timeout=30 (db-lvm-monitor-interval-10)
               start interval=0s timeout=30 (db-lvm-start-interval-0s)
               stop interval=0s timeout=30 (db-lvm-stop-interval-0s)
  Resource: db-fs (class=ocf provider=heartbeat type=Filesystem)
   Attributes: device=/dev/dbvg/dblv directory=/var/lib/mysql fstype=ext4
   Operations: monitor interval=20 timeout=40 (db-fs-monitor-interval-20)
               notify interval=0s timeout=60 (db-fs-notify-interval-0s)
               start interval=0s timeout=60 (db-fs-start-interval-0s)
               stop interval=0s timeout=60 (db-fs-stop-interval-0s)
  Resource: mysql (class=ocf provider=heartbeat type=mysql)
   Attributes: datadir=/var/lib/mysql log=/var/log/mariadb/mariadb.log pid=/run/mariadb/mariadb.pid
   Operations: demote interval=0s timeout=120 (mysql-demote-interval-0s)
               monitor interval=20 timeout=30 (mysql-monitor-interval-20)
               monitor interval=10 role=Master timeout=30 (mysql-monitor-interval-10)
               monitor interval=30 role=Slave timeout=30 (mysql-monitor-interval-30)
               notify interval=0s timeout=90 (mysql-notify-interval-0s)
               promote interval=0s timeout=120 (mysql-promote-interval-0s)
               start interval=0s timeout=120 (mysql-start-interval-0s)
               stop interval=0s timeout=120 (mysql-stop-interval-0s)

Stonith Devices:
 Resource: fence-virt-429 (class=stonith type=fence_xvm)
  Attributes: pcmk_host_check=static-list pcmk_host_list=virt-429 pcmk_host_map=virt-429:virt-429.cluster-qe.lab.eng.brq.redhat.com
  Operations: monitor interval=60s (fence-virt-429-monitor-interval-60s)
 Resource: fence-virt-430 (class=stonith type=fence_xvm)
  Attributes: pcmk_host_check=static-list pcmk_host_list=virt-430 pcmk_host_map=virt-430:virt-430.cluster-qe.lab.eng.brq.redhat.com
  Operations: monitor interval=60s (fence-virt-430-monitor-interval-60s)
 Resource: fence-virt-431 (class=stonith type=fence_xvm)
  Attributes: pcmk_host_check=static-list pcmk_host_list=virt-431 pcmk_host_map=virt-431:virt-431.cluster-qe.lab.eng.brq.redhat.com
  Operations: monitor interval=60s (fence-virt-431-monitor-interval-60s)
Fencing Levels:

Location Constraints:
  Resource: mysql
    Enabled on: virt-429 (score:INFINITY) (role: Started) (id:cli-prefer-mysql)
Ordering Constraints:
  start dlm-clone then start clvmd-clone (kind:Mandatory)
Colocation Constraints:
  clvmd-clone with dlm-clone (score:INFINITY)
  container-logs-clone with clvmd-clone (score:INFINITY)
Ticket Constraints:

Alerts:
 No alerts defined

Resources Defaults:
 No defaults set
Operations Defaults:
 No defaults set

Cluster Properties:
 cluster-infrastructure: corosync
 cluster-name: STSRHTS20495
 dc-version: 1.1.18-8.el7-2b07d5c5a9
 have-watchdog: false
 last-lrm-refresh: 1516298753
 no-quorum-policy: freeze

Quorum:
  Options:
  Device:
    Model: net
      algorithm: lms
      host: virt-428
    Heuristics:
      exec_ls: /usr/bin/test -f /tmp/test
      mode: on

> [3] pcs cluster quorum
[root@virt-430 ~]# pcs quorum status
Quorum information
------------------
Date:             Fri Jan 19 11:47:23 2018
Quorum provider:  corosync_votequorum
Nodes:            3
Node ID:          2
Ring ID:          1/7028
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   5
Highest expected: 5
Total votes:      5
Quorum:           3  
Flags:            Quorate Qdevice 

Membership information
----------------------
    Nodeid      Votes    Qdevice Name
         1          1    A,V,NMW virt-429
         2          1    A,V,NMW virt-430 (local)
         3          1    A,V,NMW virt-431
         0          2            Qdevice

> [4] /root/connection-to-other-nodes-lost

virt-430:
#!/bin/bash -x
for bin in iptables ip6tables;
do
    $bin -F other_nodes_drop || $bin -N other_nodes_drop

    for node in virt-429 virt-431;
    do
        $bin -A other_nodes_drop ! -i lo -s $node -p udp -j REJECT 
        $bin -A other_nodes_drop ! -i lo -s $node -p tcp -j REJECT 

        $bin -A other_nodes_drop ! -i lo -d $node -p udp -j REJECT 
        $bin -A other_nodes_drop ! -i lo -d $node -p tcp -j REJECT 
    done

    $bin -nvL INPUT | grep -q other_nodes_drop || $bin -I INPUT -j other_nodes_drop
    $bin -nvL OUTPUT | grep -q other_nodes_drop || $bin -I OUTPUT -j other_nodes_drop
done

Comment 12 michal novacek 2018-01-22 10:50:56 UTC
I have verified that quorum device heuristics functionality for ffsplit mode [comment #10] and lms mode [comment #11] works as expected  in corosync-2.4.3-1.el7.x86_64.

Comment 18 errata-xmlrpc 2018-04-10 16:52:19 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0920


Note You need to log in before you can comment on or make changes to this bug.