Bug 1514492 - regression in pacemaker-1.1.16-12.el7_4.4.x86_64 / setup with remote-nodes not working anymore
Summary: regression in pacemaker-1.1.16-12.el7_4.4.x86_64 / setup with remote-nodes n...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: pacemaker
Version: 7.4
Hardware: x86_64
OS: Linux
urgent
urgent
Target Milestone: rc
: 7.6
Assignee: Ken Gaillot
QA Contact: Patrik Hagara
URL:
Whiteboard:
Depends On:
Blocks: 1609081
TreeView+ depends on / blocked
 
Reported: 2017-11-17 14:57 UTC by Renaud Marigny
Modified: 2018-10-30 07:59 UTC (History)
8 users (show)

Fixed In Version: pacemaker-1.1.18-13.el7
Doc Type: No Doc Update
Doc Text:
The release note for Bug 1489728 should be sufficient to cover the issue here.
Clone Of:
Environment:
Last Closed: 2018-10-30 07:57:39 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Bugzilla 1506372 None None None 2019-02-25 21:32:38 UTC
Red Hat Knowledge Base (Solution) 3255551 None None None 2018-07-26 19:33:12 UTC
Red Hat Product Errata RHBA-2018:3055 None None None 2018-10-30 07:59:13 UTC

Internal Links: 1506372

Comment 3 Bruno Travouillon 2017-11-18 10:30:05 UTC
I have been able to reproduce this issue with the following configuration:

----8<----
[root@support0 ~]# pcs config show
Cluster Name: supportHA
Corosync Nodes:
 support0.lab.local
Pacemaker Nodes:
 support0.lab.local

Resources:
 Resource: vm-cli1 (class=ocf provider=heartbeat type=VirtualDomain)
  Attributes: config=/etc/libvirt/qemu/cli1.xml
  Meta Attrs: remote-node=cli1
  Utilization: cpu=1 hv_memory=2048
  Operations: stop interval=0s timeout=90 (vm-cli1-stop-interval-0s)
              monitor interval=30s (vm-cli1-monitor-interval-30s)
              start interval=0s timeout=120 (vm-cli1-start-interval-0s)
 Resource: vg1A (class=ocf provider=heartbeat type=LVM)
  Attributes: volgrpname=vg1A
  Operations: start interval=0s timeout=30 (vg1A-start-interval-0s)
              stop interval=0s timeout=30 (vg1A-stop-interval-0s)
              monitor interval=10 timeout=30 (vg1A-monitor-interval-10)

Stonith Devices:
Fencing Levels:

Location Constraints:
  Resource: vg1A
    Constraint: location-vg1A
      Rule: score=-INFINITY (id:location-vg1A-rule)
        Expression: #kind eq container (id:location-vg1A-rule-expr)
Ordering Constraints:
Colocation Constraints:
Ticket Constraints:

Alerts:
 No alerts defined

Resources Defaults:
 No defaults set
Operations Defaults:
 No defaults set

Cluster Properties:
 cluster-infrastructure: corosync
 cluster-name: supportHA
 dc-version: 1.1.16-12.el7_4.4.debug-94ff4df
 have-watchdog: false
 stonith-enabled: false

Quorum:
  Options:
----8<----

The issue is related to the container trying to monitor the vg1A resource while the location constraint score is -INFINITY.

In the corosync.log, we can see the following message:
Nov 17 13:40:03 [28056] support0.lab.local crmd: notice: te_rsc_command: Initiating monitor operation vg1A_monitor_0 locally on cli1 | action 4

This regression has been introduced in commit 12d453cc where the skip of active resource detection on container is removed.

----8<----
diff --git a/pengine/native.c b/pengine/native.c
index 37cf541..2e40a4c 100644
--- a/pengine/native.c
+++ b/pengine/native.c
@@ -2784,10 +2784,6 @@ native_create_probe(resource_t * rsc, node_t * node, action_t * complete,
     if (force == FALSE && is_not_set(data_set->flags, pe_flag_startup_probes)) {
         pe_rsc_trace(rsc, "Skipping active resource detection for %s", rsc->id);
         return FALSE;
-    } else if (force == FALSE && is_container_remote_node(node)) {
-        pe_rsc_trace(rsc, "Skipping active resource detection for %s on container %s",
-                     rsc->id, node->details->id);
-        return FALSE;
     }
 
     if (is_remote_node(node)) {
----8<----

When reverting this change on top of pacemaker-1.1.16-12.el7_4.4, the crmd don't try to initiate the monitoring of the vg1A resource anymore.

After some research in the pacemaker.git history, it looks like the patch is legit. The monitor action in the log does not deal with standard resource monitoring but with probes (one-time monitor operation). There is a location property to disable the resource discovery: resource-discovery=never.

With the following change in the configuration, I can't reproduce the issue with pacemaker-1.1.16-12.el7_4.4.

----8<----
# pcs constraint location remove location-vg1A
# pcs constraint location vg1A rule resource-discovery=never score=-INFINITY '#kind' eq container
# pcs constraint location show --full
Location Constraints:
  Resource: vg1A
    Constraint: location-vg1A (resource-discovery=never)
      Rule: score=-INFINITY (id:location-vg1A-rule)
        Expression: #kind eq container (id:location-vg1A-rule-expr)
----8<----

Comment 4 Ken Gaillot 2017-11-20 16:48:58 UTC
(In reply to Bruno Travouillon from comment #3)
> After some research in the pacemaker.git history, it looks like the patch is
> legit. The monitor action in the log does not deal with standard resource
> monitoring but with probes (one-time monitor operation). There is a location
> property to disable the resource discovery: resource-discovery=never.
> 
> With the following change in the configuration, I can't reproduce the issue
> with pacemaker-1.1.16-12.el7_4.4.

Yes, you have it exactly right here.

This was a long-planned fix for a limitation of guest nodes -- they didn't get probed at resource start-up like other nodes. As you saw, the probes are not attempts to start the resource, but attempts to determine the current status. They allow Pacemaker to ensure that resources aren't running where they're not supposed to be, and to properly re-detect resources that have been cleaned up.

Unfortunately, we did not consider cases where users would be relying on the absence of probes.

The good news is that the solution you found, setting resource-discovery=never, is exactly the right answer. That tells Pacemaker not to probe the resource in the constraint on that node, and is intended to be used in situations like this, where the software is not installed on the node, so probing isn't necessary.

I will look into what we can do to prevent people from getting bit by this, but the new behavior fixes other important scenarios, so it's unlikely we'll revert it.

Thanks for discovering, reporting, and investigating this issue.

Comment 5 Ken Gaillot 2017-11-20 22:07:42 UTC
We will put a release note in 7.5 about the issue, and also for 7.5 (if approved), we will make Pacemaker log a warning the first time any probe fails, like:

warning: Processing failed op monitor for rsc1 on node1: unknown error
(1)
warning: If it is not possible for rsc1 to run on node1, see the
resource-discovery option for location constraints

This will probably not be backported to 7.4.

Comment 6 Ken Gaillot 2017-11-20 22:15:24 UTC
QA: Test procedure:

1. Configure a cluster of at least one cluster node and one guest node.
2. Configure a resource that can run on the cluster node, but requires software that isn't installed on the guest node.
3. Configure a location constraint banning the resource from the guest node (omitting the resource-discovery option).
4. Start the cluster.

Using the 1.1.18-6 or earlier packages for 7.5, or the 1.1.16-12.4 or 1.1.16-12.5 packages for 7.4, cluster status will show a failed monitor for the resource on the guest node, and the logs will show a warning about the failed monitor. After the fix here, the behavior will be the same, but there will be an additional log message referring users to the resource-discovery option, the first (and only the first) time the monitor fails.

Comment 7 Ken Gaillot 2017-11-22 15:20:12 UTC
The log message (covered by this bz) will be done for 7.6 due to a tight schedule for 7.5. However the release note about the issue (covered by Bug 1489728) will be for 7.5.

Comment 12 Ken Gaillot 2018-05-02 23:22:31 UTC
(In reply to Ken Gaillot from comment #6)
> QA: Test procedure:

Updated ...

> 1. Configure a cluster of at least one cluster node and one guest node.
> 2. Configure a resource that can run on the cluster node, but requires
> software that isn't installed on the guest node.
> 3. Configure a location constraint banning the resource from the guest node
> (omitting the resource-discovery option).
> 4. Start the cluster.
> 
> Using the 1.1.18-6 or earlier packages for 7.5, or the 1.1.16-12.4 or
> 1.1.16-12.5 packages for 7.4, cluster status will show a failed monitor for
> the resource on the guest node, and the logs will show a warning about the
> failed monitor. After the fix here, the behavior will be the same, but there
> will be an additional log message referring users to the resource-discovery

The new message will only appear in the DC's logs. It will appear every time the failure is processed, however it will only be logged for actual failed probes (as opposed to unexpected running/stopped status). The message will be something like:

warning: Processing failed probe of rsc1 on node1: some error here
notice: If it is not possible for rsc1 to run on node1, see the resource-discovery option for location constraints

Comment 13 Ken Gaillot 2018-05-07 20:16:11 UTC
After investigating further, I realized that most resource agents return "not running" rather than a failure when their respective software is not installed, and thus do not have this problem.

I also found the LVM resource agent supplied with 7.5 no longer returns a failure in this case, either.

Thus, to test, it is necessary to use the LVM resource agent from 7.4 (or any agent that can return a failure for probes).

The good news for LVM agent users is that an upgrade to 7.5 should fix the issue. I am not aware of any other agents that would return a failure in this situation, but there probably are some.

Comment 14 Ken Gaillot 2018-06-01 15:35:23 UTC
The log message is upstream as of commit 57800a92

Comment 16 Patrik Hagara 2018-08-16 15:59:33 UTC
environment: a single-node cluster + one remote node

before:
=======

Installed package versions:

> [root@virt-161 ~]# rpm -q pacemaker
> pacemaker-1.1.18-12.el7.x86_64
> [root@virt-161 ~]# ssh virt-162 rpm -q pacemaker-remote
> pacemaker-remote-1.1.18-12.el7.x86_64

Copy LVM resource agent from RHEL-7.4 (as per comment #13):

> [root@virt-161 ~]# cp LVM-agent-7.4 /usr/lib/ocf/resource.d/heartbeat/LVM
> cp: overwrite ‘/usr/lib/ocf/resource.d/heartbeat/LVM’? y
> [root@virt-161 ~]# scp LVM-agent-7.4 virt-162:/usr/lib/ocf/resource.d/heartbeat/LVM
> LVM-agent-7.4                                         100%   20KB  12.4MB/s   00:00

Create a (local) PV/VG/LV accessible only to the virt-161 cluster node:

> [root@virt-161 ~]# truncate --size 1G loop
> [root@virt-161 ~]# losetup -f loop
> [root@virt-161 ~]# losetup -l
> NAME       SIZELIMIT OFFSET AUTOCLEAR RO BACK-FILE
> /dev/loop0         0      0         0  0 /root/loop
> [root@virt-161 ~]# pvcreate /dev/loop0
>   WARNING: Failed to connect to lvmetad. Falling back to device scanning.
>   Physical volume "/dev/loop0" successfully created.
> [root@virt-161 ~]# vgcreate vg_test /dev/loop0
>   WARNING: Failed to connect to lvmetad. Falling back to device scanning.
>   Volume group "vg_test" successfully created
> [root@virt-161 ~]# lvcreate -n lv_test -l +100%free vg_test
>   WARNING: Failed to connect to lvmetad. Falling back to device scanning.
>   Logical volume "lv_test" created.

Create LVM cluster resource for the VG:

> [root@virt-161 ~]# pcs resource create vg ocf:heartbeat:LVM volgrpname=vg_test

Create a -INFINITY remote node location constraint for the LVM resource:

> [root@virt-161 ~]# pcs resource ban vg virt-162.cluster-qe.lab.eng.brq.redhat.com
> Warning: Creating location constraint cli-ban-vg-on-virt-162.cluster-qe.lab.eng.brq.redhat.com with a score of -INFINITY for resource vg on node virt-161.cluster-qe.lab.eng.brq.redhat.com.
> This will prevent vg from running on virt-162.cluster-qe.lab.eng.brq.redhat.com until the constraint is removed. This will be the case even if virt-162.cluster-qe.lab.eng.brq.redhat.com is the last node in the cluster.

Restart the cluster and examine cluster status:

> [root@virt-161 ~]# pcs cluster stop --all
> virt-161.cluster-qe.lab.eng.brq.redhat.com: Stopping Cluster (pacemaker)...
> virt-161.cluster-qe.lab.eng.brq.redhat.com: Stopping Cluster (corosync)...
> [root@virt-161 ~]# date
> Thu Aug 16 14:38:29 CEST 2018
> [root@virt-161 ~]# pcs cluster start --all --wait                              
> virt-161.cluster-qe.lab.eng.brq.redhat.com: Starting Cluster (corosync)...
> virt-161.cluster-qe.lab.eng.brq.redhat.com: Starting Cluster (pacemaker)...
> Waiting for node(s) to start...
> virt-161.cluster-qe.lab.eng.brq.redhat.com: Started
> [root@virt-161 ~]# pcs status
> Cluster name: bzzt
> Stack: corosync
> Current DC: virt-161.cluster-qe.lab.eng.brq.redhat.com (version 1.1.18-12.el7-2b07d5c5a9) - partition with quorum
> Last updated: Thu Aug 16 14:39:14 2018
> Last change: Tue Aug 14 13:12:48 2018 by root via crm_resource on virt-161.cluster-qe.lab.eng.brq.redhat.com
> 
> 2 nodes configured
> 2 resources configured
> 
> Online: [ virt-161.cluster-qe.lab.eng.brq.redhat.com ]
> RemoteOnline: [ virt-162.cluster-qe.lab.eng.brq.redhat.com ]
> 
> Full list of resources:
> 
>  virt-162.cluster-qe.lab.eng.brq.redhat.com	(ocf::pacemaker:remote):	Started virt-161.cluster-qe.lab.eng.brq.redhat.com
>  vg	(ocf::heartbeat:LVM):	Started virt-161.cluster-qe.lab.eng.brq.redhat.com
> 
> Failed Actions:
> * vg_monitor_0 on virt-162.cluster-qe.lab.eng.brq.redhat.com 'unknown error' (1): call=2984, status=complete, exitreason='LVM Volume vg_test is not available',
>     last-rc-change='Thu Aug 16 14:39:09 2018', queued=0ms, exec=91ms
> 
> 
> Daemon Status:
>   corosync: active/enabled
>   pacemaker: active/enabled
>   pcsd: active/enabled

Wait a bit (cluster-recheck-interval, default 15 min) so that the probe on remote node triggers again and check DC logs:

> [root@virt-161 ~]# grep pengine: /var/log/cluster/corosync.log | cut -d' ' -f 3,6-
> 14:38:45    pengine:     info: crm_log_init:	Changed active directory to /var/lib/pacemaker/cores
> 14:38:45    pengine:     info: qb_ipcs_us_publish:	server name: pengine
> 14:38:45    pengine:     info: main:	Starting pengine
> 14:39:08    pengine:  warning: unpack_config:	Blind faith: not fencing unseen nodes
> 14:39:08    pengine:     info: determine_online_status:	Node virt-161.cluster-qe.lab.eng.brq.redhat.com is online
> 14:39:08    pengine:     info: unpack_node_loop:	Node 1 is already processed
> 14:39:08    pengine:     info: unpack_node_loop:	Node 1 is already processed
> 14:39:08    pengine:     info: common_print:	virt-162.cluster-qe.lab.eng.brq.redhat.com	(ocf::pacemaker:remote):	Stopped
> 14:39:08    pengine:     info: common_print:	vg	(ocf::heartbeat:LVM):	Stopped
> 14:39:08    pengine:     info: RecurringOp:	 Start recurring monitor (60s) for virt-162.cluster-qe.lab.eng.brq.redhat.com on virt-161.cluster-qe.lab.eng.brq.redhat.com
> 14:39:08    pengine:     info: RecurringOp:	 Start recurring monitor (10s) for vg on virt-161.cluster-qe.lab.eng.brq.redhat.com
> 14:39:08    pengine:   notice: LogAction:	 * Start      virt-162.cluster-qe.lab.eng.brq.redhat.com     ( virt-161.cluster-qe.lab.eng.brq.redhat.com )  
> 14:39:08    pengine:   notice: LogAction:	 * Start      vg                                             ( virt-161.cluster-qe.lab.eng.brq.redhat.com )  
> 14:39:08    pengine:   notice: process_pe_message:	Calculated transition 0, saving inputs in /var/lib/pacemaker/pengine/pe-input-24.bz2
> 14:39:09    pengine:     info: determine_online_status:	Node virt-161.cluster-qe.lab.eng.brq.redhat.com is online
> 14:39:09    pengine:     info: unpack_node_loop:	Node 1 is already processed
> 14:39:09    pengine:     info: unpack_node_loop:	Node virt-162.cluster-qe.lab.eng.brq.redhat.com is already processed
> 14:39:09    pengine:     info: unpack_node_loop:	Node 1 is already processed
> 14:39:09    pengine:     info: unpack_node_loop:	Node virt-162.cluster-qe.lab.eng.brq.redhat.com is already processed
> 14:39:09    pengine:     info: common_print:	virt-162.cluster-qe.lab.eng.brq.redhat.com	(ocf::pacemaker:remote):	Started virt-161.cluster-qe.lab.eng.brq.redhat.com
> 14:39:09    pengine:     info: common_print:	vg	(ocf::heartbeat:LVM):	Stopped
> 14:39:09    pengine:     info: RecurringOp:	 Start recurring monitor (60s) for virt-162.cluster-qe.lab.eng.brq.redhat.com on virt-161.cluster-qe.lab.eng.brq.redhat.com
> 14:39:09    pengine:     info: RecurringOp:	 Start recurring monitor (10s) for vg on virt-161.cluster-qe.lab.eng.brq.redhat.com
> 14:39:09    pengine:     info: LogActions:	Leave   virt-162.cluster-qe.lab.eng.brq.redhat.com	(Started virt-161.cluster-qe.lab.eng.brq.redhat.com)
> 14:39:09    pengine:   notice: LogAction:	 * Start      vg                                             ( virt-161.cluster-qe.lab.eng.brq.redhat.com )  
> 14:39:09    pengine:   notice: process_pe_message:	Calculated transition 1, saving inputs in /var/lib/pacemaker/pengine/pe-input-25.bz2
> 14:39:09    pengine:     info: determine_online_status:	Node virt-161.cluster-qe.lab.eng.brq.redhat.com is online
> 14:39:09    pengine:  warning: unpack_rsc_op_failure:	Processing failed op monitor for vg on virt-162.cluster-qe.lab.eng.brq.redhat.com: unknown error (1)
> 14:39:09    pengine:  warning: unpack_rsc_op_failure:	Processing failed op monitor for vg on virt-162.cluster-qe.lab.eng.brq.redhat.com: unknown error (1)
> 14:39:09    pengine:     info: unpack_node_loop:	Node 1 is already processed
> 14:39:09    pengine:     info: unpack_node_loop:	Node virt-162.cluster-qe.lab.eng.brq.redhat.com is already processed
> 14:39:09    pengine:     info: unpack_node_loop:	Node 1 is already processed
> 14:39:09    pengine:     info: unpack_node_loop:	Node virt-162.cluster-qe.lab.eng.brq.redhat.com is already processed
> 14:39:09    pengine:     info: common_print:	virt-162.cluster-qe.lab.eng.brq.redhat.com	(ocf::pacemaker:remote):	Started virt-161.cluster-qe.lab.eng.brq.redhat.com
> 14:39:09    pengine:     info: common_print:	vg	(ocf::heartbeat:LVM):	FAILED virt-162.cluster-qe.lab.eng.brq.redhat.com
> 14:39:09    pengine:     info: RecurringOp:	 Start recurring monitor (10s) for vg on virt-161.cluster-qe.lab.eng.brq.redhat.com
> 14:39:09    pengine:     info: LogActions:	Leave   virt-162.cluster-qe.lab.eng.brq.redhat.com	(Started virt-161.cluster-qe.lab.eng.brq.redhat.com)
> 14:39:09    pengine:   notice: LogAction:	 * Recover    vg                                             ( virt-162.cluster-qe.lab.eng.brq.redhat.com -> virt-161.cluster-qe.lab.eng.brq.redhat.com )  
> 14:39:09    pengine:   notice: process_pe_message:	Calculated transition 2, saving inputs in /var/lib/pacemaker/pengine/pe-input-26.bz2
> 14:54:10    pengine:     info: determine_online_status:	Node virt-161.cluster-qe.lab.eng.brq.redhat.com is online
> 14:54:10    pengine:  warning: unpack_rsc_op_failure:	Processing failed op monitor for vg on virt-162.cluster-qe.lab.eng.brq.redhat.com: unknown error (1)
> 14:54:10    pengine:     info: unpack_node_loop:	Node 1 is already processed
> 14:54:10    pengine:     info: unpack_node_loop:	Node virt-162.cluster-qe.lab.eng.brq.redhat.com is already processed
> 14:54:10    pengine:     info: unpack_node_loop:	Node 1 is already processed
> 14:54:10    pengine:     info: unpack_node_loop:	Node virt-162.cluster-qe.lab.eng.brq.redhat.com is already processed
> 14:54:10    pengine:     info: common_print:	virt-162.cluster-qe.lab.eng.brq.redhat.com	(ocf::pacemaker:remote):	Started virt-161.cluster-qe.lab.eng.brq.redhat.com
> 14:54:10    pengine:     info: common_print:	vg	(ocf::heartbeat:LVM):	Started virt-161.cluster-qe.lab.eng.brq.redhat.com
> 14:54:10    pengine:     info: LogActions:	Leave   virt-162.cluster-qe.lab.eng.brq.redhat.com	(Started virt-161.cluster-qe.lab.eng.brq.redhat.com)
> 14:54:10    pengine:     info: LogActions:	Leave   vg	(Started virt-161.cluster-qe.lab.eng.brq.redhat.com)
> 14:54:10    pengine:   notice: process_pe_message:	Calculated transition 3, saving inputs in /var/lib/pacemaker/pengine/pe-input-27.bz2
> [root@virt-161 ~]# grep resource-discovery /var/log/cluster/corosync.log
> [root@virt-161 ~]# echo $?
> 1

Two probe failures logged for remote node -- first at 14:39 when cluster was starting and then at 14:54 when cluster-recheck-interval was reached (and subsequently every 15 min after that). No resource-discovery hint in logs.


after:
======

> [root@virt-149 ~]# rpm -q pacemaker
> pacemaker-1.1.19-6.el7.x86_64
> [root@virt-149 ~]# ssh virt-150 rpm -q pacemaker-remote
> pacemaker-remote-1.1.19-6.el7.x86_64
> [root@virt-149 ~]# cp LVM-agent-7.4 /usr/lib/ocf/resource.d/heartbeat/LVM
> cp: overwrite ‘/usr/lib/ocf/resource.d/heartbeat/LVM’? y
> [root@virt-149 ~]# scp LVM-agent-7.4 virt-150:/usr/lib/ocf/resource.d/heartbeat/LVM
> LVM-agent-7.4                                                    100%   20KB  11.3MB/s   00:00
> [root@virt-149 ~]# truncate --size 1G loop
> [root@virt-149 ~]# losetup -f loop
> [root@virt-149 ~]# losetup -l
> NAME       SIZELIMIT OFFSET AUTOCLEAR RO BACK-FILE
> /dev/loop0         0      0         0  0 /root/loop
> [root@virt-149 ~]# pvcreate /dev/loop0
>   WARNING: Failed to connect to lvmetad. Falling back to device scanning.
>   Physical volume "/dev/loop0" successfully created.
> [root@virt-149 ~]# vgcreate vg_test /dev/loop0
>   WARNING: Failed to connect to lvmetad. Falling back to device scanning.
>   Volume group "vg_test" successfully created
> [root@virt-149 ~]# lvcreate -n lv_test -l +100%free vg_test
>   WARNING: Failed to connect to lvmetad. Falling back to device scanning.
>   Logical volume "lv_test" created.
> [root@virt-149 ~]# pcs resource create vg ocf:heartbeat:LVM volgrpname=vg_test
> [root@virt-149 ~]# pcs resource ban vg virt-150.cluster-qe.lab.eng.brq.redhat.com
> Warning: Creating location constraint cli-ban-vg-on-virt-150.cluster-qe.lab.eng.brq.redhat.com with a score of -INFINITY for resource vg on node virt-150.cluster-qe.lab.eng.brq.redhat.com.
> This will prevent vg from running on virt-150.cluster-qe.lab.eng.brq.redhat.com until the constraint is removed. This will be the case even if virt-150.cluster-qe.lab.eng.brq.redhat.com is the last node in the cluster.
> [root@virt-149 ~]# pcs cluster stop --all
> virt-149.cluster-qe.lab.eng.brq.redhat.com: Stopping Cluster (pacemaker)...
> virt-149.cluster-qe.lab.eng.brq.redhat.com: Stopping Cluster (corosync)...
> [root@virt-149 ~]# date
> Thu Aug 16 16:48:14 CEST 2018
> [root@virt-149 ~]# pcs cluster start --all --wait
> virt-149.cluster-qe.lab.eng.brq.redhat.com: Starting Cluster (corosync)...
> virt-149.cluster-qe.lab.eng.brq.redhat.com: Starting Cluster (pacemaker)...
> Waiting for node(s) to start...
> virt-149.cluster-qe.lab.eng.brq.redhat.com: Started
> [root@virt-149 ~]# pcs status
> Cluster name: bzzt
> Stack: corosync
> Current DC: virt-149.cluster-qe.lab.eng.brq.redhat.com (version 1.1.19-6.el7-c3c624ea3d) - partition with quorum
> Last updated: Thu Aug 16 16:50:41 2018
> Last change: Thu Aug 16 16:46:47 2018 by root via crm_resource on virt-149.cluster-qe.lab.eng.brq.redhat.com
> 
> 2 nodes configured
> 2 resources configured
> 
> Online: [ virt-149.cluster-qe.lab.eng.brq.redhat.com ]
> RemoteOnline: [ virt-150.cluster-qe.lab.eng.brq.redhat.com ]
> 
> Full list of resources:
> 
>  virt-150.cluster-qe.lab.eng.brq.redhat.com	(ocf::pacemaker:remote):	Started virt-149.cluster-qe.lab.eng.brq.redhat.com
>  vg	(ocf::heartbeat:LVM):	Started virt-149.cluster-qe.lab.eng.brq.redhat.com
> 
> Failed Actions:
> * vg_monitor_0 on virt-150.cluster-qe.lab.eng.brq.redhat.com 'unknown error' (1): call=19, status=complete, exitreason='LVM Volume vg_test is not available',
>     last-rc-change='Thu Aug 16 16:48:48 2018', queued=0ms, exec=65ms
> 
> 
> Daemon Status:
>   corosync: active/disabled
>   pacemaker: active/disabled
>   pcsd: active/enabled
> [root@virt-149 ~]# grep pengine: /var/log/cluster/corosync.log | cut -d' ' -f 3,6-
> 16:48:24    pengine:     info: crm_log_init:	Changed active directory to /var/lib/pacemaker/cores
> 16:48:24    pengine:     info: qb_ipcs_us_publish:	server name: pengine
> 16:48:24    pengine:     info: main:	Starting pengine
> 16:48:47    pengine:  warning: unpack_config:	Blind faith: not fencing unseen nodes
> 16:48:47    pengine:     info: determine_online_status:	Node virt-149.cluster-qe.lab.eng.brq.redhat.com is online
> 16:48:47    pengine:     info: unpack_node_loop:	Node 1 is already processed
> 16:48:47    pengine:     info: unpack_node_loop:	Node 1 is already processed
> 16:48:47    pengine:     info: common_print:	virt-150.cluster-qe.lab.eng.brq.redhat.com(ocf::pacemaker:remote):	Stopped
> 16:48:47    pengine:     info: common_print:	vg	(ocf::heartbeat:LVM):	Stopped
> 16:48:47    pengine:     info: RecurringOp:	 Start recurring monitor (60s) for virt-150.cluster-qe.lab.eng.brq.redhat.com on virt-149.cluster-qe.lab.eng.brq.redhat.com
> 16:48:47    pengine:     info: RecurringOp:	 Start recurring monitor (10s) for vg on virt-149.cluster-qe.lab.eng.brq.redhat.com
> 16:48:47    pengine:   notice: LogAction:	 * Start      virt-150.cluster-qe.lab.eng.brq.redhat.com     ( virt-149.cluster-qe.lab.eng.brq.redhat.com )  
> 16:48:47    pengine:   notice: LogAction:	 * Start      vg                                             ( virt-149.cluster-qe.lab.eng.brq.redhat.com )  
> 16:48:47    pengine:   notice: process_pe_message:	Calculated transition 0, saving inputs in /var/lib/pacemaker/pengine/pe-input-20.bz2
> 16:48:48    pengine:     info: determine_online_status:	Node virt-149.cluster-qe.lab.eng.brq.redhat.com is online
> 16:48:48    pengine:     info: unpack_node_loop:	Node 1 is already processed
> 16:48:48    pengine:     info: unpack_node_loop:	Node virt-150.cluster-qe.lab.eng.brq.redhat.com is already processed
> 16:48:48    pengine:     info: unpack_node_loop:	Node 1 is already processed
> 16:48:48    pengine:     info: unpack_node_loop:	Node virt-150.cluster-qe.lab.eng.brq.redhat.com is already processed
> 16:48:48    pengine:     info: common_print:	virt-150.cluster-qe.lab.eng.brq.redhat.com(ocf::pacemaker:remote):	Started virt-149.cluster-qe.lab.eng.brq.redhat.com
> 16:48:48    pengine:     info: common_print:	vg	(ocf::heartbeat:LVM):	Stopped
> 16:48:48    pengine:     info: RecurringOp:	 Start recurring monitor (60s) for virt-150.cluster-qe.lab.eng.brq.redhat.com on virt-149.cluster-qe.lab.eng.brq.redhat.com
> 16:48:48    pengine:     info: RecurringOp:	 Start recurring monitor (10s) for vg on virt-149.cluster-qe.lab.eng.brq.redhat.com
> 16:48:48    pengine:     info: LogActions:	Leave   virt-150.cluster-qe.lab.eng.brq.redhat.com	(Started virt-149.cluster-qe.lab.eng.brq.redhat.com)
> 16:48:48    pengine:   notice: LogAction:	 * Start      vg                                             ( virt-149.cluster-qe.lab.eng.brq.redhat.com )  
> 16:48:48    pengine:   notice: process_pe_message:	Calculated transition 1, saving inputs in /var/lib/pacemaker/pengine/pe-input-21.bz2
> 16:48:49    pengine:     info: determine_online_status:	Node virt-149.cluster-qe.lab.eng.brq.redhat.com is online
> 16:48:49    pengine:  warning: unpack_rsc_op_failure:	Processing failed probe of vg on virt-150.cluster-qe.lab.eng.brq.redhat.com: unknown error | rc=1
> 16:48:49    pengine:   notice: unpack_rsc_op_failure:	If it is not possible for vg to run on virt-150.cluster-qe.lab.eng.brq.redhat.com, see the resource-discovery option for location constraints
> 16:48:49    pengine:  warning: unpack_rsc_op_failure:	Processing failed probe of vg on virt-150.cluster-qe.lab.eng.brq.redhat.com: unknown error | rc=1
> 16:48:49    pengine:   notice: unpack_rsc_op_failure:	If it is not possible for vg to run on virt-150.cluster-qe.lab.eng.brq.redhat.com, see the resource-discovery option for location constraints
> 16:48:49    pengine:     info: unpack_node_loop:	Node 1 is already processed
> 16:48:49    pengine:     info: unpack_node_loop:	Node virt-150.cluster-qe.lab.eng.brq.redhat.com is already processed
> 16:48:49    pengine:     info: unpack_node_loop:	Node 1 is already processed
> 16:48:49    pengine:     info: unpack_node_loop:	Node virt-150.cluster-qe.lab.eng.brq.redhat.com is already processed
> 16:48:49    pengine:     info: common_print:	virt-150.cluster-qe.lab.eng.brq.redhat.com(ocf::pacemaker:remote):	Started virt-149.cluster-qe.lab.eng.brq.redhat.com
> 16:48:49    pengine:     info: common_print:	vg	(ocf::heartbeat:LVM):	FAILED virt-150.cluster-qe.lab.eng.brq.redhat.com
> 16:48:49    pengine:     info: RecurringOp:	 Start recurring monitor (10s) for vg on virt-149.cluster-qe.lab.eng.brq.redhat.com
> 16:48:49    pengine:     info: LogActions:	Leave   virt-150.cluster-qe.lab.eng.brq.redhat.com	(Started virt-149.cluster-qe.lab.eng.brq.redhat.com)
> 16:48:49    pengine:   notice: LogAction:	 * Recover    vg                                             ( virt-150.cluster-qe.lab.eng.brq.redhat.com -> virt-149.cluster-qe.lab.eng.brq.redhat.com )  
> 16:48:49    pengine:   notice: process_pe_message:	Calculated transition 2, saving inputs in /var/lib/pacemaker/pengine/pe-input-22.bz2
> 17:03:50    pengine:     info: determine_online_status:	Node virt-149.cluster-qe.lab.eng.brq.redhat.com is online
> 17:03:50    pengine:  warning: unpack_rsc_op_failure:	Processing failed probe of vg on virt-150.cluster-qe.lab.eng.brq.redhat.com: unknown error | rc=1
> 17:03:50    pengine:   notice: unpack_rsc_op_failure:	If it is not possible for vg to run on virt-150.cluster-qe.lab.eng.brq.redhat.com, see the resource-discovery option for location constraints
> 17:03:50    pengine:     info: unpack_node_loop:	Node 1 is already processed
> 17:03:50    pengine:     info: unpack_node_loop:	Node virt-150.cluster-qe.lab.eng.brq.redhat.com is already processed
> 17:03:50    pengine:     info: unpack_node_loop:	Node 1 is already processed
> 17:03:50    pengine:     info: unpack_node_loop:	Node virt-150.cluster-qe.lab.eng.brq.redhat.com is already processed
> 17:03:50    pengine:     info: common_print:	virt-150.cluster-qe.lab.eng.brq.redhat.com(ocf::pacemaker:remote):	Started virt-149.cluster-qe.lab.eng.brq.redhat.com
> 17:03:50    pengine:     info: common_print:	vg	(ocf::heartbeat:LVM):	Started virt-149.cluster-qe.lab.eng.brq.redhat.com
> 17:03:50    pengine:     info: LogActions:	Leave   virt-150.cluster-qe.lab.eng.brq.redhat.com	(Started virt-149.cluster-qe.lab.eng.brq.redhat.com)
> 17:03:50    pengine:     info: LogActions:	Leave   vg	(Started virt-149.cluster-qe.lab.eng.brq.redhat.com)
> 17:03:50    pengine:   notice: process_pe_message:	Calculated transition 3, saving inputs in /var/lib/pacemaker/pengine/pe-input-23.bz2

Cluster behavior remained the same, except now an additional message is logged on the DC pointing the administrator towards configuration fix:

> If it is not possible for vg to run on virt-150.cluster-qe.lab.eng.brq.redhat.com, see the resource-discovery option for location constraints

Marking verified.

Comment 18 errata-xmlrpc 2018-10-30 07:57:39 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:3055


Note You need to log in before you can comment on or make changes to this bug.