Bug 1514492
| Summary: | regression in pacemaker-1.1.16-12.el7_4.4.x86_64 / setup with remote-nodes not working anymore | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Renaud Marigny <rmarigny> |
| Component: | pacemaker | Assignee: | Ken Gaillot <kgaillot> |
| Status: | CLOSED ERRATA | QA Contact: | Patrik Hagara <phagara> |
| Severity: | urgent | Docs Contact: | |
| Priority: | urgent | ||
| Version: | 7.4 | CC: | abeekhof, bruno.travouillon, cluster-maint, jruemker, kgaillot, mnovacek, phagara, sbradley |
| Target Milestone: | rc | ||
| Target Release: | 7.6 | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | pacemaker-1.1.18-13.el7 | Doc Type: | No Doc Update |
| Doc Text: |
The release note for Bug 1489728 should be sufficient to cover the issue here.
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2018-10-30 07:57:39 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 1609081 | ||
(In reply to Bruno Travouillon from comment #3) > After some research in the pacemaker.git history, it looks like the patch is > legit. The monitor action in the log does not deal with standard resource > monitoring but with probes (one-time monitor operation). There is a location > property to disable the resource discovery: resource-discovery=never. > > With the following change in the configuration, I can't reproduce the issue > with pacemaker-1.1.16-12.el7_4.4. Yes, you have it exactly right here. This was a long-planned fix for a limitation of guest nodes -- they didn't get probed at resource start-up like other nodes. As you saw, the probes are not attempts to start the resource, but attempts to determine the current status. They allow Pacemaker to ensure that resources aren't running where they're not supposed to be, and to properly re-detect resources that have been cleaned up. Unfortunately, we did not consider cases where users would be relying on the absence of probes. The good news is that the solution you found, setting resource-discovery=never, is exactly the right answer. That tells Pacemaker not to probe the resource in the constraint on that node, and is intended to be used in situations like this, where the software is not installed on the node, so probing isn't necessary. I will look into what we can do to prevent people from getting bit by this, but the new behavior fixes other important scenarios, so it's unlikely we'll revert it. Thanks for discovering, reporting, and investigating this issue. We will put a release note in 7.5 about the issue, and also for 7.5 (if approved), we will make Pacemaker log a warning the first time any probe fails, like: warning: Processing failed op monitor for rsc1 on node1: unknown error (1) warning: If it is not possible for rsc1 to run on node1, see the resource-discovery option for location constraints This will probably not be backported to 7.4. QA: Test procedure: 1. Configure a cluster of at least one cluster node and one guest node. 2. Configure a resource that can run on the cluster node, but requires software that isn't installed on the guest node. 3. Configure a location constraint banning the resource from the guest node (omitting the resource-discovery option). 4. Start the cluster. Using the 1.1.18-6 or earlier packages for 7.5, or the 1.1.16-12.4 or 1.1.16-12.5 packages for 7.4, cluster status will show a failed monitor for the resource on the guest node, and the logs will show a warning about the failed monitor. After the fix here, the behavior will be the same, but there will be an additional log message referring users to the resource-discovery option, the first (and only the first) time the monitor fails. The log message (covered by this bz) will be done for 7.6 due to a tight schedule for 7.5. However the release note about the issue (covered by Bug 1489728) will be for 7.5. (In reply to Ken Gaillot from comment #6) > QA: Test procedure: Updated ... > 1. Configure a cluster of at least one cluster node and one guest node. > 2. Configure a resource that can run on the cluster node, but requires > software that isn't installed on the guest node. > 3. Configure a location constraint banning the resource from the guest node > (omitting the resource-discovery option). > 4. Start the cluster. > > Using the 1.1.18-6 or earlier packages for 7.5, or the 1.1.16-12.4 or > 1.1.16-12.5 packages for 7.4, cluster status will show a failed monitor for > the resource on the guest node, and the logs will show a warning about the > failed monitor. After the fix here, the behavior will be the same, but there > will be an additional log message referring users to the resource-discovery The new message will only appear in the DC's logs. It will appear every time the failure is processed, however it will only be logged for actual failed probes (as opposed to unexpected running/stopped status). The message will be something like: warning: Processing failed probe of rsc1 on node1: some error here notice: If it is not possible for rsc1 to run on node1, see the resource-discovery option for location constraints After investigating further, I realized that most resource agents return "not running" rather than a failure when their respective software is not installed, and thus do not have this problem. I also found the LVM resource agent supplied with 7.5 no longer returns a failure in this case, either. Thus, to test, it is necessary to use the LVM resource agent from 7.4 (or any agent that can return a failure for probes). The good news for LVM agent users is that an upgrade to 7.5 should fix the issue. I am not aware of any other agents that would return a failure in this situation, but there probably are some. The log message is upstream as of commit 57800a92 environment: a single-node cluster + one remote node before: ======= Installed package versions: > [root@virt-161 ~]# rpm -q pacemaker > pacemaker-1.1.18-12.el7.x86_64 > [root@virt-161 ~]# ssh virt-162 rpm -q pacemaker-remote > pacemaker-remote-1.1.18-12.el7.x86_64 Copy LVM resource agent from RHEL-7.4 (as per comment #13): > [root@virt-161 ~]# cp LVM-agent-7.4 /usr/lib/ocf/resource.d/heartbeat/LVM > cp: overwrite ‘/usr/lib/ocf/resource.d/heartbeat/LVM’? y > [root@virt-161 ~]# scp LVM-agent-7.4 virt-162:/usr/lib/ocf/resource.d/heartbeat/LVM > LVM-agent-7.4 100% 20KB 12.4MB/s 00:00 Create a (local) PV/VG/LV accessible only to the virt-161 cluster node: > [root@virt-161 ~]# truncate --size 1G loop > [root@virt-161 ~]# losetup -f loop > [root@virt-161 ~]# losetup -l > NAME SIZELIMIT OFFSET AUTOCLEAR RO BACK-FILE > /dev/loop0 0 0 0 0 /root/loop > [root@virt-161 ~]# pvcreate /dev/loop0 > WARNING: Failed to connect to lvmetad. Falling back to device scanning. > Physical volume "/dev/loop0" successfully created. > [root@virt-161 ~]# vgcreate vg_test /dev/loop0 > WARNING: Failed to connect to lvmetad. Falling back to device scanning. > Volume group "vg_test" successfully created > [root@virt-161 ~]# lvcreate -n lv_test -l +100%free vg_test > WARNING: Failed to connect to lvmetad. Falling back to device scanning. > Logical volume "lv_test" created. Create LVM cluster resource for the VG: > [root@virt-161 ~]# pcs resource create vg ocf:heartbeat:LVM volgrpname=vg_test Create a -INFINITY remote node location constraint for the LVM resource: > [root@virt-161 ~]# pcs resource ban vg virt-162.cluster-qe.lab.eng.brq.redhat.com > Warning: Creating location constraint cli-ban-vg-on-virt-162.cluster-qe.lab.eng.brq.redhat.com with a score of -INFINITY for resource vg on node virt-161.cluster-qe.lab.eng.brq.redhat.com. > This will prevent vg from running on virt-162.cluster-qe.lab.eng.brq.redhat.com until the constraint is removed. This will be the case even if virt-162.cluster-qe.lab.eng.brq.redhat.com is the last node in the cluster. Restart the cluster and examine cluster status: > [root@virt-161 ~]# pcs cluster stop --all > virt-161.cluster-qe.lab.eng.brq.redhat.com: Stopping Cluster (pacemaker)... > virt-161.cluster-qe.lab.eng.brq.redhat.com: Stopping Cluster (corosync)... > [root@virt-161 ~]# date > Thu Aug 16 14:38:29 CEST 2018 > [root@virt-161 ~]# pcs cluster start --all --wait > virt-161.cluster-qe.lab.eng.brq.redhat.com: Starting Cluster (corosync)... > virt-161.cluster-qe.lab.eng.brq.redhat.com: Starting Cluster (pacemaker)... > Waiting for node(s) to start... > virt-161.cluster-qe.lab.eng.brq.redhat.com: Started > [root@virt-161 ~]# pcs status > Cluster name: bzzt > Stack: corosync > Current DC: virt-161.cluster-qe.lab.eng.brq.redhat.com (version 1.1.18-12.el7-2b07d5c5a9) - partition with quorum > Last updated: Thu Aug 16 14:39:14 2018 > Last change: Tue Aug 14 13:12:48 2018 by root via crm_resource on virt-161.cluster-qe.lab.eng.brq.redhat.com > > 2 nodes configured > 2 resources configured > > Online: [ virt-161.cluster-qe.lab.eng.brq.redhat.com ] > RemoteOnline: [ virt-162.cluster-qe.lab.eng.brq.redhat.com ] > > Full list of resources: > > virt-162.cluster-qe.lab.eng.brq.redhat.com (ocf::pacemaker:remote): Started virt-161.cluster-qe.lab.eng.brq.redhat.com > vg (ocf::heartbeat:LVM): Started virt-161.cluster-qe.lab.eng.brq.redhat.com > > Failed Actions: > * vg_monitor_0 on virt-162.cluster-qe.lab.eng.brq.redhat.com 'unknown error' (1): call=2984, status=complete, exitreason='LVM Volume vg_test is not available', > last-rc-change='Thu Aug 16 14:39:09 2018', queued=0ms, exec=91ms > > > Daemon Status: > corosync: active/enabled > pacemaker: active/enabled > pcsd: active/enabled Wait a bit (cluster-recheck-interval, default 15 min) so that the probe on remote node triggers again and check DC logs: > [root@virt-161 ~]# grep pengine: /var/log/cluster/corosync.log | cut -d' ' -f 3,6- > 14:38:45 pengine: info: crm_log_init: Changed active directory to /var/lib/pacemaker/cores > 14:38:45 pengine: info: qb_ipcs_us_publish: server name: pengine > 14:38:45 pengine: info: main: Starting pengine > 14:39:08 pengine: warning: unpack_config: Blind faith: not fencing unseen nodes > 14:39:08 pengine: info: determine_online_status: Node virt-161.cluster-qe.lab.eng.brq.redhat.com is online > 14:39:08 pengine: info: unpack_node_loop: Node 1 is already processed > 14:39:08 pengine: info: unpack_node_loop: Node 1 is already processed > 14:39:08 pengine: info: common_print: virt-162.cluster-qe.lab.eng.brq.redhat.com (ocf::pacemaker:remote): Stopped > 14:39:08 pengine: info: common_print: vg (ocf::heartbeat:LVM): Stopped > 14:39:08 pengine: info: RecurringOp: Start recurring monitor (60s) for virt-162.cluster-qe.lab.eng.brq.redhat.com on virt-161.cluster-qe.lab.eng.brq.redhat.com > 14:39:08 pengine: info: RecurringOp: Start recurring monitor (10s) for vg on virt-161.cluster-qe.lab.eng.brq.redhat.com > 14:39:08 pengine: notice: LogAction: * Start virt-162.cluster-qe.lab.eng.brq.redhat.com ( virt-161.cluster-qe.lab.eng.brq.redhat.com ) > 14:39:08 pengine: notice: LogAction: * Start vg ( virt-161.cluster-qe.lab.eng.brq.redhat.com ) > 14:39:08 pengine: notice: process_pe_message: Calculated transition 0, saving inputs in /var/lib/pacemaker/pengine/pe-input-24.bz2 > 14:39:09 pengine: info: determine_online_status: Node virt-161.cluster-qe.lab.eng.brq.redhat.com is online > 14:39:09 pengine: info: unpack_node_loop: Node 1 is already processed > 14:39:09 pengine: info: unpack_node_loop: Node virt-162.cluster-qe.lab.eng.brq.redhat.com is already processed > 14:39:09 pengine: info: unpack_node_loop: Node 1 is already processed > 14:39:09 pengine: info: unpack_node_loop: Node virt-162.cluster-qe.lab.eng.brq.redhat.com is already processed > 14:39:09 pengine: info: common_print: virt-162.cluster-qe.lab.eng.brq.redhat.com (ocf::pacemaker:remote): Started virt-161.cluster-qe.lab.eng.brq.redhat.com > 14:39:09 pengine: info: common_print: vg (ocf::heartbeat:LVM): Stopped > 14:39:09 pengine: info: RecurringOp: Start recurring monitor (60s) for virt-162.cluster-qe.lab.eng.brq.redhat.com on virt-161.cluster-qe.lab.eng.brq.redhat.com > 14:39:09 pengine: info: RecurringOp: Start recurring monitor (10s) for vg on virt-161.cluster-qe.lab.eng.brq.redhat.com > 14:39:09 pengine: info: LogActions: Leave virt-162.cluster-qe.lab.eng.brq.redhat.com (Started virt-161.cluster-qe.lab.eng.brq.redhat.com) > 14:39:09 pengine: notice: LogAction: * Start vg ( virt-161.cluster-qe.lab.eng.brq.redhat.com ) > 14:39:09 pengine: notice: process_pe_message: Calculated transition 1, saving inputs in /var/lib/pacemaker/pengine/pe-input-25.bz2 > 14:39:09 pengine: info: determine_online_status: Node virt-161.cluster-qe.lab.eng.brq.redhat.com is online > 14:39:09 pengine: warning: unpack_rsc_op_failure: Processing failed op monitor for vg on virt-162.cluster-qe.lab.eng.brq.redhat.com: unknown error (1) > 14:39:09 pengine: warning: unpack_rsc_op_failure: Processing failed op monitor for vg on virt-162.cluster-qe.lab.eng.brq.redhat.com: unknown error (1) > 14:39:09 pengine: info: unpack_node_loop: Node 1 is already processed > 14:39:09 pengine: info: unpack_node_loop: Node virt-162.cluster-qe.lab.eng.brq.redhat.com is already processed > 14:39:09 pengine: info: unpack_node_loop: Node 1 is already processed > 14:39:09 pengine: info: unpack_node_loop: Node virt-162.cluster-qe.lab.eng.brq.redhat.com is already processed > 14:39:09 pengine: info: common_print: virt-162.cluster-qe.lab.eng.brq.redhat.com (ocf::pacemaker:remote): Started virt-161.cluster-qe.lab.eng.brq.redhat.com > 14:39:09 pengine: info: common_print: vg (ocf::heartbeat:LVM): FAILED virt-162.cluster-qe.lab.eng.brq.redhat.com > 14:39:09 pengine: info: RecurringOp: Start recurring monitor (10s) for vg on virt-161.cluster-qe.lab.eng.brq.redhat.com > 14:39:09 pengine: info: LogActions: Leave virt-162.cluster-qe.lab.eng.brq.redhat.com (Started virt-161.cluster-qe.lab.eng.brq.redhat.com) > 14:39:09 pengine: notice: LogAction: * Recover vg ( virt-162.cluster-qe.lab.eng.brq.redhat.com -> virt-161.cluster-qe.lab.eng.brq.redhat.com ) > 14:39:09 pengine: notice: process_pe_message: Calculated transition 2, saving inputs in /var/lib/pacemaker/pengine/pe-input-26.bz2 > 14:54:10 pengine: info: determine_online_status: Node virt-161.cluster-qe.lab.eng.brq.redhat.com is online > 14:54:10 pengine: warning: unpack_rsc_op_failure: Processing failed op monitor for vg on virt-162.cluster-qe.lab.eng.brq.redhat.com: unknown error (1) > 14:54:10 pengine: info: unpack_node_loop: Node 1 is already processed > 14:54:10 pengine: info: unpack_node_loop: Node virt-162.cluster-qe.lab.eng.brq.redhat.com is already processed > 14:54:10 pengine: info: unpack_node_loop: Node 1 is already processed > 14:54:10 pengine: info: unpack_node_loop: Node virt-162.cluster-qe.lab.eng.brq.redhat.com is already processed > 14:54:10 pengine: info: common_print: virt-162.cluster-qe.lab.eng.brq.redhat.com (ocf::pacemaker:remote): Started virt-161.cluster-qe.lab.eng.brq.redhat.com > 14:54:10 pengine: info: common_print: vg (ocf::heartbeat:LVM): Started virt-161.cluster-qe.lab.eng.brq.redhat.com > 14:54:10 pengine: info: LogActions: Leave virt-162.cluster-qe.lab.eng.brq.redhat.com (Started virt-161.cluster-qe.lab.eng.brq.redhat.com) > 14:54:10 pengine: info: LogActions: Leave vg (Started virt-161.cluster-qe.lab.eng.brq.redhat.com) > 14:54:10 pengine: notice: process_pe_message: Calculated transition 3, saving inputs in /var/lib/pacemaker/pengine/pe-input-27.bz2 > [root@virt-161 ~]# grep resource-discovery /var/log/cluster/corosync.log > [root@virt-161 ~]# echo $? > 1 Two probe failures logged for remote node -- first at 14:39 when cluster was starting and then at 14:54 when cluster-recheck-interval was reached (and subsequently every 15 min after that). No resource-discovery hint in logs. after: ====== > [root@virt-149 ~]# rpm -q pacemaker > pacemaker-1.1.19-6.el7.x86_64 > [root@virt-149 ~]# ssh virt-150 rpm -q pacemaker-remote > pacemaker-remote-1.1.19-6.el7.x86_64 > [root@virt-149 ~]# cp LVM-agent-7.4 /usr/lib/ocf/resource.d/heartbeat/LVM > cp: overwrite ‘/usr/lib/ocf/resource.d/heartbeat/LVM’? y > [root@virt-149 ~]# scp LVM-agent-7.4 virt-150:/usr/lib/ocf/resource.d/heartbeat/LVM > LVM-agent-7.4 100% 20KB 11.3MB/s 00:00 > [root@virt-149 ~]# truncate --size 1G loop > [root@virt-149 ~]# losetup -f loop > [root@virt-149 ~]# losetup -l > NAME SIZELIMIT OFFSET AUTOCLEAR RO BACK-FILE > /dev/loop0 0 0 0 0 /root/loop > [root@virt-149 ~]# pvcreate /dev/loop0 > WARNING: Failed to connect to lvmetad. Falling back to device scanning. > Physical volume "/dev/loop0" successfully created. > [root@virt-149 ~]# vgcreate vg_test /dev/loop0 > WARNING: Failed to connect to lvmetad. Falling back to device scanning. > Volume group "vg_test" successfully created > [root@virt-149 ~]# lvcreate -n lv_test -l +100%free vg_test > WARNING: Failed to connect to lvmetad. Falling back to device scanning. > Logical volume "lv_test" created. > [root@virt-149 ~]# pcs resource create vg ocf:heartbeat:LVM volgrpname=vg_test > [root@virt-149 ~]# pcs resource ban vg virt-150.cluster-qe.lab.eng.brq.redhat.com > Warning: Creating location constraint cli-ban-vg-on-virt-150.cluster-qe.lab.eng.brq.redhat.com with a score of -INFINITY for resource vg on node virt-150.cluster-qe.lab.eng.brq.redhat.com. > This will prevent vg from running on virt-150.cluster-qe.lab.eng.brq.redhat.com until the constraint is removed. This will be the case even if virt-150.cluster-qe.lab.eng.brq.redhat.com is the last node in the cluster. > [root@virt-149 ~]# pcs cluster stop --all > virt-149.cluster-qe.lab.eng.brq.redhat.com: Stopping Cluster (pacemaker)... > virt-149.cluster-qe.lab.eng.brq.redhat.com: Stopping Cluster (corosync)... > [root@virt-149 ~]# date > Thu Aug 16 16:48:14 CEST 2018 > [root@virt-149 ~]# pcs cluster start --all --wait > virt-149.cluster-qe.lab.eng.brq.redhat.com: Starting Cluster (corosync)... > virt-149.cluster-qe.lab.eng.brq.redhat.com: Starting Cluster (pacemaker)... > Waiting for node(s) to start... > virt-149.cluster-qe.lab.eng.brq.redhat.com: Started > [root@virt-149 ~]# pcs status > Cluster name: bzzt > Stack: corosync > Current DC: virt-149.cluster-qe.lab.eng.brq.redhat.com (version 1.1.19-6.el7-c3c624ea3d) - partition with quorum > Last updated: Thu Aug 16 16:50:41 2018 > Last change: Thu Aug 16 16:46:47 2018 by root via crm_resource on virt-149.cluster-qe.lab.eng.brq.redhat.com > > 2 nodes configured > 2 resources configured > > Online: [ virt-149.cluster-qe.lab.eng.brq.redhat.com ] > RemoteOnline: [ virt-150.cluster-qe.lab.eng.brq.redhat.com ] > > Full list of resources: > > virt-150.cluster-qe.lab.eng.brq.redhat.com (ocf::pacemaker:remote): Started virt-149.cluster-qe.lab.eng.brq.redhat.com > vg (ocf::heartbeat:LVM): Started virt-149.cluster-qe.lab.eng.brq.redhat.com > > Failed Actions: > * vg_monitor_0 on virt-150.cluster-qe.lab.eng.brq.redhat.com 'unknown error' (1): call=19, status=complete, exitreason='LVM Volume vg_test is not available', > last-rc-change='Thu Aug 16 16:48:48 2018', queued=0ms, exec=65ms > > > Daemon Status: > corosync: active/disabled > pacemaker: active/disabled > pcsd: active/enabled > [root@virt-149 ~]# grep pengine: /var/log/cluster/corosync.log | cut -d' ' -f 3,6- > 16:48:24 pengine: info: crm_log_init: Changed active directory to /var/lib/pacemaker/cores > 16:48:24 pengine: info: qb_ipcs_us_publish: server name: pengine > 16:48:24 pengine: info: main: Starting pengine > 16:48:47 pengine: warning: unpack_config: Blind faith: not fencing unseen nodes > 16:48:47 pengine: info: determine_online_status: Node virt-149.cluster-qe.lab.eng.brq.redhat.com is online > 16:48:47 pengine: info: unpack_node_loop: Node 1 is already processed > 16:48:47 pengine: info: unpack_node_loop: Node 1 is already processed > 16:48:47 pengine: info: common_print: virt-150.cluster-qe.lab.eng.brq.redhat.com(ocf::pacemaker:remote): Stopped > 16:48:47 pengine: info: common_print: vg (ocf::heartbeat:LVM): Stopped > 16:48:47 pengine: info: RecurringOp: Start recurring monitor (60s) for virt-150.cluster-qe.lab.eng.brq.redhat.com on virt-149.cluster-qe.lab.eng.brq.redhat.com > 16:48:47 pengine: info: RecurringOp: Start recurring monitor (10s) for vg on virt-149.cluster-qe.lab.eng.brq.redhat.com > 16:48:47 pengine: notice: LogAction: * Start virt-150.cluster-qe.lab.eng.brq.redhat.com ( virt-149.cluster-qe.lab.eng.brq.redhat.com ) > 16:48:47 pengine: notice: LogAction: * Start vg ( virt-149.cluster-qe.lab.eng.brq.redhat.com ) > 16:48:47 pengine: notice: process_pe_message: Calculated transition 0, saving inputs in /var/lib/pacemaker/pengine/pe-input-20.bz2 > 16:48:48 pengine: info: determine_online_status: Node virt-149.cluster-qe.lab.eng.brq.redhat.com is online > 16:48:48 pengine: info: unpack_node_loop: Node 1 is already processed > 16:48:48 pengine: info: unpack_node_loop: Node virt-150.cluster-qe.lab.eng.brq.redhat.com is already processed > 16:48:48 pengine: info: unpack_node_loop: Node 1 is already processed > 16:48:48 pengine: info: unpack_node_loop: Node virt-150.cluster-qe.lab.eng.brq.redhat.com is already processed > 16:48:48 pengine: info: common_print: virt-150.cluster-qe.lab.eng.brq.redhat.com(ocf::pacemaker:remote): Started virt-149.cluster-qe.lab.eng.brq.redhat.com > 16:48:48 pengine: info: common_print: vg (ocf::heartbeat:LVM): Stopped > 16:48:48 pengine: info: RecurringOp: Start recurring monitor (60s) for virt-150.cluster-qe.lab.eng.brq.redhat.com on virt-149.cluster-qe.lab.eng.brq.redhat.com > 16:48:48 pengine: info: RecurringOp: Start recurring monitor (10s) for vg on virt-149.cluster-qe.lab.eng.brq.redhat.com > 16:48:48 pengine: info: LogActions: Leave virt-150.cluster-qe.lab.eng.brq.redhat.com (Started virt-149.cluster-qe.lab.eng.brq.redhat.com) > 16:48:48 pengine: notice: LogAction: * Start vg ( virt-149.cluster-qe.lab.eng.brq.redhat.com ) > 16:48:48 pengine: notice: process_pe_message: Calculated transition 1, saving inputs in /var/lib/pacemaker/pengine/pe-input-21.bz2 > 16:48:49 pengine: info: determine_online_status: Node virt-149.cluster-qe.lab.eng.brq.redhat.com is online > 16:48:49 pengine: warning: unpack_rsc_op_failure: Processing failed probe of vg on virt-150.cluster-qe.lab.eng.brq.redhat.com: unknown error | rc=1 > 16:48:49 pengine: notice: unpack_rsc_op_failure: If it is not possible for vg to run on virt-150.cluster-qe.lab.eng.brq.redhat.com, see the resource-discovery option for location constraints > 16:48:49 pengine: warning: unpack_rsc_op_failure: Processing failed probe of vg on virt-150.cluster-qe.lab.eng.brq.redhat.com: unknown error | rc=1 > 16:48:49 pengine: notice: unpack_rsc_op_failure: If it is not possible for vg to run on virt-150.cluster-qe.lab.eng.brq.redhat.com, see the resource-discovery option for location constraints > 16:48:49 pengine: info: unpack_node_loop: Node 1 is already processed > 16:48:49 pengine: info: unpack_node_loop: Node virt-150.cluster-qe.lab.eng.brq.redhat.com is already processed > 16:48:49 pengine: info: unpack_node_loop: Node 1 is already processed > 16:48:49 pengine: info: unpack_node_loop: Node virt-150.cluster-qe.lab.eng.brq.redhat.com is already processed > 16:48:49 pengine: info: common_print: virt-150.cluster-qe.lab.eng.brq.redhat.com(ocf::pacemaker:remote): Started virt-149.cluster-qe.lab.eng.brq.redhat.com > 16:48:49 pengine: info: common_print: vg (ocf::heartbeat:LVM): FAILED virt-150.cluster-qe.lab.eng.brq.redhat.com > 16:48:49 pengine: info: RecurringOp: Start recurring monitor (10s) for vg on virt-149.cluster-qe.lab.eng.brq.redhat.com > 16:48:49 pengine: info: LogActions: Leave virt-150.cluster-qe.lab.eng.brq.redhat.com (Started virt-149.cluster-qe.lab.eng.brq.redhat.com) > 16:48:49 pengine: notice: LogAction: * Recover vg ( virt-150.cluster-qe.lab.eng.brq.redhat.com -> virt-149.cluster-qe.lab.eng.brq.redhat.com ) > 16:48:49 pengine: notice: process_pe_message: Calculated transition 2, saving inputs in /var/lib/pacemaker/pengine/pe-input-22.bz2 > 17:03:50 pengine: info: determine_online_status: Node virt-149.cluster-qe.lab.eng.brq.redhat.com is online > 17:03:50 pengine: warning: unpack_rsc_op_failure: Processing failed probe of vg on virt-150.cluster-qe.lab.eng.brq.redhat.com: unknown error | rc=1 > 17:03:50 pengine: notice: unpack_rsc_op_failure: If it is not possible for vg to run on virt-150.cluster-qe.lab.eng.brq.redhat.com, see the resource-discovery option for location constraints > 17:03:50 pengine: info: unpack_node_loop: Node 1 is already processed > 17:03:50 pengine: info: unpack_node_loop: Node virt-150.cluster-qe.lab.eng.brq.redhat.com is already processed > 17:03:50 pengine: info: unpack_node_loop: Node 1 is already processed > 17:03:50 pengine: info: unpack_node_loop: Node virt-150.cluster-qe.lab.eng.brq.redhat.com is already processed > 17:03:50 pengine: info: common_print: virt-150.cluster-qe.lab.eng.brq.redhat.com(ocf::pacemaker:remote): Started virt-149.cluster-qe.lab.eng.brq.redhat.com > 17:03:50 pengine: info: common_print: vg (ocf::heartbeat:LVM): Started virt-149.cluster-qe.lab.eng.brq.redhat.com > 17:03:50 pengine: info: LogActions: Leave virt-150.cluster-qe.lab.eng.brq.redhat.com (Started virt-149.cluster-qe.lab.eng.brq.redhat.com) > 17:03:50 pengine: info: LogActions: Leave vg (Started virt-149.cluster-qe.lab.eng.brq.redhat.com) > 17:03:50 pengine: notice: process_pe_message: Calculated transition 3, saving inputs in /var/lib/pacemaker/pengine/pe-input-23.bz2 Cluster behavior remained the same, except now an additional message is logged on the DC pointing the administrator towards configuration fix: > If it is not possible for vg to run on virt-150.cluster-qe.lab.eng.brq.redhat.com, see the resource-discovery option for location constraints Marking verified. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:3055 |
I have been able to reproduce this issue with the following configuration: ----8<---- [root@support0 ~]# pcs config show Cluster Name: supportHA Corosync Nodes: support0.lab.local Pacemaker Nodes: support0.lab.local Resources: Resource: vm-cli1 (class=ocf provider=heartbeat type=VirtualDomain) Attributes: config=/etc/libvirt/qemu/cli1.xml Meta Attrs: remote-node=cli1 Utilization: cpu=1 hv_memory=2048 Operations: stop interval=0s timeout=90 (vm-cli1-stop-interval-0s) monitor interval=30s (vm-cli1-monitor-interval-30s) start interval=0s timeout=120 (vm-cli1-start-interval-0s) Resource: vg1A (class=ocf provider=heartbeat type=LVM) Attributes: volgrpname=vg1A Operations: start interval=0s timeout=30 (vg1A-start-interval-0s) stop interval=0s timeout=30 (vg1A-stop-interval-0s) monitor interval=10 timeout=30 (vg1A-monitor-interval-10) Stonith Devices: Fencing Levels: Location Constraints: Resource: vg1A Constraint: location-vg1A Rule: score=-INFINITY (id:location-vg1A-rule) Expression: #kind eq container (id:location-vg1A-rule-expr) Ordering Constraints: Colocation Constraints: Ticket Constraints: Alerts: No alerts defined Resources Defaults: No defaults set Operations Defaults: No defaults set Cluster Properties: cluster-infrastructure: corosync cluster-name: supportHA dc-version: 1.1.16-12.el7_4.4.debug-94ff4df have-watchdog: false stonith-enabled: false Quorum: Options: ----8<---- The issue is related to the container trying to monitor the vg1A resource while the location constraint score is -INFINITY. In the corosync.log, we can see the following message: Nov 17 13:40:03 [28056] support0.lab.local crmd: notice: te_rsc_command: Initiating monitor operation vg1A_monitor_0 locally on cli1 | action 4 This regression has been introduced in commit 12d453cc where the skip of active resource detection on container is removed. ----8<---- diff --git a/pengine/native.c b/pengine/native.c index 37cf541..2e40a4c 100644 --- a/pengine/native.c +++ b/pengine/native.c @@ -2784,10 +2784,6 @@ native_create_probe(resource_t * rsc, node_t * node, action_t * complete, if (force == FALSE && is_not_set(data_set->flags, pe_flag_startup_probes)) { pe_rsc_trace(rsc, "Skipping active resource detection for %s", rsc->id); return FALSE; - } else if (force == FALSE && is_container_remote_node(node)) { - pe_rsc_trace(rsc, "Skipping active resource detection for %s on container %s", - rsc->id, node->details->id); - return FALSE; } if (is_remote_node(node)) { ----8<---- When reverting this change on top of pacemaker-1.1.16-12.el7_4.4, the crmd don't try to initiate the monitoring of the vg1A resource anymore. After some research in the pacemaker.git history, it looks like the patch is legit. The monitor action in the log does not deal with standard resource monitoring but with probes (one-time monitor operation). There is a location property to disable the resource discovery: resource-discovery=never. With the following change in the configuration, I can't reproduce the issue with pacemaker-1.1.16-12.el7_4.4. ----8<---- # pcs constraint location remove location-vg1A # pcs constraint location vg1A rule resource-discovery=never score=-INFINITY '#kind' eq container # pcs constraint location show --full Location Constraints: Resource: vg1A Constraint: location-vg1A (resource-discovery=never) Rule: score=-INFINITY (id:location-vg1A-rule) Expression: #kind eq container (id:location-vg1A-rule-expr) ----8<----