1688149 – pacemaker cluster will never settle

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1688149 - pacemaker cluster will never settle

Summary: pacemaker cluster will never settle

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 8
Classification:	Red Hat
Component:	pacemaker
Sub Component:
Version:	8.0
Hardware:	All
OS:	All
Priority:	high
Severity:	medium
Target Milestone:	pre-dev-freeze
Target Release:	8.9
Assignee:	Reid Wahl
QA Contact:	cluster-qe
Docs Contact:
URL:
Whiteboard:
Depends On:	1682116
Blocks:
TreeView+	depends on / blocked

Reported:	2019-03-13 09:50 UTC by michal novacek
Modified:	2023-11-14 16:50 UTC (History)
CC List:	4 users (show)
Fixed In Version:	pacemaker-2.1.6-4.el8
Doc Type:	Bug Fix
Doc Text:	Cause: Pacemaker previously assigned clone instances to equally scored nodes without considering the instances' current nodes. Consequence: If a clone had equally scored location constraints on a subset of nodes, clone instances could be assigned to a different node each time and continuously stopped and restarted by the cluster. Fix: Instances are now assigned to their current node whenever possible. Result: Clone instances do not get restarted unnecessarily.
Clone Of:
Environment:
Last Closed:	2023-11-14 15:32:34 UTC
Type:	Bug
Target Upstream Version:	2.1.7
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
'pcs cluster report' output (2.77 MB, application/x-bzip) 2019-03-13 09:50 UTC, michal novacek	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	CLUSTERQE-6826	0	None	None	None	2023-07-19 14:11:56 UTC
Red Hat Product Errata	RHEA-2023:6970	0	None	None	None	2023-11-14 15:33:21 UTC

Description michal novacek 2019-03-13 09:50:45 UTC

Created attachment 1543573 [details]
'pcs cluster report' output

Description of problem:
In a two node cluster with VirtualDomain resources and gfs2 filesystems [1]
seemingly running hapily.

I found out that 'crm_resource --wait' never finishes. It shows a lot of
pending actions [3] that seems not to ever finish. Once monitoring acion times
out virtual machine is shut down and started again (very undesired).
This also means that cluster is never settled.

Version-Release number of selected component (if applicable):
corosync-3.0.0-2.el8.x86_64
pacemaker-2.0.1-4.el8.x86_64
resource-agents-4.1.1-17.el8.x86_64

How reproducible: always


Steps to Reproduce:
1. create two node cluster [1], [2] and observe pending actions

Actual results: cluster never settled, lots of pending actions

Expected results: cluster settled


Additional info:

> [1]: pcs config
Cluster Name: STSRHTS3983
Corosync Nodes:
 light-01.cluster-qe.lab.eng.brq.redhat.com light-03.cluster-qe.lab.eng.brq.redhat.com
Pacemaker Nodes:
 light-01.cluster-qe.lab.eng.brq.redhat.com light-03.cluster-qe.lab.eng.brq.redhat.com

Resources:
 Clone: locking-clone
  Meta Attrs: interleave=true
  Group: locking
   Resource: dlm (class=ocf provider=pacemaker type=controld)
    Operations: monitor interval=30s (dlm-monitor-interval-30s)
                start interval=0s timeout=90s (dlm-start-interval-0s)
                stop interval=0s timeout=100s (dlm-stop-interval-0s)
   Resource: lvmlockd (class=ocf provider=heartbeat type=lvmlockd)
    Attributes: with_cmirrord=1
    Operations: monitor interval=30s (lvmlockd-monitor-interval-30s)
                start interval=0s timeout=90s (lvmlockd-start-interval-0s)
                stop interval=0s timeout=90s (lvmlockd-stop-interval-0s)
 Clone: group-var-lib-libvirt-images-clone
  Meta Attrs: clone-max=2 interleave=true ordered=true
  Group: group-var-lib-libvirt-images
   Resource: lv-var-lib-libvirt-images (class=ocf provider=heartbeat type=LVM-activate)
    Attributes: activation_mode=shared lvname=images0 vg_access_mode=lvmlockd vgname=shared
    Operations: monitor interval=30s timeout=90s (lv-var-lib-libvirt-images-monitor-interval-30s)
                start interval=0s timeout=90s (lv-var-lib-libvirt-images-start-interval-0s)
                stop interval=0s timeout=90s (lv-var-lib-libvirt-images-stop-interval-0s)
   Resource: fs-var-lib-libvirt-images (class=ocf provider=heartbeat type=Filesystem)
    Attributes: device=/dev/shared/images0 directory=/var/lib/libvirt/images fstype=gfs2 options=
    Operations: monitor interval=30s (fs-var-lib-libvirt-images-monitor-interval-30s)
                notify interval=0s timeout=60s (fs-var-lib-libvirt-images-notify-interval-0s)
                start interval=0s timeout=60s (fs-var-lib-libvirt-images-start-interval-0s)
                stop interval=0s timeout=60s (fs-var-lib-libvirt-images-stop-interval-0s)
 Clone: group-etc-libvirt-qemu-clone
  Meta Attrs: clone-max=2 interleave=true ordered=true
  Group: group-etc-libvirt-qemu
   Resource: vg-etc-libvirt-qemu (class=ocf provider=heartbeat type=LVM-activate)
    Attributes: activation_mode=shared lvname=etc0 vg_access_mode=lvmlockd vgname=shared
    Operations: monitor interval=30s timeout=90s (vg-etc-libvirt-qemu-monitor-interval-30s)
                start interval=0s timeout=90s (vg-etc-libvirt-qemu-start-interval-0s)
                stop interval=0s timeout=90s (vg-etc-libvirt-qemu-stop-interval-0s)
   Resource: fs-etc-libvirt-qemu (class=ocf provider=heartbeat type=Filesystem)
    Attributes: device=/dev/shared/etc0 directory=/etc/libvirt/qemu fstype=gfs2 options=
    Operations: monitor interval=30s (fs-etc-libvirt-qemu-monitor-interval-30s)
                notify interval=0s timeout=60s (fs-etc-libvirt-qemu-notify-interval-0s)
                start interval=0s timeout=60s (fs-etc-libvirt-qemu-start-interval-0s)
                stop interval=0s timeout=60s (fs-etc-libvirt-qemu-stop-interval-0s)
 Resource: pool-10-37-165-129 (class=ocf provider=heartbeat type=VirtualDomain)
  Attributes: config=/etc/libvirt/qemu/pool-10-37-165-129.xml hypervisor=qemu:///system migration_transport=ssh
  Meta Attrs: allow-migrate=true
  Utilization: cpu=2 hv_memory=1024
  Operations: migrate_from interval=0 timeout=120s (pool-10-37-165-129-migrate_from-interval-0)
              migrate_to interval=0 timeout=120s (pool-10-37-165-129-migrate_to-interval-0)
              monitor interval=10s timeout=30s (pool-10-37-165-129-monitor-interval-10s)
              start interval=0s timeout=90s (pool-10-37-165-129-start-interval-0s)
              stop interval=0s timeout=90s (pool-10-37-165-129-stop-interval-0s)
 Resource: pool-10-37-165-65 (class=ocf provider=heartbeat type=VirtualDomain)
  Attributes: config=/etc/libvirt/qemu/pool-10-37-165-65.xml hypervisor=qemu:///system migration_transport=ssh
  Meta Attrs: allow-migrate=true
  Utilization: cpu=2 hv_memory=1024
  Operations: migrate_from interval=0 timeout=120s (pool-10-37-165-65-migrate_from-interval-0)
              migrate_to interval=0 timeout=120s (pool-10-37-165-65-migrate_to-interval-0)
              monitor interval=10s timeout=30s (pool-10-37-165-65-monitor-interval-10s)
              start interval=0s timeout=90s (pool-10-37-165-65-start-interval-0s)
              stop interval=0s timeout=90s (pool-10-37-165-65-stop-interval-0s)

Stonith Devices:
 Resource: fence-light-01 (class=stonith type=fence_ipmilan)
  Attributes: delay=5 ipaddr=light-01-ilo lanplus=0 login=admin passwd=admin pcmk_host_check=static-list pcmk_host_list=light-01.cluster-qe.lab.eng.brq.redhat.com
  Operations: monitor interval=60s (fence-light-01-monitor-interval-60s)
 Resource: fence-light-03 (class=stonith type=fence_ipmilan)
  Attributes: ipaddr=light-03-ilo lanplus=0 login=admin passwd=admin pcmk_host_check=static-list pcmk_host_list=light-03.cluster-qe.lab.eng.brq.redhat.com
  Operations: monitor interval=60s (fence-light-03-monitor-interval-60s)
Fencing Levels:

Location Constraints:
  Resource: group-etc-libvirt-qemu-clone
    Enabled on: light-01.cluster-qe.lab.eng.brq.redhat.com (score:INFINITY) (id:location-group-etc-libvirt-qemu-clone-light-01.cluster-qe.lab.eng.brq.redhat.com-INFINITY)
    Enabled on: light-03.cluster-qe.lab.eng.brq.redhat.com (score:INFINITY) (id:location-group-etc-libvirt-qemu-clone-light-03.cluster-qe.lab.eng.brq.redhat.com-INFINITY)
    Disabled on: pool-10-37-165-129 (score:-INFINITY) (id:location-group-etc-libvirt-qemu-clone-pool-10-37-165-129--INFINITY)
    Disabled on: pool-10-37-165-65 (score:-INFINITY) (id:location-group-etc-libvirt-qemu-clone-pool-10-37-165-65--INFINITY)
  Resource: group-var-lib-libvirt-images-clone
    Enabled on: light-01.cluster-qe.lab.eng.brq.redhat.com (score:INFINITY) (id:location-group-var-lib-libvirt-images-clone-light-01.cluster-qe.lab.eng.brq.redhat.com-INFINITY)
    Enabled on: light-03.cluster-qe.lab.eng.brq.redhat.com (score:INFINITY) (id:location-group-var-lib-libvirt-images-clone-light-03.cluster-qe.lab.eng.brq.redhat.com-INFINITY)
    Disabled on: pool-10-37-165-129 (score:-INFINITY) (id:location-group-var-lib-libvirt-images-clone-pool-10-37-165-129--INFINITY)
    Disabled on: pool-10-37-165-65 (score:-INFINITY) (id:location-group-var-lib-libvirt-images-clone-pool-10-37-165-65--INFINITY)
  Resource: locking-clone
    Disabled on: pool-10-37-165-129 (score:-INFINITY) (id:location-locking-clone-pool-10-37-165-129--INFINITY)
    Disabled on: pool-10-37-165-65 (score:-INFINITY) (id:location-locking-clone-pool-10-37-165-65--INFINITY)
Ordering Constraints:
  start locking-clone then start group-var-lib-libvirt-images-clone (kind:Mandatory) (id:order-locking-clone-group-var-lib-libvirt-images-clone-mandatory)
  start locking-clone then start group-etc-libvirt-qemu-clone (kind:Mandatory) (id:order-locking-clone-group-etc-libvirt-qemu-clone-mandatory)
  start group-var-lib-libvirt-images-clone then start pool-10-37-165-129 (kind:Mandatory) (id:order-group-var-lib-libvirt-images-clone-pool-10-37-165-129-mandatory)
  start group-etc-libvirt-qemu-clone then start pool-10-37-165-129 (kind:Mandatory) (id:order-group-etc-libvirt-qemu-clone-pool-10-37-165-129-mandatory)
  start group-var-lib-libvirt-images-clone then start pool-10-37-165-65 (kind:Mandatory) (id:order-group-var-lib-libvirt-images-clone-pool-10-37-165-65-mandatory)
  start group-etc-libvirt-qemu-clone then start pool-10-37-165-65 (kind:Mandatory) (id:order-group-etc-libvirt-qemu-clone-pool-10-37-165-65-mandatory)
Colocation Constraints:
  group-var-lib-libvirt-images-clone with locking-clone (score:INFINITY) (id:colocation-group-var-lib-libvirt-images-clone-locking-clone-INFINITY)
  group-etc-libvirt-qemu-clone with locking-clone (score:INFINITY) (id:colocation-group-etc-libvirt-qemu-clone-locking-clone-INFINITY)
Ticket Constraints:

Alerts:
 No alerts defined

Resources Defaults:
 resource-stickiness: 100
Operations Defaults:
 No defaults set

Cluster Properties:
 cluster-infrastructure: corosync
 cluster-name: STSRHTS3983
 dc-version: 2.0.1-4.el8-0eb7991564
 have-watchdog: false
 no-quorum-policy: freeze

Quorum:
  Options:


> [2]: pcs resource config
 Clone: locking-clone
  Meta Attrs: interleave=true
  Group: locking
   Resource: dlm (class=ocf provider=pacemaker type=controld)
    Operations: monitor interval=30s (dlm-monitor-interval-30s)
                start interval=0s timeout=90s (dlm-start-interval-0s)
                stop interval=0s timeout=100s (dlm-stop-interval-0s)
   Resource: lvmlockd (class=ocf provider=heartbeat type=lvmlockd)
    Attributes: with_cmirrord=1
    Operations: monitor interval=30s (lvmlockd-monitor-interval-30s)
                start interval=0s timeout=90s (lvmlockd-start-interval-0s)
                stop interval=0s timeout=90s (lvmlockd-stop-interval-0s)
 Clone: group-var-lib-libvirt-images-clone
  Meta Attrs: clone-max=2 interleave=true ordered=true
  Group: group-var-lib-libvirt-images
   Resource: lv-var-lib-libvirt-images (class=ocf provider=heartbeat type=LVM-activate)
    Attributes: activation_mode=shared lvname=images0 vg_access_mode=lvmlockd vgname=shared
    Operations: monitor interval=30s timeout=90s (lv-var-lib-libvirt-images-monitor-interval-30s)
                start interval=0s timeout=90s (lv-var-lib-libvirt-images-start-interval-0s)
                stop interval=0s timeout=90s (lv-var-lib-libvirt-images-stop-interval-0s)
   Resource: fs-var-lib-libvirt-images (class=ocf provider=heartbeat type=Filesystem)
    Attributes: device=/dev/shared/images0 directory=/var/lib/libvirt/images fstype=gfs2 options=
    Operations: monitor interval=30s (fs-var-lib-libvirt-images-monitor-interval-30s)
                notify interval=0s timeout=60s (fs-var-lib-libvirt-images-notify-interval-0s)
                start interval=0s timeout=60s (fs-var-lib-libvirt-images-start-interval-0s)
                stop interval=0s timeout=60s (fs-var-lib-libvirt-images-stop-interval-0s)
 Clone: group-etc-libvirt-qemu-clone
  Meta Attrs: clone-max=2 interleave=true ordered=true
  Group: group-etc-libvirt-qemu
   Resource: vg-etc-libvirt-qemu (class=ocf provider=heartbeat type=LVM-activate)
    Attributes: activation_mode=shared lvname=etc0 vg_access_mode=lvmlockd vgname=shared
    Operations: monitor interval=30s timeout=90s (vg-etc-libvirt-qemu-monitor-interval-30s)
                start interval=0s timeout=90s (vg-etc-libvirt-qemu-start-interval-0s)
                stop interval=0s timeout=90s (vg-etc-libvirt-qemu-stop-interval-0s)
   Resource: fs-etc-libvirt-qemu (class=ocf provider=heartbeat type=Filesystem)
    Attributes: device=/dev/shared/etc0 directory=/etc/libvirt/qemu fstype=gfs2 options=
    Operations: monitor interval=30s (fs-etc-libvirt-qemu-monitor-interval-30s)
                notify interval=0s timeout=60s (fs-etc-libvirt-qemu-notify-interval-0s)
                start interval=0s timeout=60s (fs-etc-libvirt-qemu-start-interval-0s)
                stop interval=0s timeout=60s (fs-etc-libvirt-qemu-stop-interval-0s)
 Resource: pool-10-37-165-129 (class=ocf provider=heartbeat type=VirtualDomain)
  Attributes: config=/etc/libvirt/qemu/pool-10-37-165-129.xml hypervisor=qemu:///system migration_transport=ssh
  Meta Attrs: allow-migrate=true
  Utilization: cpu=2 hv_memory=1024
  Operations: migrate_from interval=0 timeout=120s (pool-10-37-165-129-migrate_from-interval-0)
              migrate_to interval=0 timeout=120s (pool-10-37-165-129-migrate_to-interval-0)
              monitor interval=10s timeout=30s (pool-10-37-165-129-monitor-interval-10s)
              start interval=0s timeout=90s (pool-10-37-165-129-start-interval-0s)
              stop interval=0s timeout=90s (pool-10-37-165-129-stop-interval-0s)
 Resource: pool-10-37-165-65 (class=ocf provider=heartbeat type=VirtualDomain)
  Attributes: config=/etc/libvirt/qemu/pool-10-37-165-65.xml hypervisor=qemu:///system migration_transport=ssh
  Meta Attrs: allow-migrate=true
  Utilization: cpu=2 hv_memory=1024
  Operations: migrate_from interval=0 timeout=120s (pool-10-37-165-65-migrate_from-interval-0)
              migrate_to interval=0 timeout=120s (pool-10-37-165-65-migrate_to-interval-0)
              monitor interval=10s timeout=30s (pool-10-37-165-65-monitor-interval-10s)
              start interval=0s timeout=90s (pool-10-37-165-65-start-interval-0s)
              stop interval=0s timeout=90s (pool-10-37-165-65-stop-interval-0s)


> [3]: crm_resource --wait --timeout=20
Pending actions:
	Action 92: pool-10-37-165-65_start_0	on light-03.cluster-qe.lab.eng.brq.redhat.com
	Action 91: pool-10-37-165-65_stop_0	on light-03.cluster-qe.lab.eng.brq.redhat.com
	Action 90: pool-10-37-165-129_start_0	on light-01.cluster-qe.lab.eng.brq.redhat.com
	Action 89: pool-10-37-165-129_stop_0	on light-01.cluster-qe.lab.eng.brq.redhat.com
	Action 80: fs-etc-libvirt-qemu:1_monitor_30000	on light-03.cluster-qe.lab.eng.brq.redhat.com
	Action 79: fs-etc-libvirt-qemu:1_start_0	on light-03.cluster-qe.lab.eng.brq.redhat.com
	Action 78: fs-etc-libvirt-qemu:1_stop_0	on light-01.cluster-qe.lab.eng.brq.redhat.com
	Action 77: vg-etc-libvirt-qemu:1_monitor_30000	on light-03.cluster-qe.lab.eng.brq.redhat.com
	Action 76: vg-etc-libvirt-qemu:1_start_0	on light-03.cluster-qe.lab.eng.brq.redhat.com
	Action 75: vg-etc-libvirt-qemu:1_stop_0	on light-01.cluster-qe.lab.eng.brq.redhat.com
	Action 70: fs-etc-libvirt-qemu:0_monitor_30000	on light-01.cluster-qe.lab.eng.brq.redhat.com
	Action 69: fs-etc-libvirt-qemu:0_start_0	on light-01.cluster-qe.lab.eng.brq.redhat.com
	Action 68: fs-etc-libvirt-qemu:0_stop_0	on light-03.cluster-qe.lab.eng.brq.redhat.com
	Action 67: vg-etc-libvirt-qemu:0_monitor_30000	on light-01.cluster-qe.lab.eng.brq.redhat.com
	Action 66: vg-etc-libvirt-qemu:0_start_0	on light-01.cluster-qe.lab.eng.brq.redhat.com
	Action 65: vg-etc-libvirt-qemu:0_stop_0	on light-03.cluster-qe.lab.eng.brq.redhat.com
	Action 56: fs-var-lib-libvirt-images:1_monitor_30000	on light-03.cluster-qe.lab.eng.brq.redhat.com
	Action 55: fs-var-lib-libvirt-images:1_start_0	on light-03.cluster-qe.lab.eng.brq.redhat.com
	Action 54: fs-var-lib-libvirt-images:1_stop_0	on light-01.cluster-qe.lab.eng.brq.redhat.com
	Action 53: lv-var-lib-libvirt-images:1_monitor_30000	on light-03.cluster-qe.lab.eng.brq.redhat.com
	Action 52: lv-var-lib-libvirt-images:1_start_0	on light-03.cluster-qe.lab.eng.brq.redhat.com
	Action 51: lv-var-lib-libvirt-images:1_stop_0	on light-01.cluster-qe.lab.eng.brq.redhat.com
	Action 46: fs-var-lib-libvirt-images:0_monitor_30000	on light-01.cluster-qe.lab.eng.brq.redhat.com
	Action 45: fs-var-lib-libvirt-images:0_start_0	on light-01.cluster-qe.lab.eng.brq.redhat.com
	Action 44: fs-var-lib-libvirt-images:0_stop_0	on light-03.cluster-qe.lab.eng.brq.redhat.com
	Action 43: lv-var-lib-libvirt-images:0_monitor_30000	on light-01.cluster-qe.lab.eng.brq.redhat.com
	Action 42: lv-var-lib-libvirt-images:0_start_0	on light-01.cluster-qe.lab.eng.brq.redhat.com
	Action 41: lv-var-lib-libvirt-images:0_stop_0	on light-03.cluster-qe.lab.eng.brq.redhat.com
	Action 16: pool-10-37-165-129_monitor_10000	on light-01.cluster-qe.lab.eng.brq.redhat.com
	Action 8: pool-10-37-165-65_monitor_10000	on light-03.cluster-qe.lab.eng.brq.redhat.com
Error performing operation: Timer expired

Comment 1 Ken Gaillot 2019-03-14 22:13:05 UTC

It's actually not running happily; starting at Mar 12 16:03:37 in the logs (when the LVM-activate/Filesystem/VirtualDomain resources and their constraints are added), the resources are continuously restarting. :(

Removing the location constraints for the Filesystem resources seems to work around the problem. (They are equally scored location constraints for both nodes in a symmetric cluster, so they have no effect.)

I also see location constraints keeping various resources off the VirtualDomain resources. Those are not Pacemaker Remote nodes, so the constraints do not mean anything. However those constraints aren't causing any problems.

There is a separate issue with the simulation (but not the cluster) thinking the fence devices need to be restarted. That might interfere with the --wait as well. This is a known issue that has not been investigated.

Can you try the workaround and see if it helps? We need to fix the underlying issues, but given how difficult it is to get anything into GA at this point, a workaround would be good to have.

Comment 2 michal novacek 2019-03-15 16:16:09 UTC

I can confirm that removing the positive constraint for filesystem works around the problem.

Comment 8 Ken Gaillot 2020-11-25 18:27:13 UTC

An update:

(In reply to Ken Gaillot from comment #1)
> It's actually not running happily; starting at Mar 12 16:03:37 in the logs
> (when the LVM-activate/Filesystem/VirtualDomain resources and their
> constraints are added), the resources are continuously restarting. :(

Looking at the logs more closely, I was off a bit: the configuration was being repeatedly changed during this time, so resources were starting and stopping appropriately. Problems actually start at Mar 12 16:32:27.

> Removing the location constraints for the Filesystem resources seems to work
> around the problem. (They are equally scored location constraints for both
> nodes in a symmetric cluster, so they have no effect.)

Changing the location constraints to have a score less than INFINITY also works around the problem.

Pacemaker assigns an instance number to clone instances on each node. What is going wrong here is that every time Pacemaker runs its scheduler, it assigns different instance numbers to the existing active instances compared to what it wants the final result to be, so it thinks the instances need to be moved.

The cause for that still needs to be found and fixed.

> There is a separate issue with the simulation (but not the cluster) thinking
> the fence devices need to be restarted. That might interfere with the --wait
> as well. This is a known issue that has not been investigated.

As an aside, the simulation issue has been fixed, though the fix will not make it into RHEL 8.4. However, that issue does not affect --wait when used with a live cluster.

Comment 13 RHEL Program Management 2021-02-01 07:39:27 UTC

After evaluating this issue, there are no plans to address it further or fix it in an upcoming release.  Therefore, it is being closed.  If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened.

Comment 14 Ken Gaillot 2021-02-01 14:43:53 UTC

(In reply to RHEL Program Management from comment #13)
> After evaluating this issue, there are no plans to address it further or fix
> it in an upcoming release.  Therefore, it is being closed.  If plans change
> such that this issue will be fixed in an upcoming release, then the bug can
> be reopened.

This is still a high priority and I am hopeful the fix will be in RHEL 8.5. Once we are further along in 8.5 release planning, we will likely reopen this.

Comment 15 Reid Wahl 2023-07-18 00:23:15 UTC

This is fixed by upstream commit 018ad6d5.

Comment 25 jrehova 2023-08-25 08:07:39 UTC

Version of pacemaker:
> [root@virt-143:~]# rpm -q pacemaker
> pacemaker-2.1.6-7.el8.x86_64

Setting of cluster with stopped cluster on node virt-134:
> [root@virt-143:~]# pcs status
> Cluster name: STSRHTS30001
> Cluster Summary:
>   * Stack: corosync (Pacemaker is running)
>   * Current DC: virt-143 (version 2.1.6-7.el8-6fdc9deea29) - partition with quorum
>   * Last updated: Wed Aug 23 17:00:47 2023 on virt-143
>   * Last change:  Wed Aug 23 16:37:15 2023 by root via cibadmin on virt-134
>   * 2 nodes configured
>   * 2 resource instances configured
> 
> Node List:
>   * Online: [ virt-143 ]
>   * OFFLINE: [ virt-134 ]
> 
> Full List of Resources:
>   * fence-virt-134  (stonith:fence_xvm):     Started virt-143
>   * fence-virt-143  (stonith:fence_xvm):     Started virt-143
> 
> Daemon Status:
>   corosync: active/disabled
>   pacemaker: active/disabled
>   pcsd: active/enabled

Creating a cloned resource group called base_group-clone with interleave=true:
> [root@virt-143:~]# pcs resource create base-a ocf:pacemaker:Dummy --group base_group
> [root@virt-143:~]# pcs resource create base-b ocf:pacemaker:Dummy --group base_group
> [root@virt-143:~]# pcs resource clone base_group interleave=true

Creating a working CIB copy:
> [root@virt-143:~]# pcs cluster cib > /tmp/cib.xml

Creating a cloned resource group called dependent_group-clone with interleave=true:
> [root@virt-143:~]# pcs -f /tmp/cib.xml resource create dependent-a ocf:pacemaker:Dummy --group dependent_group
> [root@virt-143:~]# pcs -f /tmp/cib.xml resource create dependent-b ocf:pacemaker:Dummy --group dependent_group
> [root@virt-143:~]# pcs -f /tmp/cib.xml resource clone dependent_group interleave=true

Colocating dependent_group-clone with base_group-clone:
> [root@virt-143:~]# pcs -f /tmp/cib.xml constraint colocation add dependent_group-clone with base_group-clone

Creating an ordering constraint to start base_group-clone before dependent_group-clone:
> [root@virt-143:~]# pcs -f /tmp/cib.xml constraint order start base_group-clone then start dependent_group-clone
> Adding base_group-clone dependent_group-clone (kind: Mandatory) (Options: first-action=start then-action=start)

Creating INFINITY location constraints for dependent_group-clone on both nodes:
> [root@virt-143:~]# pcs -f /tmp/cib.xml constraint location dependent_group-clone prefers virt-134
> Warning: Validation for node existence in the cluster will be skipped
> [root@virt-143:~]# pcs -f /tmp/cib.xml constraint location dependent_group-clone prefers virt-143
> Warning: Validation for node existence in the cluster will be skipped

Creating dependent-a's state file on both nodes:
> [root@virt-143:~]# touch /var/run/Dummy-dependent-a.state
> [root@virt-134:~]# touch /var/run/Dummy-dependent-a.state

Pushing the working configuration to the live CIB:
> [root@virt-143:~]# pcs cluster cib-push --config /tmp/cib.xml
> CIB updated

Result: The dependent_group resources do not get moved. Instead, the second resource (dependent-b) in each dependent_group instance gets started on the node where the first resource (dependent-a) is already running.

> [root@virt-143:~]# vim /var/log/pacemaker/pacemaker.log
> Aug 23 17:00:36 virt-143 pacemaker-controld  [59482] (do_state_transition)      notice: State transition S_IDLE -> S_POLICY_ENGINE | input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph
> ...
> Aug 23 17:14:17 virt-143 pacemaker-schedulerd[59481] (recurring_op_for_active)  info: Start 10s-interval monitor for dependent-a:0 on virt-134
> Aug 23 17:14:17 virt-143 pacemaker-schedulerd[59481] (recurring_op_for_active)  info: Start 10s-interval monitor for dependent-b:0 on virt-134
> Aug 23 17:14:17 virt-143 pacemaker-schedulerd[59481] (recurring_op_for_active)  info: Start 10s-interval monitor for dependent-a:1 on virt-143
> Aug 23 17:14:17 virt-143 pacemaker-schedulerd[59481] (recurring_op_for_active)  info: Start 10s-interval monitor for dependent-b:1 on virt-143
> Aug 23 17:14:17 virt-143 pacemaker-schedulerd[59481] (rsc_action_default)       info: Leave   fence-virt-134    (Started virt-143)
> Aug 23 17:14:17 virt-143 pacemaker-schedulerd[59481] (rsc_action_default)       info: Leave   fence-virt-143    (Started virt-134)
> Aug 23 17:14:17 virt-143 pacemaker-schedulerd[59481] (rsc_action_default)       info: Leave   base-a:0  (Started virt-143)
> Aug 23 17:14:17 virt-143 pacemaker-schedulerd[59481] (rsc_action_default)       info: Leave   base-b:0  (Started virt-143)
> Aug 23 17:14:17 virt-143 pacemaker-schedulerd[59481] (rsc_action_default)       info: Leave   base-a:1  (Started virt-134)
> Aug 23 17:14:17 virt-143 pacemaker-schedulerd[59481] (rsc_action_default)       info: Leave   base-b:1  (Started virt-134)
> Aug 23 17:14:17 virt-143 pacemaker-schedulerd[59481] (log_list_item)    notice: Actions: Start      dependent-a:0      (             virt-134 )
> Aug 23 17:14:17 virt-143 pacemaker-schedulerd[59481] (log_list_item)    notice: Actions: Start      dependent-b:0      (             virt-134 )
> Aug 23 17:14:17 virt-143 pacemaker-schedulerd[59481] (log_list_item)    notice: Actions: Start      dependent-a:1      (             virt-143 )
> Aug 23 17:14:17 virt-143 pacemaker-schedulerd[59481] (log_list_item)    notice: Actions: Start      dependent-b:1      (             virt-143 )
> ...
> Aug 23 17:14:18 virt-143 pacemaker-controld  [59482] (abort_transition_graph)   info: Transition 6 aborted by operation dependent-a_monitor_0 'modify' on virt-143: Event failed | magic=0:0;9:6:7:9f8accaf-aa51-415c-a2f4-99a6d5b15d73 cib=0.13.7 source=process_graph_event:548 complete=false
> ...
> Aug 23 17:14:18 virt-143 pacemaker-controld  [59482] (do_state_transition)      info: State transition S_TRANSITION_ENGINE -> S_POLICY_ENGINE | input=I_PE_CALC cause=C_FSA_INTERNAL origin=notify_crmd
> ...
> Aug 23 17:14:18 virt-143 pacemaker-schedulerd[59481] (recurring_op_for_active)  info: Start 10s-interval monitor for dependent-a:0 on virt-143
> Aug 23 17:14:18 virt-143 pacemaker-schedulerd[59481] (recurring_op_for_active)  info: Start 10s-interval monitor for dependent-b:0 on virt-143
> Aug 23 17:14:18 virt-143 pacemaker-schedulerd[59481] (recurring_op_for_active)  info: Start 10s-interval monitor for dependent-a:1 on virt-134
> Aug 23 17:14:18 virt-143 pacemaker-schedulerd[59481] (recurring_op_for_active)  info: Start 10s-interval monitor for dependent-b:1 on virt-134
> Aug 23 17:14:18 virt-143 pacemaker-schedulerd[59481] (rsc_action_default)       info: Leave   fence-virt-134    (Started virt-143)
> Aug 23 17:14:18 virt-143 pacemaker-schedulerd[59481] (rsc_action_default)       info: Leave   fence-virt-143    (Started virt-134)
> Aug 23 17:14:18 virt-143 pacemaker-schedulerd[59481] (rsc_action_default)       info: Leave   base-a:0  (Started virt-143)
> Aug 23 17:14:18 virt-143 pacemaker-schedulerd[59481] (rsc_action_default)       info: Leave   base-b:0  (Started virt-143)
> Aug 23 17:14:18 virt-143 pacemaker-schedulerd[59481] (rsc_action_default)       info: Leave   base-a:1  (Started virt-134)
> Aug 23 17:14:18 virt-143 pacemaker-schedulerd[59481] (rsc_action_default)       info: Leave   base-b:1  (Started virt-134)
> Aug 23 17:14:18 virt-143 pacemaker-schedulerd[59481] (rsc_action_default)       info: Leave   dependent-a:0     (Started virt-143)
> Aug 23 17:14:18 virt-143 pacemaker-schedulerd[59481] (log_list_item)    notice: Actions: Start      dependent-b:0      (             virt-143 )
> Aug 23 17:14:18 virt-143 pacemaker-schedulerd[59481] (rsc_action_default)       info: Leave   dependent-a:1     (Started virt-134)
> Aug 23 17:14:18 virt-143 pacemaker-schedulerd[59481] (log_list_item)    notice: Actions: Start      dependent-b:1      (             virt-134 )
> ...
> Aug 23 17:14:18 virt-143 pacemaker-controld  [59482] (execute_rsc_action)       notice: Initiating start operation dependent-b_start_0 locally on virt-143 | action 34
> ...
> Aug 23 17:14:18 virt-143 pacemaker-controld  [59482] (execute_rsc_action)       notice: Initiating start operation dependent-b_start_0 on virt-134 | action 43

Comment 28 errata-xmlrpc 2023-11-14 15:32:34 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (pacemaker bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2023:6970

Note You need to log in before you can comment on or make changes to this bug.