Created attachment 1543573 [details] 'pcs cluster report' output Description of problem: In a two node cluster with VirtualDomain resources and gfs2 filesystems [1] seemingly running hapily. I found out that 'crm_resource --wait' never finishes. It shows a lot of pending actions [3] that seems not to ever finish. Once monitoring acion times out virtual machine is shut down and started again (very undesired). This also means that cluster is never settled. Version-Release number of selected component (if applicable): corosync-3.0.0-2.el8.x86_64 pacemaker-2.0.1-4.el8.x86_64 resource-agents-4.1.1-17.el8.x86_64 How reproducible: always Steps to Reproduce: 1. create two node cluster [1], [2] and observe pending actions Actual results: cluster never settled, lots of pending actions Expected results: cluster settled Additional info: > [1]: pcs config Cluster Name: STSRHTS3983 Corosync Nodes: light-01.cluster-qe.lab.eng.brq.redhat.com light-03.cluster-qe.lab.eng.brq.redhat.com Pacemaker Nodes: light-01.cluster-qe.lab.eng.brq.redhat.com light-03.cluster-qe.lab.eng.brq.redhat.com Resources: Clone: locking-clone Meta Attrs: interleave=true Group: locking Resource: dlm (class=ocf provider=pacemaker type=controld) Operations: monitor interval=30s (dlm-monitor-interval-30s) start interval=0s timeout=90s (dlm-start-interval-0s) stop interval=0s timeout=100s (dlm-stop-interval-0s) Resource: lvmlockd (class=ocf provider=heartbeat type=lvmlockd) Attributes: with_cmirrord=1 Operations: monitor interval=30s (lvmlockd-monitor-interval-30s) start interval=0s timeout=90s (lvmlockd-start-interval-0s) stop interval=0s timeout=90s (lvmlockd-stop-interval-0s) Clone: group-var-lib-libvirt-images-clone Meta Attrs: clone-max=2 interleave=true ordered=true Group: group-var-lib-libvirt-images Resource: lv-var-lib-libvirt-images (class=ocf provider=heartbeat type=LVM-activate) Attributes: activation_mode=shared lvname=images0 vg_access_mode=lvmlockd vgname=shared Operations: monitor interval=30s timeout=90s (lv-var-lib-libvirt-images-monitor-interval-30s) start interval=0s timeout=90s (lv-var-lib-libvirt-images-start-interval-0s) stop interval=0s timeout=90s (lv-var-lib-libvirt-images-stop-interval-0s) Resource: fs-var-lib-libvirt-images (class=ocf provider=heartbeat type=Filesystem) Attributes: device=/dev/shared/images0 directory=/var/lib/libvirt/images fstype=gfs2 options= Operations: monitor interval=30s (fs-var-lib-libvirt-images-monitor-interval-30s) notify interval=0s timeout=60s (fs-var-lib-libvirt-images-notify-interval-0s) start interval=0s timeout=60s (fs-var-lib-libvirt-images-start-interval-0s) stop interval=0s timeout=60s (fs-var-lib-libvirt-images-stop-interval-0s) Clone: group-etc-libvirt-qemu-clone Meta Attrs: clone-max=2 interleave=true ordered=true Group: group-etc-libvirt-qemu Resource: vg-etc-libvirt-qemu (class=ocf provider=heartbeat type=LVM-activate) Attributes: activation_mode=shared lvname=etc0 vg_access_mode=lvmlockd vgname=shared Operations: monitor interval=30s timeout=90s (vg-etc-libvirt-qemu-monitor-interval-30s) start interval=0s timeout=90s (vg-etc-libvirt-qemu-start-interval-0s) stop interval=0s timeout=90s (vg-etc-libvirt-qemu-stop-interval-0s) Resource: fs-etc-libvirt-qemu (class=ocf provider=heartbeat type=Filesystem) Attributes: device=/dev/shared/etc0 directory=/etc/libvirt/qemu fstype=gfs2 options= Operations: monitor interval=30s (fs-etc-libvirt-qemu-monitor-interval-30s) notify interval=0s timeout=60s (fs-etc-libvirt-qemu-notify-interval-0s) start interval=0s timeout=60s (fs-etc-libvirt-qemu-start-interval-0s) stop interval=0s timeout=60s (fs-etc-libvirt-qemu-stop-interval-0s) Resource: pool-10-37-165-129 (class=ocf provider=heartbeat type=VirtualDomain) Attributes: config=/etc/libvirt/qemu/pool-10-37-165-129.xml hypervisor=qemu:///system migration_transport=ssh Meta Attrs: allow-migrate=true Utilization: cpu=2 hv_memory=1024 Operations: migrate_from interval=0 timeout=120s (pool-10-37-165-129-migrate_from-interval-0) migrate_to interval=0 timeout=120s (pool-10-37-165-129-migrate_to-interval-0) monitor interval=10s timeout=30s (pool-10-37-165-129-monitor-interval-10s) start interval=0s timeout=90s (pool-10-37-165-129-start-interval-0s) stop interval=0s timeout=90s (pool-10-37-165-129-stop-interval-0s) Resource: pool-10-37-165-65 (class=ocf provider=heartbeat type=VirtualDomain) Attributes: config=/etc/libvirt/qemu/pool-10-37-165-65.xml hypervisor=qemu:///system migration_transport=ssh Meta Attrs: allow-migrate=true Utilization: cpu=2 hv_memory=1024 Operations: migrate_from interval=0 timeout=120s (pool-10-37-165-65-migrate_from-interval-0) migrate_to interval=0 timeout=120s (pool-10-37-165-65-migrate_to-interval-0) monitor interval=10s timeout=30s (pool-10-37-165-65-monitor-interval-10s) start interval=0s timeout=90s (pool-10-37-165-65-start-interval-0s) stop interval=0s timeout=90s (pool-10-37-165-65-stop-interval-0s) Stonith Devices: Resource: fence-light-01 (class=stonith type=fence_ipmilan) Attributes: delay=5 ipaddr=light-01-ilo lanplus=0 login=admin passwd=admin pcmk_host_check=static-list pcmk_host_list=light-01.cluster-qe.lab.eng.brq.redhat.com Operations: monitor interval=60s (fence-light-01-monitor-interval-60s) Resource: fence-light-03 (class=stonith type=fence_ipmilan) Attributes: ipaddr=light-03-ilo lanplus=0 login=admin passwd=admin pcmk_host_check=static-list pcmk_host_list=light-03.cluster-qe.lab.eng.brq.redhat.com Operations: monitor interval=60s (fence-light-03-monitor-interval-60s) Fencing Levels: Location Constraints: Resource: group-etc-libvirt-qemu-clone Enabled on: light-01.cluster-qe.lab.eng.brq.redhat.com (score:INFINITY) (id:location-group-etc-libvirt-qemu-clone-light-01.cluster-qe.lab.eng.brq.redhat.com-INFINITY) Enabled on: light-03.cluster-qe.lab.eng.brq.redhat.com (score:INFINITY) (id:location-group-etc-libvirt-qemu-clone-light-03.cluster-qe.lab.eng.brq.redhat.com-INFINITY) Disabled on: pool-10-37-165-129 (score:-INFINITY) (id:location-group-etc-libvirt-qemu-clone-pool-10-37-165-129--INFINITY) Disabled on: pool-10-37-165-65 (score:-INFINITY) (id:location-group-etc-libvirt-qemu-clone-pool-10-37-165-65--INFINITY) Resource: group-var-lib-libvirt-images-clone Enabled on: light-01.cluster-qe.lab.eng.brq.redhat.com (score:INFINITY) (id:location-group-var-lib-libvirt-images-clone-light-01.cluster-qe.lab.eng.brq.redhat.com-INFINITY) Enabled on: light-03.cluster-qe.lab.eng.brq.redhat.com (score:INFINITY) (id:location-group-var-lib-libvirt-images-clone-light-03.cluster-qe.lab.eng.brq.redhat.com-INFINITY) Disabled on: pool-10-37-165-129 (score:-INFINITY) (id:location-group-var-lib-libvirt-images-clone-pool-10-37-165-129--INFINITY) Disabled on: pool-10-37-165-65 (score:-INFINITY) (id:location-group-var-lib-libvirt-images-clone-pool-10-37-165-65--INFINITY) Resource: locking-clone Disabled on: pool-10-37-165-129 (score:-INFINITY) (id:location-locking-clone-pool-10-37-165-129--INFINITY) Disabled on: pool-10-37-165-65 (score:-INFINITY) (id:location-locking-clone-pool-10-37-165-65--INFINITY) Ordering Constraints: start locking-clone then start group-var-lib-libvirt-images-clone (kind:Mandatory) (id:order-locking-clone-group-var-lib-libvirt-images-clone-mandatory) start locking-clone then start group-etc-libvirt-qemu-clone (kind:Mandatory) (id:order-locking-clone-group-etc-libvirt-qemu-clone-mandatory) start group-var-lib-libvirt-images-clone then start pool-10-37-165-129 (kind:Mandatory) (id:order-group-var-lib-libvirt-images-clone-pool-10-37-165-129-mandatory) start group-etc-libvirt-qemu-clone then start pool-10-37-165-129 (kind:Mandatory) (id:order-group-etc-libvirt-qemu-clone-pool-10-37-165-129-mandatory) start group-var-lib-libvirt-images-clone then start pool-10-37-165-65 (kind:Mandatory) (id:order-group-var-lib-libvirt-images-clone-pool-10-37-165-65-mandatory) start group-etc-libvirt-qemu-clone then start pool-10-37-165-65 (kind:Mandatory) (id:order-group-etc-libvirt-qemu-clone-pool-10-37-165-65-mandatory) Colocation Constraints: group-var-lib-libvirt-images-clone with locking-clone (score:INFINITY) (id:colocation-group-var-lib-libvirt-images-clone-locking-clone-INFINITY) group-etc-libvirt-qemu-clone with locking-clone (score:INFINITY) (id:colocation-group-etc-libvirt-qemu-clone-locking-clone-INFINITY) Ticket Constraints: Alerts: No alerts defined Resources Defaults: resource-stickiness: 100 Operations Defaults: No defaults set Cluster Properties: cluster-infrastructure: corosync cluster-name: STSRHTS3983 dc-version: 2.0.1-4.el8-0eb7991564 have-watchdog: false no-quorum-policy: freeze Quorum: Options: > [2]: pcs resource config Clone: locking-clone Meta Attrs: interleave=true Group: locking Resource: dlm (class=ocf provider=pacemaker type=controld) Operations: monitor interval=30s (dlm-monitor-interval-30s) start interval=0s timeout=90s (dlm-start-interval-0s) stop interval=0s timeout=100s (dlm-stop-interval-0s) Resource: lvmlockd (class=ocf provider=heartbeat type=lvmlockd) Attributes: with_cmirrord=1 Operations: monitor interval=30s (lvmlockd-monitor-interval-30s) start interval=0s timeout=90s (lvmlockd-start-interval-0s) stop interval=0s timeout=90s (lvmlockd-stop-interval-0s) Clone: group-var-lib-libvirt-images-clone Meta Attrs: clone-max=2 interleave=true ordered=true Group: group-var-lib-libvirt-images Resource: lv-var-lib-libvirt-images (class=ocf provider=heartbeat type=LVM-activate) Attributes: activation_mode=shared lvname=images0 vg_access_mode=lvmlockd vgname=shared Operations: monitor interval=30s timeout=90s (lv-var-lib-libvirt-images-monitor-interval-30s) start interval=0s timeout=90s (lv-var-lib-libvirt-images-start-interval-0s) stop interval=0s timeout=90s (lv-var-lib-libvirt-images-stop-interval-0s) Resource: fs-var-lib-libvirt-images (class=ocf provider=heartbeat type=Filesystem) Attributes: device=/dev/shared/images0 directory=/var/lib/libvirt/images fstype=gfs2 options= Operations: monitor interval=30s (fs-var-lib-libvirt-images-monitor-interval-30s) notify interval=0s timeout=60s (fs-var-lib-libvirt-images-notify-interval-0s) start interval=0s timeout=60s (fs-var-lib-libvirt-images-start-interval-0s) stop interval=0s timeout=60s (fs-var-lib-libvirt-images-stop-interval-0s) Clone: group-etc-libvirt-qemu-clone Meta Attrs: clone-max=2 interleave=true ordered=true Group: group-etc-libvirt-qemu Resource: vg-etc-libvirt-qemu (class=ocf provider=heartbeat type=LVM-activate) Attributes: activation_mode=shared lvname=etc0 vg_access_mode=lvmlockd vgname=shared Operations: monitor interval=30s timeout=90s (vg-etc-libvirt-qemu-monitor-interval-30s) start interval=0s timeout=90s (vg-etc-libvirt-qemu-start-interval-0s) stop interval=0s timeout=90s (vg-etc-libvirt-qemu-stop-interval-0s) Resource: fs-etc-libvirt-qemu (class=ocf provider=heartbeat type=Filesystem) Attributes: device=/dev/shared/etc0 directory=/etc/libvirt/qemu fstype=gfs2 options= Operations: monitor interval=30s (fs-etc-libvirt-qemu-monitor-interval-30s) notify interval=0s timeout=60s (fs-etc-libvirt-qemu-notify-interval-0s) start interval=0s timeout=60s (fs-etc-libvirt-qemu-start-interval-0s) stop interval=0s timeout=60s (fs-etc-libvirt-qemu-stop-interval-0s) Resource: pool-10-37-165-129 (class=ocf provider=heartbeat type=VirtualDomain) Attributes: config=/etc/libvirt/qemu/pool-10-37-165-129.xml hypervisor=qemu:///system migration_transport=ssh Meta Attrs: allow-migrate=true Utilization: cpu=2 hv_memory=1024 Operations: migrate_from interval=0 timeout=120s (pool-10-37-165-129-migrate_from-interval-0) migrate_to interval=0 timeout=120s (pool-10-37-165-129-migrate_to-interval-0) monitor interval=10s timeout=30s (pool-10-37-165-129-monitor-interval-10s) start interval=0s timeout=90s (pool-10-37-165-129-start-interval-0s) stop interval=0s timeout=90s (pool-10-37-165-129-stop-interval-0s) Resource: pool-10-37-165-65 (class=ocf provider=heartbeat type=VirtualDomain) Attributes: config=/etc/libvirt/qemu/pool-10-37-165-65.xml hypervisor=qemu:///system migration_transport=ssh Meta Attrs: allow-migrate=true Utilization: cpu=2 hv_memory=1024 Operations: migrate_from interval=0 timeout=120s (pool-10-37-165-65-migrate_from-interval-0) migrate_to interval=0 timeout=120s (pool-10-37-165-65-migrate_to-interval-0) monitor interval=10s timeout=30s (pool-10-37-165-65-monitor-interval-10s) start interval=0s timeout=90s (pool-10-37-165-65-start-interval-0s) stop interval=0s timeout=90s (pool-10-37-165-65-stop-interval-0s) > [3]: crm_resource --wait --timeout=20 Pending actions: Action 92: pool-10-37-165-65_start_0 on light-03.cluster-qe.lab.eng.brq.redhat.com Action 91: pool-10-37-165-65_stop_0 on light-03.cluster-qe.lab.eng.brq.redhat.com Action 90: pool-10-37-165-129_start_0 on light-01.cluster-qe.lab.eng.brq.redhat.com Action 89: pool-10-37-165-129_stop_0 on light-01.cluster-qe.lab.eng.brq.redhat.com Action 80: fs-etc-libvirt-qemu:1_monitor_30000 on light-03.cluster-qe.lab.eng.brq.redhat.com Action 79: fs-etc-libvirt-qemu:1_start_0 on light-03.cluster-qe.lab.eng.brq.redhat.com Action 78: fs-etc-libvirt-qemu:1_stop_0 on light-01.cluster-qe.lab.eng.brq.redhat.com Action 77: vg-etc-libvirt-qemu:1_monitor_30000 on light-03.cluster-qe.lab.eng.brq.redhat.com Action 76: vg-etc-libvirt-qemu:1_start_0 on light-03.cluster-qe.lab.eng.brq.redhat.com Action 75: vg-etc-libvirt-qemu:1_stop_0 on light-01.cluster-qe.lab.eng.brq.redhat.com Action 70: fs-etc-libvirt-qemu:0_monitor_30000 on light-01.cluster-qe.lab.eng.brq.redhat.com Action 69: fs-etc-libvirt-qemu:0_start_0 on light-01.cluster-qe.lab.eng.brq.redhat.com Action 68: fs-etc-libvirt-qemu:0_stop_0 on light-03.cluster-qe.lab.eng.brq.redhat.com Action 67: vg-etc-libvirt-qemu:0_monitor_30000 on light-01.cluster-qe.lab.eng.brq.redhat.com Action 66: vg-etc-libvirt-qemu:0_start_0 on light-01.cluster-qe.lab.eng.brq.redhat.com Action 65: vg-etc-libvirt-qemu:0_stop_0 on light-03.cluster-qe.lab.eng.brq.redhat.com Action 56: fs-var-lib-libvirt-images:1_monitor_30000 on light-03.cluster-qe.lab.eng.brq.redhat.com Action 55: fs-var-lib-libvirt-images:1_start_0 on light-03.cluster-qe.lab.eng.brq.redhat.com Action 54: fs-var-lib-libvirt-images:1_stop_0 on light-01.cluster-qe.lab.eng.brq.redhat.com Action 53: lv-var-lib-libvirt-images:1_monitor_30000 on light-03.cluster-qe.lab.eng.brq.redhat.com Action 52: lv-var-lib-libvirt-images:1_start_0 on light-03.cluster-qe.lab.eng.brq.redhat.com Action 51: lv-var-lib-libvirt-images:1_stop_0 on light-01.cluster-qe.lab.eng.brq.redhat.com Action 46: fs-var-lib-libvirt-images:0_monitor_30000 on light-01.cluster-qe.lab.eng.brq.redhat.com Action 45: fs-var-lib-libvirt-images:0_start_0 on light-01.cluster-qe.lab.eng.brq.redhat.com Action 44: fs-var-lib-libvirt-images:0_stop_0 on light-03.cluster-qe.lab.eng.brq.redhat.com Action 43: lv-var-lib-libvirt-images:0_monitor_30000 on light-01.cluster-qe.lab.eng.brq.redhat.com Action 42: lv-var-lib-libvirt-images:0_start_0 on light-01.cluster-qe.lab.eng.brq.redhat.com Action 41: lv-var-lib-libvirt-images:0_stop_0 on light-03.cluster-qe.lab.eng.brq.redhat.com Action 16: pool-10-37-165-129_monitor_10000 on light-01.cluster-qe.lab.eng.brq.redhat.com Action 8: pool-10-37-165-65_monitor_10000 on light-03.cluster-qe.lab.eng.brq.redhat.com Error performing operation: Timer expired
It's actually not running happily; starting at Mar 12 16:03:37 in the logs (when the LVM-activate/Filesystem/VirtualDomain resources and their constraints are added), the resources are continuously restarting. :( Removing the location constraints for the Filesystem resources seems to work around the problem. (They are equally scored location constraints for both nodes in a symmetric cluster, so they have no effect.) I also see location constraints keeping various resources off the VirtualDomain resources. Those are not Pacemaker Remote nodes, so the constraints do not mean anything. However those constraints aren't causing any problems. There is a separate issue with the simulation (but not the cluster) thinking the fence devices need to be restarted. That might interfere with the --wait as well. This is a known issue that has not been investigated. Can you try the workaround and see if it helps? We need to fix the underlying issues, but given how difficult it is to get anything into GA at this point, a workaround would be good to have.
I can confirm that removing the positive constraint for filesystem works around the problem.
An update: (In reply to Ken Gaillot from comment #1) > It's actually not running happily; starting at Mar 12 16:03:37 in the logs > (when the LVM-activate/Filesystem/VirtualDomain resources and their > constraints are added), the resources are continuously restarting. :( Looking at the logs more closely, I was off a bit: the configuration was being repeatedly changed during this time, so resources were starting and stopping appropriately. Problems actually start at Mar 12 16:32:27. > Removing the location constraints for the Filesystem resources seems to work > around the problem. (They are equally scored location constraints for both > nodes in a symmetric cluster, so they have no effect.) Changing the location constraints to have a score less than INFINITY also works around the problem. Pacemaker assigns an instance number to clone instances on each node. What is going wrong here is that every time Pacemaker runs its scheduler, it assigns different instance numbers to the existing active instances compared to what it wants the final result to be, so it thinks the instances need to be moved. The cause for that still needs to be found and fixed. > There is a separate issue with the simulation (but not the cluster) thinking > the fence devices need to be restarted. That might interfere with the --wait > as well. This is a known issue that has not been investigated. As an aside, the simulation issue has been fixed, though the fix will not make it into RHEL 8.4. However, that issue does not affect --wait when used with a live cluster.
After evaluating this issue, there are no plans to address it further or fix it in an upcoming release. Therefore, it is being closed. If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened.
(In reply to RHEL Program Management from comment #13) > After evaluating this issue, there are no plans to address it further or fix > it in an upcoming release. Therefore, it is being closed. If plans change > such that this issue will be fixed in an upcoming release, then the bug can > be reopened. This is still a high priority and I am hopeful the fix will be in RHEL 8.5. Once we are further along in 8.5 release planning, we will likely reopen this.
This is fixed by upstream commit 018ad6d5.