Description of problem: Hi. Having VMs stored on a GlusteFS volume via a mount-point live migration works okey but turns out to be a problem when system reboots or shutdowns - then it seems cluster fails to live-migrate VM and it takes long time to stop, thus delaying system's reboot/shutdown significantly. many thanks, L. Version-Release number of selected component (if applicable): pacemaker-2.1.2-4.el9.x86_64 pacemaker-cli-2.1.2-4.el9.x86_64 pacemaker-cluster-libs-2.1.2-4.el9.x86_64 pacemaker-libs-2.1.2-4.el9.x86_64 pacemaker-schemas-2.1.2-4.el9.noarch pcs-0.11.1-10.el9.x86_64 corosync-3.1.5-3.el9.x86_64 corosynclib-3.1.5-3.el9.x86_64 How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Hi, Can you provide more information, such as steps to reproduce, actual and expected results, your configuration and logs? Thanks.
What information do you need? I believe that should be easily reproducible off the info I've put in. Grab yourself a CentOS 9 and binaries/packages needed - GlusterFS ver. 10 from here: https://www.ovirt.org/develop/dev-process/install-nightly-snapshot.html - (though gluster guys say ver. 10 should land in EPEL soon). Two boxes(VMs) should do -> set up Gluster volume then mount it(via 'fstab' should do) -> store your VM's image there (have your VM resource however you have it in HA cluster and dom XML definition 'source file=' to point to glusterFS mountpoint, eg. /VMs) -> test live-migration works(which it should) -> reboot(to see the issue) All this, GlusterFS, HA/pacemaker can be on the same two(minimum to set GF volume) nodes. A resource: -> $ pcs resource config c8kubermaster1 Resource: c8kubermaster1 (class=ocf provider=heartbeat type=VirtualDomain) Attributes: config=/var/lib/pacemaker/conf.d/c8kubermaster1.xml hypervisor=qemu:///system migrate_options=--unsafe migration_transport=ssh Meta Attrs: allow-migrate=true failure-timeout=30s Operations: migrate_from interval=0s timeout=120s (c8kubermaster1-migrate_from-interval-0s) migrate_to interval=0s timeout=120s (c8kubermaster1-migrate_to-interval-0s) monitor interval=30s (c8kubermaster1-monitor-interval-30s) start interval=0s timeout=60s (c8kubermaster1-start-interval-0s) stop interval=0s timeout=60s (c8kubermaster1-stop-interval-0 If _no_ VirtualDomain resource is up & runig(or in any other state) then HA cluster stops @shutdown/reboot and such reboot/shutdown performs as expected.
This seems to be either pacemaker or glusterfs issue. I'm moving this to pacemaker for further investigation, even though it may land on glusterfs.
Hi, Can you attach the result of pcs cluster report --from "YYYY-M-D H:M:S" --to "YYYY-M-D H:M:S" from each node covering the time of interest (i.e. a few minutes before the reboot was initiated to a few minutes after the node comes back up)
Created attachment 1862050 [details] pcs report
Perhaps PCS/pacemaker cannot handle any such case where VM image is stored on net-mount fs path at all(?) - not specific to GlusterFS. I've tested now a mount point to a NFS export - still via 'fstab' - and problem remains the same.
There are a couple issues combining here. First, the VM migrations are timing out, for example: Feb 19 11:00:53.631 whale.mine.private pacemaker-controld [4375] (process_graph_event) notice: Transition 111 action 21 (c8kubernode1_migrate_to_0 on dzien): expected 'ok' but got 'error' | target-rc=0 rc=1 call-id=185 Second, the VMs have failure-timeout set to 30 seconds. This means that the failures are expiring while the migrations are timing out, causing the migrations to be repeatedly scheduled, timing out each time, until the shutdown timer pops and the cluster simply exits. I think the failure-timeout should be longer than the action timeouts, otherwise there's the risk of this situation where an action is attempted indefinitely. There are a lot of messages like these from the VirtualDomain resource agent that appear to be related to why the migration is timing out: Feb 19 11:00:51 dzien VirtualDomain(c8kubernode1)[4090944]: INFO: Virtual domain c8kubernode1 currently has no state, retrying. but I'm not familiar enough with the agent to know what problems that might indicate. If you can't figure it out from here, we can reassign to resource-agents for further investigation.
That would be weird as those 'timeouts' worked perfectly fine prior to 'libgfapi' removal. But I also do this, I reboot/shutdown with help from a simple script ... pcs node standby ${HOSTNAME%%.*} while [ $(virsh list --id | wc -l) -ne 1 ]; do echo -e \\twaiting for VMs to migrate over sleep 5s done /sbin/poweroff ${@} exit 0 It is meant to evacuate all the resources from the rebooting node and only when no VM is running(so VirtualDomain must play NO role) on the node, actually proceed to the 'reboot' and... still! reboot takes ages... During the 'reboot' other nodes see: -> $ pcs status --full ... Node List: * Node whale (3): standby (with active resources) * Online: [ dzien (1) swir (2) ] ... PCSD Status: dzien: Online swir: Online whale: Offline and! yet: (whale is the 'rebooting' node, where shutdown is actually happening after 'evacuation' took place) -> $ pcs resource status | grep -v disable * c8kubermaster1 (ocf:heartbeat:VirtualDomain): Started swir * vpn (ocf:heartbeat:VirtualDomain): Started whale ... * c8kubernode2 (ocf:heartbeat:VirtualDomain): Started whale * c8kubernode1 (ocf:heartbeat:VirtualDomain): Started swir ... * Clone Set: GWlink-clone [GWlink]: * Started: [ dzien swir ] * Stopped: [ whale ] * vpn1 (ocf:heartbeat:VirtualDomain): Started swir * ubusrv1 (ocf:heartbeat:VirtualDomain): Started dzien * ubusrv3 (ocf:heartbeat:VirtualDomain): Started dzien * ubusrv2 (ocf:heartbeat:VirtualDomain): Started swir * c8kubermaster2 (ocf:heartbeat:VirtualDomain): Started whale * c8kubernode3 (ocf:heartbeat:VirtualDomain): Started dzien * ovpn-to-ionos (systemd:openvpn-client): Started swir and.. (still node is rebooting) -> $ pcs resource status | grep -v disable * c8kubermaster1 (ocf:heartbeat:VirtualDomain): Started swir * vpn (ocf:heartbeat:VirtualDomain): Migrating whale * gorn (ocf:heartbeat:VirtualDomain): Migrating whale * ayan (ocf:heartbeat:VirtualDomain): Migrating whale * c8kubernode2 (ocf:heartbeat:VirtualDomain): Migrating whale * c8kubernode1 (ocf:heartbeat:VirtualDomain): Started swir seems that cluster wants to migrate those back to the node which is rebooting but also is 'stoodby' ? I have some negative colocation constraints but only '-100' but regardless of any constraints - node is offline and stoodby, then why? many thanks, L.
(In reply to lejeczek from comment #8) > That would be weird as those 'timeouts' worked perfectly fine prior to > 'libgfapi' removal. Right -- the initial timeout is a symptom of whatever's going wrong, not the cause, but the failure-timeout value is making the problem worse by causing the cluster to repeatedly retry live migration rather than a full stop and start. > But I also do this, I reboot/shutdown with help from a simple script > ... > pcs node standby ${HOSTNAME%%.*} > while [ $(virsh list --id | wc -l) -ne 1 ]; do > echo -e \\twaiting for VMs to migrate over > sleep 5s > done > /sbin/poweroff ${@} > exit 0 That loop waits until the VMs are gone from libvirt's point of view, but from the cluster's point of view the migrations are timing out, so the cluster cannot assume the VMs are really gone. FYI pcs node standby has a --wait option that will wait until the cluster "settles", meaning no further actions are required. That should do what you want. You can also put a time limit on the waiting. You might also want to do "pcs cluster stop" before poweroff just to be safe. > > It is meant to evacuate all the resources from the rebooting node and only > when no VM is running(so VirtualDomain must play NO role) on the node, The VMs themselves are no longer playing a role, but the live migration (of nothing, but the cluster doesn't know that) repeatedly timing out is playing a role. > actually proceed to the 'reboot' and... still! reboot takes ages... Yep, that's the stop of the pacemaker service, which will repeatedly try to finish the live migration first but can't. Raising the failure-timeout will allow the cluster to try to stop it instead -- either that will immediately succeed (which I expect it would), and the cluster can stop on the node, or it will fail, and the rest of the cluster will fence the node. Either way, the cluster will be able to recover from whatever is causing the live migration timeout. > During the 'reboot' other nodes see: > -> $ pcs status --full > ... > Node List: > * Node whale (3): standby (with active resources) > * Online: [ dzien (1) swir (2) ] > > > ... > PCSD Status: > dzien: Online > swir: Online > whale: Offline > > > and! yet: (whale is the 'rebooting' node, where shutdown is actually > happening after 'evacuation' took place) > > -> $ pcs resource status | grep -v disable > * c8kubermaster1 (ocf:heartbeat:VirtualDomain): Started swir > * vpn (ocf:heartbeat:VirtualDomain): Started whale > ... > * c8kubernode2 (ocf:heartbeat:VirtualDomain): Started whale > * c8kubernode1 (ocf:heartbeat:VirtualDomain): Started swir > ... > * Clone Set: GWlink-clone [GWlink]: > * Started: [ dzien swir ] > * Stopped: [ whale ] > * vpn1 (ocf:heartbeat:VirtualDomain): Started swir > * ubusrv1 (ocf:heartbeat:VirtualDomain): Started dzien > * ubusrv3 (ocf:heartbeat:VirtualDomain): Started dzien > * ubusrv2 (ocf:heartbeat:VirtualDomain): Started swir > * c8kubermaster2 (ocf:heartbeat:VirtualDomain): Started whale > * c8kubernode3 (ocf:heartbeat:VirtualDomain): Started dzien > * ovpn-to-ionos (systemd:openvpn-client): Started swir > > and.. (still node is rebooting) > > -> $ pcs resource status | grep -v disable > * c8kubermaster1 (ocf:heartbeat:VirtualDomain): Started swir > * vpn (ocf:heartbeat:VirtualDomain): Migrating whale > * gorn (ocf:heartbeat:VirtualDomain): Migrating whale > * ayan (ocf:heartbeat:VirtualDomain): Migrating whale > * c8kubernode2 (ocf:heartbeat:VirtualDomain): Migrating whale > * c8kubernode1 (ocf:heartbeat:VirtualDomain): Started swir > > seems that cluster wants to migrate those back to the node which is > rebooting but also is 'stoodby' ? It's still migrating them away (or at least thinks it is). > > I have some negative colocation constraints but only '-100' but regardless > of any constraints - node is offline and stoodby, then why? > > many thanks, L. To recap, there are two issues: first, why the live migration is timing out (which I see no clues for), and second, why the cluster gets stuck trying to repeat the live migration (which raising the failure-timeout should handle). I'm reassigning this bz to resource-agents to try to help debug the first issue. The next step would be to figure out if the issue is in the resource agent or at a lower level like libvirt.
If you update it with `pcs resource update c8kubernode2 trace_ra=1` you should be able to see the full command it runs in /var/lib/heartbeat/trace_ra/. After that you can run it manually to see what actually happens. You might want to disable the resource and start it manually by doing "pcs resource debug-start c8kubernode2" before running the migrate command from CLI.
I want to revisit the issue if possible. I've been for last few days fiddling with the issue whereas all the time before have had a workaround in place. So, now.. VMs involved have no constraints whatsoever and still cluster fail to migrate those live during reboot/shutdown. Also I moved away from ssh & now use tls - to make sure sshd is not the culprit here. Also ! putting the node into 'standby' migrates VMs away live, as expected - so! - this issue is only reboot/shutdown issue. -> $ pcs resource config c8kubernode3 Resource: c8kubernode3 (class=ocf provider=heartbeat type=VirtualDomain) Attributes: c8kubernode3-instance_attributes config=/etc/libvirt/qemu/pacemaker.d/c8kubernode3.xml hypervisor=qemu:///system migrate_options=--unsafe remoteuri=qemu+tls://%n.services.internal/system trace_ra=1 Meta Attributes: c8kubernode3-meta_attributes allow-migrate=true Utilization: c8kubernode3-utilization cpu=2 host_memory=8192 hv_memory=8192 Operations: migrate_from: c8kubernode3-migrate_from-interval-0s interval=0s timeout=1h migrate_to: c8kubernode3-migrate_to-interval-0s interval=0s timeout=1h monitor: c8kubernode3-monitor-interval-10s interval=10s timeout=30s start: c8kubernode3-start-interval-0s interval=0s timeout=90s stop: c8kubernode3-stop-interval-0s interval=0s timeout=90s and from ra's log with standby, with success: ... + 10:18:00: __ha_log:250: echo 'VirtualDomain(c8kubernode3)[47952]: Apr' 18 10:18:00 'INFO: c8kubernode3: Starting live migration to dzien (using: virsh --connect=qemu:///system --quiet migrate --live --unsafe c8kubernode3 qemu+tls://dzien.services.internal/system ).' + 10:18:00: VirtualDomain_migrate_to:1015: migrate_pid=48601 + 10:18:00: VirtualDomain_migrate_to:1013: virsh --connect=qemu:///system --quiet migrate --live --unsafe c8kubernode3 qemu+tls://dzien.services.internal/system + 10:18:00: VirtualDomain_migrate_to:1019: '[' 0 -ne 0 ']' + 10:18:00: VirtualDomain_migrate_to:1025: wait 48601 + 10:19:16: VirtualDomain_migrate_to:1027: rc=0 + 10:19:16: VirtualDomain_migrate_to:1028: '[' 0 -ne 0 ']' + 10:19:16: VirtualDomain_migrate_to:1032: ocf_log info 'c8kubernode3: live migration to dzien succeeded.' with reboot, with failure: ... + 09:41:51: __ha_log:250: echo 'VirtualDomain(c8kubernode3)[35495]: Apr' 18 09:41:51 'INFO: c8kubernode3: Starting live migration to dzien (using: virsh --connect=qemu:///system --quiet migrate --live --unsafe c8kubernode3 qemu+tls://dzien.services.internal/system ).' + 09:41:51: VirtualDomain_migrate_to:1015: migrate_pid=36144 + 09:41:51: VirtualDomain_migrate_to:1013: virsh --connect=qemu:///system --quiet migrate --live --unsafe c8kubernode3 qemu+tls://dzien.services.internal/system + 09:41:51: VirtualDomain_migrate_to:1019: '[' 0 -ne 0 ']' + 09:41:51: VirtualDomain_migrate_to:1025: wait 36144 + 09:44:41: VirtualDomain_migrate_to:1027: rc=1 + 09:44:41: VirtualDomain_migrate_to:1028: '[' 1 -ne 0 ']' + 09:44:41: VirtualDomain_migrate_to:1029: ocf_exit_reason 'c8kubernode3: live migration to dzien failed: 1' so it appears there is not much to tell 'why'. On the receiving node also - with default log levels - nothing to clearly explain what happens: 'virtqemud.service' logs: ... migration successfully aborted and from 'pacemaker.service' on receiving node: ... notice: Setting last-failure-mariadb#monitor_10000[swir]: (unset) -> 1681808926 notice: Transition 60 action 27 (mariadb_monitor_10000 on swir): expected 'promoted' but got 'not running' notice: Setting fail-count-mariadb#monitor_10000[swir]: (unset) -> 1 notice: State transition S_IDLE -> S_POLICY_ENGINE warning: Unexpected result (not running) was recorded for monitor of mariadb:0 on swir at Apr 18 11:08:46 2023 warning: Unexpected result (error) was recorded for migrate_to of ubusrv3 on dzien at Apr 18 08:38:41 2023 notice: Actions: Recover mariadb:0 ( Promoted swir ) notice: Calculated transition 71, saving inputs in /var/lib/pacemaker/pengine/pe-input-2933.bz2 notice: Initiating demote operation mariadb_demote_0 on swir notice: Transition 71 aborted by status-1-last-failure-mariadb.monitor_10000 doing create last-failure-mariadb#monitor_10000=1681808926: Transient attribute change notice: Setting master-mariadb[swir]: 100 -> (unset) notice: Setting mariadb-safe-to-bootstrap[swir]: (unset) -> 0 notice: Setting mariadb-last-committed[swir]: (unset) -> 2152 notice: Transition 71 (Complete=4, Pending=0, Fired=0, Skipped=1, Incomplete=9, Source=/var/lib/pacemaker/pengine/pe-input-2933.bz2): Stopped warning: Unexpected result (not running) was recorded for monitor of mariadb:0 on swir at Apr 18 11:08:46 2023 warning: Unexpected result (error) was recorded for migrate_to of ubusrv3 on dzien at Apr 18 08:38:41 2023 notice: Actions: Recover mariadb:0 ( Unpromoted swir ) notice: Calculated transition 72, saving inputs in /var/lib/pacemaker/pengine/pe-input-2934.bz2 notice: Initiating stop operation mariadb_stop_0 on swir notice: Setting mariadb-safe-to-bootstrap[swir]: 0 -> (unset) notice: Transition 72 aborted by deletion of nvpair[@id='status-1-mariadb-safe-to-bootstrap']: Transient attribute change notice: Setting mariadb-last-committed[swir]: 2152 -> (unset) notice: Transition 72 (Complete=4, Pending=0, Fired=0, Skipped=1, Incomplete=4, Source=/var/lib/pacemaker/pengine/pe-input-2934.bz2): Stopped warning: Unexpected result (not running) was recorded for monitor of mariadb:0 on swir at Apr 18 11:08:46 2023 warning: Unexpected result (error) was recorded for migrate_to of ubusrv3 on dzien at Apr 18 08:38:41 2023 notice: Actions: Start mariadb:2 ( swir ) notice: Calculated transition 73, saving inputs in /var/lib/pacemaker/pengine/pe-input-2935.bz2 notice: Initiating start operation mariadb:2_start_0 on swir notice: Setting mariadb-safe-to-bootstrap[swir]: (unset) -> 0 notice: Transition 73 aborted by status-1-mariadb-safe-to-bootstrap doing create mariadb-safe-to-bootstrap=0: Transient attribute change notice: Setting mariadb-last-committed[swir]: (unset) -> 2152 notice: Setting master-mariadb[swir]: (unset) -> 100 notice: Transition 73 (Complete=3, Pending=0, Fired=0, Skipped=2, Incomplete=2, Source=/var/lib/pacemaker/pengine/pe-input-2935.bz2): Stopped warning: Unexpected result (not running) was recorded for monitor of mariadb:0 on swir at Apr 18 11:08:46 2023 warning: Unexpected result (error) was recorded for migrate_to of ubusrv3 on dzien at Apr 18 08:38:41 2023 notice: Actions: Promote mariadb:0 ( Unpromoted -> Promoted swir ) notice: Calculated transition 74, saving inputs in /var/lib/pacemaker/pengine/pe-input-2936.bz2 notice: Initiating promote operation mariadb_promote_0 on swir notice: Setting mariadb-last-committed[swir]: 2152 -> (unset) notice: Transition 74 aborted by deletion of nvpair[@id='status-1-mariadb-last-committed']: Transient attribute change notice: Setting mariadb-safe-to-bootstrap[swir]: 0 -> (unset) notice: Setting shutdown[swir]: (unset) -> 1681808927 notice: Transition 74 action 24 (mariadb_promote_0 on swir): expected 'ok' but got 'error' notice: Setting last-failure-mariadb#promote_0[swir]: (unset) -> 1681808961 notice: Setting fail-count-mariadb#promote_0[swir]: (unset) -> 1 notice: Transition 74 (Complete=3, Pending=0, Fired=0, Skipped=0, Incomplete=1, Source=/var/lib/pacemaker/pengine/pe-input-2936.bz2): Complete notice: Clearing failure of mariadb:0 on swir because it expired notice: Clearing failure of mariadb:0 on swir because it expired notice: Ignoring expired mariadb_promote_0 failure on swir notice: Clearing failure of mariadb:0 on swir because it expired notice: Ignoring expired mariadb_promote_0 failure on swir warning: Unexpected result (error) was recorded for migrate_to of ubusrv3 on dzien at Apr 18 08:38:41 2023 notice: * Shutdown swir notice: Actions: Stop mariadb:0 ( Promoted swir ) due to node availability notice: Actions: Migrate c8kubernode3 ( swir -> dzien ) notice: Calculated transition 75, saving inputs in /var/lib/pacemaker/pengine/pe-input-2937.bz2 warning: Unexpected result (error: MySQL server failed to start (pid=35483) (rc=0), please check your installation) was recorded for promote of mariadb:0 on swir at Apr 18 11:08:46 2023 warning: Unexpected result (error: MySQL server failed to start (pid=35483) (rc=0), please check your installation) was recorded for promote of mariadb:0 on swir at Apr 18 11:08:46 2023 warning: Unexpected result (error) was recorded for migrate_to of ubusrv3 on dzien at Apr 18 08:38:41 2023 notice: * Shutdown swir notice: Actions: Stop mariadb:0 ( Promoted swir ) due to node availability notice: Actions: Migrate c8kubernode3 ( swir -> dzien ) notice: Calculated transition 76, saving inputs in /var/lib/pacemaker/pengine/pe-input-2938.bz2 notice: Initiating migrate_to operation c8kubernode3_migrate_to_0 on swir notice: Initiating demote operation mariadb_demote_0 on swir notice: Setting master-mariadb[swir]: 100 -> (unset) notice: Transition 76 aborted by deletion of nvpair[@id='status-1-master-mariadb']: Transient attribute change notice: Setting mariadb-safe-to-bootstrap[swir]: (unset) -> 0 notice: Setting mariadb-last-committed[swir]: (unset) -> 2152 notice: High CPU load detected: 35.560001 gateway-link-clone notice: Transition 76 action 41 (c8kubernode3_migrate_to_0 on swir): expected 'ok' but got 'error' notice: Transition 76 (Complete=5, Pending=0, Fired=0, Skipped=1, Incomplete=7, Source=/var/lib/pacemaker/pengine/pe-input-2938.bz2): Stopped notice: Clearing failure of mariadb:0 on swir because it expired notice: Clearing failure of mariadb:0 on swir because it expired notice: Ignoring expired mariadb_promote_0 failure on swir notice: Clearing failure of mariadb:0 on swir because it expired warning: Unexpected result (error) was recorded for migrate_to of c8kubernode3 on swir at Apr 18 11:09:21 2023 warning: Unexpected result (error) was recorded for migrate_to of ubusrv3 on dzien at Apr 18 08:38:41 2023 error: ocf resource c8kubernode3 might be active on 2 nodes (attempting recovery) notice: See https://wiki.clusterlabs.org/wiki/FAQ#Resource_is_Too_Active for more information notice: * Shutdown swir notice: Actions: Stop mariadb:0 ( Unpromoted swir ) due to node availability notice: Actions: Recover c8kubernode3 ( dzien ) error: Calculated transition 77 (with errors), saving inputs in /var/lib/pacemaker/pengine/pe-error-493.bz2 notice: Initiating stop operation c8kubernode3_stop_0 locally on dzien notice: Requesting local execution of stop operation for c8kubernode3 on dzien notice: Setting fail-count-mariadb#promote_0[swir]: 1 -> (unset) notice: Setting last-failure-mariadb#promote_0[swir]: 1681808961 -> (unset) notice: Setting fail-count-mariadb#monitor_10000[swir]: 1 -> (unset) notice: Setting last-failure-mariadb#monitor_10000[swir]: 1681808926 -> (unset) notice: Transition 77 aborted by deletion of lrm_rsc_op[@id='mariadb_last_failure_0']: Resource operation removal notice: Result of stop operation for c8kubernode3 on dzien: ok notice: Transition 77 (Complete=3, Pending=0, Fired=0, Skipped=2, Incomplete=6, Source=/var/lib/pacemaker/pengine/pe-error-493.bz2): Stopped warning: Unexpected result (error) was recorded for migrate_to of c8kubernode3 on swir at Apr 18 11:09:21 2023 warning: Unexpected result (error) was recorded for migrate_to of ubusrv3 on dzien at Apr 18 08:38:41 2023 notice: * Shutdown swir notice: Actions: Stop mariadb:0 ( Unpromoted swir ) due to node availability notice: Actions: Recover c8kubernode3 ( swir -> dzien ) notice: Calculated transition 78, saving inputs in /var/lib/pacemaker/pengine/pe-input-2939.bz2 notice: Initiating stop operation c8kubernode3_stop_0 on swir notice: Initiating stop operation mariadb_stop_0 on swir notice: Setting mariadb-safe-to-bootstrap[swir]: 0 -> (unset) notice: Transition 78 aborted by deletion of nvpair[@id='status-1-mariadb-safe-to-bootstrap']: Transient attribute change notice: Setting mariadb-last-committed[swir]: 2152 -> (unset) notice: Transition 78 action 2 (c8kubernode3_stop_0 on swir): expected 'ok' but got 'error' notice: Setting last-failure-c8kubernode3#stop_0[swir]: (unset) -> 1681809103 notice: Setting fail-count-c8kubernode3#stop_0[swir]: (unset) -> INFINITY notice: Transition 78 (Complete=4, Pending=0, Fired=0, Skipped=0, Incomplete=3, Source=/var/lib/pacemaker/pengine/pe-input-2939.bz2): Complete warning: Unexpected result (error) was recorded for migrate_to of c8kubernode3 on swir at Apr 18 11:09:21 2023 warning: Unexpected result (error: forced stop failed) was recorded for stop of c8kubernode3 on swir at Apr 18 11:10:18 2023 warning: Unexpected result (error: forced stop failed) was recorded for stop of c8kubernode3 on swir at Apr 18 11:10:18 2023 warning: Unexpected result (error) was recorded for migrate_to of ubusrv3 on dzien at Apr 18 08:38:41 2023 notice: * Shutdown swir crit: Cannot shut down swir because of c8kubernode3: unmanaged failed (c8kubernode3_stop_0) notice: Calculated transition 79, saving inputs in /var/lib/pacemaker/pengine/pe-input-2940.bz2 warning: Unexpected result (error) was recorded for migrate_to of c8kubernode3 on swir at Apr 18 11:09:21 2023 warning: Unexpected result (error: forced stop failed) was recorded for stop of c8kubernode3 on swir at Apr 18 11:10:18 2023 warning: Unexpected result (error: forced stop failed) was recorded for stop of c8kubernode3 on swir at Apr 18 11:10:18 2023 warning: Unexpected result (error) was recorded for migrate_to of ubusrv3 on dzien at Apr 18 08:38:41 2023 warning: c8kubernode3 cannot run on swir due to reaching migration threshold (clean up resource to allow again) notice: * Shutdown swir crit: Cannot shut down swir because of c8kubernode3: unmanaged failed (c8kubernode3_stop_0) notice: Calculated transition 80, saving inputs in /var/lib/pacemaker/pengine/pe-input-2941.bz2 notice: Transition 80 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-2941.bz2): Complete notice: State transition S_TRANSITION_ENGINE -> S_IDLE notice: State transition S_IDLE -> S_POLICY_ENGINE warning: Unexpected result (error) was recorded for migrate_to of c8kubernode3 on swir at Apr 18 11:09:21 2023 warning: Unexpected result (error: forced stop failed) was recorded for stop of c8kubernode3 on swir at Apr 18 11:10:18 2023 warning: Unexpected result (error: forced stop failed) was recorded for stop of c8kubernode3 on swir at Apr 18 11:10:18 2023 warning: Unexpected result (error) was recorded for migrate_to of ubusrv3 on dzien at Apr 18 08:38:41 2023 warning: c8kubernode3 cannot run on swir due to reaching migration threshold (clean up resource to allow again) notice: * Shutdown swir crit: Cannot shut down swir because of c8kubernode3: unmanaged failed (c8kubernode3_stop_0) notice: Calculated transition 81, saving inputs in /var/lib/pacemaker/pengine/pe-input-2941.bz2 notice: Transition 81 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-2941.bz2): Complete notice: State transition S_TRANSITION_ENGINE -> S_IDLE warning: Stonith/shutdown of node swir was not expected notice: State transition S_IDLE -> S_POLICY_ENGINE notice: Node swir state is now lost notice: Removing all swir attributes for peer loss notice: Purged 1 peer with id=1 and/or uname=swir from the membership cache notice: Node swir state is now lost notice: Purged 1 peer with id=1 and/or uname=swir from the membership cache notice: Node swir state is now lost notice: Purged 1 peer with id=1 and/or uname=swir from the membership cache warning: Cluster node swir is unclean: peer is unexpectedly down warning: swir is unclean warning: Unexpected result (error) was recorded for migrate_to of ubusrv3 on dzien at Apr 18 08:38:41 2023 warning: Node swir is unclean but cannot be fenced warning: Resource functionality and data integrity cannot be guaranteed (configure, enable, and test fencing to correct this) notice: Actions: Start c8kubernode3 ( dzien ) warning: Calculated transition 82 (with warnings), saving inputs in /var/lib/pacemaker/pengine/pe-warn-2.bz2 notice: Node swir state is now lost warning: Stonith/shutdown of node swir was not expected warning: Unexpected result (error) was recorded for migrate_to of ubusrv3 on dzien at Apr 18 08:38:41 2023 notice: Actions: Start c8kubernode3 ( dzien ) notice: Calculated transition 83, saving inputs in /var/lib/pacemaker/pengine/pe-input-2942.bz2 notice: Initiating start operation c8kubernode3_start_0 locally on dzien notice: Requesting local execution of start operation for c8kubernode3 on dzien notice: Result of start operation for c8kubernode3 on dzien: ok notice: Initiating monitor operation c8kubernode3_monitor_10000 locally on dzien notice: Requesting local execution of monitor operation for c8kubernode3 on dzien notice: Result of monitor operation for c8kubernode3 on dzien: ok I believe it should be relatively easy to reproduce. Centos 9 for hosts & VMs, GlusterFS for shared storage of qcow2 images and that would be it. A single VM should suffice to "demonstrate" the issue. thanks, L
You might be able to get more info by checking the libvirtd logs (journalctl -u libvirtd).
There will be nothing there - newer libvirtd, certainly ones on Centos 9, replaced "monolithic" daemon with "modular" approach. The logs I mentioned are only relevant and since last msg I upped the log level - still nothing. But, what I noticed now, which I missed earlier - which might be quite telling - is that receiving nodes, when migration fails = so reboot/shutdown, do not! create "migrate_from" logs for RA/resource but! those logs are created when migration is successful - like with 'node standby' Would that no suggest problem internally with cluster/agent?
I think this patch for the Filesystem agent should solve this issue: https://github.com/ClusterLabs/resource-agents/pull/1869
After evaluating this issue, there are no plans to address it further or fix it in an upcoming release. Therefore, it is being closed. If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened.