Bug 2055035
| Summary: | With VirtualDomain off glusterFS mount point - cluster takes ages @system reboot/shutdown | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 9 | Reporter: | lejeczek <peljasz> | ||||
| Component: | resource-agents | Assignee: | Oyvind Albrigtsen <oalbrigt> | ||||
| Status: | CLOSED WONTFIX | QA Contact: | cluster-qe <cluster-qe> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | unspecified | ||||||
| Version: | CentOS Stream | CC: | agk, bstinson, cluster-maint, fdinitto, idevat, jwboyer, mlisik, mpospisi, omular, tojeline | ||||
| Target Milestone: | rc | ||||||
| Target Release: | --- | ||||||
| Hardware: | x86_64 | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2023-08-16 07:28:34 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
|
Description
lejeczek
2022-02-16 09:11:06 UTC
Hi, Can you provide more information, such as steps to reproduce, actual and expected results, your configuration and logs? Thanks. What information do you need? I believe that should be easily reproducible off the info I've put in. Grab yourself a CentOS 9 and binaries/packages needed - GlusterFS ver. 10 from here: https://www.ovirt.org/develop/dev-process/install-nightly-snapshot.html - (though gluster guys say ver. 10 should land in EPEL soon). Two boxes(VMs) should do -> set up Gluster volume then mount it(via 'fstab' should do) -> store your VM's image there (have your VM resource however you have it in HA cluster and dom XML definition 'source file=' to point to glusterFS mountpoint, eg. /VMs) -> test live-migration works(which it should) -> reboot(to see the issue) All this, GlusterFS, HA/pacemaker can be on the same two(minimum to set GF volume) nodes. A resource: -> $ pcs resource config c8kubermaster1 Resource: c8kubermaster1 (class=ocf provider=heartbeat type=VirtualDomain) Attributes: config=/var/lib/pacemaker/conf.d/c8kubermaster1.xml hypervisor=qemu:///system migrate_options=--unsafe migration_transport=ssh Meta Attrs: allow-migrate=true failure-timeout=30s Operations: migrate_from interval=0s timeout=120s (c8kubermaster1-migrate_from-interval-0s) migrate_to interval=0s timeout=120s (c8kubermaster1-migrate_to-interval-0s) monitor interval=30s (c8kubermaster1-monitor-interval-30s) start interval=0s timeout=60s (c8kubermaster1-start-interval-0s) stop interval=0s timeout=60s (c8kubermaster1-stop-interval-0 If _no_ VirtualDomain resource is up & runig(or in any other state) then HA cluster stops @shutdown/reboot and such reboot/shutdown performs as expected. This seems to be either pacemaker or glusterfs issue. I'm moving this to pacemaker for further investigation, even though it may land on glusterfs. Hi,
Can you attach the result of
pcs cluster report --from "YYYY-M-D H:M:S" --to "YYYY-M-D H:M:S"
from each node covering the time of interest (i.e. a few minutes before the reboot was initiated to a few minutes after the node comes back up)
Created attachment 1862050 [details]
pcs report
Perhaps PCS/pacemaker cannot handle any such case where VM image is stored on net-mount fs path at all(?) - not specific to GlusterFS. I've tested now a mount point to a NFS export - still via 'fstab' - and problem remains the same. There are a couple issues combining here.
First, the VM migrations are timing out, for example:
Feb 19 11:00:53.631 whale.mine.private pacemaker-controld [4375] (process_graph_event)
notice: Transition 111 action 21 (c8kubernode1_migrate_to_0 on dzien): expected 'ok' but got 'error' | target-rc=0 rc=1 call-id=185
Second, the VMs have failure-timeout set to 30 seconds.
This means that the failures are expiring while the migrations are timing out, causing the migrations to be repeatedly scheduled, timing out each time, until the shutdown timer pops and the cluster simply exits.
I think the failure-timeout should be longer than the action timeouts, otherwise there's the risk of this situation where an action is attempted indefinitely.
There are a lot of messages like these from the VirtualDomain resource agent that appear to be related to why the migration is timing out:
Feb 19 11:00:51 dzien VirtualDomain(c8kubernode1)[4090944]: INFO: Virtual domain c8kubernode1 currently has no state, retrying.
but I'm not familiar enough with the agent to know what problems that might indicate. If you can't figure it out from here, we can reassign to resource-agents for further investigation.
That would be weird as those 'timeouts' worked perfectly fine prior to 'libgfapi' removal.
But I also do this, I reboot/shutdown with help from a simple script
...
pcs node standby ${HOSTNAME%%.*}
while [ $(virsh list --id | wc -l) -ne 1 ]; do
echo -e \\twaiting for VMs to migrate over
sleep 5s
done
/sbin/poweroff ${@}
exit 0
It is meant to evacuate all the resources from the rebooting node and only when no VM is running(so VirtualDomain must play NO role) on the node, actually proceed to the 'reboot' and... still! reboot takes ages...
During the 'reboot' other nodes see:
-> $ pcs status --full
...
Node List:
* Node whale (3): standby (with active resources)
* Online: [ dzien (1) swir (2) ]
...
PCSD Status:
dzien: Online
swir: Online
whale: Offline
and! yet: (whale is the 'rebooting' node, where shutdown is actually happening after 'evacuation' took place)
-> $ pcs resource status | grep -v disable
* c8kubermaster1 (ocf:heartbeat:VirtualDomain): Started swir
* vpn (ocf:heartbeat:VirtualDomain): Started whale
...
* c8kubernode2 (ocf:heartbeat:VirtualDomain): Started whale
* c8kubernode1 (ocf:heartbeat:VirtualDomain): Started swir
...
* Clone Set: GWlink-clone [GWlink]:
* Started: [ dzien swir ]
* Stopped: [ whale ]
* vpn1 (ocf:heartbeat:VirtualDomain): Started swir
* ubusrv1 (ocf:heartbeat:VirtualDomain): Started dzien
* ubusrv3 (ocf:heartbeat:VirtualDomain): Started dzien
* ubusrv2 (ocf:heartbeat:VirtualDomain): Started swir
* c8kubermaster2 (ocf:heartbeat:VirtualDomain): Started whale
* c8kubernode3 (ocf:heartbeat:VirtualDomain): Started dzien
* ovpn-to-ionos (systemd:openvpn-client): Started swir
and.. (still node is rebooting)
-> $ pcs resource status | grep -v disable
* c8kubermaster1 (ocf:heartbeat:VirtualDomain): Started swir
* vpn (ocf:heartbeat:VirtualDomain): Migrating whale
* gorn (ocf:heartbeat:VirtualDomain): Migrating whale
* ayan (ocf:heartbeat:VirtualDomain): Migrating whale
* c8kubernode2 (ocf:heartbeat:VirtualDomain): Migrating whale
* c8kubernode1 (ocf:heartbeat:VirtualDomain): Started swir
seems that cluster wants to migrate those back to the node which is rebooting but also is 'stoodby' ?
I have some negative colocation constraints but only '-100' but regardless of any constraints - node is offline and stoodby, then why?
many thanks, L.
(In reply to lejeczek from comment #8) > That would be weird as those 'timeouts' worked perfectly fine prior to > 'libgfapi' removal. Right -- the initial timeout is a symptom of whatever's going wrong, not the cause, but the failure-timeout value is making the problem worse by causing the cluster to repeatedly retry live migration rather than a full stop and start. > But I also do this, I reboot/shutdown with help from a simple script > ... > pcs node standby ${HOSTNAME%%.*} > while [ $(virsh list --id | wc -l) -ne 1 ]; do > echo -e \\twaiting for VMs to migrate over > sleep 5s > done > /sbin/poweroff ${@} > exit 0 That loop waits until the VMs are gone from libvirt's point of view, but from the cluster's point of view the migrations are timing out, so the cluster cannot assume the VMs are really gone. FYI pcs node standby has a --wait option that will wait until the cluster "settles", meaning no further actions are required. That should do what you want. You can also put a time limit on the waiting. You might also want to do "pcs cluster stop" before poweroff just to be safe. > > It is meant to evacuate all the resources from the rebooting node and only > when no VM is running(so VirtualDomain must play NO role) on the node, The VMs themselves are no longer playing a role, but the live migration (of nothing, but the cluster doesn't know that) repeatedly timing out is playing a role. > actually proceed to the 'reboot' and... still! reboot takes ages... Yep, that's the stop of the pacemaker service, which will repeatedly try to finish the live migration first but can't. Raising the failure-timeout will allow the cluster to try to stop it instead -- either that will immediately succeed (which I expect it would), and the cluster can stop on the node, or it will fail, and the rest of the cluster will fence the node. Either way, the cluster will be able to recover from whatever is causing the live migration timeout. > During the 'reboot' other nodes see: > -> $ pcs status --full > ... > Node List: > * Node whale (3): standby (with active resources) > * Online: [ dzien (1) swir (2) ] > > > ... > PCSD Status: > dzien: Online > swir: Online > whale: Offline > > > and! yet: (whale is the 'rebooting' node, where shutdown is actually > happening after 'evacuation' took place) > > -> $ pcs resource status | grep -v disable > * c8kubermaster1 (ocf:heartbeat:VirtualDomain): Started swir > * vpn (ocf:heartbeat:VirtualDomain): Started whale > ... > * c8kubernode2 (ocf:heartbeat:VirtualDomain): Started whale > * c8kubernode1 (ocf:heartbeat:VirtualDomain): Started swir > ... > * Clone Set: GWlink-clone [GWlink]: > * Started: [ dzien swir ] > * Stopped: [ whale ] > * vpn1 (ocf:heartbeat:VirtualDomain): Started swir > * ubusrv1 (ocf:heartbeat:VirtualDomain): Started dzien > * ubusrv3 (ocf:heartbeat:VirtualDomain): Started dzien > * ubusrv2 (ocf:heartbeat:VirtualDomain): Started swir > * c8kubermaster2 (ocf:heartbeat:VirtualDomain): Started whale > * c8kubernode3 (ocf:heartbeat:VirtualDomain): Started dzien > * ovpn-to-ionos (systemd:openvpn-client): Started swir > > and.. (still node is rebooting) > > -> $ pcs resource status | grep -v disable > * c8kubermaster1 (ocf:heartbeat:VirtualDomain): Started swir > * vpn (ocf:heartbeat:VirtualDomain): Migrating whale > * gorn (ocf:heartbeat:VirtualDomain): Migrating whale > * ayan (ocf:heartbeat:VirtualDomain): Migrating whale > * c8kubernode2 (ocf:heartbeat:VirtualDomain): Migrating whale > * c8kubernode1 (ocf:heartbeat:VirtualDomain): Started swir > > seems that cluster wants to migrate those back to the node which is > rebooting but also is 'stoodby' ? It's still migrating them away (or at least thinks it is). > > I have some negative colocation constraints but only '-100' but regardless > of any constraints - node is offline and stoodby, then why? > > many thanks, L. To recap, there are two issues: first, why the live migration is timing out (which I see no clues for), and second, why the cluster gets stuck trying to repeat the live migration (which raising the failure-timeout should handle). I'm reassigning this bz to resource-agents to try to help debug the first issue. The next step would be to figure out if the issue is in the resource agent or at a lower level like libvirt. If you update it with `pcs resource update c8kubernode2 trace_ra=1` you should be able to see the full command it runs in /var/lib/heartbeat/trace_ra/. After that you can run it manually to see what actually happens. You might want to disable the resource and start it manually by doing "pcs resource debug-start c8kubernode2" before running the migrate command from CLI. I want to revisit the issue if possible.
I've been for last few days fiddling with the issue whereas all the time before have had a workaround in place.
So, now.. VMs involved have no constraints whatsoever and still cluster fail to migrate those live during reboot/shutdown.
Also I moved away from ssh & now use tls - to make sure sshd is not the culprit here.
Also ! putting the node into 'standby' migrates VMs away live, as expected - so! - this issue is only reboot/shutdown issue.
-> $ pcs resource config c8kubernode3
Resource: c8kubernode3 (class=ocf provider=heartbeat type=VirtualDomain)
Attributes: c8kubernode3-instance_attributes
config=/etc/libvirt/qemu/pacemaker.d/c8kubernode3.xml
hypervisor=qemu:///system
migrate_options=--unsafe
remoteuri=qemu+tls://%n.services.internal/system
trace_ra=1
Meta Attributes: c8kubernode3-meta_attributes
allow-migrate=true
Utilization: c8kubernode3-utilization
cpu=2
host_memory=8192
hv_memory=8192
Operations:
migrate_from: c8kubernode3-migrate_from-interval-0s
interval=0s
timeout=1h
migrate_to: c8kubernode3-migrate_to-interval-0s
interval=0s
timeout=1h
monitor: c8kubernode3-monitor-interval-10s
interval=10s
timeout=30s
start: c8kubernode3-start-interval-0s
interval=0s
timeout=90s
stop: c8kubernode3-stop-interval-0s
interval=0s
timeout=90s
and from ra's log
with standby, with success:
...
+ 10:18:00: __ha_log:250: echo 'VirtualDomain(c8kubernode3)[47952]: Apr' 18 10:18:00 'INFO: c8kubernode3: Starting live migration to dzien (using: virsh --connect=qemu:///system --quiet migrate --live --unsafe c8kubernode3 qemu+tls://dzien.services.internal/system ).'
+ 10:18:00: VirtualDomain_migrate_to:1015: migrate_pid=48601
+ 10:18:00: VirtualDomain_migrate_to:1013: virsh --connect=qemu:///system --quiet migrate --live --unsafe c8kubernode3 qemu+tls://dzien.services.internal/system
+ 10:18:00: VirtualDomain_migrate_to:1019: '[' 0 -ne 0 ']'
+ 10:18:00: VirtualDomain_migrate_to:1025: wait 48601
+ 10:19:16: VirtualDomain_migrate_to:1027: rc=0
+ 10:19:16: VirtualDomain_migrate_to:1028: '[' 0 -ne 0 ']'
+ 10:19:16: VirtualDomain_migrate_to:1032: ocf_log info 'c8kubernode3: live migration to dzien succeeded.'
with reboot, with failure:
...
+ 09:41:51: __ha_log:250: echo 'VirtualDomain(c8kubernode3)[35495]: Apr' 18 09:41:51 'INFO: c8kubernode3: Starting live migration to dzien (using: virsh --connect=qemu:///system --quiet migrate --live --unsafe c8kubernode3 qemu+tls://dzien.services.internal/system ).'
+ 09:41:51: VirtualDomain_migrate_to:1015: migrate_pid=36144
+ 09:41:51: VirtualDomain_migrate_to:1013: virsh --connect=qemu:///system --quiet migrate --live --unsafe c8kubernode3 qemu+tls://dzien.services.internal/system
+ 09:41:51: VirtualDomain_migrate_to:1019: '[' 0 -ne 0 ']'
+ 09:41:51: VirtualDomain_migrate_to:1025: wait 36144
+ 09:44:41: VirtualDomain_migrate_to:1027: rc=1
+ 09:44:41: VirtualDomain_migrate_to:1028: '[' 1 -ne 0 ']'
+ 09:44:41: VirtualDomain_migrate_to:1029: ocf_exit_reason 'c8kubernode3: live migration to dzien failed: 1'
so it appears there is not much to tell 'why'.
On the receiving node also - with default log levels - nothing to clearly explain what happens:
'virtqemud.service' logs:
...
migration successfully aborted
and from 'pacemaker.service' on receiving node:
...
notice: Setting last-failure-mariadb#monitor_10000[swir]: (unset) -> 1681808926
notice: Transition 60 action 27 (mariadb_monitor_10000 on swir): expected 'promoted' but got 'not running'
notice: Setting fail-count-mariadb#monitor_10000[swir]: (unset) -> 1
notice: State transition S_IDLE -> S_POLICY_ENGINE
warning: Unexpected result (not running) was recorded for monitor of mariadb:0 on swir at Apr 18 11:08:46 2023
warning: Unexpected result (error) was recorded for migrate_to of ubusrv3 on dzien at Apr 18 08:38:41 2023
notice: Actions: Recover mariadb:0 ( Promoted swir )
notice: Calculated transition 71, saving inputs in /var/lib/pacemaker/pengine/pe-input-2933.bz2
notice: Initiating demote operation mariadb_demote_0 on swir
notice: Transition 71 aborted by status-1-last-failure-mariadb.monitor_10000 doing create last-failure-mariadb#monitor_10000=1681808926: Transient attribute change
notice: Setting master-mariadb[swir]: 100 -> (unset)
notice: Setting mariadb-safe-to-bootstrap[swir]: (unset) -> 0
notice: Setting mariadb-last-committed[swir]: (unset) -> 2152
notice: Transition 71 (Complete=4, Pending=0, Fired=0, Skipped=1, Incomplete=9, Source=/var/lib/pacemaker/pengine/pe-input-2933.bz2): Stopped
warning: Unexpected result (not running) was recorded for monitor of mariadb:0 on swir at Apr 18 11:08:46 2023
warning: Unexpected result (error) was recorded for migrate_to of ubusrv3 on dzien at Apr 18 08:38:41 2023
notice: Actions: Recover mariadb:0 ( Unpromoted swir )
notice: Calculated transition 72, saving inputs in /var/lib/pacemaker/pengine/pe-input-2934.bz2
notice: Initiating stop operation mariadb_stop_0 on swir
notice: Setting mariadb-safe-to-bootstrap[swir]: 0 -> (unset)
notice: Transition 72 aborted by deletion of nvpair[@id='status-1-mariadb-safe-to-bootstrap']: Transient attribute change
notice: Setting mariadb-last-committed[swir]: 2152 -> (unset)
notice: Transition 72 (Complete=4, Pending=0, Fired=0, Skipped=1, Incomplete=4, Source=/var/lib/pacemaker/pengine/pe-input-2934.bz2): Stopped
warning: Unexpected result (not running) was recorded for monitor of mariadb:0 on swir at Apr 18 11:08:46 2023
warning: Unexpected result (error) was recorded for migrate_to of ubusrv3 on dzien at Apr 18 08:38:41 2023
notice: Actions: Start mariadb:2 ( swir )
notice: Calculated transition 73, saving inputs in /var/lib/pacemaker/pengine/pe-input-2935.bz2
notice: Initiating start operation mariadb:2_start_0 on swir
notice: Setting mariadb-safe-to-bootstrap[swir]: (unset) -> 0
notice: Transition 73 aborted by status-1-mariadb-safe-to-bootstrap doing create mariadb-safe-to-bootstrap=0: Transient attribute change
notice: Setting mariadb-last-committed[swir]: (unset) -> 2152
notice: Setting master-mariadb[swir]: (unset) -> 100
notice: Transition 73 (Complete=3, Pending=0, Fired=0, Skipped=2, Incomplete=2, Source=/var/lib/pacemaker/pengine/pe-input-2935.bz2): Stopped
warning: Unexpected result (not running) was recorded for monitor of mariadb:0 on swir at Apr 18 11:08:46 2023
warning: Unexpected result (error) was recorded for migrate_to of ubusrv3 on dzien at Apr 18 08:38:41 2023
notice: Actions: Promote mariadb:0 ( Unpromoted -> Promoted swir )
notice: Calculated transition 74, saving inputs in /var/lib/pacemaker/pengine/pe-input-2936.bz2
notice: Initiating promote operation mariadb_promote_0 on swir
notice: Setting mariadb-last-committed[swir]: 2152 -> (unset)
notice: Transition 74 aborted by deletion of nvpair[@id='status-1-mariadb-last-committed']: Transient attribute change
notice: Setting mariadb-safe-to-bootstrap[swir]: 0 -> (unset)
notice: Setting shutdown[swir]: (unset) -> 1681808927
notice: Transition 74 action 24 (mariadb_promote_0 on swir): expected 'ok' but got 'error'
notice: Setting last-failure-mariadb#promote_0[swir]: (unset) -> 1681808961
notice: Setting fail-count-mariadb#promote_0[swir]: (unset) -> 1
notice: Transition 74 (Complete=3, Pending=0, Fired=0, Skipped=0, Incomplete=1, Source=/var/lib/pacemaker/pengine/pe-input-2936.bz2): Complete
notice: Clearing failure of mariadb:0 on swir because it expired
notice: Clearing failure of mariadb:0 on swir because it expired
notice: Ignoring expired mariadb_promote_0 failure on swir
notice: Clearing failure of mariadb:0 on swir because it expired
notice: Ignoring expired mariadb_promote_0 failure on swir
warning: Unexpected result (error) was recorded for migrate_to of ubusrv3 on dzien at Apr 18 08:38:41 2023
notice: * Shutdown swir
notice: Actions: Stop mariadb:0 ( Promoted swir ) due to node availability
notice: Actions: Migrate c8kubernode3 ( swir -> dzien )
notice: Calculated transition 75, saving inputs in /var/lib/pacemaker/pengine/pe-input-2937.bz2
warning: Unexpected result (error: MySQL server failed to start (pid=35483) (rc=0), please check your installation) was recorded for promote of mariadb:0 on swir at Apr 18 11:08:46 2023
warning: Unexpected result (error: MySQL server failed to start (pid=35483) (rc=0), please check your installation) was recorded for promote of mariadb:0 on swir at Apr 18 11:08:46 2023
warning: Unexpected result (error) was recorded for migrate_to of ubusrv3 on dzien at Apr 18 08:38:41 2023
notice: * Shutdown swir
notice: Actions: Stop mariadb:0 ( Promoted swir ) due to node availability
notice: Actions: Migrate c8kubernode3 ( swir -> dzien )
notice: Calculated transition 76, saving inputs in /var/lib/pacemaker/pengine/pe-input-2938.bz2
notice: Initiating migrate_to operation c8kubernode3_migrate_to_0 on swir
notice: Initiating demote operation mariadb_demote_0 on swir
notice: Setting master-mariadb[swir]: 100 -> (unset)
notice: Transition 76 aborted by deletion of nvpair[@id='status-1-master-mariadb']: Transient attribute change
notice: Setting mariadb-safe-to-bootstrap[swir]: (unset) -> 0
notice: Setting mariadb-last-committed[swir]: (unset) -> 2152
notice: High CPU load detected: 35.560001
gateway-link-clone notice: Transition 76 action 41 (c8kubernode3_migrate_to_0 on swir): expected 'ok' but got 'error'
notice: Transition 76 (Complete=5, Pending=0, Fired=0, Skipped=1, Incomplete=7, Source=/var/lib/pacemaker/pengine/pe-input-2938.bz2): Stopped
notice: Clearing failure of mariadb:0 on swir because it expired
notice: Clearing failure of mariadb:0 on swir because it expired
notice: Ignoring expired mariadb_promote_0 failure on swir
notice: Clearing failure of mariadb:0 on swir because it expired
warning: Unexpected result (error) was recorded for migrate_to of c8kubernode3 on swir at Apr 18 11:09:21 2023
warning: Unexpected result (error) was recorded for migrate_to of ubusrv3 on dzien at Apr 18 08:38:41 2023
error: ocf resource c8kubernode3 might be active on 2 nodes (attempting recovery)
notice: See https://wiki.clusterlabs.org/wiki/FAQ#Resource_is_Too_Active for more information
notice: * Shutdown swir
notice: Actions: Stop mariadb:0 ( Unpromoted swir ) due to node availability
notice: Actions: Recover c8kubernode3 ( dzien )
error: Calculated transition 77 (with errors), saving inputs in /var/lib/pacemaker/pengine/pe-error-493.bz2
notice: Initiating stop operation c8kubernode3_stop_0 locally on dzien
notice: Requesting local execution of stop operation for c8kubernode3 on dzien
notice: Setting fail-count-mariadb#promote_0[swir]: 1 -> (unset)
notice: Setting last-failure-mariadb#promote_0[swir]: 1681808961 -> (unset)
notice: Setting fail-count-mariadb#monitor_10000[swir]: 1 -> (unset)
notice: Setting last-failure-mariadb#monitor_10000[swir]: 1681808926 -> (unset)
notice: Transition 77 aborted by deletion of lrm_rsc_op[@id='mariadb_last_failure_0']: Resource operation removal
notice: Result of stop operation for c8kubernode3 on dzien: ok
notice: Transition 77 (Complete=3, Pending=0, Fired=0, Skipped=2, Incomplete=6, Source=/var/lib/pacemaker/pengine/pe-error-493.bz2): Stopped
warning: Unexpected result (error) was recorded for migrate_to of c8kubernode3 on swir at Apr 18 11:09:21 2023
warning: Unexpected result (error) was recorded for migrate_to of ubusrv3 on dzien at Apr 18 08:38:41 2023
notice: * Shutdown swir
notice: Actions: Stop mariadb:0 ( Unpromoted swir ) due to node availability
notice: Actions: Recover c8kubernode3 ( swir -> dzien )
notice: Calculated transition 78, saving inputs in /var/lib/pacemaker/pengine/pe-input-2939.bz2
notice: Initiating stop operation c8kubernode3_stop_0 on swir
notice: Initiating stop operation mariadb_stop_0 on swir
notice: Setting mariadb-safe-to-bootstrap[swir]: 0 -> (unset)
notice: Transition 78 aborted by deletion of nvpair[@id='status-1-mariadb-safe-to-bootstrap']: Transient attribute change
notice: Setting mariadb-last-committed[swir]: 2152 -> (unset)
notice: Transition 78 action 2 (c8kubernode3_stop_0 on swir): expected 'ok' but got 'error'
notice: Setting last-failure-c8kubernode3#stop_0[swir]: (unset) -> 1681809103
notice: Setting fail-count-c8kubernode3#stop_0[swir]: (unset) -> INFINITY
notice: Transition 78 (Complete=4, Pending=0, Fired=0, Skipped=0, Incomplete=3, Source=/var/lib/pacemaker/pengine/pe-input-2939.bz2): Complete
warning: Unexpected result (error) was recorded for migrate_to of c8kubernode3 on swir at Apr 18 11:09:21 2023
warning: Unexpected result (error: forced stop failed) was recorded for stop of c8kubernode3 on swir at Apr 18 11:10:18 2023
warning: Unexpected result (error: forced stop failed) was recorded for stop of c8kubernode3 on swir at Apr 18 11:10:18 2023
warning: Unexpected result (error) was recorded for migrate_to of ubusrv3 on dzien at Apr 18 08:38:41 2023
notice: * Shutdown swir
crit: Cannot shut down swir because of c8kubernode3: unmanaged failed (c8kubernode3_stop_0)
notice: Calculated transition 79, saving inputs in /var/lib/pacemaker/pengine/pe-input-2940.bz2
warning: Unexpected result (error) was recorded for migrate_to of c8kubernode3 on swir at Apr 18 11:09:21 2023
warning: Unexpected result (error: forced stop failed) was recorded for stop of c8kubernode3 on swir at Apr 18 11:10:18 2023
warning: Unexpected result (error: forced stop failed) was recorded for stop of c8kubernode3 on swir at Apr 18 11:10:18 2023
warning: Unexpected result (error) was recorded for migrate_to of ubusrv3 on dzien at Apr 18 08:38:41 2023
warning: c8kubernode3 cannot run on swir due to reaching migration threshold (clean up resource to allow again)
notice: * Shutdown swir
crit: Cannot shut down swir because of c8kubernode3: unmanaged failed (c8kubernode3_stop_0)
notice: Calculated transition 80, saving inputs in /var/lib/pacemaker/pengine/pe-input-2941.bz2
notice: Transition 80 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-2941.bz2): Complete
notice: State transition S_TRANSITION_ENGINE -> S_IDLE
notice: State transition S_IDLE -> S_POLICY_ENGINE
warning: Unexpected result (error) was recorded for migrate_to of c8kubernode3 on swir at Apr 18 11:09:21 2023
warning: Unexpected result (error: forced stop failed) was recorded for stop of c8kubernode3 on swir at Apr 18 11:10:18 2023
warning: Unexpected result (error: forced stop failed) was recorded for stop of c8kubernode3 on swir at Apr 18 11:10:18 2023
warning: Unexpected result (error) was recorded for migrate_to of ubusrv3 on dzien at Apr 18 08:38:41 2023
warning: c8kubernode3 cannot run on swir due to reaching migration threshold (clean up resource to allow again)
notice: * Shutdown swir
crit: Cannot shut down swir because of c8kubernode3: unmanaged failed (c8kubernode3_stop_0)
notice: Calculated transition 81, saving inputs in /var/lib/pacemaker/pengine/pe-input-2941.bz2
notice: Transition 81 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-2941.bz2): Complete
notice: State transition S_TRANSITION_ENGINE -> S_IDLE
warning: Stonith/shutdown of node swir was not expected
notice: State transition S_IDLE -> S_POLICY_ENGINE
notice: Node swir state is now lost
notice: Removing all swir attributes for peer loss
notice: Purged 1 peer with id=1 and/or uname=swir from the membership cache
notice: Node swir state is now lost
notice: Purged 1 peer with id=1 and/or uname=swir from the membership cache
notice: Node swir state is now lost
notice: Purged 1 peer with id=1 and/or uname=swir from the membership cache
warning: Cluster node swir is unclean: peer is unexpectedly down
warning: swir is unclean
warning: Unexpected result (error) was recorded for migrate_to of ubusrv3 on dzien at Apr 18 08:38:41 2023
warning: Node swir is unclean but cannot be fenced
warning: Resource functionality and data integrity cannot be guaranteed (configure, enable, and test fencing to correct this)
notice: Actions: Start c8kubernode3 ( dzien )
warning: Calculated transition 82 (with warnings), saving inputs in /var/lib/pacemaker/pengine/pe-warn-2.bz2
notice: Node swir state is now lost
warning: Stonith/shutdown of node swir was not expected
warning: Unexpected result (error) was recorded for migrate_to of ubusrv3 on dzien at Apr 18 08:38:41 2023
notice: Actions: Start c8kubernode3 ( dzien )
notice: Calculated transition 83, saving inputs in /var/lib/pacemaker/pengine/pe-input-2942.bz2
notice: Initiating start operation c8kubernode3_start_0 locally on dzien
notice: Requesting local execution of start operation for c8kubernode3 on dzien
notice: Result of start operation for c8kubernode3 on dzien: ok
notice: Initiating monitor operation c8kubernode3_monitor_10000 locally on dzien
notice: Requesting local execution of monitor operation for c8kubernode3 on dzien
notice: Result of monitor operation for c8kubernode3 on dzien: ok
I believe it should be relatively easy to reproduce.
Centos 9 for hosts & VMs, GlusterFS for shared storage of qcow2 images and that would be it. A single VM should suffice to "demonstrate" the issue.
thanks, L
You might be able to get more info by checking the libvirtd logs (journalctl -u libvirtd). There will be nothing there - newer libvirtd, certainly ones on Centos 9, replaced "monolithic" daemon with "modular" approach. The logs I mentioned are only relevant and since last msg I upped the log level - still nothing. But, what I noticed now, which I missed earlier - which might be quite telling - is that receiving nodes, when migration fails = so reboot/shutdown, do not! create "migrate_from" logs for RA/resource but! those logs are created when migration is successful - like with 'node standby' Would that no suggest problem internally with cluster/agent? I think this patch for the Filesystem agent should solve this issue: https://github.com/ClusterLabs/resource-agents/pull/1869 After evaluating this issue, there are no plans to address it further or fix it in an upcoming release. Therefore, it is being closed. If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened. |