Bug 2037218

Summary: VirtualDomain move fails
Product: Red Hat Enterprise Linux 9 Reporter: lejeczek <peljasz>
Component: pcsAssignee: Ondrej Mular <omular>
Status: CLOSED CURRENTRELEASE QA Contact: cluster-qe <cluster-qe>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 9.0CC: agk, bstinson, cluster-maint, fdinitto, idevat, jwboyer, mlisik, mmazoure, mpospisi, mprivozn, omular, tojeline
Target Milestone: rcKeywords: Triaged
Target Release: 9.1   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-06-08 11:45:25 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description lejeczek 2022-01-05 09:39:33 UTC
Description of problem:

Trying to move resource:

-> $ pcs resource move c8kubermaster2 swir 
Location constraint to move resource 'c8kubermaster2' has been created
Waiting for the cluster to apply configuration changes...
Location constraint created to move resource 'c8kubermaster2' has been removed
Waiting for the cluster to apply configuration changes...
Error: resource 'c8kubermaster2' is running on node 'whale'
Error: Errors have occurred, therefore pcs is unable to continue

VM store is on mounted a GlusterFS volume via fuse (now when libgfapi is removed/deprecated)
'virtsh' migrates a VM with '--unsafe' just fine, but adding this to the resource:

-> $ pcs resource update c8kubermaster2 attr migrate_options="--unsafe"

makes _no_ difference.
Should be very easy to reproduce.
Seem that moving a VirtualDomain resource between nodes is completely broken.

many thanks, L.

Version-Release number of selected component (if applicable):

resource-agents-4.10.0-4.el9.x86_64

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 lejeczek 2022-01-05 09:45:22 UTC
Just in case I left it a bit vague - it's about live move/migration which is broken - what is still working in prev versions in CentOS 8 Stream

Comment 2 Michal Privoznik 2022-01-05 15:35:45 UTC
Can you find the exact error reported in the libvirtd log? That might shed more light into why libvirt is denying migration.

Comment 3 lejeczek 2022-01-05 17:04:36 UTC
I'm looking at something else very strange, I see:
-> $ pcs constraint config | lesi
...
  Resource: c8kubermaster2
    Enabled on:
      Node: whale (score:INFINITY)


and even though I do: 'clear' & 'cleanup' that constrain remains there, until I deleted the resource & re-create, now I can 'move' the resource again, albeit not! as 'live' migration.

Also 'setenforce' seems to make no difference.(unless some silent denials do)
In new C9 there is a number of vir* services which replace libvirtd.service - looking at virtqemud.service I see:
..
2022-01-05 16:58:16.399+0000: 644190: warning : virSecurityValidateTimestamp:206 : Invalid XATTR timestamp detected on /VMs3/c8kubermaster2.qcow2 secdriver=dac
internal error: unable to execute QEMU command 'cont': Failed to get "write" lock

'locking' problem affect other bits outside of PCS, backups, snapshots of VMs, now with only-via-fuse method. (unless there is some way to fuse-mount glusterFS with does the trick)  

thanks, L.

Comment 4 lejeczek 2022-01-05 17:08:06 UTC
-> $ pcs resource config c8kubermaster2
 Resource: c8kubermaster2 (class=ocf provider=heartbeat type=VirtualDomain)
  Attributes: config=/var/lib/pacemaker/conf.d/c8kubermaster2.xml hypervisor=qemu:///system migrate_options=--unsafe migration_transport=ssh
  Meta Attrs: allow-migrate=true failure-timeout=30s
  Operations: migrate_from interval=0s timeout=90s (c8kubermaster2-migrate_from-interval-0s)
              migrate_to interval=0s timeout=90s (c8kubermaster2-migrate_to-interval-0s)
              monitor interval=10s timeout=30s (c8kubermaster2-monitor-interval-10s)
              start interval=0s timeout=60s (c8kubermaster2-start-interval-0s)
              stop interval=0s timeout=60s (c8kubermaster2-stop-interval-0s)


swir.direct:/VMs3 on /VMs3 type fuse.glusterfs (rw,relatime,user_id=0,group_id=0,allow_other,max_read=131072)

Comment 5 Ondrej Mular 2022-01-06 07:18:25 UTC
Thank you for reporting this issue. After looking into it in more detail, I'm pretty sure that I know what is causing this. There is a bug in the new implementation of `pcs resource move` introduced in pcs-0.11 (see pcs man page, section changes in pcs-0.11, for more details) which in some cases will not move the resource. However, the old implementation of the move command is still available as `pcs resource move-with-constraint` which can be used a a workaround for now. Another option is to run `pcs resource clear <resource> <node>` just before `pcs resource move`.

Comment 6 lejeczek 2022-01-06 10:16:41 UTC
Yes, but really the issue I care about reporting this BZ is LIVE migration/move of VirtualDomain - even if it's not really a BUG on PCS part - and possible ways for PCS to fix/improve that.
With Qemu/Libvirt versions still with 'libgfapi' support LIVE migration works smoothly but with new version where 'libgfapi' is removed only way is to fuse-mount GlusterFS volumes, it's broken, LIVE move fails-over to shutdown/start - which is, well, what it is.

from log:
...
internal error: unable to execute QEMU command 'cont': Failed to get "write" lock
...

thanks, L.

Comment 9 Tomas Jelinek 2022-06-08 11:45:25 UTC
This pcs issue has been resolved in bz1990787.

If you believe that the issue has not been resolved, feel free to reopen this bz.