Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 2037218

Summary: VirtualDomain move fails
Product: Red Hat Enterprise Linux 9 Reporter: lejeczek <peljasz>
Component: pcsAssignee: Ondrej Mular <omular>
Status: CLOSED CURRENTRELEASE QA Contact: cluster-qe <cluster-qe>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 9.0CC: agk, bstinson, cluster-maint, fdinitto, idevat, jwboyer, mlisik, mmazoure, mpospisi, mprivozn, omular, tojeline
Target Milestone: rcKeywords: Triaged
Target Release: 9.1Flags: pm-rhel: mirror+
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-06-08 11:45:25 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description lejeczek 2022-01-05 09:39:33 UTC
Description of problem:

Trying to move resource:

-> $ pcs resource move c8kubermaster2 swir 
Location constraint to move resource 'c8kubermaster2' has been created
Waiting for the cluster to apply configuration changes...
Location constraint created to move resource 'c8kubermaster2' has been removed
Waiting for the cluster to apply configuration changes...
Error: resource 'c8kubermaster2' is running on node 'whale'
Error: Errors have occurred, therefore pcs is unable to continue

VM store is on mounted a GlusterFS volume via fuse (now when libgfapi is removed/deprecated)
'virtsh' migrates a VM with '--unsafe' just fine, but adding this to the resource:

-> $ pcs resource update c8kubermaster2 attr migrate_options="--unsafe"

makes _no_ difference.
Should be very easy to reproduce.
Seem that moving a VirtualDomain resource between nodes is completely broken.

many thanks, L.

Version-Release number of selected component (if applicable):

resource-agents-4.10.0-4.el9.x86_64

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 lejeczek 2022-01-05 09:45:22 UTC
Just in case I left it a bit vague - it's about live move/migration which is broken - what is still working in prev versions in CentOS 8 Stream

Comment 2 Michal Privoznik 2022-01-05 15:35:45 UTC
Can you find the exact error reported in the libvirtd log? That might shed more light into why libvirt is denying migration.

Comment 3 lejeczek 2022-01-05 17:04:36 UTC
I'm looking at something else very strange, I see:
-> $ pcs constraint config | lesi
...
  Resource: c8kubermaster2
    Enabled on:
      Node: whale (score:INFINITY)


and even though I do: 'clear' & 'cleanup' that constrain remains there, until I deleted the resource & re-create, now I can 'move' the resource again, albeit not! as 'live' migration.

Also 'setenforce' seems to make no difference.(unless some silent denials do)
In new C9 there is a number of vir* services which replace libvirtd.service - looking at virtqemud.service I see:
..
2022-01-05 16:58:16.399+0000: 644190: warning : virSecurityValidateTimestamp:206 : Invalid XATTR timestamp detected on /VMs3/c8kubermaster2.qcow2 secdriver=dac
internal error: unable to execute QEMU command 'cont': Failed to get "write" lock

'locking' problem affect other bits outside of PCS, backups, snapshots of VMs, now with only-via-fuse method. (unless there is some way to fuse-mount glusterFS with does the trick)  

thanks, L.

Comment 4 lejeczek 2022-01-05 17:08:06 UTC
-> $ pcs resource config c8kubermaster2
 Resource: c8kubermaster2 (class=ocf provider=heartbeat type=VirtualDomain)
  Attributes: config=/var/lib/pacemaker/conf.d/c8kubermaster2.xml hypervisor=qemu:///system migrate_options=--unsafe migration_transport=ssh
  Meta Attrs: allow-migrate=true failure-timeout=30s
  Operations: migrate_from interval=0s timeout=90s (c8kubermaster2-migrate_from-interval-0s)
              migrate_to interval=0s timeout=90s (c8kubermaster2-migrate_to-interval-0s)
              monitor interval=10s timeout=30s (c8kubermaster2-monitor-interval-10s)
              start interval=0s timeout=60s (c8kubermaster2-start-interval-0s)
              stop interval=0s timeout=60s (c8kubermaster2-stop-interval-0s)


swir.direct:/VMs3 on /VMs3 type fuse.glusterfs (rw,relatime,user_id=0,group_id=0,allow_other,max_read=131072)

Comment 5 Ondrej Mular 2022-01-06 07:18:25 UTC
Thank you for reporting this issue. After looking into it in more detail, I'm pretty sure that I know what is causing this. There is a bug in the new implementation of `pcs resource move` introduced in pcs-0.11 (see pcs man page, section changes in pcs-0.11, for more details) which in some cases will not move the resource. However, the old implementation of the move command is still available as `pcs resource move-with-constraint` which can be used a a workaround for now. Another option is to run `pcs resource clear <resource> <node>` just before `pcs resource move`.

Comment 6 lejeczek 2022-01-06 10:16:41 UTC
Yes, but really the issue I care about reporting this BZ is LIVE migration/move of VirtualDomain - even if it's not really a BUG on PCS part - and possible ways for PCS to fix/improve that.
With Qemu/Libvirt versions still with 'libgfapi' support LIVE migration works smoothly but with new version where 'libgfapi' is removed only way is to fuse-mount GlusterFS volumes, it's broken, LIVE move fails-over to shutdown/start - which is, well, what it is.

from log:
...
internal error: unable to execute QEMU command 'cont': Failed to get "write" lock
...

thanks, L.

Comment 9 Tomas Jelinek 2022-06-08 11:45:25 UTC
This pcs issue has been resolved in bz1990787.

If you believe that the issue has not been resolved, feel free to reopen this bz.