2037218 – VirtualDomain move fails

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 2037218 - VirtualDomain move fails

Summary: VirtualDomain move fails

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise Linux 9
Classification:	Red Hat
Component:	pcs
Sub Component:
Version:	9.0
Hardware:	x86_64
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	rc
Target Release:	9.1
Assignee:	Ondrej Mular
QA Contact:	cluster-qe@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-01-05 09:39 UTC by lejeczek
Modified:	2022-06-08 11:45 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-06-08 11:45:25 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:
Flags:	pm-rhel: mirror+

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1847102	1	None	None	None	2023-06-27 12:59:41 UTC
Red Hat Bugzilla	1990784	1	None	None	None	2023-06-03 07:45:48 UTC
Red Hat Bugzilla	1990787	1	None	None	None	2023-06-03 07:46:39 UTC
Red Hat Issue Tracker	CLUSTERQE-5762	0	None	None	None	2022-05-23 16:02:49 UTC
Red Hat Issue Tracker	RHELPLAN-106856	0	None	None	None	2022-01-05 09:41:31 UTC

Description lejeczek 2022-01-05 09:39:33 UTC

Description of problem:

Trying to move resource:

-> $ pcs resource move c8kubermaster2 swir 
Location constraint to move resource 'c8kubermaster2' has been created
Waiting for the cluster to apply configuration changes...
Location constraint created to move resource 'c8kubermaster2' has been removed
Waiting for the cluster to apply configuration changes...
Error: resource 'c8kubermaster2' is running on node 'whale'
Error: Errors have occurred, therefore pcs is unable to continue

VM store is on mounted a GlusterFS volume via fuse (now when libgfapi is removed/deprecated)
'virtsh' migrates a VM with '--unsafe' just fine, but adding this to the resource:

-> $ pcs resource update c8kubermaster2 attr migrate_options="--unsafe"

makes _no_ difference.
Should be very easy to reproduce.
Seem that moving a VirtualDomain resource between nodes is completely broken.

many thanks, L.

Version-Release number of selected component (if applicable):

resource-agents-4.10.0-4.el9.x86_64

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 lejeczek 2022-01-05 09:45:22 UTC

Just in case I left it a bit vague - it's about live move/migration which is broken - what is still working in prev versions in CentOS 8 Stream

Comment 2 Michal Privoznik 2022-01-05 15:35:45 UTC

Can you find the exact error reported in the libvirtd log? That might shed more light into why libvirt is denying migration.

Comment 3 lejeczek 2022-01-05 17:04:36 UTC

I'm looking at something else very strange, I see:
-> $ pcs constraint config | lesi
...
  Resource: c8kubermaster2
    Enabled on:
      Node: whale (score:INFINITY)


and even though I do: 'clear' & 'cleanup' that constrain remains there, until I deleted the resource & re-create, now I can 'move' the resource again, albeit not! as 'live' migration.

Also 'setenforce' seems to make no difference.(unless some silent denials do)
In new C9 there is a number of vir* services which replace libvirtd.service - looking at virtqemud.service I see:
..
2022-01-05 16:58:16.399+0000: 644190: warning : virSecurityValidateTimestamp:206 : Invalid XATTR timestamp detected on /VMs3/c8kubermaster2.qcow2 secdriver=dac
internal error: unable to execute QEMU command 'cont': Failed to get "write" lock

'locking' problem affect other bits outside of PCS, backups, snapshots of VMs, now with only-via-fuse method. (unless there is some way to fuse-mount glusterFS with does the trick)  

thanks, L.

Comment 4 lejeczek 2022-01-05 17:08:06 UTC

-> $ pcs resource config c8kubermaster2
 Resource: c8kubermaster2 (class=ocf provider=heartbeat type=VirtualDomain)
  Attributes: config=/var/lib/pacemaker/conf.d/c8kubermaster2.xml hypervisor=qemu:///system migrate_options=--unsafe migration_transport=ssh
  Meta Attrs: allow-migrate=true failure-timeout=30s
  Operations: migrate_from interval=0s timeout=90s (c8kubermaster2-migrate_from-interval-0s)
              migrate_to interval=0s timeout=90s (c8kubermaster2-migrate_to-interval-0s)
              monitor interval=10s timeout=30s (c8kubermaster2-monitor-interval-10s)
              start interval=0s timeout=60s (c8kubermaster2-start-interval-0s)
              stop interval=0s timeout=60s (c8kubermaster2-stop-interval-0s)


swir.direct:/VMs3 on /VMs3 type fuse.glusterfs (rw,relatime,user_id=0,group_id=0,allow_other,max_read=131072)

Comment 5 Ondrej Mular 2022-01-06 07:18:25 UTC

Thank you for reporting this issue. After looking into it in more detail, I'm pretty sure that I know what is causing this. There is a bug in the new implementation of `pcs resource move` introduced in pcs-0.11 (see pcs man page, section changes in pcs-0.11, for more details) which in some cases will not move the resource. However, the old implementation of the move command is still available as `pcs resource move-with-constraint` which can be used a a workaround for now. Another option is to run `pcs resource clear <resource> <node>` just before `pcs resource move`.

Comment 6 lejeczek 2022-01-06 10:16:41 UTC

Yes, but really the issue I care about reporting this BZ is LIVE migration/move of VirtualDomain - even if it's not really a BUG on PCS part - and possible ways for PCS to fix/improve that.
With Qemu/Libvirt versions still with 'libgfapi' support LIVE migration works smoothly but with new version where 'libgfapi' is removed only way is to fuse-mount GlusterFS volumes, it's broken, LIVE move fails-over to shutdown/start - which is, well, what it is.

from log:
...
internal error: unable to execute QEMU command 'cont': Failed to get "write" lock
...

thanks, L.

Comment 9 Tomas Jelinek 2022-06-08 11:45:25 UTC

This pcs issue has been resolved in bz1990787.

If you believe that the issue has not been resolved, feel free to reopen this bz.

Note You need to log in before you can comment on or make changes to this bug.