1560854 – Guest is left paused on source host sometimes if kill source libvirtd during live migration due to QEMU image locking

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1560854 - Guest is left paused on source host sometimes if kill source libvirtd during live migration due to QEMU image locking

Summary: Guest is left paused on source host sometimes if kill source libvirtd during ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	qemu-kvm-rhev
Sub Component:
Version:	7.5
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Dr. David Alan Gilbert
QA Contact:	Yumei Huang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1568407
TreeView+	depends on / blocked

Reported:	2018-03-27 06:43 UTC by Fangge Jin
Modified:	2020-03-06 06:08 UTC (History)
CC List:	23 users (show)
Fixed In Version:	qemu-kvm-rhev-2.12.0-7.el7
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1568407 (view as bug list)
Environment:
Last Closed:	2018-11-01 11:07:15 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
domain XML extracted from the logs (4.36 KB, text/plain) 2018-03-27 08:16 UTC, Jiri Denemark	no flags	Details
View All

Comment 2 Jiri Denemark 2018-03-27 08:16:09 UTC

Created attachment 1413602 [details]
domain XML extracted from the logs

Comment 4 Dr. David Alan Gilbert 2018-03-27 12:24:59 UTC

Adding in kwolf, since I think he has the best grip on the locking.

What I'm not sure of is whether it's actually safe to 'cont' the source until we know the destination really is dead; I'm thinking it has got the qcow2 files open at that point - so I'm not sure.
I'd thought 'device' state would have simplified this problem since the destination now knows it's in that last phase.

I don't understand how, in this situation, you made the decision to kill the destination and restart the source, vs the opposite case of allowing the destination to finish.
Avoiding the two of them both running is the more critical case.

Comment 5 Dr. David Alan Gilbert 2018-03-27 12:30:20 UTC

Also adding in Fam

Comment 6 Jiri Denemark 2018-03-27 18:14:39 UTC

> What I'm not sure of is whether it's actually safe to 'cont' the source
> until we know the destination really is dead

Well, that's the whole point of starting the domain on the destination in the
way it does not automatically run once migration is finished, isn't it? The
destination process will have files open, but it's not supposed to touch them
until it's instructed to do so.

> I'd thought 'device' state would have simplified this problem since the
> destination now knows it's in that last phase.

This state does not really help here. Sure, we could start using it to note
that it is no longer possible to resume the source. But this is not a solution
to anything, it would just turn this bug into "can't fix" and libvirt would
just intentionally leave the domain paused instead of trying to resume it.

> I don't understand how, in this situation, you made the decision to kill the
> destination and restart the source, vs the opposite case of allowing the
> destination to finish.

It's quite simple, each side can decide this on its own depending on the
current phase of the migration.

- the source resumes the domain iff it didn't send the confirmation the
  migration finished
- the destination kills the domain as long as it didn't get the confirmation
  from the source
- if the destination gets the confirmation, it will resume the domain (at
  which point a confirmation about it is sent back to the source)
- the source kills the domain if it got the confirmation from the
  destination

- the domain remains paused on the source if the source sent a confirmation,
  but didn't get a reply from the destination
- the domain never remains paused on the destination, it's either resumed or
  killed

Of course, a lot of these actions change into "leave the domain paused" when
the migration was in post-copy phase. And just for clarification, this is all
about recovering from a migration started earlier by a no longer running
instance of libvirtd.

Comment 7 Dr. David Alan Gilbert 2018-03-28 10:43:27 UTC

OK, after discussions with Fam I think I'm reasonably happy.
'cont' already includes a bdrv_invalidate_cache_all which will grab the lock - so that means we don't need to change cont which makes me happier.
All I have to do is avoid calling it at the end of migration in the case where we don't autostart

Comment 8 Dr. David Alan Gilbert 2018-03-28 17:03:25 UTC

Posted upstream:
migration: Don't activate block devices if using -S

Comment 12 Dr. David Alan Gilbert 2018-04-10 15:07:14 UTC

Just reverted the upstream fix; so we need to think about this a bit more for a larger fix after discussion with kwolf and jdenemar.

Comment 13 Dr. David Alan Gilbert 2018-04-16 17:18:31 UTC

v2 posted, now tied to new migration capability 'late-block-activate'

Comment 15 Miroslav Rezanina 2018-07-04 08:24:02 UTC

Fix included in qemu-kvm-rhev-2.12.0-7.el7

Comment 20 Yumei Huang 2018-07-05 10:11:40 UTC

Verify:
qemu-kvm-rhev-2.12.0-7.el7

Scenario 1: Don't set late-block-activate on

Boot guest with "-S", and migrate to destination. After migration, both guest in src and dst are in paused status.  

Src: 
(qemu) info status 
VM status: paused (postmigrate)

Dst:
(qemu) info status 
VM status: paused (prelaunch)


Then resume src guest, get error message:

(qemu) c
Failed to get "write" lock
Is another process using the image?


Scenario 2: Set late-block-activate on before migrate

Boot guest with "-S", set late-block-activate on, 

(qemu) migrate_set_capability late-block-activate on

Then do migrate. After migration, resume guest in src, guest boot up successfully.

Comment 24 errata-xmlrpc 2018-11-01 11:07:15 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:3443

Note You need to log in before you can comment on or make changes to this bug.

chayang
coli
ddepaula
dgilbert
dyuan
famz
fjin
hhuang
jdenemar
jinzhao
juzhang
knoel
kwolf
michen
mrezanin
ngu
pingl
qzhang
virt-maint
xianwang
xuzhang
yafu
yuhuang