Bug 1560854

Summary: Guest is left paused on source host sometimes if kill source libvirtd during live migration due to QEMU image locking
Product: Red Hat Enterprise Linux 7 Reporter: Fangge Jin <fjin>
Component: qemu-kvm-rhevAssignee: Dr. David Alan Gilbert <dgilbert>
Status: CLOSED ERRATA QA Contact: Yumei Huang <yuhuang>
Severity: high Docs Contact:
Priority: medium    
Version: 7.5CC: chayang, coli, ddepaula, dgilbert, dyuan, famz, fjin, hhuang, jdenemar, jinzhao, juzhang, knoel, kwolf, michen, mrezanin, ngu, pingl, qzhang, virt-maint, xianwang, xuzhang, yafu, yuhuang
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: qemu-kvm-rhev-2.12.0-7.el7 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1568407 (view as bug list) Environment:
Last Closed: 2018-11-01 11:07:15 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1568407    
Attachments:
Description Flags
domain XML extracted from the logs none

Comment 2 Jiri Denemark 2018-03-27 08:16:09 UTC
Created attachment 1413602 [details]
domain XML extracted from the logs

Comment 4 Dr. David Alan Gilbert 2018-03-27 12:24:59 UTC
Adding in kwolf, since I think he has the best grip on the locking.

What I'm not sure of is whether it's actually safe to 'cont' the source until we know the destination really is dead; I'm thinking it has got the qcow2 files open at that point - so I'm not sure.
I'd thought 'device' state would have simplified this problem since the destination now knows it's in that last phase.

I don't understand how, in this situation, you made the decision to kill the destination and restart the source, vs the opposite case of allowing the destination to finish.
Avoiding the two of them both running is the more critical case.

Comment 5 Dr. David Alan Gilbert 2018-03-27 12:30:20 UTC
Also adding in Fam

Comment 6 Jiri Denemark 2018-03-27 18:14:39 UTC
> What I'm not sure of is whether it's actually safe to 'cont' the source
> until we know the destination really is dead

Well, that's the whole point of starting the domain on the destination in the
way it does not automatically run once migration is finished, isn't it? The
destination process will have files open, but it's not supposed to touch them
until it's instructed to do so.

> I'd thought 'device' state would have simplified this problem since the
> destination now knows it's in that last phase.

This state does not really help here. Sure, we could start using it to note
that it is no longer possible to resume the source. But this is not a solution
to anything, it would just turn this bug into "can't fix" and libvirt would
just intentionally leave the domain paused instead of trying to resume it.

> I don't understand how, in this situation, you made the decision to kill the
> destination and restart the source, vs the opposite case of allowing the
> destination to finish.

It's quite simple, each side can decide this on its own depending on the
current phase of the migration.

- the source resumes the domain iff it didn't send the confirmation the
  migration finished
- the destination kills the domain as long as it didn't get the confirmation
  from the source
- if the destination gets the confirmation, it will resume the domain (at
  which point a confirmation about it is sent back to the source)
- the source kills the domain if it got the confirmation from the
  destination

- the domain remains paused on the source if the source sent a confirmation,
  but didn't get a reply from the destination
- the domain never remains paused on the destination, it's either resumed or
  killed

Of course, a lot of these actions change into "leave the domain paused" when
the migration was in post-copy phase. And just for clarification, this is all
about recovering from a migration started earlier by a no longer running
instance of libvirtd.

Comment 7 Dr. David Alan Gilbert 2018-03-28 10:43:27 UTC
OK, after discussions with Fam I think I'm reasonably happy.
'cont' already includes a bdrv_invalidate_cache_all which will grab the lock - so that means we don't need to change cont which makes me happier.
All I have to do is avoid calling it at the end of migration in the case where we don't autostart

Comment 8 Dr. David Alan Gilbert 2018-03-28 17:03:25 UTC
Posted upstream:
migration: Don't activate block devices if using -S

Comment 12 Dr. David Alan Gilbert 2018-04-10 15:07:14 UTC
Just reverted the upstream fix; so we need to think about this a bit more for a larger fix after discussion with kwolf and jdenemar.

Comment 13 Dr. David Alan Gilbert 2018-04-16 17:18:31 UTC
v2 posted, now tied to new migration capability 'late-block-activate'

Comment 15 Miroslav Rezanina 2018-07-04 08:24:02 UTC
Fix included in qemu-kvm-rhev-2.12.0-7.el7

Comment 20 Yumei Huang 2018-07-05 10:11:40 UTC
Verify:
qemu-kvm-rhev-2.12.0-7.el7

Scenario 1: Don't set late-block-activate on

Boot guest with "-S", and migrate to destination. After migration, both guest in src and dst are in paused status.  

Src: 
(qemu) info status 
VM status: paused (postmigrate)

Dst:
(qemu) info status 
VM status: paused (prelaunch)


Then resume src guest, get error message:

(qemu) c
Failed to get "write" lock
Is another process using the image?


Scenario 2: Set late-block-activate on before migrate

Boot guest with "-S", set late-block-activate on, 

(qemu) migrate_set_capability late-block-activate on

Then do migrate. After migration, resume guest in src, guest boot up successfully.

Comment 24 errata-xmlrpc 2018-11-01 11:07:15 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:3443