Bug 1560854
Summary: | Guest is left paused on source host sometimes if kill source libvirtd during live migration due to QEMU image locking | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Fangge Jin <fjin> | ||||
Component: | qemu-kvm-rhev | Assignee: | Dr. David Alan Gilbert <dgilbert> | ||||
Status: | CLOSED ERRATA | QA Contact: | Yumei Huang <yuhuang> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | medium | ||||||
Version: | 7.5 | CC: | chayang, coli, ddepaula, dgilbert, dyuan, famz, fjin, hhuang, jdenemar, jinzhao, juzhang, knoel, kwolf, michen, mrezanin, ngu, pingl, qzhang, virt-maint, xianwang, xuzhang, yafu, yuhuang | ||||
Target Milestone: | rc | ||||||
Target Release: | --- | ||||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | qemu-kvm-rhev-2.12.0-7.el7 | Doc Type: | If docs needed, set a value | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | |||||||
: | 1568407 (view as bug list) | Environment: | |||||
Last Closed: | 2018-11-01 11:07:15 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 1568407 | ||||||
Attachments: |
|
Adding in kwolf, since I think he has the best grip on the locking. What I'm not sure of is whether it's actually safe to 'cont' the source until we know the destination really is dead; I'm thinking it has got the qcow2 files open at that point - so I'm not sure. I'd thought 'device' state would have simplified this problem since the destination now knows it's in that last phase. I don't understand how, in this situation, you made the decision to kill the destination and restart the source, vs the opposite case of allowing the destination to finish. Avoiding the two of them both running is the more critical case. Also adding in Fam > What I'm not sure of is whether it's actually safe to 'cont' the source > until we know the destination really is dead Well, that's the whole point of starting the domain on the destination in the way it does not automatically run once migration is finished, isn't it? The destination process will have files open, but it's not supposed to touch them until it's instructed to do so. > I'd thought 'device' state would have simplified this problem since the > destination now knows it's in that last phase. This state does not really help here. Sure, we could start using it to note that it is no longer possible to resume the source. But this is not a solution to anything, it would just turn this bug into "can't fix" and libvirt would just intentionally leave the domain paused instead of trying to resume it. > I don't understand how, in this situation, you made the decision to kill the > destination and restart the source, vs the opposite case of allowing the > destination to finish. It's quite simple, each side can decide this on its own depending on the current phase of the migration. - the source resumes the domain iff it didn't send the confirmation the migration finished - the destination kills the domain as long as it didn't get the confirmation from the source - if the destination gets the confirmation, it will resume the domain (at which point a confirmation about it is sent back to the source) - the source kills the domain if it got the confirmation from the destination - the domain remains paused on the source if the source sent a confirmation, but didn't get a reply from the destination - the domain never remains paused on the destination, it's either resumed or killed Of course, a lot of these actions change into "leave the domain paused" when the migration was in post-copy phase. And just for clarification, this is all about recovering from a migration started earlier by a no longer running instance of libvirtd. OK, after discussions with Fam I think I'm reasonably happy. 'cont' already includes a bdrv_invalidate_cache_all which will grab the lock - so that means we don't need to change cont which makes me happier. All I have to do is avoid calling it at the end of migration in the case where we don't autostart Posted upstream: migration: Don't activate block devices if using -S Just reverted the upstream fix; so we need to think about this a bit more for a larger fix after discussion with kwolf and jdenemar. v2 posted, now tied to new migration capability 'late-block-activate' Fix included in qemu-kvm-rhev-2.12.0-7.el7 Verify: qemu-kvm-rhev-2.12.0-7.el7 Scenario 1: Don't set late-block-activate on Boot guest with "-S", and migrate to destination. After migration, both guest in src and dst are in paused status. Src: (qemu) info status VM status: paused (postmigrate) Dst: (qemu) info status VM status: paused (prelaunch) Then resume src guest, get error message: (qemu) c Failed to get "write" lock Is another process using the image? Scenario 2: Set late-block-activate on before migrate Boot guest with "-S", set late-block-activate on, (qemu) migrate_set_capability late-block-activate on Then do migrate. After migration, resume guest in src, guest boot up successfully. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:3443 |
Created attachment 1413602 [details] domain XML extracted from the logs