Bug 1530130
Summary: | Target host in nova DB got updated to new compute while migration failed and qemu-kvm process was still running on source host. [rhel-7.4.z] | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Oneata Mircea Teodor <toneata> |
Component: | libvirt | Assignee: | Jiri Denemark <jdenemar> |
Status: | CLOSED ERRATA | QA Contact: | zhe peng <zpeng> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 7.4 | CC: | berrange, dasmith, dgilbert, dyuan, eglynn, fjin, jdenemar, jsuchane, kchamart, libvirt-maint, mfuruta, mkalinin, molasaga, mschuppe, mtessun, pbarta, rbalakri, rbryant, sbauza, sferdjao, sgordon, smykhail, srevivo, vromanso, xuzhang, yafu |
Target Milestone: | rc | Keywords: | ZStream |
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | libvirt-3.2.0-14.el7_4.9 | Doc Type: | Bug Fix |
Doc Text: |
Cause: Libvirt advertised migration as completed in migration statistics report immediately after QEMU finished sending data to the destination.
Consequence: Management software monitoring migration may see a migration finished even though the domain may fail to start on the destination.
Fix: Libvirt was patched to report migration as completed only after the domain is already running on the destination.
Result: Management software won't react strangely to a failed migration.
|
Story Points: | --- |
Clone Of: | 1401173 | Environment: | |
Last Closed: | 2018-03-06 21:41:17 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1401173 | ||
Bug Blocks: |
Description
Oneata Mircea Teodor
2018-01-02 06:58:57 UTC
The patch mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=1401173#c34 caused a regression in reporting statistics of a completed job. See bug 1523036 for more details and an additional patch which will need to be backported to avoid the regression in 7.4.z. I can reproduce this with build : libvirt-3.2.0-14.el7.x86_64 verify with build: libvirt-3.2.0-14.el7_4.8.x86_64 step: 1. prepare migration env.(two hosts) 2. on the destination host attach gdb to libvirtd, set breakpoint to qemuMigrationFinish, and let the daemon continue: # gdb -p $(pidof libvirtd) (gdb) br qemuMigrationFinish (gdb) c 2. migrate a domain to the destination host # virsh migrate rhel --live qemu+ssh://$target_host/system --verbose 3. once gdb stops at the breakpoint check 'virsh domjobinfo DOM' on the source host on source host: # virsh domjobinfo rhel Job type: Unbounded Operation: Outgoing migration Time elapsed: 5773 ms Data processed: 169.265 MiB Data remaining: 0.000 B Data total: 1.102 GiB Memory processed: 169.265 MiB Memory remaining: 0.000 B Memory total: 1.102 GiB Memory bandwidth: 109.149 MiB/s Dirty rate: 0 pages/s Iteration: 3 Constant pages: 742625 Normal pages: 127638 Normal data: 498.586 MiB Expected downtime: 20 ms Setup time: 9 ms 4. kill the qemu-kvm process on the destination host 5. let gdb continue with executing libvirtd (this will likely need to be done twice since gdb may stop at SIGPIPE after the first one) (gdb) c 6. check migration failed and the domain is still running on the source Migration: [100 %]error: internal error: qemu unexpectedly closed the monitor: 2018-01-17T09:21:38.632270Z qemu-kvm: -chardev pty,id=charserial0: char device redirected to /dev/pts/1 (label charserial0) 7. check guest on source # virsh list --all Id Name State ---------------------------------------------------- 1 rhel running Hi jirka, I found an issue when i do some free testing for this patch, please help check if it is a regression? below is the output of domjobinfo with --completed # virsh domjobinfo rhel --completed Job type: Completed Operation: Outgoing migration Time elapsed: 2053 ms Time elapsed w/o network: 2041 ms Total downtime: 80 ms Downtime w/o network: 68 ms but with libvirt-3.2.0-14.el7.x86_64, it is # virsh domjobinfo rhel --completed Job type: Completed Operation: Outgoing migration Time elapsed: 5822 ms Time elapsed w/o network: 5817 ms Data processed: 595.598 MiB Data remaining: 0.000 B Data total: 1.102 GiB Memory processed: 595.598 MiB Memory remaining: 0.000 B Memory total: 1.102 GiB Memory bandwidth: 111.518 MiB/s Dirty rate: 0 pages/s Iteration: 16 Constant pages: 193151 Normal pages: 151752 Normal data: 592.781 MiB Total downtime: 383 ms Downtime w/o network: 378 ms Setup time: 12 ms some output didn't show up. Yeah, it's a regression. When backporting the patches I intentionally skipped some refactoring patches and didn't properly adjust the rest. The patch mentioned in comment 3, which was supposed to fix a regression, may crash libvirtd in some cases. See bug 1536351 for more details. In other words, one more patch is needed here. Verified the issue in the comment 9 with libvirt-3.2.0-14.el7_4.9. Test steps: 1.Do migration with '--persistent' and '--offline' options: # virsh migrate rhel qemu+ssh://10.66.4.116/system --offline --verbose --persistent root.4.116's password: Migration: [100 %] Verified comment 7 with build libvirt-3.2.0-14.el7_4.9 # virsh migrate rhel --live qemu+ssh://$target_host/system --verbose Migration: [100 %] # virsh domjobinfo rhel --completed Job type: Completed Operation: Outgoing migration Time elapsed: 1124 ms Time elapsed w/o network: 1122 ms Data processed: 3.305 MiB Data remaining: 0.000 B Data total: 1.102 GiB Memory processed: 3.305 MiB Memory remaining: 0.000 B Memory total: 1.102 GiB Memory bandwidth: 38.463 MiB/s Dirty rate: 0 pages/s Iteration: 2 Constant pages: 288783 Normal pages: 211 Normal data: 844.000 KiB Total downtime: 59 ms Downtime w/o network: 57 ms Setup time: 6 ms per comment 11 & comment 12, move to verified. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:0403 |