Bug 1530130

Summary: Target host in nova DB got updated to new compute while migration failed and qemu-kvm process was still running on source host. [rhel-7.4.z]
Product: Red Hat Enterprise Linux 7 Reporter: Oneata Mircea Teodor <toneata>
Component: libvirtAssignee: Jiri Denemark <jdenemar>
Status: CLOSED ERRATA QA Contact: zhe peng <zpeng>
Severity: high Docs Contact:
Priority: high    
Version: 7.4CC: berrange, dasmith, dgilbert, dyuan, eglynn, fjin, jdenemar, jsuchane, kchamart, libvirt-maint, mfuruta, mkalinin, molasaga, mschuppe, mtessun, pbarta, rbalakri, rbryant, sbauza, sferdjao, sgordon, smykhail, srevivo, vromanso, xuzhang, yafu
Target Milestone: rcKeywords: ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: libvirt-3.2.0-14.el7_4.9 Doc Type: Bug Fix
Doc Text:
Cause: Libvirt advertised migration as completed in migration statistics report immediately after QEMU finished sending data to the destination. Consequence: Management software monitoring migration may see a migration finished even though the domain may fail to start on the destination. Fix: Libvirt was patched to report migration as completed only after the domain is already running on the destination. Result: Management software won't react strangely to a failed migration.
Story Points: ---
Clone Of: 1401173 Environment:
Last Closed: 2018-03-06 21:41:17 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1401173    
Bug Blocks:    

Description Oneata Mircea Teodor 2018-01-02 06:58:57 UTC
This bug has been copied from bug #1401173 and has been proposed to be backported to 7.4 z-stream (EUS).

Comment 3 Jiri Denemark 2018-01-11 21:54:26 UTC
The patch mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=1401173#c34 caused a regression in reporting statistics of a completed job. See bug 1523036 for more details and an additional patch which will need to be backported to avoid the regression in 7.4.z.

Comment 6 zhe peng 2018-01-17 09:27:47 UTC
I can reproduce this with build : 
libvirt-3.2.0-14.el7.x86_64

verify with build:
libvirt-3.2.0-14.el7_4.8.x86_64

step:
1. prepare migration env.(two hosts)
2. on the destination host attach gdb to libvirtd, set breakpoint to
    qemuMigrationFinish, and let the daemon continue:
    # gdb -p $(pidof libvirtd)
    (gdb) br qemuMigrationFinish
    (gdb) c

2. migrate a domain to the destination host
# virsh migrate rhel --live qemu+ssh://$target_host/system --verbose

3. once gdb stops at the breakpoint check 'virsh domjobinfo DOM' on the source
   host
on source host:
# virsh domjobinfo rhel
Job type:         Unbounded   
Operation:        Outgoing migration
Time elapsed:     5773         ms
Data processed:   169.265 MiB
Data remaining:   0.000 B
Data total:       1.102 GiB
Memory processed: 169.265 MiB
Memory remaining: 0.000 B
Memory total:     1.102 GiB
Memory bandwidth: 109.149 MiB/s
Dirty rate:       0            pages/s
Iteration:        3           
Constant pages:   742625      
Normal pages:     127638      
Normal data:      498.586 MiB
Expected downtime: 20           ms
Setup time:       9            ms


4. kill the qemu-kvm process on the destination host

5. let gdb continue with executing libvirtd (this will likely need to be done
   twice since gdb may stop at SIGPIPE after the first one)
    (gdb) c

6. check migration failed and the domain is still running on the source

Migration: [100 %]error: internal error: qemu unexpectedly closed the monitor: 2018-01-17T09:21:38.632270Z qemu-kvm: -chardev pty,id=charserial0: char device redirected to /dev/pts/1 (label charserial0)

7. check guest on source 
# virsh list --all
 Id    Name                           State
----------------------------------------------------
 1     rhel                           running

Comment 7 zhe peng 2018-01-18 03:27:54 UTC
Hi jirka,
  I found an issue when i do some free testing for this patch, please help check if it is a regression?
  below is the output of domjobinfo with --completed 
# virsh domjobinfo rhel --completed
Job type:         Completed   
Operation:        Outgoing migration
Time elapsed:     2053         ms
Time elapsed w/o network: 2041         ms
Total downtime:   80           ms
Downtime w/o network: 68           ms

but with libvirt-3.2.0-14.el7.x86_64, it is
# virsh domjobinfo rhel --completed
Job type:         Completed   
Operation:        Outgoing migration
Time elapsed:     5822         ms
Time elapsed w/o network: 5817         ms
Data processed:   595.598 MiB
Data remaining:   0.000 B
Data total:       1.102 GiB
Memory processed: 595.598 MiB
Memory remaining: 0.000 B
Memory total:     1.102 GiB
Memory bandwidth: 111.518 MiB/s
Dirty rate:       0            pages/s
Iteration:        16          
Constant pages:   193151      
Normal pages:     151752      
Normal data:      592.781 MiB
Total downtime:   383          ms
Downtime w/o network: 378          ms
Setup time:       12           ms

some output didn't show up.

Comment 8 Jiri Denemark 2018-01-19 13:13:04 UTC
Yeah, it's a regression. When backporting the patches I intentionally skipped some refactoring patches and didn't properly adjust the rest.

Comment 9 Jiri Denemark 2018-01-19 13:15:55 UTC
The patch mentioned in comment 3, which was supposed to fix a regression, may crash libvirtd in some cases. See bug 1536351 for more details. In other words, one more patch is needed here.

Comment 11 yafu 2018-01-29 06:43:03 UTC
Verified the issue in the comment 9 with libvirt-3.2.0-14.el7_4.9.

Test steps:
1.Do migration with '--persistent' and '--offline' options:
# virsh migrate rhel qemu+ssh://10.66.4.116/system --offline --verbose  --persistent
root.4.116's password: 
Migration: [100 %]

Comment 12 zhe peng 2018-01-29 07:55:32 UTC
Verified comment 7 with build libvirt-3.2.0-14.el7_4.9

# virsh migrate rhel --live qemu+ssh://$target_host/system --verbose 
Migration: [100 %]
# virsh domjobinfo rhel --completed
Job type:         Completed   
Operation:        Outgoing migration
Time elapsed:     1124         ms
Time elapsed w/o network: 1122         ms
Data processed:   3.305 MiB
Data remaining:   0.000 B
Data total:       1.102 GiB
Memory processed: 3.305 MiB
Memory remaining: 0.000 B
Memory total:     1.102 GiB
Memory bandwidth: 38.463 MiB/s
Dirty rate:       0            pages/s
Iteration:        2           
Constant pages:   288783      
Normal pages:     211         
Normal data:      844.000 KiB
Total downtime:   59           ms
Downtime w/o network: 57           ms
Setup time:       6            ms

Comment 13 zhe peng 2018-01-29 07:56:11 UTC
per comment 11 & comment 12, move to verified.

Comment 16 errata-xmlrpc 2018-03-06 21:41:17 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0403