Bug 669581

Summary: Migration Never end while Use firewall reject migration tcp port
Product: Red Hat Enterprise Linux 6 Reporter: Mike Cao <bcao>
Component: qemu-kvmAssignee: Juan Quintela <quintela>
Status: CLOSED ERRATA QA Contact: Virtualization Bugs <virt-bugs>
Severity: high Docs Contact:
Priority: high    
Version: 6.1CC: akong, bcao, berrange, ehabkost, gcosta, gren, jdenemar, juzhang, lcapitulino, michen, mkenneth, owasserm, pbonzini, quintela, qzhang, shu, syeghiay, tburke, virt-maint, ykaul
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: qemu-kvm-0.12.1.2-2.207.el6 Doc Type: Bug Fix
Doc Text:
Cause: functions in the migration code didn't handle and report errors correctly. Consequence: migration never ended if connection to the destination migration port was rejected (e.g. by a firewall) Fix: multiple fixes on error detection, reporting, and handling in the migration code. Result: more reliable handling of errors during migration. e.g. errors when connection to the destination migration port was rejected are properly detected and migration is aborted.
Story Points: ---
Clone Of: 654937
: 749806 (view as bug list) Environment:
Last Closed: 2011-12-06 15:44:02 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 580953, 725373, 750914, 799478    
Attachments:
Description Flags
Avoiding hang flushing buffers when migration is failed/cancelled none

Comment 1 Mike Cao 2011-01-14 02:46:43 UTC
Reproduced on 
# uname -r
2.6.32-94.el6.x86_64
# rpm -q qemu-kvm
qemu-kvm-0.12.1.2-2.129.el6.x86_64

Comment 2 Juan Quintela 2011-01-17 12:47:26 UTC
Only sane way of fixing this bug is to thread the migration code.  We will try to do that for 6.2.  No way to get it working work 6.1 or sooner.

Comment 3 Luiz Capitulino 2011-01-17 12:52:51 UTC
Proposing for 6.2 due previous comment, also lowering to tier3, as this is a minor issue.

Comment 4 Mike Cao 2011-03-01 05:53:18 UTC
Retried on 
# uname -r
2.6.32-118.el6.x86_64
# rpm -q qemu-kvm
qemu-kvm-0.12.1.2-2.148.el6.x86_64

After the steps in comment #0
then exec (qemu)migration_cacel

Actual Results:
(qemu) migrate_cancel 

migrate_cancel command hang.
the qemu-kvm process freezed ,can not connected by vncviewer.

Comment 5 Luiz Capitulino 2011-03-01 13:29:18 UTC
This seems a bit more serious, I'd expect migration_cancel to work in cases like this. But I'm not sure what the priority of this should be, ie. should we still fix it in 6.1 or keep deferring it to 6.2? Any comments Juan?

Comment 9 Daniel Berrangé 2011-08-26 10:19:57 UTC
> This seems a bit more serious, I'd expect migration_cancel to work in cases
> like this.

It seems pretty easy to make migrate_cancel work

When migrate_cancel gets stuck on a blocked TCP connection (eg firewalled, or just SIGSTOP the dest QEMU), I see a stack trace of:

(gdb) bt
#0  0x0000003a8320e4c2 in __libc_send (fd=10, buf=0x1bc7c70, n=19777, flags=0)
    at ../sysdeps/unix/sysv/linux/x86_64/send.c:28
#1  0x000000000048fb1e in socket_write (s=<optimized out>, buf=<optimized out>, size=<optimized out>)
    at migration-tcp.c:39
#2  0x000000000048eba4 in migrate_fd_put_buffer (opaque=0x1b76ad0, data=0x1bc7c70, size=19777) at migration.c:324
#3  0x000000000048e442 in buffered_flush (s=0x1b76b90) at buffered_file.c:87
#4  0x000000000048e4cf in buffered_close (opaque=0x1b76b90) at buffered_file.c:177
#5  0x0000000000496d57 in qemu_fclose (f=0x1bbfc10) at savevm.c:479
#6  0x000000000048f4ca in migrate_fd_cleanup (s=0x1b76ad0) at migration.c:291
#7  0x000000000048f035 in do_migrate_cancel (mon=<optimized out>, qdict=<optimized out>, 
    ret_data=<optimized out>) at migration.c:136
#8  0x0000000000418bc5 in handle_user_command (mon=0x18817b0, cmdline=<optimized out>)
    at /home/berrange/src/virt/qemu/monitor.c:4401
#9  0x0000000000418fce in monitor_command_cb (mon=0x18817b0, cmdline=<optimized out>, opaque=<optimized out>)
    at /home/berrange/src/virt/qemu/monitor.c:5047
#10 0x000000000046d899 in readline_handle_byte (rs=0x1881c20, ch=<optimized out>) at readline.c:370
#11 0x0000000000418dbc in monitor_read (opaque=<optimized out>, buf=0x7fff0e635a60 "\r", size=1)
    at /home/berrange/src/virt/qemu/monitor.c:5033
#12 0x000000000049250b in qemu_chr_read (len=<optimized out>, buf=0x7fff0e635a60 "\r", s=0x187d2b0)
    at qemu-char.c:164
#13 fd_chr_read (opaque=0x187d2b0) at qemu-char.c:579
#14 0x000000000056954e in main_loop_wait (nonblocking=<optimized out>) at /home/berrange/src/virt/qemu/vl.c:1367
#15 0x000000000040b465 in main_loop () at /home/berrange/src/virt/qemu/vl.c:1424
#16 main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>)
    at /home/berrange/src/virt/qemu/vl.c:3138


The migration_fd_cleanup method is where the problem starts. Specifically it does

    if (s->file) {
        DPRINTF("closing file\n");
        if (qemu_fclose(s->file) != 0) {
            ret = -1;
        }
        s->file = NULL;
    }

    if (s->fd != -1)
        close(s->fd);


And gets stuck in the qemu_fclose() bit because it is trying to flush buffers.

It is hard to tell qemu_fclose() that it shouldn't flush buffers directly, so the alternative is to ensure that this method fails quickly. This is easily achieved, simply by closing 's->fd' *before* calling qemu_fclose.

Comment 10 Daniel Berrangé 2011-08-26 10:20:44 UTC
Created attachment 520055 [details]
Avoiding hang flushing buffers when migration is failed/cancelled

Comment 12 Daniel Berrangé 2011-08-26 11:27:33 UTC
Posted this upstream for comments, along with an alternative version

http://lists.nongnu.org/archive/html/qemu-devel/2011-08/msg03245.html

Comment 13 Miya Chen 2011-10-14 05:40:39 UTC
Hi Juan,
what's the progress here? This is bug is targeted for rhel6.2 currently.

Comment 19 Qunfang Zhang 2011-10-27 09:43:12 UTC
Reproduced this issue on qemu-kvm-0.12.1.2-2.200.el6.
Steps:
1.Boot a guest on host A:
/usr/libexec/qemu-kvm -M rhel6.2.0 -cpu cpu64-rhel6,+x2apic  -enable-kvm -m 2048 -smp 2,sockets=2,cores=1,threads=1 -name RHEL6 -uuid 821af33f-9b98-4580-bd96-1f82f96280a4 -monitor stdio -rtc base=localtime -boot c -device virtio-serial-pci,id=virtio-serial0,max_ports=16,bus=pci.0,addr=0x4 -drive file=/media/rhel6u2.qcow2,if=none,id=drive-virtio-disk0,format=qcow2,cache=none,werror=stop,rerror=stop -device ide-drive,bus=ide.0,unit=0,drive=drive-virtio-disk0,id=virtio-disk0 -netdev tap,id=hostnet0,vhost=on,script=/etc/qemu-ifup -device virtio-net-pci,netdev=hostnet0,id=net0,mac=00:1a:4a:10:20:3a,bus=pci.0,addr=0x3 -chardev socket,id=charchannel0,path=/tmp/foo,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=com.redhat.rhevm.vdsm -usb -device usb-tablet -vnc :10

2. Boot the guest in listening mode on host B "-incoming tcp:0:5800"

3. On host B: #iptables -F

4. On host A:
(qemu)migrate -d tcp:$host_B_IP:5800
On host B(after migration starts and before complete)
#iptables -A INPUT -p tcp --dport 5800 -j REJECT

5. On host A:
(qemu)info migration
(qemu)migrate_cancel
(qemu)

Results: qemu-kvm freeze after issue "migrate_cancel" command in qemu monitor.

Verified on qemu-kvm-0.12.1.2-2.204.el6 with the same steps as above, After step5, "migrate_cancel" command returns and "info migrate" shows migration status is cancelled.

Comment 20 Qunfang Zhang 2011-10-27 09:54:27 UTC
But there's a segmentation fault in some condition in the source host qemu-kvm.
(For qemu-kvm-0.12.1.2-2.204.el6)

Continue with the steps in Comment 19.

5.On host A:
(qemu)info migration
(qemu)migrate_cancel
(qemu)info migrate
Migration status: cancelled

6.On host B:
#iptables -F

7.On host A:
(qemu)migrate -d tcp:$host_B_IP:5800

Result: 

On host A (source host) side:
Get segmentation fault.

(qemu) info migrate 
Migration status: active
transferred ram: 148 kbytes
remaining ram: 2113784 kbytes
total ram: 2113920 kbytes
(qemu) 
Program received signal SIGSEGV, Segmentation fault.
qemu_file_get_error (f=0x0) at savevm.c:429

(gdb) bt
#0  qemu_file_get_error (f=0x0) at savevm.c:429
#1  0x00000000004bad97 in migrate_fd_put_notify (opaque=0xeed870) at migration.c:333
#2  0x000000000040c57e in main_loop_wait (timeout=1000)
    at /usr/src/debug/qemu-kvm-0.12.1.2/vl.c:4029
#3  0x000000000042aeaa in kvm_main_loop () at /usr/src/debug/qemu-kvm-0.12.1.2/qemu-kvm.c:2225
#4  0x000000000040de35 in main_loop (argc=<value optimized out>, argv=<value optimized out>, 
    envp=<value optimized out>) at /usr/src/debug/qemu-kvm-0.12.1.2/vl.c:4234
#5  main (argc=<value optimized out>, argv=<value optimized out>, envp=<value optimized out>)
    at /usr/src/debug/qemu-kvm-0.12.1.2/vl.c:6470

On host B:
(qemu) qemu: warning: error while loading state section id 3
load of migration failed

Anyway, should not get a segmentation fault.

Hi, Juan
Could you help have a look?

Comment 21 Luiz Capitulino 2011-10-27 12:58:12 UTC
The first step here is to know whether this issue is being caused by the fix for this bug. If it's then it should be reverted.

If it's not, then a new bz should be opened and we should try to see if it's a regression. If it is a regression, then this is a high priority blocker.

If the segfault is not a regression (ie. it always existed) it's probably a blocker too and also has to be fixed in the z-stream.

Comment 22 Qunfang Zhang 2011-10-28 02:41:59 UTC
Seems can not reproduce this in the pre-patched version qemu-kvm-0.12.1.2-2.200.el6. 
I boot the guest in source host and boot with listening mode in dst host.
Then do migration and migrate_cancel. But at this moment, the qemu-kvm in dst side will quit as "load of migration failed" because I cancel the migration in source. So I have to boot in dst side again and re-migration. In this condition, will not hit the segmentation fault.

For the issue described in Comment 20 in the latest qemu-kvm-204, after I cancelled the migration in source host, the qemu-kvm process in dst side was still here sometimes. So when I re-migrated, the segmentation fault happens.

Comment 23 juzhang 2011-10-28 09:15:00 UTC
FYI,
when we verified bz744518 with kvm-205,hit this issue as well.

Would you please tell us our solutions?
1.revert
if so,I will set this issue as assigned.
2.Open new bug and close this issue and bz744518 and track new bz.
if so,new bug against rhel6.2 or rhel6.3?
3.or other option?

Thanks

Comment 24 Eduardo Habkost 2011-10-28 13:52:55 UTC
Yes, the segfault is caused by one of the patches included to fix this Bug. More precisely:

    migration: add error handling to migrate_fd_put_notify().
    
    RH-Author: Juan Quintela <quintela>
    Message-id: <bf976e6ee205a6ef3511cf6d3f371fe7944e64d7.1319066770.git.quintela>
    Patchwork-id: 34431
    O-Subject: [PATCH qemu-kvm RHEL-6.2 05/16] migration: add error handling to migrate_fd_put_notify().
    Bugzilla: 669581


It's a really large series to revert, I will check if reverting it is feasible. I will open a new Bug so everybody gets aware of the issue.

Comment 25 Eduardo Habkost 2011-10-28 17:05:23 UTC
Segfault bug reported as bug 749806.

Comment 28 Eduardo Habkost 2011-10-28 17:45:10 UTC
Patches were _reverted_ on qemu-kvm-0.12.1.2-2.204.el6.

Comment 32 Qunfang Zhang 2011-11-01 03:18:22 UTC
Re-tested with qemu-kvm-0.12.1.2-2.207.el6, did not hit the problem any more.
And we will run some functional test for migration.  I will change the bug status to VERIFIED after finish all the migration test if they pass.

Thanks.

Comment 36 Eduardo Habkost 2011-11-22 13:19:44 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Cause: functions in the migration code didn't handle and report errors correctly.

Consequence: migration never ended if connection to the destination migration port was rejected (e.g. by a firewall)


Fix: multiple fixes on error detection, reporting, and handling in the migration code.

Result: more reliable handling of errors during migration. e.g. errors when connection to the destination migration port was rejected are properly detected and migration is aborted.

Comment 37 errata-xmlrpc 2011-12-06 15:44:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2011-1531.html