Bug 669581
Summary: | Migration Never end while Use firewall reject migration tcp port | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Mike Cao <bcao> | ||||
Component: | qemu-kvm | Assignee: | Juan Quintela <quintela> | ||||
Status: | CLOSED ERRATA | QA Contact: | Virtualization Bugs <virt-bugs> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 6.1 | CC: | akong, bcao, berrange, ehabkost, gcosta, gren, jdenemar, juzhang, lcapitulino, michen, mkenneth, owasserm, pbonzini, quintela, qzhang, shu, syeghiay, tburke, virt-maint, ykaul | ||||
Target Milestone: | rc | ||||||
Target Release: | --- | ||||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | qemu-kvm-0.12.1.2-2.207.el6 | Doc Type: | Bug Fix | ||||
Doc Text: |
Cause: functions in the migration code didn't handle and report errors correctly.
Consequence: migration never ended if connection to the destination migration port was rejected (e.g. by a firewall)
Fix: multiple fixes on error detection, reporting, and handling in the migration code.
Result: more reliable handling of errors during migration. e.g. errors when connection to the destination migration port was rejected are properly detected and migration is aborted.
|
Story Points: | --- | ||||
Clone Of: | 654937 | ||||||
: | 749806 (view as bug list) | Environment: | |||||
Last Closed: | 2011-12-06 15:44:02 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 580953, 725373, 750914, 799478 | ||||||
Attachments: |
|
Comment 1
Mike Cao
2011-01-14 02:46:43 UTC
Only sane way of fixing this bug is to thread the migration code. We will try to do that for 6.2. No way to get it working work 6.1 or sooner. Proposing for 6.2 due previous comment, also lowering to tier3, as this is a minor issue. Retried on # uname -r 2.6.32-118.el6.x86_64 # rpm -q qemu-kvm qemu-kvm-0.12.1.2-2.148.el6.x86_64 After the steps in comment #0 then exec (qemu)migration_cacel Actual Results: (qemu) migrate_cancel migrate_cancel command hang. the qemu-kvm process freezed ,can not connected by vncviewer. This seems a bit more serious, I'd expect migration_cancel to work in cases like this. But I'm not sure what the priority of this should be, ie. should we still fix it in 6.1 or keep deferring it to 6.2? Any comments Juan? > This seems a bit more serious, I'd expect migration_cancel to work in cases
> like this.
It seems pretty easy to make migrate_cancel work
When migrate_cancel gets stuck on a blocked TCP connection (eg firewalled, or just SIGSTOP the dest QEMU), I see a stack trace of:
(gdb) bt
#0 0x0000003a8320e4c2 in __libc_send (fd=10, buf=0x1bc7c70, n=19777, flags=0)
at ../sysdeps/unix/sysv/linux/x86_64/send.c:28
#1 0x000000000048fb1e in socket_write (s=<optimized out>, buf=<optimized out>, size=<optimized out>)
at migration-tcp.c:39
#2 0x000000000048eba4 in migrate_fd_put_buffer (opaque=0x1b76ad0, data=0x1bc7c70, size=19777) at migration.c:324
#3 0x000000000048e442 in buffered_flush (s=0x1b76b90) at buffered_file.c:87
#4 0x000000000048e4cf in buffered_close (opaque=0x1b76b90) at buffered_file.c:177
#5 0x0000000000496d57 in qemu_fclose (f=0x1bbfc10) at savevm.c:479
#6 0x000000000048f4ca in migrate_fd_cleanup (s=0x1b76ad0) at migration.c:291
#7 0x000000000048f035 in do_migrate_cancel (mon=<optimized out>, qdict=<optimized out>,
ret_data=<optimized out>) at migration.c:136
#8 0x0000000000418bc5 in handle_user_command (mon=0x18817b0, cmdline=<optimized out>)
at /home/berrange/src/virt/qemu/monitor.c:4401
#9 0x0000000000418fce in monitor_command_cb (mon=0x18817b0, cmdline=<optimized out>, opaque=<optimized out>)
at /home/berrange/src/virt/qemu/monitor.c:5047
#10 0x000000000046d899 in readline_handle_byte (rs=0x1881c20, ch=<optimized out>) at readline.c:370
#11 0x0000000000418dbc in monitor_read (opaque=<optimized out>, buf=0x7fff0e635a60 "\r", size=1)
at /home/berrange/src/virt/qemu/monitor.c:5033
#12 0x000000000049250b in qemu_chr_read (len=<optimized out>, buf=0x7fff0e635a60 "\r", s=0x187d2b0)
at qemu-char.c:164
#13 fd_chr_read (opaque=0x187d2b0) at qemu-char.c:579
#14 0x000000000056954e in main_loop_wait (nonblocking=<optimized out>) at /home/berrange/src/virt/qemu/vl.c:1367
#15 0x000000000040b465 in main_loop () at /home/berrange/src/virt/qemu/vl.c:1424
#16 main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>)
at /home/berrange/src/virt/qemu/vl.c:3138
The migration_fd_cleanup method is where the problem starts. Specifically it does
if (s->file) {
DPRINTF("closing file\n");
if (qemu_fclose(s->file) != 0) {
ret = -1;
}
s->file = NULL;
}
if (s->fd != -1)
close(s->fd);
And gets stuck in the qemu_fclose() bit because it is trying to flush buffers.
It is hard to tell qemu_fclose() that it shouldn't flush buffers directly, so the alternative is to ensure that this method fails quickly. This is easily achieved, simply by closing 's->fd' *before* calling qemu_fclose.
Created attachment 520055 [details]
Avoiding hang flushing buffers when migration is failed/cancelled
Posted this upstream for comments, along with an alternative version http://lists.nongnu.org/archive/html/qemu-devel/2011-08/msg03245.html Hi Juan, what's the progress here? This is bug is targeted for rhel6.2 currently. Reproduced this issue on qemu-kvm-0.12.1.2-2.200.el6. Steps: 1.Boot a guest on host A: /usr/libexec/qemu-kvm -M rhel6.2.0 -cpu cpu64-rhel6,+x2apic -enable-kvm -m 2048 -smp 2,sockets=2,cores=1,threads=1 -name RHEL6 -uuid 821af33f-9b98-4580-bd96-1f82f96280a4 -monitor stdio -rtc base=localtime -boot c -device virtio-serial-pci,id=virtio-serial0,max_ports=16,bus=pci.0,addr=0x4 -drive file=/media/rhel6u2.qcow2,if=none,id=drive-virtio-disk0,format=qcow2,cache=none,werror=stop,rerror=stop -device ide-drive,bus=ide.0,unit=0,drive=drive-virtio-disk0,id=virtio-disk0 -netdev tap,id=hostnet0,vhost=on,script=/etc/qemu-ifup -device virtio-net-pci,netdev=hostnet0,id=net0,mac=00:1a:4a:10:20:3a,bus=pci.0,addr=0x3 -chardev socket,id=charchannel0,path=/tmp/foo,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=com.redhat.rhevm.vdsm -usb -device usb-tablet -vnc :10 2. Boot the guest in listening mode on host B "-incoming tcp:0:5800" 3. On host B: #iptables -F 4. On host A: (qemu)migrate -d tcp:$host_B_IP:5800 On host B(after migration starts and before complete) #iptables -A INPUT -p tcp --dport 5800 -j REJECT 5. On host A: (qemu)info migration (qemu)migrate_cancel (qemu) Results: qemu-kvm freeze after issue "migrate_cancel" command in qemu monitor. Verified on qemu-kvm-0.12.1.2-2.204.el6 with the same steps as above, After step5, "migrate_cancel" command returns and "info migrate" shows migration status is cancelled. But there's a segmentation fault in some condition in the source host qemu-kvm. (For qemu-kvm-0.12.1.2-2.204.el6) Continue with the steps in Comment 19. 5.On host A: (qemu)info migration (qemu)migrate_cancel (qemu)info migrate Migration status: cancelled 6.On host B: #iptables -F 7.On host A: (qemu)migrate -d tcp:$host_B_IP:5800 Result: On host A (source host) side: Get segmentation fault. (qemu) info migrate Migration status: active transferred ram: 148 kbytes remaining ram: 2113784 kbytes total ram: 2113920 kbytes (qemu) Program received signal SIGSEGV, Segmentation fault. qemu_file_get_error (f=0x0) at savevm.c:429 (gdb) bt #0 qemu_file_get_error (f=0x0) at savevm.c:429 #1 0x00000000004bad97 in migrate_fd_put_notify (opaque=0xeed870) at migration.c:333 #2 0x000000000040c57e in main_loop_wait (timeout=1000) at /usr/src/debug/qemu-kvm-0.12.1.2/vl.c:4029 #3 0x000000000042aeaa in kvm_main_loop () at /usr/src/debug/qemu-kvm-0.12.1.2/qemu-kvm.c:2225 #4 0x000000000040de35 in main_loop (argc=<value optimized out>, argv=<value optimized out>, envp=<value optimized out>) at /usr/src/debug/qemu-kvm-0.12.1.2/vl.c:4234 #5 main (argc=<value optimized out>, argv=<value optimized out>, envp=<value optimized out>) at /usr/src/debug/qemu-kvm-0.12.1.2/vl.c:6470 On host B: (qemu) qemu: warning: error while loading state section id 3 load of migration failed Anyway, should not get a segmentation fault. Hi, Juan Could you help have a look? The first step here is to know whether this issue is being caused by the fix for this bug. If it's then it should be reverted. If it's not, then a new bz should be opened and we should try to see if it's a regression. If it is a regression, then this is a high priority blocker. If the segfault is not a regression (ie. it always existed) it's probably a blocker too and also has to be fixed in the z-stream. Seems can not reproduce this in the pre-patched version qemu-kvm-0.12.1.2-2.200.el6. I boot the guest in source host and boot with listening mode in dst host. Then do migration and migrate_cancel. But at this moment, the qemu-kvm in dst side will quit as "load of migration failed" because I cancel the migration in source. So I have to boot in dst side again and re-migration. In this condition, will not hit the segmentation fault. For the issue described in Comment 20 in the latest qemu-kvm-204, after I cancelled the migration in source host, the qemu-kvm process in dst side was still here sometimes. So when I re-migrated, the segmentation fault happens. FYI, when we verified bz744518 with kvm-205,hit this issue as well. Would you please tell us our solutions? 1.revert if so,I will set this issue as assigned. 2.Open new bug and close this issue and bz744518 and track new bz. if so,new bug against rhel6.2 or rhel6.3? 3.or other option? Thanks Yes, the segfault is caused by one of the patches included to fix this Bug. More precisely: migration: add error handling to migrate_fd_put_notify(). RH-Author: Juan Quintela <quintela> Message-id: <bf976e6ee205a6ef3511cf6d3f371fe7944e64d7.1319066770.git.quintela> Patchwork-id: 34431 O-Subject: [PATCH qemu-kvm RHEL-6.2 05/16] migration: add error handling to migrate_fd_put_notify(). Bugzilla: 669581 It's a really large series to revert, I will check if reverting it is feasible. I will open a new Bug so everybody gets aware of the issue. Segfault bug reported as bug 749806. Patches were _reverted_ on qemu-kvm-0.12.1.2-2.204.el6. Re-tested with qemu-kvm-0.12.1.2-2.207.el6, did not hit the problem any more. And we will run some functional test for migration. I will change the bug status to VERIFIED after finish all the migration test if they pass. Thanks. Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Cause: functions in the migration code didn't handle and report errors correctly. Consequence: migration never ended if connection to the destination migration port was rejected (e.g. by a firewall) Fix: multiple fixes on error detection, reporting, and handling in the migration code. Result: more reliable handling of errors during migration. e.g. errors when connection to the destination migration port was rejected are properly detected and migration is aborted. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2011-1531.html |