Bug 684076

Summary: Segfault occurred during migration
Product: Red Hat Enterprise Linux 6 Reporter: Amos Kong <akong>
Component: qemu-kvmAssignee: Michael S. Tsirkin <mst>
Status: CLOSED ERRATA QA Contact: Virtualization Bugs <virt-bugs>
Severity: high Docs Contact:
Priority: high    
Version: 6.1CC: ailan, ehabkost, jasowang, lcapitulino, mkenneth, tburke, virt-maint
Target Milestone: rcKeywords: TestBlocker
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: qemu-kvm-0.12.1.2-2.151.el6 Doc Type: Bug Fix
Doc Text:
Cause: bug on address converstion in the vhost migration dirty page handling code. Consequence: qemu-kvm Segmentation fault during live migration of KVM VMs. Fix: corrected address convertion in the vhost migration dirty page handling code. Result: qemu-kvm doesn't crash during live migration anymore.
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-05-19 11:27:18 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 580951    
Attachments:
Description Flags
debugging patch
none
Debug msg after loading mst's patch
none
extra debugging patch: apply on top
none
debug msg after loading second patch
none
another debug, pls apply on top of the previous ones
none
fix build
none
debug-msg-after-loading-3rd-patch.txt
none
first proposed patch none

Description Amos Kong 2011-03-11 02:42:55 UTC
Description of problem:
I try to do internal migration, the migrations could not completed during 600 seconds. Sometimes, segfault occurred.
vhost is enabled.


Program terminated with signal 11, Segmentation fault.
#0  cpu_physical_memory_set_dirty (dev=<value optimized out>, mfirst=<value optimized out>, mlast=<value optimized out>, rfirst=<value optimized out>, rlast=<value optimized out>)
    at /usr/src/debug/qemu-kvm-0.12.1.2/cpu-all.h:909
909         if (!cpu_physical_memory_get_dirty(addr, MIGRATION_DIRTY_FLAG))
...
(gdb) bt
#0  cpu_physical_memory_set_dirty (dev=<value optimized out>, mfirst=<value optimized out>, mlast=<value optimized out>, rfirst=<value optimized out>, rlast=<value optimized out>)
    at /usr/src/debug/qemu-kvm-0.12.1.2/cpu-all.h:909
#1  vhost_dev_sync_region (dev=<value optimized out>, mfirst=<value optimized out>, mlast=<value optimized out>, rfirst=<value optimized out>, rlast=<value optimized out>)
    at /usr/src/debug/qemu-kvm-0.12.1.2/hw/vhost.c:53
#2  0x0000000000425681 in vhost_client_sync_dirty_bitmap (client=0x10e0df0, start_addr=0, end_addr=18446744073709551615) at /usr/src/debug/qemu-kvm-0.12.1.2/hw/vhost.c:71
#3  0x00000000004eb62d in cpu_notify_sync_dirty_bitmap (start_addr=0, end_addr=18446744073709551615) at /usr/src/debug/qemu-kvm-0.12.1.2/exec.c:1651
#4  cpu_physical_sync_dirty_bitmap (start_addr=0, end_addr=18446744073709551615) at /usr/src/debug/qemu-kvm-0.12.1.2/exec.c:2020
#5  0x000000000040ac1a in ram_save_live (mon=<value optimized out>, f=0x3ac6070, stage=2, opaque=<value optimized out>) at /usr/src/debug/qemu-kvm-0.12.1.2/vl.c:3387
#6  0x00000000004c0949 in qemu_savevm_state_iterate (mon=0x0, f=0x3ac6070) at savevm.c:1513
#7  0x00000000004b9afd in migrate_fd_put_ready (opaque=0x118af60) at migration.c:392
#8  0x00000000004b9722 in buffered_put_buffer (opaque=0x138df50, buf=0x0, pos=0, size=0) at buffered_file.c:163
#9  0x000000000040bb56 in main_loop_wait (timeout=1000) at /usr/src/debug/qemu-kvm-0.12.1.2/vl.c:4433
#10 0x000000000042b44a in kvm_main_loop () at /usr/src/debug/qemu-kvm-0.12.1.2/qemu-kvm.c:2164
#11 0x000000000040eeb5 in main_loop (argc=<value optimized out>, argv=<value optimized out>, envp=<value optimized out>) at /usr/src/debug/qemu-kvm-0.12.1.2/vl.c:4638
#12 main (argc=<value optimized out>, argv=<value optimized out>, envp=<value optimized out>) at /usr/src/debug/qemu-kvm-0.12.1.2/vl.c:6852


Version-Release number of selected component (if applicable):
qemu-kvm-0.12.1.2-2.150.el6.x86_64
host kernel: 2.6.32-120.el6.x86_64

How reproducible:
not always

Steps to Reproduce:
1. launch src vm
2. launch dst vm
3. execute migrate cmd in monitor of src vm
   tcp, unix
  
Actual results:
1. migration could not completed during 600 seconds.
2. sometimes, segfault occurred. 

Expected results:
migration completed.

Additional info:

1. src VM cmdline:
# qemu-kvm -name 'vm1' -chardev socket,id=human_monitor_A0Ll,path=/tmp/monitor-humanmonitor1-20110310-212223-498e,server,nowait -mon chardev=human_monitor_A0Ll,mode=readline -chardev socket,id=serial_2GGr,path=/tmp/serial-20110310-212223-498e,server,nowait -device isa-serial,chardev=serial_2GGr -drive file='/usr/local/staf/test/RHEV/kvm-new/autotest/client/tests/kvm/images/RHEL-Server-6.0-64-virtio.qcow2',index=0,if=none,id=drive-virtio-disk1,media=disk,cache=none,format=qcow2,aio=native -device virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk1,id=virtio-disk1 -device virtio-net-pci,netdev=id4hU3V4,mac=9a:24:36:f4:7f:d4,netdev=id4hU3V4,id=ndev00id4hU3V4,bus=pci.0,addr=0x3 -netdev tap,id=id4hU3V4,vhost=on,ifname='t0-212223-498e',script='/usr/local/staf/test/RHEV/kvm-new/autotest/client/tests/kvm/scripts/qemu-ifup-switch',downscript='no' -m 8192 -smp 4,cores=1,threads=1,sockets=4 -cpu cpu64-rhel6,+sse2,+x2apic -vnc :0 -rtc base=utc,clock=host,driftfix=none  -boot order=cdn,once=c,menu=off   -usbdevice tablet -no-kvm-pit-reinjection -enable-kvm 


2. host info:
cpu-core: 12
memory: 31G 

processor	: 11
vendor_id	: AuthenticAMD
cpu family	: 16
model		: 8
model name	: Six-Core AMD Opteron(tm) Processor 2427
stepping	: 0
cpu MHz		: 800.000
cache size	: 512 KB
physical id	: 1
siblings	: 6
core id		: 5
cpu cores	: 6
apicid		: 13
initial apicid	: 13
fpu		: yes
fpu_exception	: yes
cpuid level	: 5
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nonstop_tsc extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt npt lbrv svm_lock nrip_save
bogomips	: 4399.62
TLB size	: 1024 4K pages
clflush size	: 64
cache_alignment	: 64
address sizes	: 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate

Comment 5 Michael S. Tsirkin 2011-03-14 15:35:50 UTC
Created attachment 484225 [details]
debugging patch

Comment 10 Amos Kong 2011-03-15 05:05:17 UTC
Created attachment 484351 [details]
Debug msg after loading mst's patch

Comment 12 Michael S. Tsirkin 2011-03-15 07:38:50 UTC
Created attachment 484385 [details]
extra debugging patch: apply on top

Comment 13 Amos Kong 2011-03-15 07:53:29 UTC
Created attachment 484387 [details]
debug msg after loading second patch

Comment 14 Michael S. Tsirkin 2011-03-15 12:40:10 UTC
Created attachment 484453 [details]
another debug, pls apply on top of the previous ones

Comment 15 Michael S. Tsirkin 2011-03-15 12:45:46 UTC
Created attachment 484455 [details]
fix build

Comment 16 Amos Kong 2011-03-15 12:51:54 UTC
Created attachment 484457 [details]
debug-msg-after-loading-3rd-patch.txt

Comment 18 Michael S. Tsirkin 2011-03-16 10:35:44 UTC
Created attachment 485702 [details]
first proposed patch

Comment 19 Amos Kong 2011-03-16 10:56:53 UTC
(In reply to comment #18)
> Created attachment 485702 [details]
> first proposed patch

tested 8 time with this patch, they are all pass !

Comment 23 Amos Kong 2011-03-21 09:02:22 UTC
Migration tests are all passed with qemu-kvm-0.12.1.2-2.151.el6.
https://virtlab.englab.nay.redhat.com/job/29738/details/

Moving to VERIFIED.

Comment 25 Eduardo Habkost 2011-05-05 12:31:52 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Cause: bug on address converstion in the vhost migration dirty page handling code.

Consequence: qemu-kvm Segmentation fault during live migration of KVM VMs.

Fix: corrected address convertion in the vhost migration dirty page handling code.

Result: qemu-kvm doesn't crash during live migration anymore.

Comment 26 errata-xmlrpc 2011-05-19 11:27:18 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0534.html

Comment 27 errata-xmlrpc 2011-05-19 13:02:03 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0534.html