Bug 1066338

Summary: Reduce the migrate cache size during migration causes qemu segment fault
Product: Red Hat Enterprise Linux 7 Reporter: Qunfang Zhang <qzhang>
Component: qemu-kvmAssignee: Dr. David Alan Gilbert <dgilbert>
Status: CLOSED ERRATA QA Contact: Virtualization Bugs <virt-bugs>
Severity: high Docs Contact:
Priority: high    
Version: 7.0CC: hhuang, huding, jherrman, juzhang, lmiksik, michen, mrezanin, qzhang, rbalakri, shu, tdosek, virt-maint
Target Milestone: rcKeywords: ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: qemu-kvm-1.5.3-63.el7 Doc Type: Bug Fix
Doc Text:
Prior to this update, the QEMU command interface did not properly handle resizing of cache memory during a guest migration, causing QEMU to terminate unexpectedly with a segmentation fault and QEMU to fail. This update fixes the related code and QEMU no longer crashes in the described situation.
Story Points: ---
Clone Of:
: 1110191 (view as bug list) Environment:
Last Closed: 2015-03-05 08:04:22 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1076185, 1110191    

Description Qunfang Zhang 2014-02-18 09:21:53 UTC
Description of problem:
Migration a guest with workload inside guest (I'm running google stressapptest tool), turn on xbzrle and set a migration cache size. During migration, reduce the cache size, and qemu segment fault. 

Version-Release number of selected component (if applicable):
kernel-3.10.0-73.el7.x86_64
qemu-kvm-1.5.3-48.el7.x86_64

How reproducible:
Always

Steps to Reproduce:
1. Boot up a guest 

/usr/libexec/qemu-kvm -cpu SandyBridge -enable-kvm -m 30G -smp 8,sockets=1,cores=8,threads=1 -enable-kvm -name t2-rhel6.4-64 -uuid 61b6c504-5a8b-4fe1-8347-6c929b750dde -k en-us -rtc base=localtime,clock=host,driftfix=slew -no-kvm-pit-reinjection -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device usb-tablet,id=input0 -drive file=/mnt/RHEL-Server-7.0-64-virtio.qcow2,if=none,id=disk0,format=qcow2,werror=stop,rerror=stop,aio=native -device virtio-blk-pci,bus=pci.0,addr=0x3,drive=disk0,id=disk0 -drive if=none,media=cdrom,id=drive-ide0-1-0,readonly=on,format=raw -device ide-drive,drive=drive-ide0-1-0,bus=ide.1,unit=0,id=cdrom -netdev tap,id=hostnet0,vhost=on -device virtio-net-pci,netdev=hostnet0,id=net0,mac=44:37:E6:5E:91:85,bus=pci.0,addr=0x5 -monitor stdio -qmp tcp:0:6666,server,nowait -chardev socket,path=/tmp/isa-serial,server,nowait,id=isa1 -device isa-serial,chardev=isa1,id=isa-serial1 -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x8 -chardev socket,id=charchannel0,path=/tmp/serial-socket,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=com.redhat.rhevm.vdsm -chardev socket,path=/tmp/foo,server,nowait,id=foo -device virtconsole,chardev=foo,id=console0 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x9 -vnc :10 -k en-us -boot c -chardev socket,path=/tmp/qga.sock,server,nowait,id=qga0 -device virtserialport,bus=virtio-serial0.0,chardev=qga0,name=org.qemu.guest_agent.0 -global PIIX4_PM.disable_s3=0 -global PIIX4_PM.disable_s4=0

2. Boot the guest on dst host with "-incoming tcp:0:5800"

3. Running google stressapptest inside guest (Refer to Bug 1063417)

Docs:
https://code.google.com/p/stressapptest/wiki/Introduction

(1) Get the code from: http://code.google.com/p/stressapptest/downloads/list (I used 1.0.6)

(2)untar
./configure
make
This produces the binary src/stressapptest

** Don't run the test on your laptop - it'll run it out of memory without options ! **

(3)copy the binary onto the victim VM:
scp src/stressapptest   thevmname:

(4) Then on a text-console on the VM do:
 ./stressapptest -s 3600 -m 20 -i 20 -C 20
	

5. On source host qemu:
(qemu) migrate_set_capability auto-converge on
(qemu) migrate_set_capability xbzrle on
(qemu) migrate_set_cache_size 1G
(qemu) migrate_set_speed 1G

6. Implement migration
(qemu) migrate -d tcp:$dst_host_ip:5800

7. Wait for a while, before migration finish.
(qemu) migrate_set_cache_size 128M


Actual results:
Qemu segment fault:

(qemu) migrate_set_cache_size 128

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff2d00faa in _int_free () from /lib64/libc.so.6
(gdb) bt
#0  0x00007ffff2d00faa in _int_free () from /lib64/libc.so.6
#1  0x00007ffff74ef9af in g_free () from /lib64/libglib-2.0.so.0
#2  0x00005555556ec291 in cache_resize (cache=0x7ff7380008c0, new_num_pages=new_num_pages@entry=32768)
    at page_cache.c:216
#3  0x0000555555744ab5 in xbzrle_cache_resize (new_size=new_size@entry=134217728)
    at /usr/src/debug/qemu-1.5.3/arch_init.c:184
#4  0x00005555556e11a5 in qmp_migrate_set_cache_size (value=<optimized out>, errp=<optimized out>)
    at migration.c:494
#5  0x0000555555653a0a in hmp_migrate_set_cache_size (mon=0x5555564f34d0, qdict=<optimized out>)
    at hmp.c:917
#6  0x000055555579efc9 in handle_user_command (mon=mon@entry=0x5555564f34d0, cmdline=<optimized out>)
    at /usr/src/debug/qemu-1.5.3/monitor.c:4008
#7  0x000055555579f297 in monitor_command_cb (mon=0x5555564f34d0, cmdline=<optimized out>, 
    opaque=<optimized out>) at /usr/src/debug/qemu-1.5.3/monitor.c:4624
#8  0x00005555557171f4 in readline_handle_byte (rs=0x5555566593b0, ch=<optimized out>) at readline.c:374
#9  0x000055555579f224 in monitor_read (opaque=<optimized out>, buf=<optimized out>, size=<optimized out>)
    at /usr/src/debug/qemu-1.5.3/monitor.c:4610
#10 0x0000555555707f3b in qemu_chr_be_write (len=<optimized out>, buf=0x7fffffffc6c0 "\r", s=0x5555564db8f0)
    at qemu-char.c:167
#11 fd_chr_read (chan=<optimized out>, cond=<optimized out>, opaque=0x5555564db8f0) at qemu-char.c:850
#12 0x00007ffff74e9e06 in g_main_context_dispatch () from /lib64/libglib-2.0.so.0
#13 0x00005555556dae9a in glib_pollfds_poll () at main-loop.c:187
#14 os_host_main_loop_wait (timeout=<optimized out>) at main-loop.c:232
#15 main_loop_wait (nonblocking=<optimized out>) at main-loop.c:464
#16 0x00005555556017c0 in main_loop () at vl.c:1988
#17 main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at vl.c:4357
(gdb) 

Expected results:
Segment fault should not happen. 

Additional info:
Host info:
processor	: 31
vendor_id	: GenuineIntel
cpu family	: 6
model		: 45
model name	: Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz
stepping	: 7
microcode	: 0x710
cpu MHz		: 1995.069
cache size	: 20480 KB
physical id	: 1
siblings	: 16
core id		: 7
cpu cores	: 8
apicid		: 47
initial apicid	: 47
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid
bogomips	: 3994.12
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:


# free -m 
             total       used       free     shared    buffers     cached
Mem:         31880        751      31129          0          8         60
-/+ buffers/cache:        682      31197
Swap:        16135         46      16089

Comment 2 Dr. David Alan Gilbert 2014-02-19 11:58:58 UTC
Looks like someone else hit this at the same time; discussion just started:

http://lists.gnu.org/archive/html/qemu-devel/2014-02/msg03332.html

Comment 11 Miroslav Rezanina 2014-06-17 12:40:26 UTC
Fix included in qemu-kvm-1.5.3-63.el7

Comment 12 huiqingding 2014-06-24 07:01:42 UTC
Verify this bug using the following version:
kernel-3.10.0-128.el7.x86_64
qemu-kvm-1.5.3-64.el7.x86_64

Steps to Verify:
1. Boot up a guest 
# /usr/libexec/qemu-kvm -cpu SandyBridge -enable-kvm -m 30G -smp 8,sockets=1,cores=8,threads=1 -enable-kvm -name t2-rhel6.4-64 -uuid 61b6c504-5a8b-4fe1-8347-6c929b750dde -k en-us -rtc base=localtime,clock=host,driftfix=slew -no-kvm-pit-reinjection -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device usb-tablet,id=input0 -drive file=/mnt/RHEL-Server-7.0-64-virtio.qcow2,if=none,id=disk0,format=qcow2,werror=stop,rerror=stop,aio=native -device virtio-blk-pci,bus=pci.0,addr=0x3,drive=disk0,id=disk0 -drive if=none,media=cdrom,id=drive-ide0-1-0,readonly=on,format=raw -device ide-drive,drive=drive-ide0-1-0,bus=ide.1,unit=0,id=cdrom -netdev tap,id=hostnet0,vhost=on -device virtio-net-pci,netdev=hostnet0,id=net0,mac=44:37:E6:5E:91:85,bus=pci.0,addr=0x5 -monitor stdio -qmp tcp:0:6666,server,nowait -chardev socket,path=/tmp/isa-serial,server,nowait,id=isa1 -device isa-serial,chardev=isa1,id=isa-serial1 -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x8 -chardev socket,id=charchannel0,path=/tmp/serial-socket,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=com.redhat.rhevm.vdsm -chardev socket,path=/tmp/foo,server,nowait,id=foo -device virtconsole,chardev=foo,id=console0 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x9 -vnc :10 -k en-us -boot c -chardev socket,path=/tmp/qga.sock,server,nowait,id=qga0 -device virtserialport,bus=virtio-serial0.0,chardev=qga0,name=org.qemu.guest_agent.0 -global PIIX4_PM.disable_s3=0 -global PIIX4_PM.disable_s4=0

2. Boot the guest on dst host with "-incoming tcp:0:5800"

3. Running google stressapptest inside guest (Refer to Bug 1063417)

Docs:
https://code.google.com/p/stressapptest/wiki/Introduction

(1) Get the code from: http://code.google.com/p/stressapptest/downloads/list (I used 1.0.6)

(2)untar
./configure
make
This produces the binary src/stressapptest

** Don't run the test on your laptop - it'll run it out of memory without options ! **

(3)copy the binary onto the victim VM:
scp src/stressapptest   thevmname:

(4) Then on a text-console on the VM do:
 ./stressapptest -s 3600 -m 20 -i 20 -C 20
	

5. On source host qemu:
(qemu) migrate_set_capability auto-converge on
(qemu) migrate_set_capability xbzrle on
(qemu) migrate_set_cache_size 1G
(qemu) migrate_set_speed 1G

6. Implement migration
(qemu) migrate -d tcp:$dst_host_ip:5800

7. Wait for a while, before migration finish.
(qemu) migrate_set_cache_size 128M


Actual results:
after step7, qemu-kvm is not Segmentation fault, migration could be finished successfully when enlarge downtime. I do ping-pong migration for three times, migration could be finished.

Comment 13 Dr. David Alan Gilbert 2014-07-11 08:04:13 UTC
*** Bug 1045266 has been marked as a duplicate of this bug. ***

Comment 15 Shaolong Hu 2014-10-13 09:07:33 UTC
Verified on:

qemu-kvm-rhev-2.1.2-3.el7.x86_64 and qemu-kvm-1.5.3-75.el7.x86_64

with steps in comment 12, no core dump.

Comment 17 errata-xmlrpc 2015-03-05 08:04:22 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2015-0349.html