Red Hat Bugzilla – Bug 1066338
Reduce the migrate cache size during migration causes qemu segment fault
Last modified: 2015-03-05 03:04:22 EST
Description of problem: Migration a guest with workload inside guest (I'm running google stressapptest tool), turn on xbzrle and set a migration cache size. During migration, reduce the cache size, and qemu segment fault. Version-Release number of selected component (if applicable): kernel-3.10.0-73.el7.x86_64 qemu-kvm-1.5.3-48.el7.x86_64 How reproducible: Always Steps to Reproduce: 1. Boot up a guest /usr/libexec/qemu-kvm -cpu SandyBridge -enable-kvm -m 30G -smp 8,sockets=1,cores=8,threads=1 -enable-kvm -name t2-rhel6.4-64 -uuid 61b6c504-5a8b-4fe1-8347-6c929b750dde -k en-us -rtc base=localtime,clock=host,driftfix=slew -no-kvm-pit-reinjection -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device usb-tablet,id=input0 -drive file=/mnt/RHEL-Server-7.0-64-virtio.qcow2,if=none,id=disk0,format=qcow2,werror=stop,rerror=stop,aio=native -device virtio-blk-pci,bus=pci.0,addr=0x3,drive=disk0,id=disk0 -drive if=none,media=cdrom,id=drive-ide0-1-0,readonly=on,format=raw -device ide-drive,drive=drive-ide0-1-0,bus=ide.1,unit=0,id=cdrom -netdev tap,id=hostnet0,vhost=on -device virtio-net-pci,netdev=hostnet0,id=net0,mac=44:37:E6:5E:91:85,bus=pci.0,addr=0x5 -monitor stdio -qmp tcp:0:6666,server,nowait -chardev socket,path=/tmp/isa-serial,server,nowait,id=isa1 -device isa-serial,chardev=isa1,id=isa-serial1 -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x8 -chardev socket,id=charchannel0,path=/tmp/serial-socket,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=com.redhat.rhevm.vdsm -chardev socket,path=/tmp/foo,server,nowait,id=foo -device virtconsole,chardev=foo,id=console0 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x9 -vnc :10 -k en-us -boot c -chardev socket,path=/tmp/qga.sock,server,nowait,id=qga0 -device virtserialport,bus=virtio-serial0.0,chardev=qga0,name=org.qemu.guest_agent.0 -global PIIX4_PM.disable_s3=0 -global PIIX4_PM.disable_s4=0 2. Boot the guest on dst host with "-incoming tcp:0:5800" 3. Running google stressapptest inside guest (Refer to Bug 1063417) Docs: https://code.google.com/p/stressapptest/wiki/Introduction (1) Get the code from: http://code.google.com/p/stressapptest/downloads/list (I used 1.0.6) (2)untar ./configure make This produces the binary src/stressapptest ** Don't run the test on your laptop - it'll run it out of memory without options ! ** (3)copy the binary onto the victim VM: scp src/stressapptest thevmname: (4) Then on a text-console on the VM do: ./stressapptest -s 3600 -m 20 -i 20 -C 20 5. On source host qemu: (qemu) migrate_set_capability auto-converge on (qemu) migrate_set_capability xbzrle on (qemu) migrate_set_cache_size 1G (qemu) migrate_set_speed 1G 6. Implement migration (qemu) migrate -d tcp:$dst_host_ip:5800 7. Wait for a while, before migration finish. (qemu) migrate_set_cache_size 128M Actual results: Qemu segment fault: (qemu) migrate_set_cache_size 128 Program received signal SIGSEGV, Segmentation fault. 0x00007ffff2d00faa in _int_free () from /lib64/libc.so.6 (gdb) bt #0 0x00007ffff2d00faa in _int_free () from /lib64/libc.so.6 #1 0x00007ffff74ef9af in g_free () from /lib64/libglib-2.0.so.0 #2 0x00005555556ec291 in cache_resize (cache=0x7ff7380008c0, new_num_pages=new_num_pages@entry=32768) at page_cache.c:216 #3 0x0000555555744ab5 in xbzrle_cache_resize (new_size=new_size@entry=134217728) at /usr/src/debug/qemu-1.5.3/arch_init.c:184 #4 0x00005555556e11a5 in qmp_migrate_set_cache_size (value=<optimized out>, errp=<optimized out>) at migration.c:494 #5 0x0000555555653a0a in hmp_migrate_set_cache_size (mon=0x5555564f34d0, qdict=<optimized out>) at hmp.c:917 #6 0x000055555579efc9 in handle_user_command (mon=mon@entry=0x5555564f34d0, cmdline=<optimized out>) at /usr/src/debug/qemu-1.5.3/monitor.c:4008 #7 0x000055555579f297 in monitor_command_cb (mon=0x5555564f34d0, cmdline=<optimized out>, opaque=<optimized out>) at /usr/src/debug/qemu-1.5.3/monitor.c:4624 #8 0x00005555557171f4 in readline_handle_byte (rs=0x5555566593b0, ch=<optimized out>) at readline.c:374 #9 0x000055555579f224 in monitor_read (opaque=<optimized out>, buf=<optimized out>, size=<optimized out>) at /usr/src/debug/qemu-1.5.3/monitor.c:4610 #10 0x0000555555707f3b in qemu_chr_be_write (len=<optimized out>, buf=0x7fffffffc6c0 "\r", s=0x5555564db8f0) at qemu-char.c:167 #11 fd_chr_read (chan=<optimized out>, cond=<optimized out>, opaque=0x5555564db8f0) at qemu-char.c:850 #12 0x00007ffff74e9e06 in g_main_context_dispatch () from /lib64/libglib-2.0.so.0 #13 0x00005555556dae9a in glib_pollfds_poll () at main-loop.c:187 #14 os_host_main_loop_wait (timeout=<optimized out>) at main-loop.c:232 #15 main_loop_wait (nonblocking=<optimized out>) at main-loop.c:464 #16 0x00005555556017c0 in main_loop () at vl.c:1988 #17 main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at vl.c:4357 (gdb) Expected results: Segment fault should not happen. Additional info: Host info: processor : 31 vendor_id : GenuineIntel cpu family : 6 model : 45 model name : Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz stepping : 7 microcode : 0x710 cpu MHz : 1995.069 cache size : 20480 KB physical id : 1 siblings : 16 core id : 7 cpu cores : 8 apicid : 47 initial apicid : 47 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid bogomips : 3994.12 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management: # free -m total used free shared buffers cached Mem: 31880 751 31129 0 8 60 -/+ buffers/cache: 682 31197 Swap: 16135 46 16089
Looks like someone else hit this at the same time; discussion just started: http://lists.gnu.org/archive/html/qemu-devel/2014-02/msg03332.html
Fix included in qemu-kvm-1.5.3-63.el7
Verify this bug using the following version: kernel-3.10.0-128.el7.x86_64 qemu-kvm-1.5.3-64.el7.x86_64 Steps to Verify: 1. Boot up a guest # /usr/libexec/qemu-kvm -cpu SandyBridge -enable-kvm -m 30G -smp 8,sockets=1,cores=8,threads=1 -enable-kvm -name t2-rhel6.4-64 -uuid 61b6c504-5a8b-4fe1-8347-6c929b750dde -k en-us -rtc base=localtime,clock=host,driftfix=slew -no-kvm-pit-reinjection -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device usb-tablet,id=input0 -drive file=/mnt/RHEL-Server-7.0-64-virtio.qcow2,if=none,id=disk0,format=qcow2,werror=stop,rerror=stop,aio=native -device virtio-blk-pci,bus=pci.0,addr=0x3,drive=disk0,id=disk0 -drive if=none,media=cdrom,id=drive-ide0-1-0,readonly=on,format=raw -device ide-drive,drive=drive-ide0-1-0,bus=ide.1,unit=0,id=cdrom -netdev tap,id=hostnet0,vhost=on -device virtio-net-pci,netdev=hostnet0,id=net0,mac=44:37:E6:5E:91:85,bus=pci.0,addr=0x5 -monitor stdio -qmp tcp:0:6666,server,nowait -chardev socket,path=/tmp/isa-serial,server,nowait,id=isa1 -device isa-serial,chardev=isa1,id=isa-serial1 -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x8 -chardev socket,id=charchannel0,path=/tmp/serial-socket,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=com.redhat.rhevm.vdsm -chardev socket,path=/tmp/foo,server,nowait,id=foo -device virtconsole,chardev=foo,id=console0 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x9 -vnc :10 -k en-us -boot c -chardev socket,path=/tmp/qga.sock,server,nowait,id=qga0 -device virtserialport,bus=virtio-serial0.0,chardev=qga0,name=org.qemu.guest_agent.0 -global PIIX4_PM.disable_s3=0 -global PIIX4_PM.disable_s4=0 2. Boot the guest on dst host with "-incoming tcp:0:5800" 3. Running google stressapptest inside guest (Refer to Bug 1063417) Docs: https://code.google.com/p/stressapptest/wiki/Introduction (1) Get the code from: http://code.google.com/p/stressapptest/downloads/list (I used 1.0.6) (2)untar ./configure make This produces the binary src/stressapptest ** Don't run the test on your laptop - it'll run it out of memory without options ! ** (3)copy the binary onto the victim VM: scp src/stressapptest thevmname: (4) Then on a text-console on the VM do: ./stressapptest -s 3600 -m 20 -i 20 -C 20 5. On source host qemu: (qemu) migrate_set_capability auto-converge on (qemu) migrate_set_capability xbzrle on (qemu) migrate_set_cache_size 1G (qemu) migrate_set_speed 1G 6. Implement migration (qemu) migrate -d tcp:$dst_host_ip:5800 7. Wait for a while, before migration finish. (qemu) migrate_set_cache_size 128M Actual results: after step7, qemu-kvm is not Segmentation fault, migration could be finished successfully when enlarge downtime. I do ping-pong migration for three times, migration could be finished.
*** Bug 1045266 has been marked as a duplicate of this bug. ***
Verified on: qemu-kvm-rhev-2.1.2-3.el7.x86_64 and qemu-kvm-1.5.3-75.el7.x86_64 with steps in comment 12, no core dump.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2015-0349.html