Hide Forgot
Description of problem: Migrate a guest from source host to destination host side with some application running, it sometimes fails on the destination host. I hit this problem for about 4~5 times today, it happens in the following scenarios: (1) Migrate guest during image installation (installing package stage). (2) Migrate guest during cdrom is in use: (in guest) # while true; do cp -r /media/RHEL7\ X86_64/ /home/test; sleep 1; rm -rm /home/test ; done (3) Migrate guest when stress is running inside guest: #stress -m 2 All the above scenarios are with xbzrle=on when I do migration. But I'm not sure whether xbzrle=off could trigger it because the bug is not always reproduced. This will need lots of time attempts. Version-Release number of selected component (if applicable): kernek-3.10.0-63.el7.x86_64 qemu-kvm-1.5.3-24.el7.x86_64 How reproducible: Always Steps to Reproduce: 1. Boot up a guest on source host /usr/libexec/qemu-kvm -cpu SandyBridge -enable-kvm -m 2048 -smp 2,sockets=2,cores=1,threads=1 -enable-kvm -name t2-rhel6.4-32 -uuid 61b6c504-5a8b-4fe1-8347-6c929b750dde -k en-us -rtc base=localtime,clock=host,driftfix=slew -no-kvm-pit-reinjection -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device usb-tablet,id=input0 -drive file=/mnt/installation.qcow2,if=none,id=disk0,format=qcow2,werror=stop,rerror=stop,aio=native -device ide-drive,bus=ide.0,unit=1,drive=disk0,id=disk0 -drive file=/mnt/boot.iso,if=none,media=cdrom,id=drive-ide0-1-0,readonly=on,format=raw -device ide-drive,drive=drive-ide0-1-0,bus=ide.1,unit=0,id=cdrom -netdev tap,id=hostnet0 -device rtl8139,netdev=hostnet0,id=net0,mac=44:37:E6:5E:91:85,bus=pci.0,addr=0x5 -monitor stdio -qmp tcp:0:6666,server,nowait -chardev socket,path=/tmp/isa-serial,server,nowait,id=isa1 -device isa-serial,chardev=isa1,id=isa-serial1 -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x8 -chardev socket,id=charchannel0,path=/tmp/serial-socket,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=com.redhat.rhevm.vdsm -chardev socket,path=/tmp/foo,server,nowait,id=foo -device virtconsole,chardev=foo,id=console0 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x9 -vnc :10 -k en-us -boot dc -chardev socket,path=/tmp/qga.sock,server,nowait,id=qga0 -device virtserialport,bus=virtio-serial0.0,chardev=qga0,name=org.qemu.guest_agent.0 -global PIIX4_PM.disable_s3=0 -global PIIX4_PM.disable_s4=0 2. (qemu)migrate_set_capability xbzrle on (qemu) migrate_set_cache_size 2G 3. Do 1 of the following 3 scenarios: (1) install guest (for an empty image) (2) read cdrom inside guest (for an pre-installed image) (in guest) # while true; do cp -r /media/RHEL7\ X86_64/ /home/test; sleep 1; rm -rm /home/test ; done (3) Migrate guest when stress is running inside guest: #stress -m 2 4. Migrate the guest (qemu) migrate -d tcp:t2:5800 Actual results: Guest failed to load on the destination host: (qemu) info status VM status: paused (inmigrate) (qemu) qemu: warning: error while loading state section id 2 load of migration failed Expected results: Guest should be migrated successfully and work well. Additional info: Sometimes the issue happens when the migration has not finished, so guest is still running on src host. But sometimes the problems happens just when migration finished. Then guest will be dead.
Hi, I'm suspecting it related to the high memory usage of the XBZRLE feature. What is the amount of memory the hosts have? Can you print the memory usage when migration fails? does it happen when you set the cache size to a smaller value (migrate_set_cache_size)? Thanks, Orit
(In reply to Orit Wasserman from comment #2) > Hi, > I'm suspecting it related to the high memory usage of the XBZRLE feature. > What is the amount of memory the hosts have? The host has 8G mem > Can you print the memory usage when migration fails? Below is the host memory usage when migration fails. I re-test and reproduce again still with 2G migration cache size. #cat /proc/meminfo MemTotal: 7911636 kB MemFree: 4280912 kB Buffers: 36 kB Cached: 1333200 kB SwapCached: 968 kB Active: 2421036 kB Inactive: 1050756 kB Active(anon): 2069740 kB Inactive(anon): 75688 kB Active(file): 351296 kB Inactive(file): 975068 kB Unevictable: 0 kB Mlocked: 0 kB SwapTotal: 8273916 kB SwapFree: 8272240 kB Dirty: 88 kB Writeback: 0 kB AnonPages: 2138048 kB Mapped: 21272 kB Shmem: 6852 kB Slab: 54464 kB SReclaimable: 19512 kB SUnreclaim: 34952 kB KernelStack: 1560 kB PageTables: 11228 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 12229732 kB Committed_AS: 2567384 kB VmallocTotal: 34359738367 kB VmallocUsed: 150644 kB VmallocChunk: 34359584748 kB HardwareCorrupted: 0 kB AnonHugePages: 1992704 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB DirectMap4k: 98072 kB DirectMap2M: 4091904 kB DirectMap1G: 4194304 kB #free -m total used free shared buffers cached Mem: 7726 3545 4180 6 0 1301 -/+ buffers/cache: 2243 5482 Swap: 8079 1 8078 > does it happen when you set the cache size to a smaller value > (migrate_set_cache_size)? I have not reproduced the bug with a smaller value so far (I used 512M cache size and tried 5 times already) > > Thanks, > Orit
(In reply to Qunfang Zhang from comment #0) > > How reproducible: > Always Sometimes (as in summary)