Hide Forgot
Description of problem: When I do acceptance testing in a loop, which include win2008-64 guest stop/continue, migrate to file, live migrate during reboot testing. QEMU crash 4 times in above testing. hit this bug in both intel and AMD host with win2008-64 guest. (gdb) bt #0 0x00000035a4232945 in raise (sig=<value optimized out>) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64 #1 0x00000035a4234125 in abort () at abort.c:92 #2 0x00000035af230ad9 in handle_dev_update (listener=0x7f2753e2ac00, events=<value optimized out>) at red_worker.c:9725 #3 handle_dev_input (listener=0x7f2753e2ac00, events=<value optimized out>) at red_worker.c:9982 #4 0x00000035af22fa75 in red_worker_main (arg=<value optimized out>) at red_worker.c:10304 #5 0x00000035a46077e1 in start_thread (arg=0x7f2753fff700) at pthread_create.c:301 #6 0x00000035a42e578d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 Version-Release number of selected component (if applicable): host info: kernel 2.6.32-193.el6.x86_64 qemu-kvm-0.12.1.2-2.188.el6.x86_64 guest info: win2008-64-virtio_nic-virtio_blk.qcow2 How reproducible: hit qemu crash 4 times Steps to Reproduce: 1. when I do acceptance testing in a loop, do stop/cont testing: run (qemu)stop and then run (qemu)cont, the following info display: (qemu) /bin/sh: line 1: 3680 Aborted 2. do migration-exec testing: run: {'execute': 'migrate', 'arguments': {'uri': 'exec:gzip -c > /tmp/exec-BwhstLIN.gz', 'blk': False, 'inc': False}, 'id': '0pyhCZmf'} (qemu) /bin/sh: line 1: 13568 Aborted 3. do live migrate during reboot testing: (qemu) /bin/sh: line 1: 14346 Aborted Actual results: qemu crash Expected results: qemu not crash Additional info: [cmd to boot guest]: qemu-kvm -drive file='win2008-64-virtio.qcow2',index=0,if=none,id=drive-virtio-disk1,media=disk,cache=none,format=qcow2,aio=native -device virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk1,id=virtio-disk1 -device virtio-net-pci,netdev=id69qZBY,mac=9a:e9:d6:d0:bb:c6,id=ndev00id69qZBY,bus=pci.0,addr=0x3 -netdev tap,id=id69qZBY,vhost=on,ifname='t0-171613-I0ct',script='/etc/qemu-ifup-switch',downscript='no' -m 2048 -smp 2,cores=1,threads=1,sockets=2 -drive file='winutils.iso',index=1,if=none,id=drive-ide0-0-0,media=cdrom,readonly=on,format=raw -device ide-drive,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0 -cpu cpu64-rhel6,+sse2,+x2apic -spice port=8001,disable-ticketing -vga qxl -rtc base=localtime,clock=host,driftfix=none -M rhel6.1.0 -boot order=cdn,once=c,menu=off -usbdevice tablet -enable-kvm -incoming tcp:0:5200 [host info]: amd-9600b-8-1 processor : 3 vendor_id : AuthenticAMD cpu family : 16 model : 2 model name : AMD Phenom(tm) 9600B Quad-Core Processor stepping : 3 cpu MHz : 2300.000 cache size : 512 KB physical id : 0 siblings : 4 core id : 3 cpu cores : 4 apicid : 3 initial apicid : 3 fpu : yes fpu_exception : yes cpuid level : 5 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nonstop_tsc extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs npt lbrv svm_lock bogomips : 4609.46 TLB size : 1024 4K pages clflush size : 64 cache_alignment : 64 address sizes : 48 bits physical, 48 bits virtual power management: ts ttp tm stc 100mhzsteps hwpstate [root@amd-9600b-8-1 ~]# cat /proc/meminfo MemTotal: 7995668 kB MemFree: 3254684 kB Buffers: 8332 kB Cached: 2290936 kB SwapCached: 0 kB Active: 2464324 kB Inactive: 2075120 kB Active(anon): 2240384 kB Inactive(anon): 16 kB Active(file): 223940 kB Inactive(file): 2075104 kB Unevictable: 0 kB Mlocked: 0 kB SwapTotal: 8388600 kB SwapFree: 8388600 kB Dirty: 44 kB Writeback: 0 kB AnonPages: 2240172 kB Mapped: 23360 kB Shmem: 228 kB Slab: 73620 kB SReclaimable: 15880 kB SUnreclaim: 57740 kB KernelStack: 1600 kB PageTables: 10276 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 12386432 kB Committed_AS: 2710956 kB VmallocTotal: 34359738367 kB VmallocUsed: 126936 kB VmallocChunk: 34359485244 kB HardwareCorrupted: 0 kB AnonHugePages: 2131968 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB DirectMap4k: 6976 kB DirectMap2M: 3072000 kB DirectMap1G: 5242880 kB
Hit this bug in the following cases respectively: 1. stop/continue 2. migrate to file 3. live migrate during reboot testing
host info: spice-server-0.8.2-3.el6.x86_64
Moving to spice owners, please test w/o spice/qxl too.
Already created four jobs with vnc to retest this bug. I will update the test result when job finished.
Can you provide backtrace of all threads, not just the crashing thread? I suspect this has something to do with an io write after vm stop.
This seems like a duplicate of rhbz#729621. Please provide qemu-kvm output. Does it contain a line with "ASSERT worker->running failed" ?
(In reply to comment #5) > Already created four jobs with vnc to retest this bug. > I will update the test result when job finished. Tested with vnc, didn't hit qemu crash, passed. the job link: https://virtlab.englab.nay.redhat.com/job/38140/details/ https://virtlab.englab.nay.redhat.com/job/38139/details/ https://virtlab.englab.nay.redhat.com/job/38137/details/ https://virtlab.englab.nay.redhat.com/job/38136/details/
(In reply to comment #7) > This seems like a duplicate of rhbz#729621. > Please provide qemu-kvm output. > Does it contain a line with "ASSERT worker->running failed" ? Yes, hit above line. (qemu) handle_dev_update: ASSERT worker->running failed (qemu) /bin/sh: line 1: 13568 Aborted (qemu) (Process terminated with status 134) If you think this is duplicate of rhbz#729621. Please close this bug, thanks.
Since this issue component is changed form qemu-kvm to spice-server,reset qa_ack to ?
Since RHEL 6.2 External Beta has begun, and this bug remains unresolved, it has been rejected as it is not proposed as exception or blocker. Red Hat invites you to ask your support representative to propose this request, if appropriate and relevant, in the next release of Red Hat Enterprise Linux.
proposed to 6.3 based on comment 11.
*** This bug has been marked as a duplicate of bug 729621 ***
Options for solution: (1) If screen_dump occurs when the vm is stopped, qxl will return an empty screen dump. (2) Different api for screen_dumps requests (or an additional flag to update_area). And when we receive screen dump request, we return the screen dump according to the latest spice server state, which can be older than the one of the device. (3) Holding a flag that indicates if it is o.k to access the guest memory. This flag will be false between pre_load and post_load. We will allow screen_dumps when the flag is true (don't think screen dumps are even possible otherwise since the migration target is blocked). Still I think we need a different api for screen_dumps, since calls for update_area should be triggered from the vm when it is stopped. (4) upon stopping the vm, reading and emptying the command rings. If a client is connected, the time it will take to empty the command ring, depends on sending all the commands to the client (or a timeout), and it will affect migration. Then, when we receive an update_area command, we don't need to read the command rings. CC'ing gerd for comments.
(In reply to comment #19) > Options for solution: > (1) If screen_dump occurs when the vm is stopped, qxl will return an empty > screen dump. Not an option, misleading to users. > (2) > Different api for screen_dumps requests (or an additional flag to update_area). > And when we receive screen dump request, we return the screen dump according to > the latest spice server state, which can be older than the one of the device. > (3) Holding a flag that indicates if it is o.k to access the guest memory. This > flag will be false between pre_load and post_load. We will allow screen_dumps > when the flag is true (don't think screen dumps are even possible otherwise > since the migration target is blocked). I think so too. Not sure if this is something we can rely on to stay in the future. > Still I think we need a different api for screen_dumps, since calls for > update_area should be triggered from the vm when it is stopped. The right thing to do seems to me is to read the command ring (if we can, with the flag) and not to force the to-client pipe size to be bounded. With the current implementation that would mean we may cross the 50 items in the outgoing pipe. I think it would still be ok, since the next update to the client (not a screen_dump or update_area if we keep the old api) will flush it. > (4) upon stopping the vm, reading and emptying the command rings. If a client > is connected, the time it will take to empty the command ring, depends on > sending all the commands to the client (or a timeout), and it will affect > migration. Then, when we receive an update_area command, we don't need to read > the command rings. Unbounded time to complete, not a good idea. > > > CC'ing gerd for comments.
(In reply to comment #20) > (In reply to comment #19) > > Options for solution: > > (1) If screen_dump occurs when the vm is stopped, qxl will return an empty > > screen dump. > > Not an option, misleading to users. > > > (2) > > Different api for screen_dumps requests (or an additional flag to update_area). > > And when we receive screen dump request, we return the screen dump according to > > the latest spice server state, which can be older than the one of the device. > > (3) Holding a flag that indicates if it is o.k to access the guest memory. This > > flag will be false between pre_load and post_load. We will allow screen_dumps > > when the flag is true (don't think screen dumps are even possible otherwise > > since the migration target is blocked). > > I think so too. Not sure if this is something we can rely on to stay in the > future. > > > Still I think we need a different api for screen_dumps, since calls for > > update_area should be triggered from the vm when it is stopped. > > The right thing to do seems to me is to read the command ring (if we can, with > the flag) and not to force the to-client pipe size to be bounded. With the > current implementation that would mean we may cross the 50 items in the > outgoing pipe. I think it would still be ok, since the next update to the > client (not a screen_dump or update_area if we keep the old api) will flush it. > IMHO, the issue of limiting the pipe size is not related to this bug. The pipe limit affects each update area, and not only the ones triggered from screen dump. The limit exists in order not to overflow the device memory, and we know it should be changed to be more smart than 50 items. But I think it is not in this bug scope. > > (4) upon stopping the vm, reading and emptying the command rings. If a client > > is connected, the time it will take to empty the command ring, depends on > > sending all the commands to the client (or a timeout), and it will affect > > migration. Then, when we receive an update_area command, we don't need to read > > the command rings. > > Unbounded time to complete, not a good idea. I don't like it as well, but I presented it since it is simple. > > > > > > > CC'ing gerd for comments.
Hi Golita, can you also attach the qemu output log? Thanks, Yonit.
Hi Yonit, Please refer to comment #9 for qemu output log. b.r. Golita
I'm closing this bug since its fix is in qemu-kvm and doesn't involve spice-server (bug 748810)