Bug 736631

Summary: qemu crashes if screen dump is called when the vm is stopped
Product: Red Hat Enterprise Linux 6 Reporter: Golita Yue <gyue>
Component: spice-serverAssignee: Yonit Halperin <yhalperi>
Status: CLOSED NOTABUG QA Contact: Desktop QE <desktop-qa-list>
Severity: high Docs Contact:
Priority: high    
Version: 6.2CC: acathrow, alevy, bcao, cfergeau, dblechte, juzhang, kraxel, michen, mkenneth, mkrcmari, pvine, shuang, tburke, virt-maint, xwei
Target Milestone: rcKeywords: Reopened
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 748810 (view as bug list) Environment:
Last Closed: 2012-02-15 15:08:27 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 748810, 798195    

Description Golita Yue 2011-09-08 10:22:26 UTC
Description of problem:
When I do acceptance testing in a loop, which include win2008-64 guest stop/continue, migrate to file, live migrate during reboot testing.
QEMU crash 4 times in above testing.
hit this bug in both intel and AMD host with win2008-64 guest.

(gdb) bt
#0  0x00000035a4232945 in raise (sig=<value optimized out>) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1  0x00000035a4234125 in abort () at abort.c:92
#2  0x00000035af230ad9 in handle_dev_update (listener=0x7f2753e2ac00, events=<value optimized out>) at red_worker.c:9725
#3  handle_dev_input (listener=0x7f2753e2ac00, events=<value optimized out>) at red_worker.c:9982
#4  0x00000035af22fa75 in red_worker_main (arg=<value optimized out>) at red_worker.c:10304
#5  0x00000035a46077e1 in start_thread (arg=0x7f2753fff700) at pthread_create.c:301
#6  0x00000035a42e578d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Version-Release number of selected component (if applicable):
host info:
kernel 2.6.32-193.el6.x86_64
qemu-kvm-0.12.1.2-2.188.el6.x86_64

guest info:
win2008-64-virtio_nic-virtio_blk.qcow2

How reproducible:
hit qemu crash 4 times

Steps to Reproduce:
1. when I do acceptance testing in a loop, do stop/cont testing:
   run (qemu)stop and then run (qemu)cont, the following info display:
   (qemu) /bin/sh: line 1:  3680 Aborted 
2. do migration-exec testing:
   run: {'execute': 'migrate', 'arguments': {'uri': 'exec:gzip -c > /tmp/exec-BwhstLIN.gz', 'blk': False, 'inc': False}, 'id': '0pyhCZmf'}
   (qemu) /bin/sh: line 1: 13568 Aborted
3. do live migrate during reboot testing:
   (qemu) /bin/sh: line 1: 14346 Aborted   
  
Actual results:
qemu crash 

Expected results:
qemu not crash

Additional info:

[cmd to boot guest]:
qemu-kvm -drive file='win2008-64-virtio.qcow2',index=0,if=none,id=drive-virtio-disk1,media=disk,cache=none,format=qcow2,aio=native -device virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk1,id=virtio-disk1 
-device virtio-net-pci,netdev=id69qZBY,mac=9a:e9:d6:d0:bb:c6,id=ndev00id69qZBY,bus=pci.0,addr=0x3 -netdev tap,id=id69qZBY,vhost=on,ifname='t0-171613-I0ct',script='/etc/qemu-ifup-switch',downscript='no' -m 2048 -smp 2,cores=1,threads=1,sockets=2 -drive file='winutils.iso',index=1,if=none,id=drive-ide0-0-0,media=cdrom,readonly=on,format=raw -device ide-drive,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0 -cpu cpu64-rhel6,+sse2,+x2apic -spice port=8001,disable-ticketing -vga qxl -rtc base=localtime,clock=host,driftfix=none -M rhel6.1.0 -boot order=cdn,once=c,menu=off   -usbdevice tablet -enable-kvm  -incoming tcp:0:5200

[host info]:  amd-9600b-8-1
processor	: 3
vendor_id	: AuthenticAMD
cpu family	: 16
model		: 2
model name	: AMD Phenom(tm) 9600B Quad-Core Processor
stepping	: 3
cpu MHz		: 2300.000
cache size	: 512 KB
physical id	: 0
siblings	: 4
core id		: 3
cpu cores	: 4
apicid		: 3
initial apicid	: 3
fpu		: yes
fpu_exception	: yes
cpuid level	: 5
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nonstop_tsc extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs npt lbrv svm_lock
bogomips	: 4609.46
TLB size	: 1024 4K pages
clflush size	: 64
cache_alignment	: 64
address sizes	: 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate

[root@amd-9600b-8-1 ~]# cat /proc/meminfo 
MemTotal:        7995668 kB
MemFree:         3254684 kB
Buffers:            8332 kB
Cached:          2290936 kB
SwapCached:            0 kB
Active:          2464324 kB
Inactive:        2075120 kB
Active(anon):    2240384 kB
Inactive(anon):       16 kB
Active(file):     223940 kB
Inactive(file):  2075104 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:       8388600 kB
SwapFree:        8388600 kB
Dirty:                44 kB
Writeback:             0 kB
AnonPages:       2240172 kB
Mapped:            23360 kB
Shmem:               228 kB
Slab:              73620 kB
SReclaimable:      15880 kB
SUnreclaim:        57740 kB
KernelStack:        1600 kB
PageTables:        10276 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    12386432 kB
Committed_AS:    2710956 kB
VmallocTotal:   34359738367 kB
VmallocUsed:      126936 kB
VmallocChunk:   34359485244 kB
HardwareCorrupted:     0 kB
AnonHugePages:   2131968 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:        6976 kB
DirectMap2M:     3072000 kB
DirectMap1G:     5242880 kB

Comment 2 Golita Yue 2011-09-09 06:03:30 UTC
Hit this bug in the following cases respectively:
1. stop/continue
2. migrate to file
3. live migrate during reboot testing

Comment 3 Golita Yue 2011-09-09 08:54:57 UTC
host info:
spice-server-0.8.2-3.el6.x86_64

Comment 4 Dor Laor 2011-09-11 11:19:59 UTC
Moving to spice owners, please test w/o spice/qxl too.

Comment 5 Golita Yue 2011-09-13 04:43:04 UTC
Already created four jobs with vnc to retest this bug. 
I will update the test result when job finished.

Comment 6 Alon Levy 2011-09-13 08:37:09 UTC
Can you provide backtrace of all threads, not just the crashing thread? I suspect this has something to do with an io write after vm stop.

Comment 7 Uri Lublin 2011-09-13 08:57:01 UTC
This seems like a duplicate of rhbz#729621.
Please provide qemu-kvm output.
Does it contain a line with "ASSERT worker->running failed" ?

Comment 8 Golita Yue 2011-09-15 07:35:44 UTC
(In reply to comment #5)
> Already created four jobs with vnc to retest this bug. 
> I will update the test result when job finished.

Tested with vnc, didn't hit qemu crash, passed.
the job link:
https://virtlab.englab.nay.redhat.com/job/38140/details/
https://virtlab.englab.nay.redhat.com/job/38139/details/
https://virtlab.englab.nay.redhat.com/job/38137/details/
https://virtlab.englab.nay.redhat.com/job/38136/details/

Comment 9 Golita Yue 2011-09-15 07:41:07 UTC
(In reply to comment #7)
> This seems like a duplicate of rhbz#729621.
> Please provide qemu-kvm output.
> Does it contain a line with "ASSERT worker->running failed" ?

Yes, hit above line.
(qemu) handle_dev_update: ASSERT worker->running failed
(qemu) /bin/sh: line 1: 13568 Aborted
(qemu) (Process terminated with status 134)

If you think this is duplicate of rhbz#729621. Please close this bug, thanks.

Comment 10 juzhang 2011-09-15 09:10:21 UTC
Since this issue component is changed form qemu-kvm to spice-server,reset qa_ack to ?

Comment 11 RHEL Program Management 2011-10-07 16:07:10 UTC
Since RHEL 6.2 External Beta has begun, and this bug remains
unresolved, it has been rejected as it is not proposed as
exception or blocker.

Red Hat invites you to ask your support representative to
propose this request, if appropriate and relevant, in the
next release of Red Hat Enterprise Linux.

Comment 12 David Blechter 2011-10-16 22:31:51 UTC
proposed to 6.3 based on comment 11.

Comment 13 Dor Laor 2011-10-17 08:02:46 UTC

*** This bug has been marked as a duplicate of bug 729621 ***

Comment 19 Yonit Halperin 2011-10-23 08:05:58 UTC
Options for solution:
(1) If screen_dump occurs when the vm is stopped, qxl will return an empty screen dump.
(2) 
Different api for screen_dumps requests (or an additional flag to update_area).
And when we receive screen dump request, we return the screen dump according to the latest spice server state, which can be older than the one of the device.
(3) Holding a flag that indicates if it is o.k to access the guest memory. This flag will be false between pre_load and post_load. We will allow screen_dumps when the flag is true (don't think screen dumps are even possible otherwise since the migration target is blocked).
Still I think we need a different api for screen_dumps, since calls for update_area should be triggered from the vm when it is stopped.
(4) upon stopping the vm, reading and emptying the command rings. If a client is connected, the time it will take to empty the command ring, depends on sending all the commands to the client (or a timeout), and it will affect migration. Then, when we receive an update_area command, we don't need to read the command rings. 
     

CC'ing gerd for comments.

Comment 20 Alon Levy 2011-10-23 08:50:38 UTC
(In reply to comment #19)
> Options for solution:
> (1) If screen_dump occurs when the vm is stopped, qxl will return an empty
> screen dump.

Not an option, misleading to users.

> (2) 
> Different api for screen_dumps requests (or an additional flag to update_area).
> And when we receive screen dump request, we return the screen dump according to
> the latest spice server state, which can be older than the one of the device.
> (3) Holding a flag that indicates if it is o.k to access the guest memory. This
> flag will be false between pre_load and post_load. We will allow screen_dumps
> when the flag is true (don't think screen dumps are even possible otherwise
> since the migration target is blocked).

I think so too. Not sure if this is something we can rely on to stay in the future.

> Still I think we need a different api for screen_dumps, since calls for
> update_area should be triggered from the vm when it is stopped.

The right thing to do seems to me is to read the command ring (if we can, with the flag) and not to force the to-client pipe size to be bounded. With the current implementation that would mean we may cross the 50 items in the outgoing pipe. I think it would still be ok, since the next update to the client (not a screen_dump or update_area if we keep the old api) will flush it.

> (4) upon stopping the vm, reading and emptying the command rings. If a client
> is connected, the time it will take to empty the command ring, depends on
> sending all the commands to the client (or a timeout), and it will affect
> migration. Then, when we receive an update_area command, we don't need to read
> the command rings. 

Unbounded time to complete, not a good idea.

> 
> 
> CC'ing gerd for comments.

Comment 21 Yonit Halperin 2011-10-23 09:03:06 UTC
(In reply to comment #20)
> (In reply to comment #19)
> > Options for solution:
> > (1) If screen_dump occurs when the vm is stopped, qxl will return an empty
> > screen dump.
> 
> Not an option, misleading to users.
> 
> > (2) 
> > Different api for screen_dumps requests (or an additional flag to update_area).
> > And when we receive screen dump request, we return the screen dump according to
> > the latest spice server state, which can be older than the one of the device.
> > (3) Holding a flag that indicates if it is o.k to access the guest memory. This
> > flag will be false between pre_load and post_load. We will allow screen_dumps
> > when the flag is true (don't think screen dumps are even possible otherwise
> > since the migration target is blocked).
> 
> I think so too. Not sure if this is something we can rely on to stay in the
> future.
> 
> > Still I think we need a different api for screen_dumps, since calls for
> > update_area should be triggered from the vm when it is stopped.
> 
> The right thing to do seems to me is to read the command ring (if we can, with
> the flag) and not to force the to-client pipe size to be bounded. With the
> current implementation that would mean we may cross the 50 items in the
> outgoing pipe. I think it would still be ok, since the next update to the
> client (not a screen_dump or update_area if we keep the old api) will flush it.
> 
IMHO, the issue of limiting the pipe size is not related to this bug. The pipe limit affects each update area, and not only the ones triggered from screen dump. The limit exists in order not to overflow the device memory, and we know it should be changed to be more smart than 50 items. But I think it is not in this bug scope.
> > (4) upon stopping the vm, reading and emptying the command rings. If a client
> > is connected, the time it will take to empty the command ring, depends on
> > sending all the commands to the client (or a timeout), and it will affect
> > migration. Then, when we receive an update_area command, we don't need to read
> > the command rings. 
> 
> Unbounded time to complete, not a good idea.
I don't like it as well, but I presented it since it is simple.
> 
> > 
> > 
> > CC'ing gerd for comments.

Comment 22 Yonit Halperin 2011-10-23 09:09:40 UTC
Hi Golita,
can you also attach the qemu output log?

Thanks,
Yonit.

Comment 23 Golita Yue 2011-10-24 10:23:59 UTC
Hi Yonit,

Please refer to comment #9 for qemu output log.

b.r.
Golita

Comment 29 Yonit Halperin 2012-02-15 15:08:27 UTC
I'm closing this bug since its fix is in qemu-kvm and doesn't involve spice-server (bug 748810)