Bug 700134

Summary: [qemu-kvm] - qxl runs i/o requests synchronously
Product: Red Hat Enterprise Linux 6 Reporter: Ortal <ogvura>
Component: qemu-kvmAssignee: Gerd Hoffmann <kraxel>
Status: CLOSED ERRATA QA Contact: Virtualization Bugs <virt-bugs>
Severity: high Docs Contact:
Priority: high    
Version: 6.1CC: alevy, bcao, chayang, dallan, desktop-qa-list, djasa, juzhang, kraxel, marcandre.lureau, mkenneth, mkrcmari, tburke, virt-maint
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: qemu-kvm-0.12.1.2-2.179.el6 Doc Type: Bug Fix
Doc Text:
Cause qxl submits requests to the spice server thread and waits syncronously for the result. Consequence qemu is blocked for a potentially long time. Fix Use completion notification instead of waiting. Result spice server thread processes the request ayncronously without stalling qemu.
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-12-06 15:41:03 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 565939    
Bug Blocks: 723676    
Attachments:
Description Flags
libvirt.log
none
NEW Logs.tar none

Description Ortal 2011-04-27 14:50:35 UTC
Created attachment 495262 [details]
libvirt.log

Description of problem:

LIBVIRT [0.8.7-18.el6.x86_64] - VMs become Not-Responding when opening/closing few Spice-client Console to a different VMs
[Can see that DomainGetXMLDesc command got stuck]

Version-Release number of selected component (if applicable):
Libvirt version: 0.8.7-18.el6.x86_64
VDSM Version: 

How reproducible:
Always

Steps to Reproduce:
1. Raise up few VMs
2. Open Spice Console for all VMs
3. Attempt to close the VMs Spice-Console monitor
  
Actual results:

1. VMs state change to Not Responding:  


[root@iron-vdsb ~]# vdsClient -s 0 list table
001a6b25-8dfc-4dd2-90ab-20d35a0f0495   5911  VM1_Win2k3           Up*                  10.35.70.6
7c89899e-8df1-4b04-8af5-d92611f366bb   6489  Win2k3-4             Up*
62c0edd4-691f-44c6-8906-382ec36650ba   6249  Win2k3-3             Up*
d5305078-0df6-4400-a137-fa8c39460514   6060  Win-XP               Up*                  10.35.70.7

2. libvirt.log:


17:31:14.761: 5935: debug : virDomainGetBlockInfo:4895 : dom=0x7f4dfc006900, (VM: name=VM1_Win2k3, uuid=001a6b25-8dfc-4dd2-90ab-20d35a0f0495), info=0x7f4de61f9fb0 flags=0
17:31:14.761: 5935: debug : remoteIO:10405 : Do proc=194 serial=1082 length=244 wait=0x7f4dfc423260
17:31:14.761: 5935: debug : remoteIO:10434 : Going to sleep 194 0x7f4dfc423260 0x7f4dfc4a33e0
17:31:14.761: 6093: debug : remoteIOEventLoop:10266 : Woken up from poll by other thread
17:31:14.821: 6514: debug : virDomainBlockStats:4506 : dom=0x7f4dfc267db0, (VM: name=Win2k3-4, uuid=7c89899e-8df1-4b04-8af5-d92611f366bb), path=hda, stats=0x7f4df4df7ae0, size=40
17:31:20.517: 7634: debug : virDomainGetXMLDesc:3256 : dom=0x7f4de8007d10, (VM: name=Win-XP, uuid=d5305078-0df6-4400-a137-fa8c39460514), flags=0
17:31:24.691: 7702: debug : virDomainGetXMLDesc:3256 : dom=0x7f4df0001040, (VM: name=Win2k3-3, uuid=62c0edd4-691f-44c6-8906-382ec36650ba), flags=0
17:34:21.082: 10417: debug : virDomainGetXMLDesc:3256 : dom=0x7f4dfc267db0, (VM: name=Win2k3-4, uuid=7c89899e-8df1-4b04-8af5-d92611f366bb), flags=0
17:34:25.193: 10482: debug : virDomainGetXMLDesc:3256 : dom=0x7f4de8007d10, (VM: name=Win-XP, uuid=d5305078-0df6-4400-a137-fa8c39460514), flags=0
(END)


Expected results:

Should be able Open/Close Spice console many times, with no problem

Additional info:

** Please see attached libvirt.log

Comment 3 Ortal 2011-04-28 12:02:01 UTC
Created attachment 495519 [details]
NEW Logs.tar

Comment 4 Michal Privoznik 2011-05-03 14:45:32 UTC
could you please attach libvirt client logs? libvirtd logs seems good.

Comment 6 Ortal 2011-05-29 11:35:59 UTC
It's been a while since this bug, sorry but I can't find the relevant logs in that host (No longer exists)

Comment 7 Marc-Andre Lureau 2011-06-05 18:23:25 UTC
Ortal, I suppose the console didn't have spice-gtk display, or did it? I can easily reproduce it with spice-gtk (integrated in virt-manager for f15).

It seems to be happen more often with less optimized QXL drivers in the guest, such as the one in RHEL 6.1. All I have to do is start rhel61 VM, open the console, and do a lot of display operations (move windows terminal around quickly, while doing a "find /" for example)

It always hangs at virDomainGetXMLDesc().

The end of the client libvirt log looks like:

19:56:23.082: 12918: debug : virDomainGetXMLDesc:3040 : dom=0x34ff050, (VM: name=rhel6.1, uuid=e60f906e-ac52-d3a8-dfd4-597fbd0847a2), flags=1
19:56:23.082: 12918: debug : remoteIO:10413 : Do proc=14 serial=616 length=64 wait=(nil)
19:56:23.082: 12918: debug : remoteIO:10485 : We have the buck 14 0x5362750 0x5362750

Comment 8 Marc-Andre Lureau 2011-06-16 17:19:37 UTC
I tracked this down a bit, and found that the bug is in spice server code.

We discussed it with Alon, and found that flush_display_commands() is buggy: it waits for the client to read all commands, however being asynchronous and sharing the same loop, spice-gtk didn't read all, and virt-manager blocks on a query, resulting in a dead lock. 

We don't know how best to solve it, either by cleaning the function, releasing the lock, or adding more locks etc.. Gerd help might be welcome

Thread 25 (Thread 0x7f8a62e4b700 (LWP 12200)):
#0  0x0000003b3280e57d in read () from /lib64/libpthread.so.0
#1  0x0000003b34c1271a in read (__nbytes=4, __buf=0x7f8a62e4a9ec, __fd=14) at /usr/include/bits/unistd.h:45
#2  receive_data (n=4, in_buf=0x7f8a62e4a9ec, fd=14) at red_worker.h:131
#3  read_message (message=0x7f8a62e4a9ec, fd=14) at red_worker.h:144
#4  qxl_worker_update_area (qxl_worker=<optimized out>, surface_id=938, qxl_area=<optimized out>, qxl_dirty_rects=0x0, num_dirty_rects=0, clear_dirty_region=0)
    at red_dispatcher.c:227
#5  0x00000000005db1a4 in ioport_write (opaque=0x33b4a10, addr=<optimized out>, val=0) at /usr/src/debug/qemu-kvm-0.14.0/hw/qxl.c:991
#6  0x000000000042f4df in kvm_handle_io (count=1, size=1, direction=1, data=<optimized out>, port=49218) at /usr/src/debug/qemu-kvm-0.14.0/kvm-all.c:818
#7  kvm_run (env=0x2f88940) at /usr/src/debug/qemu-kvm-0.14.0/qemu-kvm.c:617
#8  0x0000000000430a39 in kvm_cpu_exec (env=<optimized out>) at /usr/src/debug/qemu-kvm-0.14.0/qemu-kvm.c:1233
#9  0x0000000000430cc1 in kvm_main_loop_cpu (env=0x2f88940) at /usr/src/debug/qemu-kvm-0.14.0/qemu-kvm.c:1419
#10 ap_main_loop (_env=0x2f88940) at /usr/src/debug/qemu-kvm-0.14.0/qemu-kvm.c:1466
#11 0x0000003b32807cd1 in start_thread () from /lib64/libpthread.so.0
#12 0x0000003b324e19bd in clone () from /lib64/libc.so.6

Thread 24 (Thread 0x7f8a613e8700 (LWP 12201)):
#0  0x0000003b324adf5d in nanosleep () from /lib64/libc.so.6
#1  0x0000003b324db5b4 in usleep () from /lib64/libc.so.6
#2  0x0000003b34c2d68a in flush_display_commands (worker=0x7f8a61213890) at red_worker.c:8787
#3  0x0000003b34c2f232 in handle_dev_update (worker=0x7f8a61213890) at red_worker.c:9723
#4  handle_dev_input (listener=0x7f8a61213890, events=<optimized out>) at red_worker.c:9936
#5  0x0000003b34c30385 in red_worker_main (arg=<optimized out>) at red_worker.c:10286
#6  0x0000003b32807cd1 in start_thread () from /lib64/libpthread.so.0
#7  0x0000003b324e19bd in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7f8a630599c0 (LWP 12195)):
#0  0x0000003b3280e29d in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x0000003b32809ec7 in _L_lock_862 () from /lib64/libpthread.so.0
#2  0x0000003b32809d1b in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x0000000000431758 in kvm_mutex_lock () at /usr/src/debug/qemu-kvm-0.14.0/qemu-kvm.c:1730
#4  qemu_mutex_lock_iothread () at /usr/src/debug/qemu-kvm-0.14.0/qemu-kvm.c:1744
#5  0x000000000041e569 in main_loop_wait (nonblocking=<optimized out>) at /usr/src/debug/qemu-kvm-0.14.0/vl.c:1412
#6  0x0000000000431547 in kvm_main_loop () at /usr/src/debug/qemu-kvm-0.14.0/qemu-kvm.c:1589
#7  0x000000000040b0a5 in main_loop () at /usr/src/debug/qemu-kvm-0.14.0/vl.c:1464
#8  main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at /usr/src/debug/qemu-kvm-0.14.0/vl.c:3236

Comment 9 Gerd Hoffmann 2011-06-17 07:47:11 UTC
Reassigning to qemu-kvm.

The fundamental issue is that the qxl device runs certain long-running I/O operations synchronously.   qemu-kvm needs a fix, maybe also spice-server (depends on how we'll fix it on the host side).  The guest drivers will need fixes too as we'll have to extend the device interface.

Comment 10 Gerd Hoffmann 2011-07-14 14:49:48 UTC
Extending cond nak to upstream.
Fixing this needs a new guest/host interface, we can't commit to that without the bits being upstream first.

Comment 19 Alon Levy 2011-08-24 14:21:10 UTC
Scratch build

http://brewweb.devel.redhat.com/brew/taskinfo?taskID=3580575

Comment 23 Mike Cao 2011-10-19 05:51:46 UTC
Tried on qemu-kvm-0.12.1.2-2.172.el6 ,can *not* reproduce

packages versions:
RHEV Manager for Servers and Desktops :3.0.0_0001-40.el6
qemu-kvm-0.12.1.2-2.172.el6.x86_64
libvirt-0.9.4-7.2.el6.x86_64
vdsm-4.9-104.el6.x86_64
spice-server-0.8.2-2.el6.x86_64

steps:
1.use RHEVM installing a RHEL6u1 guest with spice
2.make it into template and create 5 new guests from the template
3.start 5 guests with spice
4.find a win08R2 host to operation RHEVM ,open the guest spice console.
5.right-click the task-bar ,then click “Close All Windows"
then close them

Actual Results:
After steps4 ,the status of guests shows "up" 
Can not reproduce this issue .

Hi,Ortal

Is the steps above I have missed something ? Could you tell me how to reproduce this issue ?

Thanks ,
Mike

Comment 24 Ortal 2011-10-23 08:42:47 UTC
Hi Mike,

Those are the exact steps. Try to open more than 5 spice consoles, maybe in that case it will reproduce to u..

Thanks,
Ortal.

Comment 25 Mike Cao 2011-10-25 03:31:36 UTC
Retry this with 12 VMs according to comment #23 and comment #24
still can not reproduce this issue .when I am trying to reproduce this one ,I always hit bug 697441 .

bcao-->Gerd,
Hi,Gerd 

As the reporter said closing the spice client will cause guest in *unknown* status which RHEVM shows .
Under which  circumstances qemu-kvm will cause this,Could you provide me a way how to reproduce this  ?

TIA,
Mike

Comment 26 Mike Cao 2011-10-28 08:27:53 UTC
Tried on qemu-kvm-0.12.1.2-2.200.el6.x86_64

Repeat the steps in comment#0 more than 30 times ,(windows and linux guest both)

Actual Results:
on qemu-kvm-0.12.1.2-2.200.el6 ,everything works fine .no error in libvirt.log and guest still shows "UP" both in RHEVM-GUI and vdsClient -s0 list table

Comment 27 juzhang 2011-10-28 10:55:37 UTC
Hi,Gerd

   According to comment25,we failed to reproduce this issue but reproduce bz697441,seems bz697441 blocked our to reproduce this issue.according to comment26,with qemu-kvm-0.12.1.2-2.200.el6.x86_64,both bz697441 and this issue works well.would you please double-check our steps,if our steps are right,can we change this issue status as verified?

Comment 28 Gerd Hoffmann 2011-10-28 11:18:05 UTC
I never reproduced it myself as I have no local RHEV-M setup.
So if Ortal says in comment 24 the steps are ok and you can't reproduce it this way it should really be fixed -> set to verified is ok.

Comment 30 Gerd Hoffmann 2011-11-18 11:50:14 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Cause
    qxl submits requests to the spice server thread and waits syncronously for the result.
Consequence
    qemu is blocked for a potentially long time.
Fix
    Use completion notification instead of waiting.
Result
    spice server thread processes the request ayncronously without stalling qemu.

Comment 31 errata-xmlrpc 2011-12-06 15:41:03 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2011-1531.html