Created attachment 495262 [details]
Description of problem:
LIBVIRT [0.8.7-18.el6.x86_64] - VMs become Not-Responding when opening/closing few Spice-client Console to a different VMs
[Can see that DomainGetXMLDesc command got stuck]
Version-Release number of selected component (if applicable):
Libvirt version: 0.8.7-18.el6.x86_64
Steps to Reproduce:
1. Raise up few VMs
2. Open Spice Console for all VMs
3. Attempt to close the VMs Spice-Console monitor
1. VMs state change to Not Responding:
[root@iron-vdsb ~]# vdsClient -s 0 list table
001a6b25-8dfc-4dd2-90ab-20d35a0f0495 5911 VM1_Win2k3 Up* 10.35.70.6
7c89899e-8df1-4b04-8af5-d92611f366bb 6489 Win2k3-4 Up*
62c0edd4-691f-44c6-8906-382ec36650ba 6249 Win2k3-3 Up*
d5305078-0df6-4400-a137-fa8c39460514 6060 Win-XP Up* 10.35.70.7
17:31:14.761: 5935: debug : virDomainGetBlockInfo:4895 : dom=0x7f4dfc006900, (VM: name=VM1_Win2k3, uuid=001a6b25-8dfc-4dd2-90ab-20d35a0f0495), info=0x7f4de61f9fb0 flags=0
17:31:14.761: 5935: debug : remoteIO:10405 : Do proc=194 serial=1082 length=244 wait=0x7f4dfc423260
17:31:14.761: 5935: debug : remoteIO:10434 : Going to sleep 194 0x7f4dfc423260 0x7f4dfc4a33e0
17:31:14.761: 6093: debug : remoteIOEventLoop:10266 : Woken up from poll by other thread
17:31:14.821: 6514: debug : virDomainBlockStats:4506 : dom=0x7f4dfc267db0, (VM: name=Win2k3-4, uuid=7c89899e-8df1-4b04-8af5-d92611f366bb), path=hda, stats=0x7f4df4df7ae0, size=40
17:31:20.517: 7634: debug : virDomainGetXMLDesc:3256 : dom=0x7f4de8007d10, (VM: name=Win-XP, uuid=d5305078-0df6-4400-a137-fa8c39460514), flags=0
17:31:24.691: 7702: debug : virDomainGetXMLDesc:3256 : dom=0x7f4df0001040, (VM: name=Win2k3-3, uuid=62c0edd4-691f-44c6-8906-382ec36650ba), flags=0
17:34:21.082: 10417: debug : virDomainGetXMLDesc:3256 : dom=0x7f4dfc267db0, (VM: name=Win2k3-4, uuid=7c89899e-8df1-4b04-8af5-d92611f366bb), flags=0
17:34:25.193: 10482: debug : virDomainGetXMLDesc:3256 : dom=0x7f4de8007d10, (VM: name=Win-XP, uuid=d5305078-0df6-4400-a137-fa8c39460514), flags=0
Should be able Open/Close Spice console many times, with no problem
** Please see attached libvirt.log
Created attachment 495519 [details]
could you please attach libvirt client logs? libvirtd logs seems good.
It's been a while since this bug, sorry but I can't find the relevant logs in that host (No longer exists)
Ortal, I suppose the console didn't have spice-gtk display, or did it? I can easily reproduce it with spice-gtk (integrated in virt-manager for f15).
It seems to be happen more often with less optimized QXL drivers in the guest, such as the one in RHEL 6.1. All I have to do is start rhel61 VM, open the console, and do a lot of display operations (move windows terminal around quickly, while doing a "find /" for example)
It always hangs at virDomainGetXMLDesc().
The end of the client libvirt log looks like:
19:56:23.082: 12918: debug : virDomainGetXMLDesc:3040 : dom=0x34ff050, (VM: name=rhel6.1, uuid=e60f906e-ac52-d3a8-dfd4-597fbd0847a2), flags=1
19:56:23.082: 12918: debug : remoteIO:10413 : Do proc=14 serial=616 length=64 wait=(nil)
19:56:23.082: 12918: debug : remoteIO:10485 : We have the buck 14 0x5362750 0x5362750
I tracked this down a bit, and found that the bug is in spice server code.
We discussed it with Alon, and found that flush_display_commands() is buggy: it waits for the client to read all commands, however being asynchronous and sharing the same loop, spice-gtk didn't read all, and virt-manager blocks on a query, resulting in a dead lock.
We don't know how best to solve it, either by cleaning the function, releasing the lock, or adding more locks etc.. Gerd help might be welcome
Thread 25 (Thread 0x7f8a62e4b700 (LWP 12200)):
#0 0x0000003b3280e57d in read () from /lib64/libpthread.so.0
#1 0x0000003b34c1271a in read (__nbytes=4, __buf=0x7f8a62e4a9ec, __fd=14) at /usr/include/bits/unistd.h:45
#2 receive_data (n=4, in_buf=0x7f8a62e4a9ec, fd=14) at red_worker.h:131
#3 read_message (message=0x7f8a62e4a9ec, fd=14) at red_worker.h:144
#4 qxl_worker_update_area (qxl_worker=<optimized out>, surface_id=938, qxl_area=<optimized out>, qxl_dirty_rects=0x0, num_dirty_rects=0, clear_dirty_region=0)
#5 0x00000000005db1a4 in ioport_write (opaque=0x33b4a10, addr=<optimized out>, val=0) at /usr/src/debug/qemu-kvm-0.14.0/hw/qxl.c:991
#6 0x000000000042f4df in kvm_handle_io (count=1, size=1, direction=1, data=<optimized out>, port=49218) at /usr/src/debug/qemu-kvm-0.14.0/kvm-all.c:818
#7 kvm_run (env=0x2f88940) at /usr/src/debug/qemu-kvm-0.14.0/qemu-kvm.c:617
#8 0x0000000000430a39 in kvm_cpu_exec (env=<optimized out>) at /usr/src/debug/qemu-kvm-0.14.0/qemu-kvm.c:1233
#9 0x0000000000430cc1 in kvm_main_loop_cpu (env=0x2f88940) at /usr/src/debug/qemu-kvm-0.14.0/qemu-kvm.c:1419
#10 ap_main_loop (_env=0x2f88940) at /usr/src/debug/qemu-kvm-0.14.0/qemu-kvm.c:1466
#11 0x0000003b32807cd1 in start_thread () from /lib64/libpthread.so.0
#12 0x0000003b324e19bd in clone () from /lib64/libc.so.6
Thread 24 (Thread 0x7f8a613e8700 (LWP 12201)):
#0 0x0000003b324adf5d in nanosleep () from /lib64/libc.so.6
#1 0x0000003b324db5b4 in usleep () from /lib64/libc.so.6
#2 0x0000003b34c2d68a in flush_display_commands (worker=0x7f8a61213890) at red_worker.c:8787
#3 0x0000003b34c2f232 in handle_dev_update (worker=0x7f8a61213890) at red_worker.c:9723
#4 handle_dev_input (listener=0x7f8a61213890, events=<optimized out>) at red_worker.c:9936
#5 0x0000003b34c30385 in red_worker_main (arg=<optimized out>) at red_worker.c:10286
#6 0x0000003b32807cd1 in start_thread () from /lib64/libpthread.so.0
#7 0x0000003b324e19bd in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7f8a630599c0 (LWP 12195)):
#0 0x0000003b3280e29d in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x0000003b32809ec7 in _L_lock_862 () from /lib64/libpthread.so.0
#2 0x0000003b32809d1b in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x0000000000431758 in kvm_mutex_lock () at /usr/src/debug/qemu-kvm-0.14.0/qemu-kvm.c:1730
#4 qemu_mutex_lock_iothread () at /usr/src/debug/qemu-kvm-0.14.0/qemu-kvm.c:1744
#5 0x000000000041e569 in main_loop_wait (nonblocking=<optimized out>) at /usr/src/debug/qemu-kvm-0.14.0/vl.c:1412
#6 0x0000000000431547 in kvm_main_loop () at /usr/src/debug/qemu-kvm-0.14.0/qemu-kvm.c:1589
#7 0x000000000040b0a5 in main_loop () at /usr/src/debug/qemu-kvm-0.14.0/vl.c:1464
#8 main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at /usr/src/debug/qemu-kvm-0.14.0/vl.c:3236
Reassigning to qemu-kvm.
The fundamental issue is that the qxl device runs certain long-running I/O operations synchronously. qemu-kvm needs a fix, maybe also spice-server (depends on how we'll fix it on the host side). The guest drivers will need fixes too as we'll have to extend the device interface.
Extending cond nak to upstream.
Fixing this needs a new guest/host interface, we can't commit to that without the bits being upstream first.
Tried on qemu-kvm-0.12.1.2-2.172.el6 ,can *not* reproduce
RHEV Manager for Servers and Desktops :3.0.0_0001-40.el6
1.use RHEVM installing a RHEL6u1 guest with spice
2.make it into template and create 5 new guests from the template
3.start 5 guests with spice
4.find a win08R2 host to operation RHEVM ,open the guest spice console.
5.right-click the task-bar ,then click “Close All Windows"
then close them
After steps4 ,the status of guests shows "up"
Can not reproduce this issue .
Is the steps above I have missed something ? Could you tell me how to reproduce this issue ?
Those are the exact steps. Try to open more than 5 spice consoles, maybe in that case it will reproduce to u..
Retry this with 12 VMs according to comment #23 and comment #24
still can not reproduce this issue .when I am trying to reproduce this one ,I always hit bug 697441 .
As the reporter said closing the spice client will cause guest in *unknown* status which RHEVM shows .
Under which circumstances qemu-kvm will cause this,Could you provide me a way how to reproduce this ?
Tried on qemu-kvm-0.12.1.2-2.200.el6.x86_64
Repeat the steps in comment#0 more than 30 times ,(windows and linux guest both)
on qemu-kvm-0.12.1.2-2.200.el6 ,everything works fine .no error in libvirt.log and guest still shows "UP" both in RHEVM-GUI and vdsClient -s0 list table
According to comment25,we failed to reproduce this issue but reproduce bz697441,seems bz697441 blocked our to reproduce this issue.according to comment26,with qemu-kvm-0.12.1.2-2.200.el6.x86_64,both bz697441 and this issue works well.would you please double-check our steps,if our steps are right,can we change this issue status as verified?
I never reproduced it myself as I have no local RHEV-M setup.
So if Ortal says in comment 24 the steps are ok and you can't reproduce it this way it should really be fixed -> set to verified is ok.
Technical note added. If any revisions are required, please edit the "Technical Notes" field
accordingly. All revisions will be proofread by the Engineering Content Services team.
qxl submits requests to the spice server thread and waits syncronously for the result.
qemu is blocked for a potentially long time.
Use completion notification instead of waiting.
spice server thread processes the request ayncronously without stalling qemu.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.