Description of problem: After Live migration of a Windows VM several times between two hypervisor nodes SPICE and RDP connection suddenly stops working. Version-Release number of selected component (if applicable): Ovirt 3.4.1 Hypervisor nodes: Fedora 20 QEMU: 1.6.2 SPICE: 0.12.5 VM: Windows XP VM drivers: guest tools 0.74 How reproducible: 100% Steps to Reproduce: 1. Create a spice based windows VM 2. Install XP into that VM 3. Migrate VM several times between two hypervisor hosts Actual results: 4. After 10-20 migrations the VM console cannot be opened any more. Neither through RDP nor through SPICE Expected results: Connection should work. Additional info: SPICE log (level 4) attaced screenshot of spice connection try attached
Created attachment 902192 [details] qemu/spice log
Created attachment 902193 [details] spice console
As not clearly explained: I have spice-server 0.12.5 installed from the Fedora testing repositories. In 0.12.4 we had several endless loops because 64 bit time stamps were truncated to 32 bit.
A working VM shows the following SPICE log lines around the error: ((null):19914): SpiceWorker-Info **: red_worker.c:9635:display_channel_client_wait_for_init: creating encoder with id == 0 <----- Here the connection hangs ----> <----- Below the lines of a noraml working VM ----> ((null):19914): Spice-Info **: reds.c:2014:reds_handle_auth_mechanism: Auth method: 1 ((null):19914): Spice-Info **: reds.c:1421:reds_info_new_channel: channel 3:0, connected successfully, over Secure link inputs_connect: inputs channel client create
Compiled spice-server 0.12.5 from scratch and added further logging instructions: Ouput: ((null):10731): SpiceWorker-Info **: red_worker.c:9635:display_channel_client_wait_for_init: MST creating encoder with id == 0 ((null):10731): SpiceWorker-Info **: red_worker.c:9637:display_channel_client_wait_for_init: MST glz_encoder_created ((null):10731): SpiceWorker-Info **: red_worker.c:9668:on_new_display_channel_client: MST red_channel_client_ack_zero_messages_window ((null):10731): SpiceWorker-Info **: red_worker.c:9678:on_new_display_channel_client: MST end ((null):10731): SpiceWorker-Info **: red_worker.c:10652:handle_new_display_channel: MST end ((null):10731): SpiceWorker-Info **: red_worker.c:11449:handle_dev_display_connect: MST connected ((null):10731): SpiceWorker-Info **: red_worker.c:10119:display_channel_handle_message: MST before_handle_message ((null):10731): Spice-Info **: red_channel.c:1531:red_channel_client_handle_message: MST start type=1 <---- The last tracepoint that is called in error case The next tracepoint that is called in ok case ---> ((null):10731): Spice-Info **: reds.c:2266:reds_accept_ssl_connection: MST start ((null):10731): Spice-Info **: reds.c:2230:reds_init_client_ssl_connection: MST start Conclusion: The spice connection to the bad VM dos not jump into function reds_accept_ssl_connection() after the display_channel_client_wait_for_init() call. I'm willing to enhance the tracepoints further. Any ideas where to search next. I'm having some problems to understand the watcher concept and to find a good point to debug the sent/received packets.
please attach vdsm.log as well
Created attachment 902418 [details] vdsm
VDSM log from migration target host attached. The log starts before the migration is initiated and ends after I tried to logon to SPICE console via webadmin.
Remark: The following behaviour can be reproduced even without doing a SPICE console connection. Just an online migration of the VM to the special spice debug enabled Fedora 20 hypervisor. I narrowed the error down a bit in function red_worker_main(). In a normal working VM the endless loop in this function is permanently running. In the "damaged" VM this loop is suddenly halted. I placed two debugging messages before and after the call: spice_info("before_poll"); num_events = poll(worker->poll_fds, MAX_EVENT_SOURCES, worker->event_timeout); spice_info("after_poll"); On the bad VM that call suddenly does not return any more data and stalls. the last log line shows "before_poll". In the good VM that call will always return and the log is written on and on.
Please ignore my last comments. I made some further analysis and the outcome is as follows: When doing live migrations in the above described setup the SPICE performance might deteriorate on the target host. This can lead to a stall of the console display. Several problems might occur: 1) SPICE console lags. If you draw windows over the WinXP desktop very fast the react slowly and are running "behind" the mouse pointer. 2) SPICE console is no longer responsive. You cannot select any item on the WinXP desktop. Nevertheless the console display can be opened from webadmin. 3) SPICE console stays blank with the text connecting ... This has been verified with the updated 0.12.5 spice-server and the original FC20 spice-server 0.12.4 Videos of case 2 are attached.
Created attachment 902559 [details] normal performance before migration
Created attachment 902561 [details] bad performance after migration
(In reply to Markus Stockhausen from comment #5) > Conclusion: The spice connection to the bad VM dos not jump into function > reds_accept_ssl_connection() after the > display_channel_client_wait_for_init() call. > > I'm willing to enhance the tracepoints further. Any ideas where to search > next. I'm having some problems to understand the watcher concept and to find > a good point to debug the sent/received packets. Have you tried getting a gdb backtrace when the server is hanging? ('debuginfo-install qemu-kvm' 'gdb --pid $qemu_pid' followed by 'thread apply all bt') (In reply to Markus Stockhausen from comment #10) > Please ignore my last comments. All of them? There is no actual hang, but 'only' some very slow behaviour? > > This has been verified with the updated 0.12.5 spice-server and the original > FC20 spice-server 0.12.4 > Just to make sure I got you right, 0.12.4 was working ok (save for this 32 bit timer overflow), and this issue appeared in 0.12.5 ?
Some further tests showed that this problem occurs only in Windows XP VMs. To be precise we see the system performance degrading after each migration. So we can divide the error into several stages. 1) After 1-5 migration the SPICE display will get slower. The system seems to work normally. See video above. 2) Another 5-10 migrations later the system halts completely. Exchanging SPICE 0.12.5 with 0.12.4 and vice versa does not change the situation So to get closer to the error I will follow your advice and analyze what is going on, when the system stalls completely. Depending on that result I will open a new BZ or will continue this one.
Created attachment 907289 [details] gdb output
GDB Output attached. The current version state in that situation is: - qemu 1.6.2 - libspice-server 0.12.4
BZ1107835 has been created to analyze the major problem -> VM will stall after some migrations.
During long tests we isolated several bugs we encountered in the Ovirt environment. Therefore I changed this bug topic to better reflect a really reproducible testcase. Hope this is ok for the assignees. Currently it boils down to a very simple and unwanted behaviour. Server side: Ovirt 3.4.2 Fedora 20 hyper visor nodes qemu 1.6.2 libspice-server 0.12.4 Client side: Windows 7 virt-viewer 0.6.0 Steps to reproduce: 1) Start a Windows XP SPICE VM from Ovirt webadmin 2) Connect to the console of the VM 3) virt-viewer will start and open the console window 4) Admin can control the VM "seamless" That means the mouse is not trapped in the conolse window. Simply dragging the mouse over the console window you can interact with the VM. 5) Logout from console by closing virt-viwer 6) Repeat steps 2-5 exactly 11 times 7) Logon to the VM for the 12th time. Result: When admin wants to interact with the console the mouse gets trapped in the console window. Admin must press SHIFT+F12 to get out of the window. Expected result: Seamless integration. No trapped mouse.
When the mouse is trapped, can you verify if the vdagent is still running? most likely, not, could you get the vdagent & vdservice logs when the bug occurs? thanks
Created attachment 908784 [details] vdservice
Created attachment 908785 [details] vdagent
Logs attached - please don't tell me vdagent/vdservice are stopped if a user logs into a server too often :(
... if (!normal_restart && ++_agent_restarts > VD_AGENT_MAX_RESTARTS) { vd_printf("Agent restarted too many times"); ret = false; stop(); } if (ret && kill_agent() && launch_agent()) { if (time - _last_agent_restart_time > VD_AGENT_RESTART_COUNT_RESET_INTERVAL) { _agent_restarts = 0; } ... That math looks too anticipatory. In that setup vdservice will kill the agent if a user logs into the console with an interval of < 60 seconds. Is there no better way to handle that?
Windows XP ist not so graceful with agent restarts as newer versions: case WAIT_OBJECT_0 + VD_STATIC_EVENTS_COUNT: vd_printf("Agent killed"); if (_system_version == SYS_VER_WIN_XP_CLASS) { restart_agent(false); } else if (_system_version == SYS_VER_WIN_7_CLASS) { kill_agent(); // Assume agent was killed due to console disconnect, and wait for agent // normal restart due to console connect. If the agent is not alive yet, // it was killed manually (or crashed), so let's restart it. if (WaitForSingleObject(_control_event, VD_AGENT_RESTART_INTERVAL) == WAIT_OBJECT_0) { handle_control_event(); } if (_running && !_agent_alive) { restart_agent(false); } Any reason for that?
The bit of code you quote was added as part of https://bugzilla.redhat.com/show_bug.cgi?id=845222 Thanks for the very detailed investigation! I'm not sure if we have a spice-vdagent-win upstream bugzilla component though :((
(In reply to Markus Stockhausen from comment #23) The logic of this code is: When spice-vdagent exits, spice-vdservice runs a new instance of spice-vdagent. If something prevents the spice-vdagent from starting, for example a problem with the virtio-serial device/driver, spice-vdservice tries VD_AGENT_MAX_RESTARTS consecutive times and stops trying. The code assumes that if there is no problem, at least one instance out of the VD_AGENT_MAX_RESTARTS vdagents will live for more than VD_AGENT_RESTART_COUNT_RESET_INTERVAL. The problem is partially due to making the windows spice vdagent exit when the spice client disconnects.
(In reply to Christophe Fergeau from comment #25) > The bit of code you quote was added as part of > https://bugzilla.redhat.com/show_bug.cgi?id=845222 > Thanks for the very detailed investigation! > I'm not sure if we have a spice-vdagent-win upstream bugzilla component > though :(( Is there some better place to ask for a fix (outside RH bugzilla)?
Markus, there is an upstream spice w32 agent bug tracker, but no guarantees a fix will be generated. at least the bug has a better chance of surviving since this one will be autoclosed when F20 is end-of-life https://bugs.freedesktop.org/buglist.cgi?component=win32%20agent&product=Spice
This message is a reminder that Fedora 20 is nearing its end of life. Approximately 4 (four) weeks from now Fedora will stop maintaining and issuing updates for Fedora 20. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as EOL if it remains open with a Fedora 'version' of '20'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version. Thank you for reporting this issue and we are sorry that we were not able to fix it before Fedora 20 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged change the 'version' to a later Fedora version prior this bug is closed as described in the policy above. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete.
I've opened https://bugs.freedesktop.org/show_bug.cgi?id=90819 fwiw
Since the bug was moved to the upstream tracker, just closing this one since f20 will autoclose it soon anyways