Bug 1104697 - SPICE connection to VM no longer seamless with 12th login
Summary: SPICE connection to VM no longer seamless with 12th login
Keywords:
Status: CLOSED EOL
Alias: None
Product: Fedora
Classification: Fedora
Component: spice
Version: 20
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Christophe Fergeau
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard: virt
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-06-04 14:04 UTC by Markus Stockhausen
Modified: 2015-06-05 20:59 UTC (History)
18 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2015-06-05 20:59:18 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
qemu/spice log (231.72 KB, text/plain)
2014-06-04 14:07 UTC, Markus Stockhausen
no flags Details
spice console (26.61 KB, image/png)
2014-06-04 14:08 UTC, Markus Stockhausen
no flags Details
vdsm (76.50 KB, text/plain)
2014-06-05 07:28 UTC, Markus Stockhausen
no flags Details
normal performance before migration (711.77 KB, video/x-ms-wmv)
2014-06-05 14:10 UTC, Markus Stockhausen
no flags Details
bad performance after migration (1.07 MB, video/x-ms-wmv)
2014-06-05 14:11 UTC, Markus Stockhausen
no flags Details
gdb output (2.35 KB, text/plain)
2014-06-10 14:28 UTC, Markus Stockhausen
no flags Details
vdservice (1.03 KB, text/plain)
2014-06-14 14:35 UTC, Markus Stockhausen
no flags Details
vdagent (35.04 KB, text/plain)
2014-06-14 14:36 UTC, Markus Stockhausen
no flags Details

Description Markus Stockhausen 2014-06-04 14:04:06 UTC
Description of problem:

After Live migration of a Windows VM several times between two hypervisor nodes SPICE and RDP connection suddenly stops working.

Version-Release number of selected component (if applicable):

Ovirt 3.4.1
Hypervisor nodes: Fedora 20
QEMU: 1.6.2
SPICE: 0.12.5
VM: Windows XP
VM drivers: guest tools 0.74

How reproducible:

100% 

Steps to Reproduce:
1. Create a spice based windows VM 
2. Install XP into that VM
3. Migrate VM several times between two hypervisor hosts

Actual results:

4. After 10-20 migrations the VM console cannot be opened any more. Neither through RDP nor through SPICE

Expected results:

Connection should work.

Additional info:

SPICE log (level 4) attaced

screenshot of spice connection try attached

Comment 1 Markus Stockhausen 2014-06-04 14:07:36 UTC
Created attachment 902192 [details]
qemu/spice log

Comment 2 Markus Stockhausen 2014-06-04 14:08:23 UTC
Created attachment 902193 [details]
spice console

Comment 3 Markus Stockhausen 2014-06-04 14:24:53 UTC
As not clearly explained: I have spice-server 0.12.5 installed from the Fedora testing repositories. In 0.12.4 we had several endless loops because 64 bit time stamps were truncated to 32 bit.

Comment 4 Markus Stockhausen 2014-06-04 14:46:26 UTC
A working VM shows the following SPICE log lines around the error:

((null):19914): SpiceWorker-Info **: red_worker.c:9635:display_channel_client_wait_for_init: creating encoder with id == 0
<----- Here the connection hangs               ---->
<----- Below the lines of a noraml working VM  ---->
((null):19914): Spice-Info **: reds.c:2014:reds_handle_auth_mechanism: Auth method: 1
((null):19914): Spice-Info **: reds.c:1421:reds_info_new_channel: channel 3:0, connected successfully, over Secure link
inputs_connect: inputs channel client create

Comment 5 Markus Stockhausen 2014-06-04 19:30:01 UTC
Compiled spice-server 0.12.5 from scratch and added further logging instructions:

Ouput:

((null):10731): SpiceWorker-Info **: red_worker.c:9635:display_channel_client_wait_for_init: MST creating encoder with id == 0
((null):10731): SpiceWorker-Info **: red_worker.c:9637:display_channel_client_wait_for_init: MST glz_encoder_created
((null):10731): SpiceWorker-Info **: red_worker.c:9668:on_new_display_channel_client: MST red_channel_client_ack_zero_messages_window
((null):10731): SpiceWorker-Info **: red_worker.c:9678:on_new_display_channel_client: MST end
((null):10731): SpiceWorker-Info **: red_worker.c:10652:handle_new_display_channel: MST end
((null):10731): SpiceWorker-Info **: red_worker.c:11449:handle_dev_display_connect: MST connected
((null):10731): SpiceWorker-Info **: red_worker.c:10119:display_channel_handle_message: MST before_handle_message
((null):10731): Spice-Info **: red_channel.c:1531:red_channel_client_handle_message: MST start type=1
<---- The last tracepoint that is called in error case
      The next tracepoint that is called in ok case --->
((null):10731): Spice-Info **: reds.c:2266:reds_accept_ssl_connection: MST start
((null):10731): Spice-Info **: reds.c:2230:reds_init_client_ssl_connection: MST start

Conclusion: The spice connection to the bad VM dos not jump into function reds_accept_ssl_connection() after the display_channel_client_wait_for_init() call. 

I'm willing to enhance the tracepoints further. Any ideas where to search next. I'm having some problems to understand the watcher concept and to find a good point to debug the sent/received packets.

Comment 6 Omer Frenkel 2014-06-05 07:06:45 UTC
please attach vdsm.log as well

Comment 7 Markus Stockhausen 2014-06-05 07:28:24 UTC
Created attachment 902418 [details]
vdsm

Comment 8 Markus Stockhausen 2014-06-05 07:30:19 UTC
VDSM log from migration target host attached. The log starts before the migration is initiated and ends after I tried to logon to SPICE console via webadmin.

Comment 9 Markus Stockhausen 2014-06-05 08:42:14 UTC
Remark: The following behaviour can be reproduced even without doing a SPICE console connection. Just an online migration of the VM to the special spice debug enabled Fedora 20 hypervisor.

I narrowed the error down a bit in function red_worker_main(). In a normal working VM the endless loop in this function is permanently running. In the "damaged" VM this loop is suddenly halted. I placed two debugging messages before and after the call:

spice_info("before_poll");
num_events = poll(worker->poll_fds, MAX_EVENT_SOURCES, worker->event_timeout);
spice_info("after_poll");

On the bad VM that call suddenly does not return any more data and stalls. the last log line shows "before_poll". In the good VM that call will always return and the log is written on and on.

Comment 10 Markus Stockhausen 2014-06-05 14:10:13 UTC
Please ignore my last comments. 

I made some further analysis and the outcome is as follows:

When doing live migrations in the above described setup the SPICE performance might deteriorate on the target host. This can lead to a stall of the console display. Several problems might occur:

1) SPICE console lags. If you draw windows over the WinXP desktop very fast the react slowly and are running "behind" the mouse pointer.

2) SPICE console is no longer responsive. You cannot select any item on the WinXP desktop. Nevertheless the console display can be opened from webadmin.

3) SPICE console stays blank with the text connecting ...

This has been verified with the updated 0.12.5 spice-server and the original FC20 spice-server 0.12.4

Videos of case 2 are attached.

Comment 11 Markus Stockhausen 2014-06-05 14:10:49 UTC
Created attachment 902559 [details]
normal performance before migration

Comment 12 Markus Stockhausen 2014-06-05 14:11:18 UTC
Created attachment 902561 [details]
bad performance after migration

Comment 13 Christophe Fergeau 2014-06-10 13:46:43 UTC
(In reply to Markus Stockhausen from comment #5)

> Conclusion: The spice connection to the bad VM dos not jump into function
> reds_accept_ssl_connection() after the
> display_channel_client_wait_for_init() call. 
> 
> I'm willing to enhance the tracepoints further. Any ideas where to search
> next. I'm having some problems to understand the watcher concept and to find
> a good point to debug the sent/received packets.

Have you tried getting a gdb backtrace when the server is hanging? ('debuginfo-install qemu-kvm'  'gdb --pid $qemu_pid' followed by 'thread apply all bt')



(In reply to Markus Stockhausen from comment #10)
> Please ignore my last comments. 

All of them? There is no actual hang, but 'only' some very slow behaviour?

> 
> This has been verified with the updated 0.12.5 spice-server and the original
> FC20 spice-server 0.12.4
> 

Just to make sure I got you right, 0.12.4 was working ok (save for this 32 bit timer overflow), and this issue appeared in 0.12.5 ?

Comment 14 Markus Stockhausen 2014-06-10 14:10:47 UTC
Some further tests showed that this problem occurs only in Windows XP VMs. To be precise we see the system performance degrading after each migration.  So we can divide the error into several stages.

1) After 1-5 migration the SPICE display will get slower. The system seems to work normally. See video above.

2) Another 5-10 migrations later the system halts completely.

Exchanging SPICE 0.12.5 with 0.12.4 and vice versa does not change the situation

So to get closer to the error I will follow your advice and analyze what is going on, when the system stalls completely. Depending on that result I will open a new BZ or will continue this one.

Comment 15 Markus Stockhausen 2014-06-10 14:28:19 UTC
Created attachment 907289 [details]
gdb output

Comment 16 Markus Stockhausen 2014-06-10 14:30:26 UTC
GDB Output attached. The current version state in that situation is:

- qemu 1.6.2
- libspice-server 0.12.4

Comment 17 Markus Stockhausen 2014-06-10 18:30:03 UTC
BZ1107835 has been created to analyze the major problem -> VM will stall after some migrations.

Comment 18 Markus Stockhausen 2014-06-13 19:22:36 UTC
During long tests we isolated several bugs we encountered in the Ovirt environment. Therefore I changed this bug topic to better reflect a really reproducible testcase. Hope this is ok for the assignees.

Currently it boils down to a very simple and unwanted behaviour.

Server side:

Ovirt 3.4.2
Fedora 20 hyper visor nodes
qemu 1.6.2
libspice-server 0.12.4

Client side: 

Windows 7
virt-viewer 0.6.0

Steps to reproduce:

1) Start a Windows XP SPICE VM from Ovirt webadmin
2) Connect to the console of the VM
3) virt-viewer will start and open the console window
4) Admin can control the VM "seamless" That means the mouse is not trapped in the conolse window. Simply dragging the mouse over the console window you can interact with the VM.
5) Logout from console by closing virt-viwer

6) Repeat steps 2-5 exactly 11 times

7) Logon to the VM for the 12th time.

Result:

When admin wants to interact with the console the mouse gets trapped in the console window. Admin must press SHIFT+F12 to get out of the window.

Expected result: 

Seamless integration. No trapped mouse.

Comment 19 Marc-Andre Lureau 2014-06-13 21:21:33 UTC
When the mouse is trapped, can you verify if the vdagent is still running? most likely, not, could you get the vdagent & vdservice logs when the bug occurs? thanks

Comment 20 Markus Stockhausen 2014-06-14 14:35:52 UTC
Created attachment 908784 [details]
vdservice

Comment 21 Markus Stockhausen 2014-06-14 14:36:22 UTC
Created attachment 908785 [details]
vdagent

Comment 22 Markus Stockhausen 2014-06-14 14:37:36 UTC
Logs attached - please don't tell me vdagent/vdservice are stopped if a user logs into a server too often :(

Comment 23 Markus Stockhausen 2014-06-14 14:49:16 UTC
...
    if (!normal_restart && ++_agent_restarts > VD_AGENT_MAX_RESTARTS) {
        vd_printf("Agent restarted too many times");
        ret = false;
        stop();
    }
    if (ret && kill_agent() && launch_agent()) {
        if (time - _last_agent_restart_time > VD_AGENT_RESTART_COUNT_RESET_INTERVAL) {
            _agent_restarts = 0;
        }
...

That math looks too anticipatory. In that setup vdservice will kill the agent if a user logs into the console with an interval of < 60 seconds. Is there no better way to handle that?

Comment 24 Markus Stockhausen 2014-06-14 14:55:01 UTC
Windows XP ist not so graceful with agent restarts as newer versions:

  case WAIT_OBJECT_0 + VD_STATIC_EVENTS_COUNT:
    vd_printf("Agent killed");
    if (_system_version == SYS_VER_WIN_XP_CLASS) {
      restart_agent(false);
    } else if (_system_version == SYS_VER_WIN_7_CLASS) {
      kill_agent();
      // Assume agent was killed due to console disconnect, and wait for agent
      // normal restart due to console connect. If the agent is not alive yet,
      // it was killed manually (or crashed), so let's restart it.
      if (WaitForSingleObject(_control_event, VD_AGENT_RESTART_INTERVAL) ==
                       WAIT_OBJECT_0) {
        handle_control_event();
      }
      if (_running && !_agent_alive) {
        restart_agent(false);
      }

Any reason for that?

Comment 25 Christophe Fergeau 2014-06-16 08:50:34 UTC
The bit of code you quote was added as part of https://bugzilla.redhat.com/show_bug.cgi?id=845222
Thanks for the very detailed investigation!
I'm not sure if we have a spice-vdagent-win upstream bugzilla component though :((

Comment 26 Uri Lublin 2014-06-16 09:32:41 UTC
(In reply to Markus Stockhausen from comment #23)

The logic of this code is:
When spice-vdagent exits, spice-vdservice runs a new instance of spice-vdagent.
If something prevents the spice-vdagent from starting, for example a problem with the virtio-serial device/driver, spice-vdservice tries VD_AGENT_MAX_RESTARTS consecutive times and stops trying.
The code assumes that if there is no problem, at least one instance out of the VD_AGENT_MAX_RESTARTS vdagents will live for more than VD_AGENT_RESTART_COUNT_RESET_INTERVAL.

The problem is partially due to making the windows spice vdagent exit when the spice client disconnects.

Comment 27 Markus Stockhausen 2014-06-26 10:08:25 UTC
(In reply to Christophe Fergeau from comment #25)
> The bit of code you quote was added as part of
> https://bugzilla.redhat.com/show_bug.cgi?id=845222
> Thanks for the very detailed investigation!
> I'm not sure if we have a spice-vdagent-win upstream bugzilla component
> though :((

Is there some better place to ask for a fix (outside RH bugzilla)?

Comment 28 Cole Robinson 2015-03-31 19:41:02 UTC
Markus, there is an upstream spice w32 agent bug tracker, but no guarantees a fix will be generated. at least the bug has a better chance of surviving since this one will be autoclosed when F20 is end-of-life

https://bugs.freedesktop.org/buglist.cgi?component=win32%20agent&product=Spice

Comment 29 Fedora End Of Life 2015-05-29 12:01:26 UTC
This message is a reminder that Fedora 20 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 20. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as EOL if it remains open with a Fedora  'version'
of '20'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 20 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 30 Christophe Fergeau 2015-06-02 13:43:27 UTC
I've opened https://bugs.freedesktop.org/show_bug.cgi?id=90819 fwiw

Comment 31 Cole Robinson 2015-06-05 20:59:18 UTC
Since the bug was moved to the upstream tracker, just closing this one since f20 will autoclose it soon anyways


Note You need to log in before you can comment on or make changes to this bug.