Bug 1274575

Summary: vm qemu process crashed with spice-server assertion failure
Product: Red Hat Enterprise Linux 6 Reporter: shangxu <sllone>
Component: spice-serverAssignee: Victor Toso <victortoso>
Status: CLOSED ERRATA QA Contact: SPICE QE bug list <spice-qe-bugs>
Severity: low Docs Contact:
Priority: low    
Version: 6.5CC: cfergeau, dblechte, djasa, fziglio, mkenneth, qguo, qingyu.yang, rbalakri, rduda, rh-spice-bugs, sllone, tpelka, victortoso
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: spice-server-0.12.4-15.el6 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-03-21 09:20:17 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1269194    
Attachments:
Description Flags
qemu log
none
backtrace with steps
none
proposal patch to avoid spice-server crash none

Description shangxu 2015-10-23 03:31:27 UTC
Description of problem:
In a Windows 7 virtual machine, use potplayer play a video, changing the transparency of player, and frequent switching and drag potplayer window, after a period of time, the Windows 7 virtual is shut off,and a log output "Windows 7 VM is the Exit message: Lost connection with qemu process." in rhevm.

Version-Release number of selected component (if applicable):
spice-server-0.12.4-12.el6.src.rpm
spice-gtk-0.26-4.el6.src.rpm
qemu-kvm-rhev-0.12.1.2-2
QXL 6.1.0.10018

How reproducible:
frequently

Steps to Reproduce:
1.remote to win7 vm
2.use potplayer play a video with sound
3.frequent switching and drag potplayer window, after a period of time, the Windows 7 virtual is shut off.

Actual results:
Windows 7 virtual is shut off,and a log output "Windows 7 VM is the Exit message: Lost connection with qemu process." in rhevm.

Expected results:
Normal 

Additional info:

Comment 1 shangxu 2015-10-23 03:39:14 UTC
Created attachment 1085694 [details]
qemu log

Comment 3 Victor Toso 2015-10-23 06:38:53 UTC
Hi, not the first time seeing this issue:
https://bugzilla.redhat.com/show_bug.cgi?id=1172036

How much time is needed for the VM to shut of?
Can you reproduce this using VNC?
As you mentioned 'sound', does this not occur when sound is off?

Comment 4 shangxu 2015-10-23 07:18:43 UTC
I have the same problem as 1172036.
I can reproduce every time.
This time I disable the sound card, problems still occur.
which log do you need?

Comment 5 Victor Toso 2015-10-23 13:23:12 UTC
(In reply to shangxu from comment #4)
> I have the same problem as 1172036.
> I can reproduce every time.
> This time I disable the sound card, problems still occur.
> which log do you need?

I'm interested in how to reproduce this problem. rhbz#1172036 did not mention anything besides playing video with potplayer. I did that in different VM's for several days, without issue.

Do you mean that it is important to change potplayer transparency and also keep moving it around in the desktop while playing the video in order to this bug happen?

How long it usually takes to the crash? Minutes? Hours? Days?

Comment 6 Victor Toso 2015-10-23 13:29:59 UTC
*** Bug 1172036 has been marked as a duplicate of this bug. ***

Comment 7 shangxu 2015-10-24 03:30:34 UTC
A method of reproduce as I have written it, repeatedly change potplayer transparency , and then drag the window, and then open a few other applications. The most important thing is repeatedly change potplayer transparency. Sometimes there will be two minutes, sometimes 10 minutes, it seems, and the performance of the terminal has a relationship, I am in the redhat6.5 vmware virtual machine, through virtviewer access, will soon appear. Through my win7 PC access time, it appears slower.
I also test redhat7.1(spice-server-12.4-9),but it does not appear, and I noticed when test in redhat7, /var/log /libvirt/qemu/win7.log no output ‘Application transferred too many scanlines’. Perhaps only when it occurs, it will appear BUG.
I searched ‘...scanlines’, it was said  jepg related, jepg version of my environment is libjpeg-turbo-1.2.1-3.el6_5.x86_64, redhat7 is libjpeg-turbo-1.2.90-5.el7.x86_64.

Comment 8 shangxu 2015-10-27 01:32:57 UTC
(In reply to Victor Toso from comment #5)
> (In reply to shangxu from comment #4)
> > I have the same problem as 1172036.
> > I can reproduce every time.
> > This time I disable the sound card, problems still occur.
> > which log do you need?
> 
> I'm interested in how to reproduce this problem. rhbz#1172036 did not
> mention anything besides playing video with potplayer. I did that in
> different VM's for several days, without issue.
> 
> Do you mean that it is important to change potplayer transparency and also
> keep moving it around in the desktop while playing the video in order to
> this bug happen?
> 
> How long it usually takes to the crash? Minutes? Hours? Days?

The following method can also reproduce this problem:
1. connected virtual machines through the windows virt-viewer 
2. Use potplayer playing video, and then in the playlist, select another video, play video repeatedly switch
3. occasionally change the transparency of a video

I change the version of spice-server and libjpeg, the problem still exists.

Comment 10 David Jaša 2015-11-19 16:13:57 UTC
Created attachment 1096766 [details]
backtrace with steps

The catch here is that spice server makes qemu exit so you have to connect gdb early and set a breakpoint. For spice-server-0.12.4-12.el6_7.1.x86_64, the function in question is this one:

static void red_channel_remove_client(RedChannelClient *rcc)
{
    if (!pthread_equal(pthread_self(), rcc->channel->thread_id)) {
        spice_warning("channel type %d id %d - "
                      "channel->thread_id (0x%lx) != pthread_self (0x%lx)."
                      "If one of the threads is != io-thread && != vcpu-thread, "
                      "this might be a BUG",
                      rcc->channel->type, rcc->channel->id,
                      rcc->channel->thread_id, pthread_self());
    }
    ring_remove(&rcc->channel_link);
    spice_assert(rcc->channel->clients_num > 0);
    rcc->channel->clients_num--;
    // TODO: should we set rcc->channel to NULL???
}

I managed to hit the condition several times, it takes varying time to reproduce. I couldn't see any reliable trigger but transparency adjustment seems indeed to make the bug happen.

Attached is gdb output with full backtrace of all threads followed by "step 10000" command - note that this assertion is not present in normal qemu log:
> ((null):19928): Spice-ERROR **: snd_worker.c:1088:spice_server_playback_get_buffer: assertion `playback_channel->base.active' failed

Comment 11 David Jaša 2015-11-19 16:56:56 UTC
Addendum:
To catch the log, I run:
gdb --pid $(pgrep -f $VM_NAME) -x gdb_potplayer_commands

with gdb_potplayer_commands file contents:

set logging file /var/log/libvirt/qemu/${VM_NAME}.log
b red_channel.c:1800
commands
set logging on
t a a bt full
step 10000
end
continue

(aka teeing gdb output to qemu log)

Is the backtrace helpful, or do we need some more heavyweight tools for getting information?

Comment 13 Victor Toso 2016-02-01 14:32:20 UTC
So, I think I've found a step-by-step to reproduce this:

1-) connect to the w7 machine (not fullscreen)
2-) start potplayer and set transparency (with the slider in the top-right)
3-) start the video
4-) increase the size of remote-viewer (the widget itself) and wait to the guest autoresize
5-) increase the size of potplayer in the guest
6-) decrease potplayer transparency (with the slider in the top-right)
7-) crash seems to happen after step 6 due too few data in jpeg.

I have a patch that avoids the crash but might leave glitches in the stream so, waiting for feedback on it.

PS: Seems that this not happen with upstream qxl driver. I was only able to reproduce with rhevm-tools 3.5.9 so far.

Comment 14 Victor Toso 2016-02-04 13:36:15 UTC
So far, making the spice-server not crash seems the best for now. The stream code in spice-server needs improvements and for this reason it is disable in RHEL7 [0] which is the probable reason for this crash not being reproducible there.

[0] https://bugzilla.redhat.com/show_bug.cgi?id=1294564#c3

With this fix applied I noticed that when the crash should happen the stream gets a bit slower and if you move the player in the guest some glitches could be noticed.

Some context regarding the error messages:

1-) From libjpeg_turbo "Application transferred too many scanlines" happens when the stream is bigger then what we set for libjpeg encoder so the encoder is ignoring part of the stream;

2-) From libjpeg_turbo "Application transferred too few scanlines" happens when the stream is smaller then what we set in libjpeg encoder so the encoder does not have enough data and causes the crash;

The (2) is being handled so error will be avoided.

Comment 16 shangxu 2016-02-18 07:11:58 UTC
(In reply to Victor Toso from comment #14)
> So far, making the spice-server not crash seems the best for now. The stream
> code in spice-server needs improvements and for this reason it is disable in
> RHEL7 [0] which is the probable reason for this crash not being reproducible
> there.
> 
> [0] https://bugzilla.redhat.com/show_bug.cgi?id=1294564#c3
> 
> With this fix applied I noticed that when the crash should happen the stream
> gets a bit slower and if you move the player in the guest some glitches
> could be noticed.
> 
> Some context regarding the error messages:
> 
> 1-) From libjpeg_turbo "Application transferred too many scanlines" happens
> when the stream is bigger then what we set for libjpeg encoder so the
> encoder is ignoring part of the stream;
> 
> 2-) From libjpeg_turbo "Application transferred too few scanlines" happens
> when the stream is smaller then what we set in libjpeg encoder so the
> encoder does not have enough data and causes the crash;
> 
> The (2) is being handled so error will be avoided.
el7 is by closing mjpeg avoids this problem?
If it is el6.5 users should be how to solve this problem?
patch?

Comment 17 Victor Toso 2016-02-19 10:56:33 UTC
(In reply to shangxu from comment #16)
> el7 is by closing mjpeg avoids this problem?

On el7 the stream detection is disabled (in order to enable it, you must change qemu command line).

> If it is el6.5 users should be how to solve this problem?
> patch?

Patch are still under review and being tested so I would recommend customers to wait the release. In any case I'll attach the proposal patch here.

Comment 18 Victor Toso 2016-02-19 11:00:07 UTC
Created attachment 1128510 [details]
proposal patch to avoid spice-server crash

Comment 20 Victor Toso 2016-02-27 15:27:54 UTC
(In reply to Frediano Ziglio from comment #19)
> See
> https://lists.freedesktop.org/archives/spice-devel/2016-February/026852.html

Indeed, seems that it could be a better way to avoid the crash.

Tested and seems that performance is better as stream is not as slow as with the previous patch. I guess that glitches could still happen, but as I said in comment #14 - I would prefer to avoid the crash of spice-server now but fix the sized-stream upstream.

Comment 22 Mike McCune 2016-03-28 23:43:17 UTC
This bug was accidentally moved from POST to MODIFIED via an error in automation, please see mmccune with any questions

Comment 23 Christophe Fergeau 2016-04-06 15:07:47 UTC
Moving to 6.9 .

Comment 24 Victor Toso 2016-08-08 14:03:56 UTC
Patch from comment #20 is upstream [0]. Moving to ASSIGNED to double check it early in the next development phase.

[0] https://cgit.freedesktop.org/spice/spice/commit/?id=1b69198c4ec73110251e0ebf969275e98950808e

Comment 25 Victor Toso 2016-09-02 13:22:52 UTC
Backported following patches and tested following comment #13 and no more crashes.

28f2e425c4e9d86570970d49a7a3eee43e24134e
Francois Gouget (1):
      streaming: Rework red_marshall_stream_data a bit

42a5794845d0ee4b34ac523b8ad5a6c453d2203c
Francois Gouget (1):
      streaming: Remove the Drawable.sized_stream field

032cb0ce85b44da3ee5d0308909164452e25bff5
Francois Gouget (1):
      mjpeg: Use src_area as the authoritative source for the frame dimensions

Comment 31 errata-xmlrpc 2017-03-21 09:20:17 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2017-0588.html