Bug 1594456 - spice GL causes zombie qemu process
Summary: spice GL causes zombie qemu process
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Virtualization Tools
Classification: Community
Component: libvirt
Version: unspecified
Hardware: x86_64
OS: Linux
unspecified
unspecified
Target Milestone: ---
Assignee: Libvirt Maintainers
QA Contact:
URL:
Whiteboard:
Keywords:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-06-23 05:56 UTC by Toolybird
Modified: 2019-07-03 08:42 UTC (History)
5 users (show)

(edit)
Clone Of:
(edit)
Last Closed: 2019-07-03 08:42:56 UTC


Attachments (Terms of Use)
xml dump (4.47 KB, text/plain)
2018-06-23 05:56 UTC, Toolybird
no flags Details
log file (3.25 KB, text/plain)
2018-06-23 05:58 UTC, Toolybird
no flags Details
debug log file with 1:qemu (19.11 KB, text/plain)
2018-06-24 00:55 UTC, Toolybird
no flags Details

Description Toolybird 2018-06-23 05:56:59 UTC
Created attachment 1453897 [details]
xml dump

Description of problem:
  
This is on Arch Linux so the bug could well be in Arch configuration. Would appreciate any pointers on how to debug.

. virgl works with plain qemu (-vga virtio -display gtk,gl=on)
. virgl works with spice / remote-viewer (-device virtio-vga,virgl=on \
    -spice gl=on,unix,addr=/run/user/1000/spice.sock,disable-ticketing)

However, when using virt-manager / virsh, the VM does not start and I am instead left with an apparently crashed qemu process:

# ps aux | grep qemu
nobody    7675  0.0  0.0      0     0 ?        Zl   15:35   0:00 [qemu-system-x86] <defunct>

(Arch build uses --with-qemu-user=nobody --with-qemu-group=kvm)

Latest released versions of everything:

libvirt-4.4.0
qemu-2.12.0
virglrenderer-0.6.0
mesa-18.1.2
linux-4.17.2

xml dump and log file attached.
Thanks

Comment 1 Toolybird 2018-06-23 05:58 UTC
Created attachment 1453898 [details]
log file

Comment 2 Toolybird 2018-06-23 06:15:38 UTC
PS. The permission denied thing at the end of log file is a red herring. I had previously added env var MESA_GLSL_CACHE_DISABLE=true to the xml which got rid of the error but didn't solve the problem.

Comment 3 Toolybird 2018-06-24 00:55 UTC
Created attachment 1454087 [details]
debug log file with 1:qemu

I added some logging to libvirtd (log_filters="1:qemu") and it seems the process chokes when libvirtd attaches to the QEMU monitor and executes "qmp_capabilities"

New debug log file attached.

Comment 4 Toolybird 2018-07-01 01:29:48 UTC
Also happens with  qemu:///session

To be clear, this is the trigger:

--- Original XML
+++ New XML
@@ -84,7 +84,7 @@
     <graphics type="spice">
       <listen type="none"/>
       <image compression="off"/>
-      <gl enable="no"/>
+      <gl enable="yes"/>
     </graphics>
     <video>
       <model type="virtio" heads="1" primary="yes">

Tried the following to no avail:

 - an earlier version of qemu
 - <listen type='socket'/> instead of <listen type='none'/>

Outside of libvirt, I also manually set up spice GL with QMP monitor port and connected to it via telnet and executed "qmp_capabilities" which worked fine.

The closest related thing I can find is here:

https://forums.gentoo.org/viewtopic-t-954294-start-0.html

which seems to be indicating libvirt's use of "-daemonize" can lead to qemu zombies

My hardware is Ryzen 7 2700X + Radeon RX 550 + NVME storage i.e. ie: quick. I'm thinking I might be hitting a timing related bug in libvirt.

Would be grateful for any input on where to go from here.

(re-titling bug report to be more accurate)

Comment 5 Toolybird 2018-07-09 23:15:55 UTC
Found the culprit!

It's the -sandbox arg to qemu, in particular the "resourcecontrol" param.

Changing as follows allows qemu binary to run:

-sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=allow

seccomp related apparently. Any thoughts?

Comment 6 Erik Skultety 2018-07-10 06:11:52 UTC
Thanks for doing such a thorough examination, this is an interesting issue I'd like to investigate further, so I'll start looking at it, but I probably won't get to it until next week (if you happen to fix it by then, perfect, patches are always welcome). Anyhow, this might affect downstream too, eventually we might want to move this BZ to a related product in order to get more attention and priority.

Comment 7 Marc-Andre Lureau 2018-07-10 13:51:59 UTC
commenting out this line in qemu-seccomp.c works around it.

    /* { SCMP_SYS(sched_setscheduler),     QEMU_SECCOMP_SET_RESOURCECTL }, */

I suppose we should let changing scheduling to lower priority.

A second issue is finding out why libvirt doesn't receive a HUP on the monitor socket when qemu dies with the seccomp rule...

Comment 8 Marc-Andre Lureau 2018-07-10 14:27:18 UTC
simple reproducer:
qemu-system-x86_64 -sandbox on,resourcecontrol=deny  -spice gl=on

Comment 9 Marc-Andre Lureau 2018-07-11 13:04:16 UTC
here is a simple reproducer for the hang (sigsys not received)

#include <unistd.h>
#include <seccomp.h>
#include <pthread.h>
#include <sched.h>

static void *thread_fn(void *args)
{
    while(1)
        sleep(1);
}

int main(void)
{
    {
        pthread_attr_t attr;
        pthread_t tid;
        pthread_attr_init(&attr);
        pthread_create(&tid, &attr, thread_fn, 0);
    }


    {
        scmp_filter_ctx ctx = seccomp_init(SCMP_ACT_ALLOW);
        seccomp_rule_add(ctx, SCMP_ACT_KILL, SCMP_SYS(sched_getscheduler), 0);
        seccomp_load(ctx);
    }

    sched_getscheduler(0);
    return 0;
}

Comment 10 Marc-Andre Lureau 2018-07-12 16:33:11 UTC
we need SECCOMP_RET_KILL_PROCESS, see https://github.com/seccomp/libseccomp/issues/96

Comment 11 Marc-Andre Lureau 2019-07-03 08:42:56 UTC
The seccomp problem was fixed upstream (commit 6f2231e9b0931e1998d9ed0c509adf7aedc02db2 and bda08a5764d470f101fa38635d30b41179a313e1) and backported in various RHEL versions. No changed identified in libvirt. Closing


Note You need to log in before you can comment on or make changes to this bug.