Bug 1594456

Summary: spice GL causes zombie qemu process
Product: [Community] Virtualization Tools Reporter: Toolybird <toolybird>
Component: libvirtAssignee: Libvirt Maintainers <libvirt-maint>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: unspecifiedCC: eskultet, fjin, libvirt-maint, marcandre.lureau, yafu
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-07-03 08:42:56 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
xml dump
none
log file
none
debug log file with 1:qemu none

Description Toolybird 2018-06-23 05:56:59 UTC
Created attachment 1453897 [details]
xml dump

Description of problem:
  
This is on Arch Linux so the bug could well be in Arch configuration. Would appreciate any pointers on how to debug.

. virgl works with plain qemu (-vga virtio -display gtk,gl=on)
. virgl works with spice / remote-viewer (-device virtio-vga,virgl=on \
    -spice gl=on,unix,addr=/run/user/1000/spice.sock,disable-ticketing)

However, when using virt-manager / virsh, the VM does not start and I am instead left with an apparently crashed qemu process:

# ps aux | grep qemu
nobody    7675  0.0  0.0      0     0 ?        Zl   15:35   0:00 [qemu-system-x86] <defunct>

(Arch build uses --with-qemu-user=nobody --with-qemu-group=kvm)

Latest released versions of everything:

libvirt-4.4.0
qemu-2.12.0
virglrenderer-0.6.0
mesa-18.1.2
linux-4.17.2

xml dump and log file attached.
Thanks

Comment 1 Toolybird 2018-06-23 05:58:21 UTC
Created attachment 1453898 [details]
log file

Comment 2 Toolybird 2018-06-23 06:15:38 UTC
PS. The permission denied thing at the end of log file is a red herring. I had previously added env var MESA_GLSL_CACHE_DISABLE=true to the xml which got rid of the error but didn't solve the problem.

Comment 3 Toolybird 2018-06-24 00:55:51 UTC
Created attachment 1454087 [details]
debug log file with 1:qemu

I added some logging to libvirtd (log_filters="1:qemu") and it seems the process chokes when libvirtd attaches to the QEMU monitor and executes "qmp_capabilities"

New debug log file attached.

Comment 4 Toolybird 2018-07-01 01:29:48 UTC
Also happens with  qemu:///session

To be clear, this is the trigger:

--- Original XML
+++ New XML
@@ -84,7 +84,7 @@
     <graphics type="spice">
       <listen type="none"/>
       <image compression="off"/>
-      <gl enable="no"/>
+      <gl enable="yes"/>
     </graphics>
     <video>
       <model type="virtio" heads="1" primary="yes">

Tried the following to no avail:

 - an earlier version of qemu
 - <listen type='socket'/> instead of <listen type='none'/>

Outside of libvirt, I also manually set up spice GL with QMP monitor port and connected to it via telnet and executed "qmp_capabilities" which worked fine.

The closest related thing I can find is here:

https://forums.gentoo.org/viewtopic-t-954294-start-0.html

which seems to be indicating libvirt's use of "-daemonize" can lead to qemu zombies

My hardware is Ryzen 7 2700X + Radeon RX 550 + NVME storage i.e. ie: quick. I'm thinking I might be hitting a timing related bug in libvirt.

Would be grateful for any input on where to go from here.

(re-titling bug report to be more accurate)

Comment 5 Toolybird 2018-07-09 23:15:55 UTC
Found the culprit!

It's the -sandbox arg to qemu, in particular the "resourcecontrol" param.

Changing as follows allows qemu binary to run:

-sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=allow

seccomp related apparently. Any thoughts?

Comment 6 Erik Skultety 2018-07-10 06:11:52 UTC
Thanks for doing such a thorough examination, this is an interesting issue I'd like to investigate further, so I'll start looking at it, but I probably won't get to it until next week (if you happen to fix it by then, perfect, patches are always welcome). Anyhow, this might affect downstream too, eventually we might want to move this BZ to a related product in order to get more attention and priority.

Comment 7 Marc-Andre Lureau 2018-07-10 13:51:59 UTC
commenting out this line in qemu-seccomp.c works around it.

    /* { SCMP_SYS(sched_setscheduler),     QEMU_SECCOMP_SET_RESOURCECTL }, */

I suppose we should let changing scheduling to lower priority.

A second issue is finding out why libvirt doesn't receive a HUP on the monitor socket when qemu dies with the seccomp rule...

Comment 8 Marc-Andre Lureau 2018-07-10 14:27:18 UTC
simple reproducer:
qemu-system-x86_64 -sandbox on,resourcecontrol=deny  -spice gl=on

Comment 9 Marc-Andre Lureau 2018-07-11 13:04:16 UTC
here is a simple reproducer for the hang (sigsys not received)

#include <unistd.h>
#include <seccomp.h>
#include <pthread.h>
#include <sched.h>

static void *thread_fn(void *args)
{
    while(1)
        sleep(1);
}

int main(void)
{
    {
        pthread_attr_t attr;
        pthread_t tid;
        pthread_attr_init(&attr);
        pthread_create(&tid, &attr, thread_fn, 0);
    }


    {
        scmp_filter_ctx ctx = seccomp_init(SCMP_ACT_ALLOW);
        seccomp_rule_add(ctx, SCMP_ACT_KILL, SCMP_SYS(sched_getscheduler), 0);
        seccomp_load(ctx);
    }

    sched_getscheduler(0);
    return 0;
}

Comment 10 Marc-Andre Lureau 2018-07-12 16:33:11 UTC
we need SECCOMP_RET_KILL_PROCESS, see https://github.com/seccomp/libseccomp/issues/96

Comment 11 Marc-Andre Lureau 2019-07-03 08:42:56 UTC
The seccomp problem was fixed upstream (commit 6f2231e9b0931e1998d9ed0c509adf7aedc02db2 and bda08a5764d470f101fa38635d30b41179a313e1) and backported in various RHEL versions. No changed identified in libvirt. Closing