Bug 1303031

Summary: libvirtd hang since fork() was called while another thread had security manager locked
Product: Red Hat Enterprise Linux Advanced Virtualization Reporter: yafu <yafu>
Component: libvirtAssignee: Virtualization Maintenance <virt-maint>
Status: CLOSED CANTFIX QA Contact: yafu <yafu>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 8.0CC: dyuan, fjin, jsuchane, mprivozn, rbalakri, rjones, xuzhang, yafu
Target Milestone: rcKeywords: Triaged
Target Release: ---   
Hardware: x86_64   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-11-05 15:18:35 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1401400    

Description yafu 2016-01-29 11:22:44 UTC
Description of problem:
Run test-virt-alignment-scan-guests.sh in the libguestfs (https://github.com/libguestfs/libguestfs/blob/master/align/test-virt-alignment-scan-guests.sh , the script will start up lots of parallel libvirt instances), and the libvirtd process hang since fork() was called while another thread had security manager locked.

Version-Release number of selected component (if applicable):
libvirt-1.2.17-13.el7_2.2.x86_64
qemu-kvm-rhev-2.3.0-31.el7_2.5.x86_64

How reproducible:
sometimes

Steps to Reproduce:
1.Run test-virt-alignment-scan-guests.sh in the libguestfs:
# while true ; do ./test-virt-alignment-scan-guests.sh ; done

2.Check the output of 'virsh list' at the same time:
#watch virsh list

3.After about 20 hours, the libvirtd daemon hang and the output of 'virsh list' did not change any more.

4.Using gdb to print the libvirtd process backtrace:
(gdb)bt
#0  __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007fec0cafed02 in _L_lock_791 () from /lib64/libpthread.so.0
#2  0x00007fec0cafec08 in __GI___pthread_mutex_lock (mutex=mutex@entry=0x7febec1d92b0) at pthread_mutex_lock.c:64
#3  0x00007fec0f491ba5 in virMutexLock (m=m@entry=0x7febec1d92b0) at util/virthread.c:89
#4  0x00007fec0f47922e in virObjectLock (anyobj=anyobj@entry=0x7febec1d92a0) at util/virobject.c:323
#5  0x00007fec0f60db1c in virSecurityManagerSetSocketLabel (mgr=0x7febec1d92a0, vm=vm@entry=0x7febd800bb20) at security/security_manager.c:431
#6  0x00007fec0f60ade3 in virSecurityStackSetSocketLabel (mgr=<optimized out>, vm=0x7febd800bb20) at security/security_stack.c:456
#7  0x00007fec0f60db29 in virSecurityManagerSetSocketLabel (mgr=0x7febec1d91f0, vm=0x7febd800bb20) at security/security_manager.c:432
#8  0x00007febf66e896e in qemuProcessHook (data=0x7febfee740f0) at qemu/qemu_process.c:3227
#9  0x00007fec0f43f8fa in virExec (cmd=cmd@entry=0x7febd8007a70) at util/vircommand.c:692
#10 0x00007fec0f442267 in virCommandRunAsync (cmd=cmd@entry=0x7febd8007a70, pid=pid@entry=0x0) at util/vircommand.c:2429
#11 0x00007fec0f442616 in virCommandRun (cmd=cmd@entry=0x7febd8007a70, exitstatus=exitstatus@entry=0x0) at util/vircommand.c:2261
#12 0x00007febf66f053d in qemuProcessStart (conn=conn@entry=0x7febe0008110, driver=driver@entry=0x7febec11d7b0, vm=<optimized out>, asyncJob=asyncJob@entry=0, 
    migrateFrom=migrateFrom@entry=0x0, stdin_fd=stdin_fd@entry=-1, stdin_path=stdin_path@entry=0x0, snapshot=snapshot@entry=0x0, vmop=vmop@entry=VIR_NETDEV_VPORT_PROFILE_OP_CREATE, 
    flags=flags@entry=5) at qemu/qemu_process.c:4859
#13 0x00007febf673d7df in qemuDomainCreateXML (conn=0x7febe0008110, xml=<optimized out>, flags=<optimized out>) at qemu/qemu_driver.c:1768
#14 0x00007fec0f520a11 in virDomainCreateXML (conn=0x7febe0008110, 
    xmlDesc=0x7febd8000cb0 "<?xml version=\"1.0\"?>\n<domain type=\"kvm\" xmlns:qemu=\"http://libvirt.org/schemas/domain/qemu/1.0\">\n  <name>guestfs-fxyf5s4hmc0iipn6</name>\n  <memory unit=\"MiB\">500</memory>\n  <currentMemory unit=\"MiB\">"..., flags=2) at libvirt-domain.c:180
#15 0x00007fec1018411a in remoteDispatchDomainCreateXML (server=0x7fec11e1bb60, msg=0x7fec11e362f0, ret=0x7febd800d0b0, args=0x7febd80087f0, rerr=0x7febfee74c30, client=0x7fec11e3e320)
    at remote_dispatch.h:3754
#16 remoteDispatchDomainCreateXMLHelper (server=0x7fec11e1bb60, client=0x7fec11e3e320, msg=0x7fec11e362f0, rerr=0x7febfee74c30, args=0x7febd80087f0, ret=0x7febd800d0b0)
    at remote_dispatch.h:3732
#17 0x00007fec0f59c3c2 in virNetServerProgramDispatchCall (msg=0x7fec11e362f0, client=0x7fec11e3e320, server=0x7fec11e1bb60, prog=0x7fec11e30000) at rpc/virnetserverprogram.c:437
#18 virNetServerProgramDispatch (prog=0x7fec11e30000, server=server@entry=0x7fec11e1bb60, client=0x7fec11e3e320, msg=0x7fec11e362f0) at rpc/virnetserverprogram.c:307
#19 0x00007fec0f59763d in virNetServerProcessMsg (msg=<optimized out>, prog=<optimized out>, client=<optimized out>, srv=0x7fec11e1bb60) at rpc/virnetserver.c:135
#20 virNetServerHandleJob (jobOpaque=<optimized out>, opaque=0x7fec11e1bb60) at rpc/virnetserver.c:156
#21 0x00007fec0f4924f5 in virThreadPoolWorker (opaque=opaque@entry=0x7fec11e10de0) at util/virthreadpool.c:145
#22 0x00007fec0f491a18 in virThreadHelper (data=<optimized out>) at util/virthread.c:206
#23 0x00007fec0cafcdc5 in start_thread (arg=0x7febfee75700) at pthread_create.c:308
#24 0x00007fec0c82a1cd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

Actual results:


Expected results:
The libvirtd should not hang while lots of threads try to lock the security manager at the same time.

Additional info:
Because the libvirtd log_level is 3 when the bug happend, I can not provide the debug log of libvirtd now. I will try to reproduce the bug and provide the debug log of libvirtd if I reproduce it.

Comment 1 Richard W.M. Jones 2017-09-26 10:17:39 UTC
We haven't seen any hangs recently, and this was tested against
a very old version of libvirt.

Is it possible to retest this using RHEL 7.4 / 7.5 libvirt to see
if it still happens (even rarely after 20+ hours)?

Otherwise I suggest closing this and if we find problems with locking
we can reopen or open a new bug.

Comment 2 yafu 2017-10-11 06:12:21 UTC
(In reply to Richard W.M. Jones from comment #1)
> We haven't seen any hangs recently, and this was tested against
> a very old version of libvirt.
> 
> Is it possible to retest this using RHEL 7.4 / 7.5 libvirt to see
> if it still happens (even rarely after 20+ hours)?
> 
> Otherwise I suggest closing this and if we find problems with locking
> we can reopen or open a new bug.

I retested the bug with libvirt-3.8.0-1.el7.x86_64. libvirtd will hang in about 5 minutes. The backtrace of libvirtd is the same as comment 0.

Comment 3 Peter Krempa 2020-02-11 15:52:16 UTC
I've looked at the code and the issue was not fixed yet:

qemuProcessHook calls qemuSecurityPostFork which is supposed to force-unlock the security manager to use in the forked process. The problem is that since we have the "stack" security driver which manages internally other security drivers the only one that gets unlokced is the top one, but no of the nested ones. Thus if one of the nested drivers ("dac" or "selinux") is still locked the above "workaround" for post-fork locking will not fix it.

This means that qemuSecurityPostFork must be fixed such that it also iterates through all the nested security drivers and unlocks the manager object for every single nested driver.

Comment 4 Peter Krempa 2020-02-11 16:07:05 UTC
Okay, so the code actually locks and then unlocks also the nested drivers prior to fork, but does not fork only after both are locked. That would mean that the above scenario can happen only if one of the nested drivers were in use. I didn't find a code path for it yet.

Comment 7 Michal Privoznik 2020-11-05 15:18:35 UTC
I don't think this is solvable problem. One might suggest pthread_atfork() to unlock all mutexes in the child. But the problem with that approach is that majority of our mutexes is not global (as in global variable) rather than contained in a structure they are guarding. Therefore it might not be possible to unlock all mutexes in (child) handler of pthread_atfork(). And even if it were, what about all libraries libvirt is linked with? Not to mention that atfork handlers are not called when creating a child process via clone(). Sorry.