Description of problem:
Run test-virt-alignment-scan-guests.sh in the libguestfs (https://github.com/libguestfs/libguestfs/blob/master/align/test-virt-alignment-scan-guests.sh , the script will start up lots of parallel libvirt instances), and the libvirtd process hang since fork() was called while another thread had security manager locked.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1.Run test-virt-alignment-scan-guests.sh in the libguestfs:
# while true ; do ./test-virt-alignment-scan-guests.sh ; done
2.Check the output of 'virsh list' at the same time:
#watch virsh list
3.After about 20 hours, the libvirtd daemon hang and the output of 'virsh list' did not change any more.
4.Using gdb to print the libvirtd process backtrace:
#0 __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1 0x00007fec0cafed02 in _L_lock_791 () from /lib64/libpthread.so.0
#2 0x00007fec0cafec08 in __GI___pthread_mutex_lock (mutex=mutex@entry=0x7febec1d92b0) at pthread_mutex_lock.c:64
#3 0x00007fec0f491ba5 in virMutexLock (m=m@entry=0x7febec1d92b0) at util/virthread.c:89
#4 0x00007fec0f47922e in virObjectLock (anyobj=anyobj@entry=0x7febec1d92a0) at util/virobject.c:323
#5 0x00007fec0f60db1c in virSecurityManagerSetSocketLabel (mgr=0x7febec1d92a0, vm=vm@entry=0x7febd800bb20) at security/security_manager.c:431
#6 0x00007fec0f60ade3 in virSecurityStackSetSocketLabel (mgr=<optimized out>, vm=0x7febd800bb20) at security/security_stack.c:456
#7 0x00007fec0f60db29 in virSecurityManagerSetSocketLabel (mgr=0x7febec1d91f0, vm=0x7febd800bb20) at security/security_manager.c:432
#8 0x00007febf66e896e in qemuProcessHook (data=0x7febfee740f0) at qemu/qemu_process.c:3227
#9 0x00007fec0f43f8fa in virExec (cmd=cmd@entry=0x7febd8007a70) at util/vircommand.c:692
#10 0x00007fec0f442267 in virCommandRunAsync (cmd=cmd@entry=0x7febd8007a70, pid=pid@entry=0x0) at util/vircommand.c:2429
#11 0x00007fec0f442616 in virCommandRun (cmd=cmd@entry=0x7febd8007a70, exitstatus=exitstatus@entry=0x0) at util/vircommand.c:2261
#12 0x00007febf66f053d in qemuProcessStart (conn=conn@entry=0x7febe0008110, driver=driver@entry=0x7febec11d7b0, vm=<optimized out>, asyncJob=asyncJob@entry=0,
migrateFrom=migrateFrom@entry=0x0, stdin_fd=stdin_fd@entry=-1, stdin_path=stdin_path@entry=0x0, snapshot=snapshot@entry=0x0, vmop=vmop@entry=VIR_NETDEV_VPORT_PROFILE_OP_CREATE,
flags=flags@entry=5) at qemu/qemu_process.c:4859
#13 0x00007febf673d7df in qemuDomainCreateXML (conn=0x7febe0008110, xml=<optimized out>, flags=<optimized out>) at qemu/qemu_driver.c:1768
#14 0x00007fec0f520a11 in virDomainCreateXML (conn=0x7febe0008110,
xmlDesc=0x7febd8000cb0 "<?xml version=\"1.0\"?>\n<domain type=\"kvm\" xmlns:qemu=\"http://libvirt.org/schemas/domain/qemu/1.0\">\n <name>guestfs-fxyf5s4hmc0iipn6</name>\n <memory unit=\"MiB\">500</memory>\n <currentMemory unit=\"MiB\">"..., flags=2) at libvirt-domain.c:180
#15 0x00007fec1018411a in remoteDispatchDomainCreateXML (server=0x7fec11e1bb60, msg=0x7fec11e362f0, ret=0x7febd800d0b0, args=0x7febd80087f0, rerr=0x7febfee74c30, client=0x7fec11e3e320)
#16 remoteDispatchDomainCreateXMLHelper (server=0x7fec11e1bb60, client=0x7fec11e3e320, msg=0x7fec11e362f0, rerr=0x7febfee74c30, args=0x7febd80087f0, ret=0x7febd800d0b0)
#17 0x00007fec0f59c3c2 in virNetServerProgramDispatchCall (msg=0x7fec11e362f0, client=0x7fec11e3e320, server=0x7fec11e1bb60, prog=0x7fec11e30000) at rpc/virnetserverprogram.c:437
#18 virNetServerProgramDispatch (prog=0x7fec11e30000, server=server@entry=0x7fec11e1bb60, client=0x7fec11e3e320, msg=0x7fec11e362f0) at rpc/virnetserverprogram.c:307
#19 0x00007fec0f59763d in virNetServerProcessMsg (msg=<optimized out>, prog=<optimized out>, client=<optimized out>, srv=0x7fec11e1bb60) at rpc/virnetserver.c:135
#20 virNetServerHandleJob (jobOpaque=<optimized out>, opaque=0x7fec11e1bb60) at rpc/virnetserver.c:156
#21 0x00007fec0f4924f5 in virThreadPoolWorker (opaque=opaque@entry=0x7fec11e10de0) at util/virthreadpool.c:145
#22 0x00007fec0f491a18 in virThreadHelper (data=<optimized out>) at util/virthread.c:206
#23 0x00007fec0cafcdc5 in start_thread (arg=0x7febfee75700) at pthread_create.c:308
#24 0x00007fec0c82a1cd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113
The libvirtd should not hang while lots of threads try to lock the security manager at the same time.
Because the libvirtd log_level is 3 when the bug happend, I can not provide the debug log of libvirtd now. I will try to reproduce the bug and provide the debug log of libvirtd if I reproduce it.
We haven't seen any hangs recently, and this was tested against
a very old version of libvirt.
Is it possible to retest this using RHEL 7.4 / 7.5 libvirt to see
if it still happens (even rarely after 20+ hours)?
Otherwise I suggest closing this and if we find problems with locking
we can reopen or open a new bug.
(In reply to Richard W.M. Jones from comment #1)
> We haven't seen any hangs recently, and this was tested against
> a very old version of libvirt.
> Is it possible to retest this using RHEL 7.4 / 7.5 libvirt to see
> if it still happens (even rarely after 20+ hours)?
> Otherwise I suggest closing this and if we find problems with locking
> we can reopen or open a new bug.
I retested the bug with libvirt-3.8.0-1.el7.x86_64. libvirtd will hang in about 5 minutes. The backtrace of libvirtd is the same as comment 0.
I've looked at the code and the issue was not fixed yet:
qemuProcessHook calls qemuSecurityPostFork which is supposed to force-unlock the security manager to use in the forked process. The problem is that since we have the "stack" security driver which manages internally other security drivers the only one that gets unlokced is the top one, but no of the nested ones. Thus if one of the nested drivers ("dac" or "selinux") is still locked the above "workaround" for post-fork locking will not fix it.
This means that qemuSecurityPostFork must be fixed such that it also iterates through all the nested security drivers and unlocks the manager object for every single nested driver.
Okay, so the code actually locks and then unlocks also the nested drivers prior to fork, but does not fork only after both are locked. That would mean that the above scenario can happen only if one of the nested drivers were in use. I didn't find a code path for it yet.