Bug 1974550

Summary: glibc: Deadlock in atfork handler which calls dlclose
Product: Red Hat Enterprise Linux 7 Reporter: yili <33186108>
Component: glibcAssignee: glibc team <glibc-bugzilla>
Status: CLOSED WONTFIX QA Contact: qe-baseos-tools-bugs
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 7.6CC: ashankar, codonell, dj, fweimer, mnewsome, pfrankli, sipoyare
Target Milestone: rcKeywords: Bugfix
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-05-13 17:39:05 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description yili 2021-06-22 03:37:05 UTC
Description of problem:

 atfork handler which calls dlclose will cause the libvirtd hang.

Version-Release number of selected component (if applicable):
glibc-2.17-317.el7.x86_64
libvirt 4.5


Steps to Reproduce:
I have not find the method that how to reporduce.   it just happened occasionally


Additional info:

(gdb) bt
#0  __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007fc3ce85ae9b in _L_lock_883 () from /lib64/libpthread.so.0
#2  0x00007fc3ce85ad68 in __GI___pthread_mutex_lock (mutex=0x7fc3840017b0) at ../nptl/pthread_mutex_lock.c:78
#3  0x00007fc3d19d95a5 in virMutexLock (m=<optimized out>) at util/virthread.c:89
#4  0x00007fc3d19b190b in virObjectLock (anyobj=anyobj@entry=0x7fc3840017a0) at util/virobject.c:429
#5  0x00007fc3d1a61241 in virStoragePoolObjListSearchCb (payload=0x7fc3840017a0, name=<optimized out>, opaque=0x7fc38bffe9b0)
    at conf/virstorageobj.c:495
#6  0x00007fc3d1983855 in virHashSearch (ctable=0x7fc3a40e90a0, iter=iter@entry=0x7fc3d1a61230 <virStoragePoolObjListSearchCb>,
    data=data@entry=0x7fc38bffe9b0, name=name@entry=0x0) at util/virhash.c:696
#7  0x00007fc3d1a624dc in virStoragePoolObjListSearch (pools=0x7fc3a40e9ca0,
    searcher=searcher@entry=0x7fc3bb49ea20 <storageVolLookupByPathCallback>, opaque=opaque@entry=0x7fc38bffea10)
    at conf/virstorageobj.c:525
#8  0x00007fc3bb49e8b5 in storageVolLookupByPath (conn=0x7fc31c000ef0,
    path=0x7fc31c000fe0 "hygonstoragepool/1014ce0d-28d9-4312-9ad2-1dd288f1f12c_10") at storage/storage_driver.c:1676
#9  0x00007fc3d1b98dd7 in virStorageVolLookupByPath (conn=0x7fc31c000ef0,
    path=0x7fc31c000fe0 "hygonstoragepool/1014ce0d-28d9-4312-9ad2-1dd288f1f12c_10") at libvirt-storage.c:1414
#10 0x000056061edce21b in remoteDispatchStorageVolLookupByPath (server=0x56061f95c030, msg=0x56061f9ccf20, ret=0x7fc31c000980,
    args=0x7fc31c000940, rerr=0x7fc38bffec10, client=0x56061f9d6870) at remote/remote_daemon_dispatch_stubs.h:18277
#11 remoteDispatchStorageVolLookupByPathHelper (server=0x56061f95c030, client=0x56061f9d6870, msg=0x56061f9ccf20,
    rerr=0x7fc38bffec10, args=0x7fc31c000940, ret=0x7fc31c000980) at remote/remote_daemon_dispatch_stubs.h:18255
#12 0x00007fc3d1aa4315 in virNetServerProgramDispatchCall (msg=0x56061f9ccf20, client=0x56061f9d6870, server=0x56061f95c030,
    prog=0x56061f9c2dc0) at rpc/virnetserverprogram.c:437
#13 virNetServerProgramDispatch (prog=0x56061f9c2dc0, server=server@entry=0x56061f95c030, client=0x56061f9d6870, msg=0x56061f9ccf20)
    at rpc/virnetserverprogram.c:304
#14 0x00007fc3d1aaab4d in virNetServerProcessMsg (msg=<optimized out>, prog=<optimized out>, client=<optimized out>,
    srv=0x56061f95c030) at rpc/virnetserver.c:144
#15 virNetServerHandleJob (jobOpaque=<optimized out>, opaque=0x56061f95c030) at rpc/virnetserver.c:165
#16 0x00007fc3d19da091 in virThreadPoolWorker (opaque=opaque@entry=0x56061f9d66f0) at util/virthreadpool.c:167
#17 0x00007fc3d19d9418 in virThreadHelper (data=<optimized out>) at util/virthread.c:206
#18 0x00007fc3ce858ea5 in start_thread (arg=0x7fc38bfff700) at pthread_create.c:307
#19 0x00007fc3ce58196d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
(gdb) frame 5
#5  0x00007fc3d1a61241 in virStoragePoolObjListSearchCb (payload=0x7fc3840017a0, name=<optimized out>, opaque=0x7fc38bffe9b0)
    at conf/virstorageobj.c:495
495         virObjectLock(obj);
(gdb) p *obj
$1 = {parent = {parent = {u = {dummy_align1 = 16290545681, dummy_align2 = 0x3cafe0011, s = {magic = 3405643793, refs = 3}},
      klass = 0x7fc3a40e9c40}, lock = {lock = {__data = {__lock = 2, __count = 0, __owner = 20024, __nusers = 1, __kind = 512,
          __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}},
        __size = "\002\000\000\000\000\000\000\000\070N\000\000\001\000\000\000\000\002", '\000' <repeats 21 times>, __align = 2}}},
  configFile = 0x7fc384001290 "/etc/libvirt/storage/hygonstoragepool.xml",
  autostartLink = 0x7fc384004530 "/etc/libvirt/storage/autostart/hygonstoragepool.xml", active = true, starting = false,
  autostart = true, asyncjobs = 0, def = 0x7fc384001e50, newDef = 0x0, volumes = 0x7fc384004640}
(gdb) thread 26
[Switching to thread 26 (Thread 0x7fc3bf5c8700 (LWP 20024))]
#0  __unregister_atfork (dso_handle=<optimized out>) at ../nptl/sysdeps/unix/sysv/linux/unregister-atfork.c:115
115           while ((val = deleted->handler->refcntr) != 0)
(gdb) bt
#0  __unregister_atfork (dso_handle=<optimized out>) at ../nptl/sysdeps/unix/sysv/linux/unregister-atfork.c:115
#1  0x00007fc3ac01d283 in __do_global_dtors_aux () from /lib64/libsoftokn3.so
#2  0x00007fc3bf5c6db0 in ?? ()
#3  0x00007fc3d22d3229 in _dl_close_worker (map=<optimized out>) at dl-close.c:266
Backtrace stopped: frame did not save the PC
(gdb) p *deleted->handler
$2 = {next = 0x7fc3ce84de78 <fork_handler_pool+56>, prepare_handler = 0x0, parent_handler = 0x0,
  child_handler = 0x7fc3ac0238f0 <ForkedChild>, dso_handle = 0x7fc3ac254708, refcntr = 0, need_signal = 1}
(gdb)

Comment 3 yili 2021-06-22 03:39:55 UTC
https://bugzilla.redhat.com/show_bug.cgi?id=1746928  tell The problematic commit went into 2.28 and is therefore included in Red Hat Enterprise Linux 8.

Comment 4 Carlos O'Donell 2021-06-22 04:42:07 UTC
The issue fixed in bug 1746928 is about calling fork from an asynchronous signal handler and that doesn't appear to be the case here.

The thread in question, thread #26, is currently deregistering a fork handler as part of dlclose, and another thread might be calling fork (and running ForkedChild registered by /lib64/libsoftokn3.so).

The dlclose operation will block until the current user of the handler completes the fork call.

Is thread #26 inside lll_futex_wait?

Can you determine exactly in the instruction stream where thread #26 is waiting?

Please be aware that Red Hat Enterprise Linux 7 is currently in Maintenance Support 2 Phase and support is limited in scope to Urgent priority bug fixes.

Any fixes in the area of locking for fork and dlclose would likely be considered unsuitable for Maintenance Support 2 Phase.

Have you been able to reproduce this in Red Hat Enterprise Linux 8? In RHEL8 the fork handler implementation was simplified significantly and the handler deletion is done with the lock held which serializes the deletion and so it happens before the fork or after the fork entirely. This removes some of the complexity present in the implementation for RHEL7.

Comment 5 yili 2021-06-22 13:17:29 UTC
(In reply to Carlos O'Donell from comment #4)
> The issue fixed in bug 1746928 is about calling fork from an asynchronous
> signal handler and that doesn't appear to be the case here.
> 
> The thread in question, thread #26, is currently deregistering a fork
> handler as part of dlclose, and another thread might be calling fork (and
> running ForkedChild registered by /lib64/libsoftokn3.so).
> 
> The dlclose operation will block until the current user of the handler
> completes the fork call.
> 
> Is thread #26 inside lll_futex_wait?
> 
> Can you determine exactly in the instruction stream where thread #26 is
> waiting?
> 
(gdb) thread 26
[Switching to thread 26 (Thread 0x7fc3bf5c8700 (LWP 20024))]
#0  __unregister_atfork (dso_handle=<optimized out>) at ../nptl/sysdeps/unix/sysv/linux/unregister-atfork.c:115
115           while ((val = deleted->handler->refcntr) != 0)
(gdb) list
110
111           /* Decrement the reference counter.  If it does not reach zero
112              wait for the last user.  */
113           atomic_decrement (&deleted->handler->refcntr);
114           unsigned int val;
115           while ((val = deleted->handler->refcntr) != 0)
116             lll_futex_wait (&deleted->handler->refcntr, val, LLL_PRIVATE);
117
118           deleted = deleted->next;
119         }
(gdb) p *val
Cannot access memory at address 0x1
(gdb) p val
$1 = 1
(gdb) p *deleted->handler
$2 = {next = 0x7fc3ce84de78 <fork_handler_pool+56>, prepare_handler = 0x0, parent_handler = 0x0, child_handler = 0x7fc3ac0238f0 <ForkedChild>, dso_handle = 0x7fc3ac254708, refcntr = 0,
  need_signal = 1}
(gdb)


the val is 1  and the deleted->handler->refcntr = 0 


> Please be aware that Red Hat Enterprise Linux 7 is currently in Maintenance
> Support 2 Phase and support is limited in scope to Urgent priority bug fixes.
> 
> Any fixes in the area of locking for fork and dlclose would likely be
> considered unsuitable for Maintenance Support 2 Phase.
> 
> Have you been able to reproduce this in Red Hat Enterprise Linux 8? In RHEL8
> the fork handler implementation was simplified significantly and the handler
> deletion is done with the lock held which serializes the deletion and so it
> happens before the fork or after the fork entirely. This removes some of the
> complexity present in the implementation for RHEL7.

i will check the RHEL8, but maybe we can dig more about this issue.

Comment 6 yili 2021-06-22 13:40:38 UTC
   0x00007fc3ce58fa4d <+285>:   lock decl 0x28(%rax)
   0x00007fc3ce58fa51 <+289>:   mov    (%r8),%rdi
   0x00007fc3ce58fa54 <+292>:   mov    0x28(%rdi),%edx
   0x00007fc3ce58fa57 <+295>:   test   %edx,%edx
   0x00007fc3ce58fa59 <+297>:   je     0x7fc3ce58fa76 <__unregister_atfork+326>
   0x00007fc3ce58fa5b <+299>:   nopl   0x0(%rax,%rax,1)
   0x00007fc3ce58fa60 <+304>:   xor    %r10d,%r10d
   0x00007fc3ce58fa63 <+307>:   add    $0x28,%rdi
   0x00007fc3ce58fa67 <+311>:   mov    %r9d,%eax
   0x00007fc3ce58fa6a <+314>:   syscall
=> 0x00007fc3ce58fa6c <+316>:   mov    (%r8),%rdi
   0x00007fc3ce58fa6f <+319>:   mov    0x28(%rdi),%edx
   0x00007fc3ce58fa72 <+322>:   test   %edx,%edx
   0x00007fc3ce58fa74 <+324>:   jne    0x7fc3ce58fa60 <__unregister_atfork+304>
   0x00007fc3ce58fa76 <+326>:   mov    0x8(%r8),%r8
   0x00007fc3ce58fa7a <+330>:   test   %r8,%r8
   0x00007fc3ce58fa7d <+333>:   jne    0x7fc3ce58fa40 <__unregister_atfork+272>
   0x00007fc3ce58fa7f <+335>:   leaveq
   0x00007fc3ce58fa80 <+336>:   retq
   0x00007fc3ce58fa81 <+337>:   mov    (%rdx),%rcx
   0x00007fc3ce58fa84 <+340>:   mov    %rdx,%rax
   0x00007fc3ce58fa87 <+343>:   cmpl   $0x0,%fs:0x18
   0x00007fc3ce58fa90 <+352>:   je     0x7fc3ce58fa93 <__unregister_atfork+355>
---Type <return> to continue, or q <return> to quit---
   0x00007fc3ce58fa92 <+354>:   lock cmpxchg %rcx,0x2c02fd(%rip)        # 0x7fc3ce84fd98 <__fork_handlers>
   0x00007fc3ce58fa9b <+363>:   cmp    %rax,%rdx
   0x00007fc3ce58fa9e <+366>:   je     0x7fc3ce58f9c0 <__unregister_atfork+144>
   0x00007fc3ce58faa4 <+372>:   mov    0x2c02ed(%rip),%rdx        # 0x7fc3ce84fd98 <__fork_handlers>
   0x00007fc3ce58faab <+379>:   jmpq   0x7fc3ce58f9ab <__unregister_atfork+123>
End of assembler dump.
(gdb)
(gdb) info reg
rax            0xfffffffffffffe00       -512
rbx            0x7fc3ac2545d0   140478383474128
rcx            0x7fc3ce58fa6c   140478957288044
rdx            0x1      1
rsi            0x80     128
rdi            0x7fc3ce84ded0   140478960164560
rbp            0x7fc3bf5c6870   0x7fc3bf5c6870
rsp            0x7fc3bf5c6850   0x7fc3bf5c6850
r8             0x7fc3bf5c6850   140478705854544
r9             0xca     202
r10            0x0      0
r11            0x212    530
r12            0x7fc370008640   140477374432832
r13            0x6d     109
r14            0x0      0
r15            0x6d     109
rip            0x7fc3ce58fa6c   0x7fc3ce58fa6c <__unregister_atfork+316>
eflags         0x212    [ AF IF ]
cs             0x33     51
ss             0x2b     43
ds             0x0      0
es             0x0      0
fs             0x0      0
gs             0x0      0

according this: thread #26 inside lll_futex_wait

Comment 7 Carlos O'Donell 2022-05-13 17:39:05 UTC
RHEL 7 is currently in Maintenance Support 2 Phase. During this phase only urgent priority bug fixes will be considered. This particular issue has always been present given the current implementation of fork and pthread_atfork handlers. We do not plan to fix this for RHEL 7, but we are actively evaluating a solution for RHEL 8 in Bug 1888660.

I am marking this as CLOSED / WONTFIX for RHEL 7. Please feel free to follow bug 1888660 for the RHEL8 fix.