Bug 1811632

Summary: pl_DefWrite() crashes in callback *PRWriteFN
Product: Red Hat Enterprise Linux 7 Reporter: amitkuma
Component: 389-ds-baseAssignee: thierry bordaz <tbordaz>
Status: CLOSED DUPLICATE QA Contact: RHDS QE <ds-qe-bugs>
Severity: high Docs Contact:
Priority: high    
Version: 7.6CC: aheverle, apeddire, hkhot, jwooten, mreynolds, msauton, pasik, rrelyea, spichugi, tbordaz, tmihinto, vashirov
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-07-23 00:13:05 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Comment 5 thierry bordaz 2020-03-12 08:46:07 UTC
Thanks for the debug environment

    - Crashing thread is sasl bind sending back data
[29/Feb/2020:00:11:37.005277765 -0500] conn=1759899 fd=169 slot=169 connection from 10.8.4.59 to 10.241.127.197
[29/Feb/2020:00:11:37.005922272 -0500] conn=1759899 op=0 SRCH base="" scope=0 filter="(objectClass=*)" attrs="* altServer namingContexts supportedControl supportedExtension supportedFeatures supportedLDAPVersion supportedSASLMechanisms domaincontrollerfunctionality defaultnamingcontext lastusn highestcommittedusn aci"
[29/Feb/2020:00:11:37.007175023 -0500] conn=1759899 op=0 RESULT err=0 tag=101 nentries=1 etime=0.0001492091
[29/Feb/2020:00:11:37.112671578 -0500] conn=1759899 op=1 BIND dn="" method=sasl version=3 mech=GSSAPI
[29/Feb/2020:00:11:37.170833924 -0500] conn=1759899 op=1 RESULT err=14 tag=97 nentries=0 etime=0.1835246772, SASL bind in progress
[29/Feb/2020:00:11:37.172148132 -0500] conn=1759899 op=2 BIND dn="" method=sasl version=3 mech=GSSAPI
[29/Feb/2020:00:11:37.172995298 -0500] conn=1759899 op=2 RESULT err=14 tag=97 nentries=0 etime=0.0000872759, SASL bind in progress
[29/Feb/2020:00:11:37.174364349 -0500] conn=1759899 op=3 BIND dn="" method=sasl version=3 mech=GSSAPI

(gdb) where
#0  0x00007f11a4d411e4 in pl_DefWrite (fd=<optimized out>, buf=0x56395a100000, amount=14) at ../../../nspr/pr/src/io/prlayer.c:87
#1  0x00005638dbd9c431 in write_function (ignore=0, handle=0x563952c0f4a0, count=14, buffer=0x56395a100000) at ldap/servers/slapd/daemon.c:2002
#2  0x00005638dbd9c431 in openldap_write_function (sbiod=<optimized out>, buf=0x56395a100000, len=14) at ldap/servers/slapd/daemon.c:2076
#3  0x00007f11a590292e in ber_int_sb_write (sb=sb@entry=0x5638dfd75770, buf=0x56395a100000, len=len@entry=14) at sockbuf.c:445
#4  0x00007f11a58fec0b in ber_flush2 (sb=0x5638dfd75770, ber=ber@entry=0x5639032e1220, freeit=freeit@entry=1) at io.c:246
#5  0x00007f11a58fed1d in ber_flush (sb=<optimized out>, ber=ber@entry=0x5639032e1220, freeit=freeit@entry=1) at io.c:211
#6  0x00007f11a6f982ec in flush_ber (pb=pb@entry=0x5638dfd691a0, conn=0x5638dfd85d80, op=0x563903430000, ber=ber@entry=0x5639032e1220, type=type@entry=0)
    at ldap/servers/slapd/result.c:1761
#7  0x00007f11a6f9a4f0 in send_ldap_result_ext (pb=pb@entry=0x5638dfd691a0, err=err@entry=0, matched=matched@entry=0x0, text=<optimized out>, 
    text@entry=0x0, nentries=nentries@entry=0, urls=urls@entry=0x0, ber=0x5639032e1220, ber@entry=0x0) at ldap/servers/slapd/result.c:600
#8  0x00007f11a6f9a6b1 in send_ldap_result (pb=pb@entry=0x5638dfd691a0, err=err@entry=0, matched=matched@entry=0x0, text=text@entry=0x0, nentries=nentries@entry=0, urls=urls@entry=0x0) at ldap/servers/slapd/result.c:199
#9  0x00005638dbdabbfd in ids_sasl_check_bind (pb=pb@entry=0x5638dfd691a0) at ldap/servers/slapd/saslbind.c:1125
#10 0x00005638dbd93871 in do_bind (pb=pb@entry=0x5638dfd691a0) at ldap/servers/slapd/bind.c:363
#11 0x00005638dbd9a4b8 in connection_dispatch_operation (pb=0x5638dfd691a0, op=0x563903430000, conn=0x5638dfd85d80) at ldap/servers/slapd/connection.c:600
#12 0x00005638dbd9a4b8 in connection_threadmain () at ldap/servers/slapd/connection.c:1785
#13 0x00007f11a4d5abab in _pt_root (arg=0x5638dfc446c0) at ../../../nspr/pr/src/pthreads/ptthread.c:201
#14 0x00007f11a46fadd5 in start_thread (arg=0x7f11781b9700) at pthread_create.c:307
#15 0x00007f11a3da6ead in __libc_ifunc_impl_list (name=<optimized out>, array=0x7f11781b9700, max=<optimized out>) at ../sysdeps/x86_64/multiarch/ifunc-impl-list.c:338
#16 0x0000000000000000 in None ()

    - It crashed because the lower layer (containing callbacks) was NULL
(gdb) frame 0
(gdb) x/i pl_DefWrite
   0x7f11a4d411e0 <pl_DefWrite>:	mov    0x10(%rdi),%rdi
(gdb) x/10i pl_DefWrite
   0x7f11a4d411e0 <pl_DefWrite>:	mov    0x10(%rdi),%rdi   <--- 'rdi' was the fd handle. 0x10(%rdi) was fd->lower
=> 0x7f11a4d411e4 <pl_DefWrite+4>:	mov    (%rdi),%rax
(gdb) print $rdi
$8 = 0
(gdb) ptype fd
type = struct PRFileDesc {
    const PRIOMethods *methods;
    PRFilePrivate *secret;
    PRFileDesc *lower;      <--- 'lower' layer was NULL
    PRFileDesc *higher;
    void (*dtor)(PRFileDesc *);
    PRDescIdentity identity;
} *

    - But in the dumped core we can see that it is *not* NULL
(gdb) thread 1
(gdb) frame 1
(gdb) print *((PRFileDesc *)handle)->lower
$6 = {methods = 0x7f11a4f6d580 <ipv6_to_v4_tcpMethods>, secret = 0x0, lower = 0x563903e198c0, higher = 0x563952c0f4a0, dtor = 0x7f11a4d41390 <pl_FDDestructor>, 
  identity = 1}
(gdb) print *((PRFileDesc *)handle)->lower->methods
$7 = {file_type = PR_DESC_LAYERED, close = 0x7f11a4d41bc0 <pl_TopClose>, read = 0x7f11a4d411d0 <pl_DefRead>, write = 0x7f11a4d411e0 <pl_DefWrite>, 
  available = 0x7f11a4d411f0 <pl_DefAvailable>, available64 = 0x7f11a4d41200 <pl_DefAvailable64>, fsync = 0x7f11a4d41210 <pl_DefFsync>, 
  seek = 0x7f11a4d41220 <pl_DefSeek>, seek64 = 0x7f11a4d41230 <pl_DefSeek64>, fileInfo = 0x7f11a4d41240 <pl_DefFileInfo>, 
  fileInfo64 = 0x7f11a4d41250 <pl_DefFileInfo64>, writev = 0x7f11a4d41260 <pl_DefWritev>, connect = 0x7f11a4d40f20 <Ipv6ToIpv4SocketConnect>, 
  accept = 0x7f11a4d40c80 <Ipv6ToIpv4SocketAccept>, bind = 0x7f11a4d40e70 <Ipv6ToIpv4SocketBind>, listen = 0x7f11a4d412b0 <pl_DefListen>, 
  shutdown = 0x7f11a4d412c0 <pl_DefShutdown>, recv = 0x7f11a4d412d0 <pl_DefRecv>, send = 0x7f11a4d412e0 <pl_DefSend>, recvfrom = 0x7f11a4d412f0 <pl_DefRecvfrom>, 
  sendto = 0x7f11a4d41300 <pl_DefSendto>, poll = 0x7f11a4d41310 <pl_DefPoll>, acceptread = 0x7f11a4d40ae0 <Ipv6ToIpv4SocketAcceptRead>, 
  transmitfile = 0x7f11a4d41330 <pl_DefTransmitfile>, getsockname = 0x7f11a4d40a70 <Ipv6ToIpv4SocketGetName>, 
  getpeername = 0x7f11a4d40a00 <Ipv6ToIpv4SocketGetPeerName>, reserved_fn_6 = 0x7f11a4d40230 <_PR_InvalidInt>, reserved_fn_5 = 0x7f11a4d40230 <_PR_InvalidInt>, 
  getsocketoption = 0x7f11a4d41360 <pl_DefGetsocketoption>, setsocketoption = 0x7f11a4d41370 <pl_DefSetsocketoption>, sendfile = 0x7f11a4d41380 <pl_DefSendfile>, 
  connectcontinue = 0x7f11a4d41280 <pl_DefConnectcontinue>, reserved_fn_3 = 0x7f11a4d40230 <_PR_InvalidInt>, reserved_fn_2 = 0x7f11a4d40230 <_PR_InvalidInt>, 
  reserved_fn_1 = 0x7f11a4d40230 <_PR_InvalidInt>, reserved_fn_0 = 0x7f11a4d40230 <_PR_InvalidInt>}

    In conclusion:
        I suspect two threads (thread 1 and another one not identified in the core) competed to use the connect.
        Thread 1 (sasl bind) registered callbacks and at the same time the other thread pushed
        them as layer callback.
        The push reset fd->lower for a transient period, that triggered the crash
        It matches BZ https://bugzilla.redhat.com/show_bug.cgi?id=1756182

Next step:
    - How frequently does it crash ?
    - Did you identify a reproducer ?
    - Please update 389-ds version to 389-ds-base-1.3.10.1-3.el7 (bz 1756182) and check if crash happens again

Comment 6 amitkuma 2020-03-13 11:47:58 UTC
Response from Customer:

- How frequently does it crash ?
ns-slapd is crashing every few days, this may be more frequent under heavy load, but this is anecdotal at the moment.
    - Did you identify a reproducer ?
Unfortunately not
    - Please update 389-ds version to 389-ds-base-1.3.10.1-3.el7 (bz 1756182) and check if crash happens again "
I will discuss this internal.

Will revert once customer discusses internally comes back

Comment 20 Marc Sauton 2020-06-04 00:49:23 UTC
*** Bug 1843676 has been marked as a duplicate of this bug. ***

Comment 24 joel 2020-06-04 19:15:55 UTC
customer has verified their package:

 "The version is indeed 1.3.9.1-12.el7_7. 

So what's the best course of action for now?
"