Response from Customer:
- How frequently does it crash ?
ns-slapd is crashing every few days, this may be more frequent under heavy load, but this is anecdotal at the moment.
- Did you identify a reproducer ?
Unfortunately not
- Please update 389-ds version to 389-ds-base-1.3.10.1-3.el7 (bz 1756182) and check if crash happens again "
I will discuss this internal.
Will revert once customer discusses internally comes back
Thanks for the debug environment - Crashing thread is sasl bind sending back data [29/Feb/2020:00:11:37.005277765 -0500] conn=1759899 fd=169 slot=169 connection from 10.8.4.59 to 10.241.127.197 [29/Feb/2020:00:11:37.005922272 -0500] conn=1759899 op=0 SRCH base="" scope=0 filter="(objectClass=*)" attrs="* altServer namingContexts supportedControl supportedExtension supportedFeatures supportedLDAPVersion supportedSASLMechanisms domaincontrollerfunctionality defaultnamingcontext lastusn highestcommittedusn aci" [29/Feb/2020:00:11:37.007175023 -0500] conn=1759899 op=0 RESULT err=0 tag=101 nentries=1 etime=0.0001492091 [29/Feb/2020:00:11:37.112671578 -0500] conn=1759899 op=1 BIND dn="" method=sasl version=3 mech=GSSAPI [29/Feb/2020:00:11:37.170833924 -0500] conn=1759899 op=1 RESULT err=14 tag=97 nentries=0 etime=0.1835246772, SASL bind in progress [29/Feb/2020:00:11:37.172148132 -0500] conn=1759899 op=2 BIND dn="" method=sasl version=3 mech=GSSAPI [29/Feb/2020:00:11:37.172995298 -0500] conn=1759899 op=2 RESULT err=14 tag=97 nentries=0 etime=0.0000872759, SASL bind in progress [29/Feb/2020:00:11:37.174364349 -0500] conn=1759899 op=3 BIND dn="" method=sasl version=3 mech=GSSAPI (gdb) where #0 0x00007f11a4d411e4 in pl_DefWrite (fd=<optimized out>, buf=0x56395a100000, amount=14) at ../../../nspr/pr/src/io/prlayer.c:87 #1 0x00005638dbd9c431 in write_function (ignore=0, handle=0x563952c0f4a0, count=14, buffer=0x56395a100000) at ldap/servers/slapd/daemon.c:2002 #2 0x00005638dbd9c431 in openldap_write_function (sbiod=<optimized out>, buf=0x56395a100000, len=14) at ldap/servers/slapd/daemon.c:2076 #3 0x00007f11a590292e in ber_int_sb_write (sb=sb@entry=0x5638dfd75770, buf=0x56395a100000, len=len@entry=14) at sockbuf.c:445 #4 0x00007f11a58fec0b in ber_flush2 (sb=0x5638dfd75770, ber=ber@entry=0x5639032e1220, freeit=freeit@entry=1) at io.c:246 #5 0x00007f11a58fed1d in ber_flush (sb=<optimized out>, ber=ber@entry=0x5639032e1220, freeit=freeit@entry=1) at io.c:211 #6 0x00007f11a6f982ec in flush_ber (pb=pb@entry=0x5638dfd691a0, conn=0x5638dfd85d80, op=0x563903430000, ber=ber@entry=0x5639032e1220, type=type@entry=0) at ldap/servers/slapd/result.c:1761 #7 0x00007f11a6f9a4f0 in send_ldap_result_ext (pb=pb@entry=0x5638dfd691a0, err=err@entry=0, matched=matched@entry=0x0, text=<optimized out>, text@entry=0x0, nentries=nentries@entry=0, urls=urls@entry=0x0, ber=0x5639032e1220, ber@entry=0x0) at ldap/servers/slapd/result.c:600 #8 0x00007f11a6f9a6b1 in send_ldap_result (pb=pb@entry=0x5638dfd691a0, err=err@entry=0, matched=matched@entry=0x0, text=text@entry=0x0, nentries=nentries@entry=0, urls=urls@entry=0x0) at ldap/servers/slapd/result.c:199 #9 0x00005638dbdabbfd in ids_sasl_check_bind (pb=pb@entry=0x5638dfd691a0) at ldap/servers/slapd/saslbind.c:1125 #10 0x00005638dbd93871 in do_bind (pb=pb@entry=0x5638dfd691a0) at ldap/servers/slapd/bind.c:363 #11 0x00005638dbd9a4b8 in connection_dispatch_operation (pb=0x5638dfd691a0, op=0x563903430000, conn=0x5638dfd85d80) at ldap/servers/slapd/connection.c:600 #12 0x00005638dbd9a4b8 in connection_threadmain () at ldap/servers/slapd/connection.c:1785 #13 0x00007f11a4d5abab in _pt_root (arg=0x5638dfc446c0) at ../../../nspr/pr/src/pthreads/ptthread.c:201 #14 0x00007f11a46fadd5 in start_thread (arg=0x7f11781b9700) at pthread_create.c:307 #15 0x00007f11a3da6ead in __libc_ifunc_impl_list (name=<optimized out>, array=0x7f11781b9700, max=<optimized out>) at ../sysdeps/x86_64/multiarch/ifunc-impl-list.c:338 #16 0x0000000000000000 in None () - It crashed because the lower layer (containing callbacks) was NULL (gdb) frame 0 (gdb) x/i pl_DefWrite 0x7f11a4d411e0 <pl_DefWrite>: mov 0x10(%rdi),%rdi (gdb) x/10i pl_DefWrite 0x7f11a4d411e0 <pl_DefWrite>: mov 0x10(%rdi),%rdi <--- 'rdi' was the fd handle. 0x10(%rdi) was fd->lower => 0x7f11a4d411e4 <pl_DefWrite+4>: mov (%rdi),%rax (gdb) print $rdi $8 = 0 (gdb) ptype fd type = struct PRFileDesc { const PRIOMethods *methods; PRFilePrivate *secret; PRFileDesc *lower; <--- 'lower' layer was NULL PRFileDesc *higher; void (*dtor)(PRFileDesc *); PRDescIdentity identity; } * - But in the dumped core we can see that it is *not* NULL (gdb) thread 1 (gdb) frame 1 (gdb) print *((PRFileDesc *)handle)->lower $6 = {methods = 0x7f11a4f6d580 <ipv6_to_v4_tcpMethods>, secret = 0x0, lower = 0x563903e198c0, higher = 0x563952c0f4a0, dtor = 0x7f11a4d41390 <pl_FDDestructor>, identity = 1} (gdb) print *((PRFileDesc *)handle)->lower->methods $7 = {file_type = PR_DESC_LAYERED, close = 0x7f11a4d41bc0 <pl_TopClose>, read = 0x7f11a4d411d0 <pl_DefRead>, write = 0x7f11a4d411e0 <pl_DefWrite>, available = 0x7f11a4d411f0 <pl_DefAvailable>, available64 = 0x7f11a4d41200 <pl_DefAvailable64>, fsync = 0x7f11a4d41210 <pl_DefFsync>, seek = 0x7f11a4d41220 <pl_DefSeek>, seek64 = 0x7f11a4d41230 <pl_DefSeek64>, fileInfo = 0x7f11a4d41240 <pl_DefFileInfo>, fileInfo64 = 0x7f11a4d41250 <pl_DefFileInfo64>, writev = 0x7f11a4d41260 <pl_DefWritev>, connect = 0x7f11a4d40f20 <Ipv6ToIpv4SocketConnect>, accept = 0x7f11a4d40c80 <Ipv6ToIpv4SocketAccept>, bind = 0x7f11a4d40e70 <Ipv6ToIpv4SocketBind>, listen = 0x7f11a4d412b0 <pl_DefListen>, shutdown = 0x7f11a4d412c0 <pl_DefShutdown>, recv = 0x7f11a4d412d0 <pl_DefRecv>, send = 0x7f11a4d412e0 <pl_DefSend>, recvfrom = 0x7f11a4d412f0 <pl_DefRecvfrom>, sendto = 0x7f11a4d41300 <pl_DefSendto>, poll = 0x7f11a4d41310 <pl_DefPoll>, acceptread = 0x7f11a4d40ae0 <Ipv6ToIpv4SocketAcceptRead>, transmitfile = 0x7f11a4d41330 <pl_DefTransmitfile>, getsockname = 0x7f11a4d40a70 <Ipv6ToIpv4SocketGetName>, getpeername = 0x7f11a4d40a00 <Ipv6ToIpv4SocketGetPeerName>, reserved_fn_6 = 0x7f11a4d40230 <_PR_InvalidInt>, reserved_fn_5 = 0x7f11a4d40230 <_PR_InvalidInt>, getsocketoption = 0x7f11a4d41360 <pl_DefGetsocketoption>, setsocketoption = 0x7f11a4d41370 <pl_DefSetsocketoption>, sendfile = 0x7f11a4d41380 <pl_DefSendfile>, connectcontinue = 0x7f11a4d41280 <pl_DefConnectcontinue>, reserved_fn_3 = 0x7f11a4d40230 <_PR_InvalidInt>, reserved_fn_2 = 0x7f11a4d40230 <_PR_InvalidInt>, reserved_fn_1 = 0x7f11a4d40230 <_PR_InvalidInt>, reserved_fn_0 = 0x7f11a4d40230 <_PR_InvalidInt>} In conclusion: I suspect two threads (thread 1 and another one not identified in the core) competed to use the connect. Thread 1 (sasl bind) registered callbacks and at the same time the other thread pushed them as layer callback. The push reset fd->lower for a transient period, that triggered the crash It matches BZ https://bugzilla.redhat.com/show_bug.cgi?id=1756182 Next step: - How frequently does it crash ? - Did you identify a reproducer ? - Please update 389-ds version to 389-ds-base-1.3.10.1-3.el7 (bz 1756182) and check if crash happens again