Bug 1615586

Summary: segfault in ns-slapd during FreeIPA server deployment (or post-deployment operation) on Rawhide
Product: [Fedora] Fedora Reporter: Adam Williamson <awilliam>
Component: 389-ds-baseAssignee: mreynolds
Status: CLOSED ERRATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 29CC: abokovoy, awilliam, gmarr, mhonek, mreynolds, normand, rcritten, rmeggins, robatino, ssorce, tbordaz, vashirov
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard: AcceptedBlocker
Fixed In Version: 389-ds-base-1.4.0.15-1.fc29 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-09-07 23:54:07 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1517011    
Attachments:
Description Flags
stacktrace.txt
none
core.ns-slapd.389.c2d5d91a0ca341219d1c21ee196232b1.19895.1534273439000000.lz4 none

Description Adam Williamson 2018-08-13 18:54:09 UTC
Most FreeIPA server deployment tests failed on both prod and staging openQA in the last two Rawhide composes, Fedora-Rawhide-20180812.n.0 and Fedora-Rawhide-20180813.n.0. In all cases, it seems like this was caused by a segfault in ns-slapd, like this (0813.n.0 prod):

======

Aug 13 09:48:27 ipa002.domain.local ns-slapd[32126]: [13/Aug/2018:12:48:27.184902543 -0400] - INFO - ldbm_back_ldbm2index - userRoot: Finished indexing.
Aug 13 09:48:27 ipa002.domain.local audit[32126]: ANOM_ABEND auid=4294967295 uid=389 gid=389 ses=4294967295 subj=system_u:system_r:dirsrv_t:s0 pid=32126 comm="ns-slapd" exe="/usr/sbin/ns-slapd" sig=11 res=1
Aug 13 09:48:27 ipa002.domain.local kernel: ns-slapd[32156]: segfault at 1a2000 ip 00007f44055a651f sp 00007f43ebbf9360 error 4 in libjemalloc.so.2[7f44055a4000+6c000]
Aug 13 09:48:27 ipa002.domain.local kernel: Code: 28 41 81 e4 f0 00 00 00 49 01 dc 4d 8b 4c 24 28 4d 39 ce 0f 85 da 01 00 00 48 89 e8 48 c1 e8 09 25 f8 ff 1f 00 49 03 44 24 30 <4c> 8b 00 4c 8d 15 97 c8 06 00 4d 89 c1 49 c1 e9 30 44 89 ca 49 8b 
Aug 13 09:48:27 ipa002.domain.local systemd[1]: dirsrv: Main process exited, code=killed, status=11/SEGV
Aug 13 09:48:27 ipa002.domain.local systemd[1]: dirsrv: Failed with result 'signal'.
Aug 13 09:48:27 ipa002.domain.local audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=dirsrv@DOMAIN-LOCAL comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=failed'

======

or this (0813.n.0 stg):

Aug 13 09:19:46 ipa002.domain.local systemd[1]: Started The Apache HTTP Server.
Aug 13 09:19:46 ipa002.domain.local audit[32244]: ANOM_ABEND auid=4294967295 uid=389 gid=389 ses=4294967295 subj=system_u:system_r:dirsrv_t:s0 pid=32244 comm="ns-slapd" exe="/usr/sbin/ns-slapd" sig=11 res=1
Aug 13 09:19:46 ipa002.domain.local kernel: ns-slapd[32257]: segfault at bd800 ip 00007efc81ad651f sp 00007efc6c5fb360 error 4 in libjemalloc.so.2[7efc81ad4000+6c000]
Aug 13 09:19:46 ipa002.domain.local kernel: Code: 28 41 81 e4 f0 00 00 00 49 01 dc 4d 8b 4c 24 28 4d 39 ce 0f 85 da 01 00 00 48 89 e8 48 c1 e8 09 25 f8 ff 1f 00 49 03 44 24 30 <4c> 8b 00 4c 8d 15 97 c8 06 00 4d 89 c1 49 c1 e9 30 44 89 ca 49 8b 
Aug 13 09:19:46 ipa002.domain.local systemd[1]: dirsrv: Main process exited, code=killed, status=11/SEGV
Aug 13 09:19:46 ipa002.domain.local systemd[1]: dirsrv: Failed with result 'signal'.
Aug 13 09:19:46 ipa002.domain.local audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=dirsrv@DOMAIN-LOCAL comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=failed'

======

or this (0812.n.0 stg):

Aug 12 08:22:59 ipa002.domain.local [32363]: GSSAPI client step 1
Aug 12 08:22:59 ipa002.domain.local kernel: ns-slapd[32250]: segfault at 6b18 ip 00007f10fb8be51f sp 00007f10e79fd360 error 4 in libjemalloc.so.2[7f10fb8bc000+6c000]
Aug 12 08:22:59 ipa002.domain.local kernel: Code: 28 41 81 e4 f0 00 00 00 49 01 dc 4d 8b 4c 24 28 4d 39 ce 0f 85 da 01 00 00 48 89 e8 48 c1 e8 09 25 f8 ff 1f 00 49 03 44 24 30 <4c> 8b 00 4c 8d 15 97 c8 06 00 4d 89 c1 49 c1 e9 30 44 89 ca 49 8b 
Aug 12 08:22:59 ipa002.domain.local audit[32239]: ANOM_ABEND auid=4294967295 uid=389 gid=389 ses=4294967295 subj=system_u:system_r:dirsrv_t:s0 pid=32239 comm="ns-slapd" exe="/usr/sbin/ns-slapd" sig=11 res=1
Aug 12 08:22:59 ipa002.domain.local systemd[1]: dirsrv: Main process exited, code=killed, status=11/SEGV
Aug 12 08:22:59 ipa002.domain.local named-pkcs11[32658]: LDAP error: Can't contact LDAP server: ldap_sync_poll() failed
Aug 12 08:22:59 ipa002.domain.local systemd[1]: dirsrv: Failed with result 'signal'.
Aug 12 08:22:59 ipa002.domain.local named-pkcs11[32658]: ldap_syncrepl will reconnect in 60 seconds

======

The segfault doesn't seem to happen at the exact same time or point of the tests each run, and in a couple of cases didn't happen at all (the non-replica server tests for 0813.n.0). It happened in 6 out of 8 total FreeIPA server deployment tests run on those two composes across prod and staging openQA.

Comment 1 Adam Williamson 2018-08-13 18:55:27 UTC
*** Bug 1615452 has been marked as a duplicate of this bug. ***

Comment 2 Adam Williamson 2018-08-13 19:05:13 UTC
Correction: crash happened in at least 7 out of 8 cases, it also happened in the 'successful' https://openqa.fedoraproject.org/tests/265344 , probably while one of the client tests was trying to do something, likely causing client test failure. It's a bit difficult trying to unpick all the different test failures ATM, but I suspect a lot if not all are down to this.

Comment 3 mreynolds 2018-08-13 19:13:12 UTC
Which version of 389-ds-base was used?

We just added changes to satisfy this bug around jemalloc (1.4.0.14-2):

https://bugzilla.redhat.com/show_bug.cgi?id=1615302

Perhaps this is part of the problem?

Comment 4 mreynolds 2018-08-13 19:25:05 UTC
Also, is there a core file or stack trace that I can look at?

Comment 5 mreynolds 2018-08-13 19:43:58 UTC
https://kojipkgs.fedoraproject.org/compose/rawhide/Fedora-Rawhide-20180813.n.0/compose/Everything/x86_64/os/Packages/3/

This is actually 389-ds-base-1.4.0.14-1. So this is not related to https://bugzilla.redhat.com/show_bug.cgi?id=1615302 (which is good).

But there are no known crashes in 1.4.0.14-1, so this is a new issue.  Really need a reproducer or corefile/stacktrace to look at.

Comment 6 Adam Williamson 2018-08-13 21:21:18 UTC
Well, it seems it quite often is reproduced just by running a FreeIPA server deployment on current Rawhide. I will try to reproduce it in a local VM and get at the core dump somehow (it seems like it's not being placed in /var/lib/systemd/coredump or /var/spool/abrt , where openQA looks for them from systemd and abrt respectively).

Comment 7 mreynolds 2018-08-13 21:49:56 UTC
(In reply to Adam Williamson from comment #6)
> Well, it seems it quite often is reproduced just by running a FreeIPA server
> deployment on current Rawhide. 

Do you mean just installing FreeIPA or running some kind of test suite?  If it's just installing FreeIPA then I can work on that.

Comment 8 Adam Williamson 2018-08-13 22:01:47 UTC
Just installing necessary packages and doing 'ipa-server-install' with appropriate options. The test that's failing the most runs it like this:

ipa-server-install -U --realm=DOMAIN.LOCAL --domain=domain.local --ds-password=monkeys123 --admin-password=monkeys123 --setup-dns --reverse-zone=2.0.10.in-addr.arpa --allow-zone-overlap --forwarder=10.5.126.21 --forwarder=10.5.126.22

obviously adjust host / realm names and IP addresses as appropriate.

Comment 9 Jan Kurik 2018-08-14 10:25:09 UTC
This bug appears to have been reported against 'rawhide' during the Fedora 29 development cycle.
Changing version to '29'.

Comment 10 Viktor Ashirov 2018-08-14 19:16:40 UTC
Created attachment 1475948 [details]
stacktrace.txt

Comment 11 Viktor Ashirov 2018-08-14 19:17:59 UTC
Created attachment 1475949 [details]
core.ns-slapd.389.c2d5d91a0ca341219d1c21ee196232b1.19895.1534273439000000.lz4

Comment 13 mreynolds 2018-08-14 19:23:56 UTC
Looks like its a double free in the connection code:

#8  0x000055e04c06d1fb in connection_threadmain () at ldap/servers/slapd/connection.c:1585
1585	                        slapi_sdn_free(&anon_sdn);


(gdb) p *anon_sdn
$3 = {flag = 46 '.', udn = 0x65 <error: Cannot access memory at address 0x65>, 
  dn = 0x7fb49264a04000 <error: Cannot access memory at address 0x7fb49264a04000>, ndn = 0x0, ndn_len = 10496}


This is possibly related to nunc-stans being disabled.  Needs more investigation...

Comment 14 mreynolds 2018-08-14 19:34:01 UTC
Here is the crashing thread

#0  0x00007fb498f3f51f in atomic_load_p (mo=atomic_memory_order_relaxed, a=0x125020)
    at include/jemalloc/internal/atomic_gcc_atomic.h:18
#1  0x00007fb498f3f51f in rtree_leaf_elm_bits_read (dependent=true, elm=0x125020, rtree=<optimized out>, tsdn=<optimized out>)
    at include/jemalloc/internal/rtree.h:175
#2  0x00007fb498f3f51f in rtree_szind_slab_read
    (r_slab=<synthetic pointer>, r_szind=<synthetic pointer>, dependent=true, key=35945862888701952, rtree_ctx=0x7fb482172f28, rtree=<optimized out>, tsdn=<optimized out>) at include/jemalloc/internal/rtree.h:464
#3  0x00007fb498f3f51f in ifree (slow_path=false, tcache=0x7fb4821730c0, ptr=0x7fb49264a04000, tsd=<optimized out>)
    at src/jemalloc.c:2206
#4  0x00007fb498f3f51f in free (ptr=0x7fb49264a04000) at src/jemalloc.c:2393
#5  0x00007fb498e36fb5 in slapi_ch_free () at /usr/lib64/dirsrv/libslapd.so.0
#6  0x00007fb498e40479 in slapi_sdn_done () at /usr/lib64/dirsrv/libslapd.so.0
#7  0x00007fb498e40afe in slapi_sdn_free () at /usr/lib64/dirsrv/libslapd.so.0
#8  0x000055e04c06d1fb in connection_threadmain () at ldap/servers/slapd/connection.c:1585
#9  0x00007fb498a00ac8 in _pt_root (arg=0x7fb493fb8740) at ../../.././nspr/pr/src/pthreads/ptthread.c:201
#10 0x00007fb49899e58e in start_thread (arg=<optimized out>) at pthread_create.c:486
#11 0x00007fb4987d0513 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

So the anon_sdn is corrupted, but it should not be looking at the code.  Going to try and reproduce using valgrind...

Comment 15 Adam Williamson 2018-08-14 20:01:42 UTC
Thanks for getting the core dump, Viktor!

Comment 16 Viktor Ashirov 2018-08-15 12:36:33 UTC
Update:
I've tested with ASan build:
* with 389-ds-base-1.4.0.13-1.fc29.x86_64 install succeeds without a crash.
* with 389-ds-base-1.4.0.14-2.fc29.x86_64 install fails with the stack trace: https://pagure.io/389-ds-base/issue/49029#comment-526297

With halt_on_error=false ASan build doesn't continue and exits anyway. I also tried to attach with gdb (with sleep_before_dying=1000), but gdb coredumps :|

Anyway, looks like a newly added internal operations logging code caused the issue.

Comment 17 mreynolds 2018-08-16 19:08:48 UTC
Fixed upstream:

https://pagure.io/389-ds-base/issue/49029

New fedora build complete.

Comment 18 Geoffrey Marr 2018-08-20 19:45:53 UTC
Discussed during the 2018-08-20 blocker review meeting: [1]

The decision to classify this bug as an "AcceptedBlocker" was made as it violates the following criteria:

"It must be possible to configure a Fedora Server system installed according to the above criteria as a FreeIPA domain controller, using the official deployment tools provided in the distribution FreeIPA packages. Once deployed, the system must handle multiple client enrolments and unenrolments, and client authentication via Kerberos..." 

[1] https://meetbot.fedoraproject.org/fedora-blocker-review/2018-08-20/f29-blocker-review.2018-08-20-16.00.txt

Comment 19 Adam Williamson 2018-09-07 23:54:07 UTC
FreeIPA tests have been passing quite reliably of late, so this was clearly fixed.