Most FreeIPA server deployment tests failed on both prod and staging openQA in the last two Rawhide composes, Fedora-Rawhide-20180812.n.0 and Fedora-Rawhide-20180813.n.0. In all cases, it seems like this was caused by a segfault in ns-slapd, like this (0813.n.0 prod): ====== Aug 13 09:48:27 ipa002.domain.local ns-slapd[32126]: [13/Aug/2018:12:48:27.184902543 -0400] - INFO - ldbm_back_ldbm2index - userRoot: Finished indexing. Aug 13 09:48:27 ipa002.domain.local audit[32126]: ANOM_ABEND auid=4294967295 uid=389 gid=389 ses=4294967295 subj=system_u:system_r:dirsrv_t:s0 pid=32126 comm="ns-slapd" exe="/usr/sbin/ns-slapd" sig=11 res=1 Aug 13 09:48:27 ipa002.domain.local kernel: ns-slapd[32156]: segfault at 1a2000 ip 00007f44055a651f sp 00007f43ebbf9360 error 4 in libjemalloc.so.2[7f44055a4000+6c000] Aug 13 09:48:27 ipa002.domain.local kernel: Code: 28 41 81 e4 f0 00 00 00 49 01 dc 4d 8b 4c 24 28 4d 39 ce 0f 85 da 01 00 00 48 89 e8 48 c1 e8 09 25 f8 ff 1f 00 49 03 44 24 30 <4c> 8b 00 4c 8d 15 97 c8 06 00 4d 89 c1 49 c1 e9 30 44 89 ca 49 8b Aug 13 09:48:27 ipa002.domain.local systemd[1]: dirsrv: Main process exited, code=killed, status=11/SEGV Aug 13 09:48:27 ipa002.domain.local systemd[1]: dirsrv: Failed with result 'signal'. Aug 13 09:48:27 ipa002.domain.local audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=dirsrv@DOMAIN-LOCAL comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=failed' ====== or this (0813.n.0 stg): Aug 13 09:19:46 ipa002.domain.local systemd[1]: Started The Apache HTTP Server. Aug 13 09:19:46 ipa002.domain.local audit[32244]: ANOM_ABEND auid=4294967295 uid=389 gid=389 ses=4294967295 subj=system_u:system_r:dirsrv_t:s0 pid=32244 comm="ns-slapd" exe="/usr/sbin/ns-slapd" sig=11 res=1 Aug 13 09:19:46 ipa002.domain.local kernel: ns-slapd[32257]: segfault at bd800 ip 00007efc81ad651f sp 00007efc6c5fb360 error 4 in libjemalloc.so.2[7efc81ad4000+6c000] Aug 13 09:19:46 ipa002.domain.local kernel: Code: 28 41 81 e4 f0 00 00 00 49 01 dc 4d 8b 4c 24 28 4d 39 ce 0f 85 da 01 00 00 48 89 e8 48 c1 e8 09 25 f8 ff 1f 00 49 03 44 24 30 <4c> 8b 00 4c 8d 15 97 c8 06 00 4d 89 c1 49 c1 e9 30 44 89 ca 49 8b Aug 13 09:19:46 ipa002.domain.local systemd[1]: dirsrv: Main process exited, code=killed, status=11/SEGV Aug 13 09:19:46 ipa002.domain.local systemd[1]: dirsrv: Failed with result 'signal'. Aug 13 09:19:46 ipa002.domain.local audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=dirsrv@DOMAIN-LOCAL comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=failed' ====== or this (0812.n.0 stg): Aug 12 08:22:59 ipa002.domain.local [32363]: GSSAPI client step 1 Aug 12 08:22:59 ipa002.domain.local kernel: ns-slapd[32250]: segfault at 6b18 ip 00007f10fb8be51f sp 00007f10e79fd360 error 4 in libjemalloc.so.2[7f10fb8bc000+6c000] Aug 12 08:22:59 ipa002.domain.local kernel: Code: 28 41 81 e4 f0 00 00 00 49 01 dc 4d 8b 4c 24 28 4d 39 ce 0f 85 da 01 00 00 48 89 e8 48 c1 e8 09 25 f8 ff 1f 00 49 03 44 24 30 <4c> 8b 00 4c 8d 15 97 c8 06 00 4d 89 c1 49 c1 e9 30 44 89 ca 49 8b Aug 12 08:22:59 ipa002.domain.local audit[32239]: ANOM_ABEND auid=4294967295 uid=389 gid=389 ses=4294967295 subj=system_u:system_r:dirsrv_t:s0 pid=32239 comm="ns-slapd" exe="/usr/sbin/ns-slapd" sig=11 res=1 Aug 12 08:22:59 ipa002.domain.local systemd[1]: dirsrv: Main process exited, code=killed, status=11/SEGV Aug 12 08:22:59 ipa002.domain.local named-pkcs11[32658]: LDAP error: Can't contact LDAP server: ldap_sync_poll() failed Aug 12 08:22:59 ipa002.domain.local systemd[1]: dirsrv: Failed with result 'signal'. Aug 12 08:22:59 ipa002.domain.local named-pkcs11[32658]: ldap_syncrepl will reconnect in 60 seconds ====== The segfault doesn't seem to happen at the exact same time or point of the tests each run, and in a couple of cases didn't happen at all (the non-replica server tests for 0813.n.0). It happened in 6 out of 8 total FreeIPA server deployment tests run on those two composes across prod and staging openQA.
*** Bug 1615452 has been marked as a duplicate of this bug. ***
Correction: crash happened in at least 7 out of 8 cases, it also happened in the 'successful' https://openqa.fedoraproject.org/tests/265344 , probably while one of the client tests was trying to do something, likely causing client test failure. It's a bit difficult trying to unpick all the different test failures ATM, but I suspect a lot if not all are down to this.
Which version of 389-ds-base was used? We just added changes to satisfy this bug around jemalloc (1.4.0.14-2): https://bugzilla.redhat.com/show_bug.cgi?id=1615302 Perhaps this is part of the problem?
Also, is there a core file or stack trace that I can look at?
https://kojipkgs.fedoraproject.org/compose/rawhide/Fedora-Rawhide-20180813.n.0/compose/Everything/x86_64/os/Packages/3/ This is actually 389-ds-base-1.4.0.14-1. So this is not related to https://bugzilla.redhat.com/show_bug.cgi?id=1615302 (which is good). But there are no known crashes in 1.4.0.14-1, so this is a new issue. Really need a reproducer or corefile/stacktrace to look at.
Well, it seems it quite often is reproduced just by running a FreeIPA server deployment on current Rawhide. I will try to reproduce it in a local VM and get at the core dump somehow (it seems like it's not being placed in /var/lib/systemd/coredump or /var/spool/abrt , where openQA looks for them from systemd and abrt respectively).
(In reply to Adam Williamson from comment #6) > Well, it seems it quite often is reproduced just by running a FreeIPA server > deployment on current Rawhide. Do you mean just installing FreeIPA or running some kind of test suite? If it's just installing FreeIPA then I can work on that.
Just installing necessary packages and doing 'ipa-server-install' with appropriate options. The test that's failing the most runs it like this: ipa-server-install -U --realm=DOMAIN.LOCAL --domain=domain.local --ds-password=monkeys123 --admin-password=monkeys123 --setup-dns --reverse-zone=2.0.10.in-addr.arpa --allow-zone-overlap --forwarder=10.5.126.21 --forwarder=10.5.126.22 obviously adjust host / realm names and IP addresses as appropriate.
This bug appears to have been reported against 'rawhide' during the Fedora 29 development cycle. Changing version to '29'.
Created attachment 1475948 [details] stacktrace.txt
Created attachment 1475949 [details] core.ns-slapd.389.c2d5d91a0ca341219d1c21ee196232b1.19895.1534273439000000.lz4
Looks like its a double free in the connection code: #8 0x000055e04c06d1fb in connection_threadmain () at ldap/servers/slapd/connection.c:1585 1585 slapi_sdn_free(&anon_sdn); (gdb) p *anon_sdn $3 = {flag = 46 '.', udn = 0x65 <error: Cannot access memory at address 0x65>, dn = 0x7fb49264a04000 <error: Cannot access memory at address 0x7fb49264a04000>, ndn = 0x0, ndn_len = 10496} This is possibly related to nunc-stans being disabled. Needs more investigation...
Here is the crashing thread #0 0x00007fb498f3f51f in atomic_load_p (mo=atomic_memory_order_relaxed, a=0x125020) at include/jemalloc/internal/atomic_gcc_atomic.h:18 #1 0x00007fb498f3f51f in rtree_leaf_elm_bits_read (dependent=true, elm=0x125020, rtree=<optimized out>, tsdn=<optimized out>) at include/jemalloc/internal/rtree.h:175 #2 0x00007fb498f3f51f in rtree_szind_slab_read (r_slab=<synthetic pointer>, r_szind=<synthetic pointer>, dependent=true, key=35945862888701952, rtree_ctx=0x7fb482172f28, rtree=<optimized out>, tsdn=<optimized out>) at include/jemalloc/internal/rtree.h:464 #3 0x00007fb498f3f51f in ifree (slow_path=false, tcache=0x7fb4821730c0, ptr=0x7fb49264a04000, tsd=<optimized out>) at src/jemalloc.c:2206 #4 0x00007fb498f3f51f in free (ptr=0x7fb49264a04000) at src/jemalloc.c:2393 #5 0x00007fb498e36fb5 in slapi_ch_free () at /usr/lib64/dirsrv/libslapd.so.0 #6 0x00007fb498e40479 in slapi_sdn_done () at /usr/lib64/dirsrv/libslapd.so.0 #7 0x00007fb498e40afe in slapi_sdn_free () at /usr/lib64/dirsrv/libslapd.so.0 #8 0x000055e04c06d1fb in connection_threadmain () at ldap/servers/slapd/connection.c:1585 #9 0x00007fb498a00ac8 in _pt_root (arg=0x7fb493fb8740) at ../../.././nspr/pr/src/pthreads/ptthread.c:201 #10 0x00007fb49899e58e in start_thread (arg=<optimized out>) at pthread_create.c:486 #11 0x00007fb4987d0513 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 So the anon_sdn is corrupted, but it should not be looking at the code. Going to try and reproduce using valgrind...
Thanks for getting the core dump, Viktor!
Update: I've tested with ASan build: * with 389-ds-base-1.4.0.13-1.fc29.x86_64 install succeeds without a crash. * with 389-ds-base-1.4.0.14-2.fc29.x86_64 install fails with the stack trace: https://pagure.io/389-ds-base/issue/49029#comment-526297 With halt_on_error=false ASan build doesn't continue and exits anyway. I also tried to attach with gdb (with sleep_before_dying=1000), but gdb coredumps :| Anyway, looks like a newly added internal operations logging code caused the issue.
Fixed upstream: https://pagure.io/389-ds-base/issue/49029 New fedora build complete.
Discussed during the 2018-08-20 blocker review meeting: [1] The decision to classify this bug as an "AcceptedBlocker" was made as it violates the following criteria: "It must be possible to configure a Fedora Server system installed according to the above criteria as a FreeIPA domain controller, using the official deployment tools provided in the distribution FreeIPA packages. Once deployed, the system must handle multiple client enrolments and unenrolments, and client authentication via Kerberos..." [1] https://meetbot.fedoraproject.org/fedora-blocker-review/2018-08-20/f29-blocker-review.2018-08-20-16.00.txt
FreeIPA tests have been passing quite reliably of late, so this was clearly fixed.