Bug 1615586
| Summary: | segfault in ns-slapd during FreeIPA server deployment (or post-deployment operation) on Rawhide | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | [Fedora] Fedora | Reporter: | Adam Williamson <awilliam> | ||||||
| Component: | 389-ds-base | Assignee: | mreynolds | ||||||
| Status: | CLOSED ERRATA | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||||
| Severity: | urgent | Docs Contact: | |||||||
| Priority: | unspecified | ||||||||
| Version: | 29 | CC: | abokovoy, awilliam, gmarr, mhonek, mreynolds, normand, rcritten, rmeggins, robatino, ssorce, tbordaz, vashirov | ||||||
| Target Milestone: | --- | ||||||||
| Target Release: | --- | ||||||||
| Hardware: | x86_64 | ||||||||
| OS: | Linux | ||||||||
| Whiteboard: | AcceptedBlocker | ||||||||
| Fixed In Version: | 389-ds-base-1.4.0.15-1.fc29 | Doc Type: | If docs needed, set a value | ||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2018-09-07 23:54:07 UTC | Type: | Bug | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Bug Depends On: | |||||||||
| Bug Blocks: | 1517011 | ||||||||
| Attachments: |
|
||||||||
|
Description
Adam Williamson
2018-08-13 18:54:09 UTC
*** Bug 1615452 has been marked as a duplicate of this bug. *** Correction: crash happened in at least 7 out of 8 cases, it also happened in the 'successful' https://openqa.fedoraproject.org/tests/265344 , probably while one of the client tests was trying to do something, likely causing client test failure. It's a bit difficult trying to unpick all the different test failures ATM, but I suspect a lot if not all are down to this. Which version of 389-ds-base was used? We just added changes to satisfy this bug around jemalloc (1.4.0.14-2): https://bugzilla.redhat.com/show_bug.cgi?id=1615302 Perhaps this is part of the problem? Also, is there a core file or stack trace that I can look at? https://kojipkgs.fedoraproject.org/compose/rawhide/Fedora-Rawhide-20180813.n.0/compose/Everything/x86_64/os/Packages/3/ This is actually 389-ds-base-1.4.0.14-1. So this is not related to https://bugzilla.redhat.com/show_bug.cgi?id=1615302 (which is good). But there are no known crashes in 1.4.0.14-1, so this is a new issue. Really need a reproducer or corefile/stacktrace to look at. Well, it seems it quite often is reproduced just by running a FreeIPA server deployment on current Rawhide. I will try to reproduce it in a local VM and get at the core dump somehow (it seems like it's not being placed in /var/lib/systemd/coredump or /var/spool/abrt , where openQA looks for them from systemd and abrt respectively). (In reply to Adam Williamson from comment #6) > Well, it seems it quite often is reproduced just by running a FreeIPA server > deployment on current Rawhide. Do you mean just installing FreeIPA or running some kind of test suite? If it's just installing FreeIPA then I can work on that. Just installing necessary packages and doing 'ipa-server-install' with appropriate options. The test that's failing the most runs it like this: ipa-server-install -U --realm=DOMAIN.LOCAL --domain=domain.local --ds-password=monkeys123 --admin-password=monkeys123 --setup-dns --reverse-zone=2.0.10.in-addr.arpa --allow-zone-overlap --forwarder=10.5.126.21 --forwarder=10.5.126.22 obviously adjust host / realm names and IP addresses as appropriate. This bug appears to have been reported against 'rawhide' during the Fedora 29 development cycle. Changing version to '29'. Created attachment 1475948 [details]
stacktrace.txt
Created attachment 1475949 [details]
core.ns-slapd.389.c2d5d91a0ca341219d1c21ee196232b1.19895.1534273439000000.lz4
Looks like its a double free in the connection code:
#8 0x000055e04c06d1fb in connection_threadmain () at ldap/servers/slapd/connection.c:1585
1585 slapi_sdn_free(&anon_sdn);
(gdb) p *anon_sdn
$3 = {flag = 46 '.', udn = 0x65 <error: Cannot access memory at address 0x65>,
dn = 0x7fb49264a04000 <error: Cannot access memory at address 0x7fb49264a04000>, ndn = 0x0, ndn_len = 10496}
This is possibly related to nunc-stans being disabled. Needs more investigation...
Here is the crashing thread
#0 0x00007fb498f3f51f in atomic_load_p (mo=atomic_memory_order_relaxed, a=0x125020)
at include/jemalloc/internal/atomic_gcc_atomic.h:18
#1 0x00007fb498f3f51f in rtree_leaf_elm_bits_read (dependent=true, elm=0x125020, rtree=<optimized out>, tsdn=<optimized out>)
at include/jemalloc/internal/rtree.h:175
#2 0x00007fb498f3f51f in rtree_szind_slab_read
(r_slab=<synthetic pointer>, r_szind=<synthetic pointer>, dependent=true, key=35945862888701952, rtree_ctx=0x7fb482172f28, rtree=<optimized out>, tsdn=<optimized out>) at include/jemalloc/internal/rtree.h:464
#3 0x00007fb498f3f51f in ifree (slow_path=false, tcache=0x7fb4821730c0, ptr=0x7fb49264a04000, tsd=<optimized out>)
at src/jemalloc.c:2206
#4 0x00007fb498f3f51f in free (ptr=0x7fb49264a04000) at src/jemalloc.c:2393
#5 0x00007fb498e36fb5 in slapi_ch_free () at /usr/lib64/dirsrv/libslapd.so.0
#6 0x00007fb498e40479 in slapi_sdn_done () at /usr/lib64/dirsrv/libslapd.so.0
#7 0x00007fb498e40afe in slapi_sdn_free () at /usr/lib64/dirsrv/libslapd.so.0
#8 0x000055e04c06d1fb in connection_threadmain () at ldap/servers/slapd/connection.c:1585
#9 0x00007fb498a00ac8 in _pt_root (arg=0x7fb493fb8740) at ../../.././nspr/pr/src/pthreads/ptthread.c:201
#10 0x00007fb49899e58e in start_thread (arg=<optimized out>) at pthread_create.c:486
#11 0x00007fb4987d0513 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
So the anon_sdn is corrupted, but it should not be looking at the code. Going to try and reproduce using valgrind...
Thanks for getting the core dump, Viktor! Update: I've tested with ASan build: * with 389-ds-base-1.4.0.13-1.fc29.x86_64 install succeeds without a crash. * with 389-ds-base-1.4.0.14-2.fc29.x86_64 install fails with the stack trace: https://pagure.io/389-ds-base/issue/49029#comment-526297 With halt_on_error=false ASan build doesn't continue and exits anyway. I also tried to attach with gdb (with sleep_before_dying=1000), but gdb coredumps :| Anyway, looks like a newly added internal operations logging code caused the issue. Fixed upstream: https://pagure.io/389-ds-base/issue/49029 New fedora build complete. Discussed during the 2018-08-20 blocker review meeting: [1] The decision to classify this bug as an "AcceptedBlocker" was made as it violates the following criteria: "It must be possible to configure a Fedora Server system installed according to the above criteria as a FreeIPA domain controller, using the official deployment tools provided in the distribution FreeIPA packages. Once deployed, the system must handle multiple client enrolments and unenrolments, and client authentication via Kerberos..." [1] https://meetbot.fedoraproject.org/fedora-blocker-review/2018-08-20/f29-blocker-review.2018-08-20-16.00.txt FreeIPA tests have been passing quite reliably of late, so this was clearly fixed. |