Hello, krb5 (libgssapi) dlopen()s a file which dlopen()s krb5. This has in the past caused problems[0] and so in [1] we started dlopen()ing with RTLD_NODELETE. However, in rawhide the problem is back. Here's a trace: (gdb) bt #0 _dl_close (_map=0xafa160) at dl-close.c:811 #1 0x00007f7b0ca85164 in _dl_catch_error (objname=0xafa0c0, errstring=0xafa0c8, mallocedp=0xafa0b8, operate=0x7f7b0ba07090 <dlclose_doit>, args=0xafa160) at dl-error.c:187 #2 0x00007f7b0ba07631 in _dlerror_run (operate=operate@entry=0x7f7b0ba07090 <dlclose_doit>, args=0xafa160) at dlerror.c:163 #3 0x00007f7b0ba070bf in __dlclose (handle=<optimized out>) at dlclose.c:46 #4 0x00007f7b0b0e17c5 in krb5int_close_plugin (h=0xb020e0) at plugins.c:416 #5 0x00007f7b0c4275aa in ?? () #6 0x000000000cc8c040 in ?? () #7 0x335b2b426230eb00 in ?? () #8 0x0000000000afa0b0 in ?? () #9 0x0000000000000000 in ?? () (gdb) p map->l_init_called $6 = 0 (gdb) To reiterate, this is during the exit phase of the program, after main has returned, and the program otherwise ran correctly. [0]: http://krbdev.mit.edu/rt/Ticket/Display.html?id=7135 [1]: http://krbdev.mit.edu/rt/Ticket/Display.html?id=7947
This looks like dlclose racing against exit. It is undefined behaviour to call anything after exit has been started (for many reasons including the fact that destructors are starting to run for all not-yet closed objects). Only exit will call _dl_fini which will set map->l_init_called to zero, to indicate that the shtudown processing has handled that particular finalization for the object. Therefore this looks like a user bug. You're going to have to track down if there is a thread running that's still active and calling dlclose, while the rest of the program has already called exit to shutdown. For the sake of robustness you should ensure all threads are stopped and joined before you call exit. Given my description, is it possible you have a dlclose and an exit happening near eachother in the way I've described?
Sorry, I mentioned "undefined behaviour" and that's not quite true. In the broadest sense you get data races which make your program undefined per the C standard, but that need not be the case. You could in theory have a thread that survives the exit process and is terminated only by the final kill signal. An atexit registered function might even load a value from that thread using an atomic read of some shared global state. Either way, I just wanted to correct the slight language issue. There is still the real issue that dlclose and exit without any happens-before relationships is going to cause problems. However, because I know nothing about krb5 and libgssapi I have no idea about the threads you might have active. I'm relying on your help there.
Upstream discussion here: https://www.sourceware.org/ml/libc-alpha/2015-09/msg00536.html I had my suspicious about glibc's behaviour here, and upstream agrees with me, that while this is dangerous to do in glibc, it is standards mandated behaviour via POSIX that dlclose and exit shall both be thread safe. Therefore any data races are flaws in glibc Here is a build with fixes for this issue: http://koji.fedoraproject.org/koji/taskinfo?taskID=11172534
The next step here is to get access again to a machine that reproduce the issue and work out the concurrency fixes required to fix exit and dlclose racing.
This bug appears to have been reported against 'rawhide' during the Fedora 24 development cycle. Changing version to '24'. More information and reason for this action is here: https://fedoraproject.org/wiki/Fedora_Program_Management/HouseKeeping/Fedora24#Rawhide_Rebase
I've hit this in ipa-client-install and ssh: [root@omiday ~]# ipa-client-install Using existing certificate '/etc/ipa/ca.crt'. Discovery was successful! Client hostname: omiday.can.local Realm: CAN.LOCAL DNS Domain: can.local IPA Server: stuff.can.local BaseDN: dc=can,dc=local Continue to configure the system with these values? [no]: yes Removed old keys for realm CAN.LOCAL from /etc/krb5.keytab Synchronizing time with KDC... Attempting to sync time using ntpd. Will timeout after 15 seconds Attempting to sync time using ntpd. Will timeout after 15 seconds Unable to sync time with NTP server, assuming the time is in sync. Please check that 123 UDP port is opened. User authorized to enroll computers: admin Password for admin: Joining realm failed: Host is already joined. Inconsistency detected by ld.so: dl-close.c: 811: _dl_close: Assertion `map->l_init_called' failed! Installation failed. Rolling back changes. IPA client is not configured on this system. omiday ~ $ ssh -S none stuff.can.local Last login: Sat Aug 20 22:03:41 2016 from 192.168.0.11 [root@stuff ~]# logout Connection to 192.168.0.9 closed. Inconsistency detected by ld.so: dl-close.c: 811: _dl_close: Assertion `map->l_init_called' failed! This is on a system upgraded earlier this month from F24 to F25-alpha: [root@omiday ~]# dnf history list 100 ID | Command line | Date and time | Action(s) | Altered ------------------------------------------------------------------------------- 100 | --releasever=25 system-u | 2016-08-11 13:56 | D, E, I, O, U | 1492 EE Running glibc-2.24-3.fc25.x86_64.
While investigating hit-and-miss Ansible playbooks connection errors I discovered that disabling 'gssapi-with-mic' works around the ssh issue: omiday ~ $ ssh -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey stuff.can.local ; echo $? Last login: Wed Oct 5 22:53:08 2016 from 192.168.0.11 [root@stuff ~]# logout Connection to 192.168.0.9 closed. Inconsistency detected by ld.so: dl-close.c: 811: _dl_close: Assertion `map->l_init_called' failed! 127 omiday ~ $ omiday ~ $ omiday ~ $ omiday ~ $ ssh -o PreferredAuthentications=gssapi-keyex,hostbased,publickey stuff.can.local ; echo $? Last login: Wed Oct 5 22:55:16 2016 from 192.168.0.11 [root@stuff ~]# logout Connection to 192.168.0.9 closed. 0 So I don't know if this helps in any way with the request in comment 4 but if there is anything else I can help with let me know.
*** Bug 1189856 has been marked as a duplicate of this bug. ***
to comment #4 the ssh server git.gnome.org exhibits the issue. I have an error each time I clone a repository. (I've open the related bug #
(In reply to Baptiste Mille-Mathias from comment #12) > to comment #4 the ssh server git.gnome.org exhibits the issue. I have an > error each time I clone a repository. (I've open the related bug # I've worked around it by setting PreferredAuthentications=publickey in ssh_config so scripts relying on ssh exit code still work. It's a client side issue.
I'm seeing this after upgrade to F25. But only on one x86_64 system, which is a VM running on ESXi. The other two x86_64 systems, which are bare metal, do not seem to be showing this problem. It appears for both ssh and rcp.gssd, if it matters.
Does anyone know how to fix this for good in glibc? Or krb5? This is 100% repeatable on my F25 x86_64 VM that uses Kerberos against AD. PS. Have another F25 x86_64 machine (bare metal) that uses Kerberos against MIT krb5 server and I haven't seen it there. In fact, rpc-gssd.service fires up on it just fine.
(In reply to Bojan Smojver from comment #15) > Does anyone know how to fix this for good in glibc? Or krb5? This is a glibc bug, not a krb5 bug. It is a race condition between dlclose() and exit() that trips an assert in glibc. A possible workaround might be to LD_PRELOAD something to turn dlclose() into a no-op, but I can't speak to what the consequences of this are. Alternately, a build with the assert patched out may work; I do not know.
Hmm, wouldn't patching out assert just mask the real issue? Anyhow, if this is a race, is there a way to fix the race or is the nature of the problem such that this is simply unfixable?
(In reply to Bojan Smojver from comment #17) > Hmm, wouldn't patching out assert just mask the real issue? > > Anyhow, if this is a race, is there a way to fix the race or is the nature > of the problem such that this is simply unfixable? It is fixable, we (Fedora glibc team) have just prioritized other issues over this one. Thank you for commenting on this issue though since it helps us keep the priority current.
(In reply to Carlos O'Donell from comment #18) > It is fixable, we (Fedora glibc team) have just prioritized other issues > over this one. Thank you for commenting on this issue though since it helps > us keep the priority current. No worries. I can pester you with questions all you want. :-) PS. For people that are hitting this problem, the system runs in a degraded state, because rpc-gssd.service won't start.
I am hitting this one after upgrading from f24 to f25 yesterday on a t450 laptop.
Have we actually confirmed this is a race involving multiple threads?
(In reply to Florian Weimer from comment #21) > Have we actually confirmed this is a race involving multiple threads? How do you do that?
(In reply to Viorel Tabara from comment #22) > (In reply to Florian Weimer from comment #21) > > Have we actually confirmed this is a race involving multiple threads? > > How do you do that? Run “info threads” if you can reproduce the assertion failure under GDB, or when examining the core file. If there is just one thread, the program may still have been multi-threaded at the time the bug was trigger, so it is only a reliable indicator when multiple threads are present. If you can trigger the bug while running under GDB, then you can watch out for the LWP messages. (The question was more directed towards Carlos.)
Downgrading to glibc-2.23.1-11.fc24 makes the assert stop. OpenSSH does not have multiple threads running when this occurs. It seems to only affect OpenSSH. Other apps that dlopen() gssapi krb5 libs, such as the OpenSSL and cURL clients, do not hit this assert.
(In reply to Michael Cronenworth from comment #24) > It seems to only affect OpenSSH. Try starting rpc-gssd.service. I have a machine where it works and one where it won't start because of the assert.
I think we have two different bugs here. The first one is the multi-threading race between dlclose and process exit. The other bug is some other form of link map corruption, probably along the lines of bug 1395758. Can those who encounter the other bug (with ssh and rpc-ssd.service) please run these processes with LD_DEBUG=all and attach the output? Thanks.
Created attachment 1223171 [details] rpc-gssd output with LD_DEBUG=all As requested by Florian Weimer.
Created attachment 1223172 [details] Output of ssh session with LD_DEBUG=all This is the output of ssh with LD_DEBUG=all.
I think the problem with ssh and krb is that _exit calls _dl_fini for /usr/lib64/gssntlmssp/gssntlmssp.so first and then it calls _dl_fini for /lib64/libgssapi_krb5.so.2 which does freeMechList -> releaseMechInfo which calls krb5int_close_plugin for the already closed gssntlmssp.so
Created attachment 1223334 [details] ssh with LD_DEBUG=all (In reply to Florian Weimer from comment #26) > > Can those who encounter the other bug (with ssh and rpc-ssd.service) please > run these processes with LD_DEBUG=all and attach the output? Thanks. There you go. Thanks!
gssntlmssp.so plugin is opened at gss_init_sec_context -> gssint_get_mechanism but as mech->gss_init_sec_context is NULL the function returns without closing the plugin
The ssh client assert has been filed as bug 1398370.
I see rpc.gssd failing to start from boot in the F25. Do you still need debug on this? systemctl status rpc-gssd.service ● rpc-gssd.service - RPC security service for NFS client and server Loaded: loaded (/usr/lib/systemd/system/rpc-gssd.service; static; vendor preset: disabled) Active: failed (Result: exit-code) since Fri 2016-11-25 14:15:34 GMT; 7s ago Process: 5528 ExecStart=/usr/sbin/rpc.gssd $RPCSVCGSSDARGS (code=exited, status=127) Nov 25 14:15:34 shader systemd[1]: Starting RPC security service for NFS client and server... Nov 25 14:15:34 shader rpc.gssd[5528]: Inconsistency detected by ld.so: dl-close.c: 811: _dl_close: Assertion `map->l_init_called' failed! Nov 25 14:15:34 shader systemd[1]: rpc-gssd.service: Control process exited, code=exited status=127 Nov 25 14:15:34 shader systemd[1]: Failed to start RPC security service for NFS client and server. Nov 25 14:15:34 shader systemd[1]: rpc-gssd.service: Unit entered failed state. Nov 25 14:15:34 shader systemd[1]: rpc-gssd.service: Failed with result 'exit-code'.
Also bumped into the same issue after upgrading into F25. rpc.gssd[1179]: Inconsistency detected by ld.so: dl-close.c: 811: _dl_close: Assertion `map->l_init_called'
I'm running into this in a very persistent way using git-over-ssh.
*** Bug 1398370 has been marked as a duplicate of this bug. ***
This seems to be fixed with the updates described in #1398370
(In reply to Anthony Messina from comment #37) > This seems to be fixed with the updates described in #1398370 No, bug 1398370 and bug 1264556 are different bugs, and bug 1264556 is not yet fixed.
(In reply to Florian Weimer from comment #38) > No, bug 1398370 and bug 1264556 are different bugs, and bug 1264556 is not > yet fixed. Interesting. That gss service now starts fine for me. And openssh stopped with the errors as well.
This message is a reminder that Fedora 25 is nearing its end of life. Approximately 4 (four) weeks from now Fedora will stop maintaining and issuing updates for Fedora 25. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as EOL if it remains open with a Fedora 'version' of '25'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version. Thank you for reporting this issue and we are sorry that we were not able to fix it before Fedora 25 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged change the 'version' to a later Fedora version prior this bug is closed as described in the policy above. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete.
Reproducer is gone, but the issue remains, and may come back.
I believe we have fixed the most egregious problems just recently in upstream. Therefore I think this is fixed for Rawhide. I'm going to mark this CLOSED/RAWHIDE, and it can be reopened if someone shows a testcase that fails in Rawhide. commit e5e4d7cc056ffae51fc55b66d9dd0abd99927486 Author: Paul Pluzhnikov <ppluzhnikov> Date: Thu Sep 21 12:14:41 2017 -0700 Fix BZ# 22180. POSIX requires that dlclose() and exit() be thread safe, therefore you can have one thread in the middle of dlclose() and another thread executing exit() without causing any undefined behaviour on the part of the implementation. The existing implementation had a flaw that exit() exit handler processing did not consider a concurrent dlclose() and would not mark already run exit handlers using the ef_free flavour. The consequence of this is that a concurrent exit() with dlclose() will run all the exit handlers that dlclose() had not yet run, but then will block on the loader lock. The concurrent dlclose() will continue to run all the exit handlers again (twice) in violation of the Itanium C++ ABI requirements for __cxa_atexit(). This commit fixes this by having exit() mark all handlers with ef_free to ensure that concurrent dlclose() won't re-run registered exit handlers that have already run.
Interestingly, have just observed "Inconsistency detected by ld.so" message when running cvlc (vlc-core-3.0.5-6.fc30.x86_64, built on buildvm-06.virt.rpmfusion.net so the source is clear, yet still reproducible if need be). Sadly, I've been resizing the terminal and got the message lost, but ISTR there was indeed "map" variable mentioned. Cannot reproduce that anymore now, it hapenned when cvlc was run for the first time after boot, which happened several hours earlier). $ rpm -q glibc > glibc-2.29.9000-9.fc31.x86_64 $ file /usr/bin/cvlc > /usr/bin/cvlc: a /usr/bin/sh script, ASCII text executable $ cat /usr/bin/cvlc > #!/usr/bin/sh > exec /usr/bin/vlc -I "dummy" "$@" $ ldd /usr/bin/vlc > linux-vdso.so.1 (0x00007ffde278a000) > libvlc.so.5 => /lib64/libvlc.so.5 (0x00007fa0cd846000) > libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fa0cd825000) > libdl.so.2 => /lib64/libdl.so.2 (0x00007fa0cd81e000) > libc.so.6 => /lib64/libc.so.6 (0x00007fa0cd659000) > libvlccore.so.9 => /lib64/libvlccore.so.9 (0x00007fa0cd566000) > libm.so.6 => /lib64/libm.so.6 (0x00007fa0cd420000) > libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007fa0cd404000) > /lib64/ld-linux-x86-64.so.2 (0x00007fa0cd8a8000) > libidn.so.12 => /lib64/libidn.so.12 (0x00007fa0cd3cf000) > libdbus-1.so.3 => /lib64/libdbus-1.so.3 (0x00007fa0cd379000) > libsystemd.so.0 => /lib64/libsystemd.so.0 (0x00007fa0cd2c9000) > librt.so.1 => /lib64/librt.so.1 (0x00007fa0cd2be000) > liblzma.so.5 => /lib64/liblzma.so.5 (0x00007fa0cd293000) > liblz4.so.1 => /lib64/liblz4.so.1 (0x00007fa0cd274000) > libgcrypt.so.20 => /lib64/libgcrypt.so.20 (0x00007fa0cd156000) > libgpg-error.so.0 => /lib64/libgpg-error.so.0 (0x00007fa0cd133000 $ ldd /lib64/libvlc.so.5 > linux-vdso.so.1 (0x00007ffe8d16d000) > libvlccore.so.9 => /lib64/libvlccore.so.9 (0x00007f9b58364000) > libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f9b58343000) > libm.so.6 => /lib64/libm.so.6 (0x00007f9b581fd000) > libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f9b581e3000) > libc.so.6 => /lib64/libc.so.6 (0x00007f9b5801e000) > libidn.so.12 => /lib64/libidn.so.12 (0x00007f9b57fe9000) > libdl.so.2 => /lib64/libdl.so.2 (0x00007f9b57fe0000) > libdbus-1.so.3 => /lib64/libdbus-1.so.3 (0x00007f9b57f8a000) > /lib64/ld-linux-x86-64.so.2 (0x00007f9b584b4000) > libsystemd.so.0 => /lib64/libsystemd.so.0 (0x00007f9b57eda000) > librt.so.1 => /lib64/librt.so.1 (0x00007f9b57ecf000) > liblzma.so.5 => /lib64/liblzma.so.5 (0x00007f9b57ea6000) > liblz4.so.1 => /lib64/liblz4.so.1 (0x00007f9b57e87000) > libgcrypt.so.20 => /lib64/libgcrypt.so.20 (0x00007f9b57d67000) > libgpg-error.so.0 => /lib64/libgpg-error.so.0 (0x00007f9b57d44000) $ ldd /lib64/libvlccore.so.9 > linux-vdso.so.1 (0x00007ffe07fd9000) > libidn.so.12 => /lib64/libidn.so.12 (0x00007f2d05346000) > libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f2d05325000) > libdl.so.2 => /lib64/libdl.so.2 (0x00007f2d0531e000) > libm.so.6 => /lib64/libm.so.6 (0x00007f2d051d8000) > libdbus-1.so.3 => /lib64/libdbus-1.so.3 (0x00007f2d05182000) > libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f2d05168000) > libc.so.6 => /lib64/libc.so.6 (0x00007f2d04fa1000) > /lib64/ld-linux-x86-64.so.2 (0x00007f2d054a7000) > libsystemd.so.0 => /lib64/libsystemd.so.0 (0x00007f2d04ef1000) > librt.so.1 => /lib64/librt.so.1 (0x00007f2d04ee6000) > liblzma.so.5 => /lib64/liblzma.so.5 (0x00007f2d04ebd000) > liblz4.so.1 => /lib64/liblz4.so.1 (0x00007f2d04e9e000) > libgcrypt.so.20 => /lib64/libgcrypt.so.20 (0x00007f2d04d80000) > libgpg-error.so.0 => /lib64/libgpg-error.so.0 (0x00007f2d04d5b000)
(In reply to Jan Pokorný [poki] from comment #43) > Interestingly, have just observed "Inconsistency detected by ld.so" > message when running cvlc (vlc-core-3.0.5-6.fc30.x86_64, built on > buildvm-06.virt.rpmfusion.net so the source is clear, yet still > reproducible if need be). Sadly, I've been resizing the terminal > and got the message lost, but ISTR there was indeed "map" variable > mentioned. > > Cannot reproduce that anymore now, it hapenned when cvlc was run for > the first time after boot, which happened several hours earlier). We need a core from the abort. If you can get one we can try fix it :-)
Yep, would like to, but there's no trace of the coredump or even the crash at sight. The only idea is then that either the glibc source of that message is not actually aborting loudly (the program has terminated, though), or that vlc catches SIGABRT and turns that into a graceful stop. Either of that will make any troubleshooting tough for sure. Main intention is to keep trace of this or similar problem recidivating to be possibly linked to any future observations, hopefully with more info available.
(In reply to Jan Pokorný [poki] from comment #45) > Yep, would like to, but there's no trace of the coredump or even the > crash at sight. The only idea is then that either the glibc source > of that message is not actually aborting loudly (the program has > terminated, though), or that vlc catches SIGABRT and turns that into > a graceful stop. Either of that will make any troubleshooting tough > for sure. Main intention is to keep trace of this or similar problem > recidivating to be possibly linked to any future observations, > hopefully with more info available. Agree, it helps to get reports of seeing this in the wild.