Red Hat Bugzilla – Full Text Bug Listing
|Summary:||Make dlclose and exit concurrent calls work correctly per POSIX.|
|Product:||[Fedora] Fedora||Reporter:||Robbie Harwood <rharwood>|
|Component:||glibc||Assignee:||Carlos O'Donell <codonell>|
|Status:||CLOSED RAWHIDE||QA Contact:||Fedora Extras Quality Assurance <extras-qa>|
|Version:||rawhide||CC:||apfaffeneder, arjun.is, baptiste.millemathias, bojan, codonell, Colin.Simpson, erinn.looneytriggs, fherrera, fweimer, gedwards, gregory.lee.bartholomew, ioan, jakub, law, lorenzo.dalrio, mats, mfabian, mike, more.andres, nicolas.rolans, pfrankli, rharwood, ricardo.cordeiro, ryan.r.ware, samuel-rhbugs, shawn.starr, sid, steved, swehack, viorel.tabara, wpoely86, zing|
|Fixed In Version:||Doc Type:||Bug Fix|
|Doc Text:||Story Points:||---|
|Last Closed:||2017-11-17 18:59:16 EST||Type:||Bug|
|oVirt Team:||---||RHEL 7.3 requirements from Atomic Host:|
|Bug Depends On:|
|Bug Blocks:||1189856, 1274922|
Description Robbie Harwood 2015-09-18 15:31:28 EDT
Hello, krb5 (libgssapi) dlopen()s a file which dlopen()s krb5. This has in the past caused problems and so in  we started dlopen()ing with RTLD_NODELETE. However, in rawhide the problem is back. Here's a trace: (gdb) bt #0 _dl_close (_map=0xafa160) at dl-close.c:811 #1 0x00007f7b0ca85164 in _dl_catch_error (objname=0xafa0c0, errstring=0xafa0c8, mallocedp=0xafa0b8, operate=0x7f7b0ba07090 <dlclose_doit>, args=0xafa160) at dl-error.c:187 #2 0x00007f7b0ba07631 in _dlerror_run (operate=operate@entry=0x7f7b0ba07090 <dlclose_doit>, args=0xafa160) at dlerror.c:163 #3 0x00007f7b0ba070bf in __dlclose (handle=<optimized out>) at dlclose.c:46 #4 0x00007f7b0b0e17c5 in krb5int_close_plugin (h=0xb020e0) at plugins.c:416 #5 0x00007f7b0c4275aa in ?? () #6 0x000000000cc8c040 in ?? () #7 0x335b2b426230eb00 in ?? () #8 0x0000000000afa0b0 in ?? () #9 0x0000000000000000 in ?? () (gdb) p map->l_init_called $6 = 0 (gdb) To reiterate, this is during the exit phase of the program, after main has returned, and the program otherwise ran correctly. : http://krbdev.mit.edu/rt/Ticket/Display.html?id=7135 : http://krbdev.mit.edu/rt/Ticket/Display.html?id=7947
Comment 1 Carlos O'Donell 2015-09-18 22:40:10 EDT
This looks like dlclose racing against exit. It is undefined behaviour to call anything after exit has been started (for many reasons including the fact that destructors are starting to run for all not-yet closed objects). Only exit will call _dl_fini which will set map->l_init_called to zero, to indicate that the shtudown processing has handled that particular finalization for the object. Therefore this looks like a user bug. You're going to have to track down if there is a thread running that's still active and calling dlclose, while the rest of the program has already called exit to shutdown. For the sake of robustness you should ensure all threads are stopped and joined before you call exit. Given my description, is it possible you have a dlclose and an exit happening near eachother in the way I've described?
Comment 2 Carlos O'Donell 2015-09-18 22:54:30 EDT
Sorry, I mentioned "undefined behaviour" and that's not quite true. In the broadest sense you get data races which make your program undefined per the C standard, but that need not be the case. You could in theory have a thread that survives the exit process and is terminated only by the final kill signal. An atexit registered function might even load a value from that thread using an atomic read of some shared global state. Either way, I just wanted to correct the slight language issue. There is still the real issue that dlclose and exit without any happens-before relationships is going to cause problems. However, because I know nothing about krb5 and libgssapi I have no idea about the threads you might have active. I'm relying on your help there.
Comment 3 Carlos O'Donell 2015-09-22 07:43:02 EDT
Upstream discussion here: https://www.sourceware.org/ml/libc-alpha/2015-09/msg00536.html I had my suspicious about glibc's behaviour here, and upstream agrees with me, that while this is dangerous to do in glibc, it is standards mandated behaviour via POSIX that dlclose and exit shall both be thread safe. Therefore any data races are flaws in glibc Here is a build with fixes for this issue: http://koji.fedoraproject.org/koji/taskinfo?taskID=11172534
Comment 4 Carlos O'Donell 2015-10-30 10:40:24 EDT
The next step here is to get access again to a machine that reproduce the issue and work out the concurrency fixes required to fix exit and dlclose racing.
Comment 8 Jan Kurik 2016-02-24 08:46:28 EST
This bug appears to have been reported against 'rawhide' during the Fedora 24 development cycle. Changing version to '24'. More information and reason for this action is here: https://fedoraproject.org/wiki/Fedora_Program_Management/HouseKeeping/Fedora24#Rawhide_Rebase
Comment 9 Viorel Tabara 2016-08-21 00:39:13 EDT
I've hit this in ipa-client-install and ssh: [root@omiday ~]# ipa-client-install Using existing certificate '/etc/ipa/ca.crt'. Discovery was successful! Client hostname: omiday.can.local Realm: CAN.LOCAL DNS Domain: can.local IPA Server: stuff.can.local BaseDN: dc=can,dc=local Continue to configure the system with these values? [no]: yes Removed old keys for realm CAN.LOCAL from /etc/krb5.keytab Synchronizing time with KDC... Attempting to sync time using ntpd. Will timeout after 15 seconds Attempting to sync time using ntpd. Will timeout after 15 seconds Unable to sync time with NTP server, assuming the time is in sync. Please check that 123 UDP port is opened. User authorized to enroll computers: admin Password for admin@CAN.LOCAL: Joining realm failed: Host is already joined. Inconsistency detected by ld.so: dl-close.c: 811: _dl_close: Assertion `map->l_init_called' failed! Installation failed. Rolling back changes. IPA client is not configured on this system. omiday ~ $ ssh -S none stuff.can.local Last login: Sat Aug 20 22:03:41 2016 from 192.168.0.11 [root@stuff ~]# logout Connection to 192.168.0.9 closed. Inconsistency detected by ld.so: dl-close.c: 811: _dl_close: Assertion `map->l_init_called' failed! This is on a system upgraded earlier this month from F24 to F25-alpha: [root@omiday ~]# dnf history list 100 ID | Command line | Date and time | Action(s) | Altered ------------------------------------------------------------------------------- 100 | --releasever=25 system-u | 2016-08-11 13:56 | D, E, I, O, U | 1492 EE Running glibc-2.24-3.fc25.x86_64.
Comment 10 Viorel Tabara 2016-10-06 01:23:59 EDT
While investigating hit-and-miss Ansible playbooks connection errors I discovered that disabling 'gssapi-with-mic' works around the ssh issue: omiday ~ $ ssh -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey stuff.can.local ; echo $? Last login: Wed Oct 5 22:53:08 2016 from 192.168.0.11 [root@stuff ~]# logout Connection to 192.168.0.9 closed. Inconsistency detected by ld.so: dl-close.c: 811: _dl_close: Assertion `map->l_init_called' failed! 127 omiday ~ $ omiday ~ $ omiday ~ $ omiday ~ $ ssh -o PreferredAuthentications=gssapi-keyex,hostbased,publickey stuff.can.local ; echo $? Last login: Wed Oct 5 22:55:16 2016 from 192.168.0.11 [root@stuff ~]# logout Connection to 192.168.0.9 closed. 0 So I don't know if this helps in any way with the request in comment 4 but if there is anything else I can help with let me know.
Comment 11 Robbie Harwood 2016-10-12 14:07:42 EDT
*** Bug 1189856 has been marked as a duplicate of this bug. ***
Comment 12 Baptiste Mille-Mathias 2016-10-31 13:40:16 EDT
to comment #4 the ssh server git.gnome.org exhibits the issue. I have an error each time I clone a repository. (I've open the related bug #
Comment 13 Viorel Tabara 2016-10-31 14:13:48 EDT
(In reply to Baptiste Mille-Mathias from comment #12) > to comment #4 the ssh server git.gnome.org exhibits the issue. I have an > error each time I clone a repository. (I've open the related bug # I've worked around it by setting PreferredAuthentications=publickey in ssh_config so scripts relying on ssh exit code still work. It's a client side issue.
Comment 14 Bojan Smojver 2016-11-12 01:47:16 EST
I'm seeing this after upgrade to F25. But only on one x86_64 system, which is a VM running on ESXi. The other two x86_64 systems, which are bare metal, do not seem to be showing this problem. It appears for both ssh and rcp.gssd, if it matters.
Comment 15 Bojan Smojver 2016-11-15 22:16:06 EST
Does anyone know how to fix this for good in glibc? Or krb5? This is 100% repeatable on my F25 x86_64 VM that uses Kerberos against AD. PS. Have another F25 x86_64 machine (bare metal) that uses Kerberos against MIT krb5 server and I haven't seen it there. In fact, rpc-gssd.service fires up on it just fine.
Comment 16 Robbie Harwood 2016-11-15 23:15:51 EST
(In reply to Bojan Smojver from comment #15) > Does anyone know how to fix this for good in glibc? Or krb5? This is a glibc bug, not a krb5 bug. It is a race condition between dlclose() and exit() that trips an assert in glibc. A possible workaround might be to LD_PRELOAD something to turn dlclose() into a no-op, but I can't speak to what the consequences of this are. Alternately, a build with the assert patched out may work; I do not know.
Comment 17 Bojan Smojver 2016-11-16 00:37:55 EST
Hmm, wouldn't patching out assert just mask the real issue? Anyhow, if this is a race, is there a way to fix the race or is the nature of the problem such that this is simply unfixable?
Comment 18 Carlos O'Donell 2016-11-16 08:04:18 EST
(In reply to Bojan Smojver from comment #17) > Hmm, wouldn't patching out assert just mask the real issue? > > Anyhow, if this is a race, is there a way to fix the race or is the nature > of the problem such that this is simply unfixable? It is fixable, we (Fedora glibc team) have just prioritized other issues over this one. Thank you for commenting on this issue though since it helps us keep the priority current.
Comment 19 Bojan Smojver 2016-11-16 16:58:58 EST
(In reply to Carlos O'Donell from comment #18) > It is fixable, we (Fedora glibc team) have just prioritized other issues > over this one. Thank you for commenting on this issue though since it helps > us keep the priority current. No worries. I can pester you with questions all you want. :-) PS. For people that are hitting this problem, the system runs in a degraded state, because rpc-gssd.service won't start.
Comment 20 Lorenzo Dalrio 2016-11-20 05:39:42 EST
I am hitting this one after upgrading from f24 to f25 yesterday on a t450 laptop.
Comment 21 Florian Weimer 2016-11-20 08:36:59 EST
Have we actually confirmed this is a race involving multiple threads?
Comment 22 Viorel Tabara 2016-11-20 14:59:52 EST
(In reply to Florian Weimer from comment #21) > Have we actually confirmed this is a race involving multiple threads? How do you do that?
Comment 23 Florian Weimer 2016-11-20 15:29:20 EST
(In reply to Viorel Tabara from comment #22) > (In reply to Florian Weimer from comment #21) > > Have we actually confirmed this is a race involving multiple threads? > > How do you do that? Run “info threads” if you can reproduce the assertion failure under GDB, or when examining the core file. If there is just one thread, the program may still have been multi-threaded at the time the bug was trigger, so it is only a reliable indicator when multiple threads are present. If you can trigger the bug while running under GDB, then you can watch out for the LWP messages. (The question was more directed towards Carlos.)
Comment 24 Michael Cronenworth 2016-11-23 00:45:28 EST
Downgrading to glibc-2.23.1-11.fc24 makes the assert stop. OpenSSH does not have multiple threads running when this occurs. It seems to only affect OpenSSH. Other apps that dlopen() gssapi krb5 libs, such as the OpenSSL and cURL clients, do not hit this assert.
Comment 25 Bojan Smojver 2016-11-23 01:06:21 EST
(In reply to Michael Cronenworth from comment #24) > It seems to only affect OpenSSH. Try starting rpc-gssd.service. I have a machine where it works and one where it won't start because of the assert.
Comment 26 Florian Weimer 2016-11-23 05:55:26 EST
I think we have two different bugs here. The first one is the multi-threading race between dlclose and process exit. The other bug is some other form of link map corruption, probably along the lines of bug 1395758. Can those who encounter the other bug (with ssh and rpc-ssd.service) please run these processes with LD_DEBUG=all and attach the output? Thanks.
Comment 27 nicolas.rolans 2016-11-23 06:08 EST
Created attachment 1223171 [details] rpc-gssd output with LD_DEBUG=all As requested by Florian Weimer.
Comment 28 Lorenzo Dalrio 2016-11-23 06:09 EST
Created attachment 1223172 [details] Output of ssh session with LD_DEBUG=all This is the output of ssh with LD_DEBUG=all.
Comment 29 Fernando Herrera 2016-11-23 10:57:30 EST
I think the problem with ssh and krb is that _exit calls _dl_fini for /usr/lib64/gssntlmssp/gssntlmssp.so first and then it calls _dl_fini for /lib64/libgssapi_krb5.so.2 which does freeMechList -> releaseMechInfo which calls krb5int_close_plugin for the already closed gssntlmssp.so
Comment 30 Viorel Tabara 2016-11-23 11:18 EST
Created attachment 1223334 [details] ssh with LD_DEBUG=all (In reply to Florian Weimer from comment #26) > > Can those who encounter the other bug (with ssh and rpc-ssd.service) please > run these processes with LD_DEBUG=all and attach the output? Thanks. There you go. Thanks!
Comment 31 Fernando Herrera 2016-11-23 11:46:38 EST
gssntlmssp.so plugin is opened at gss_init_sec_context -> gssint_get_mechanism but as mech->gss_init_sec_context is NULL the function returns without closing the plugin
Comment 32 Florian Weimer 2016-11-24 10:15:18 EST
The ssh client assert has been filed as bug 1398370.
Comment 33 Colin.Simpson 2016-11-25 09:17:58 EST
I see rpc.gssd failing to start from boot in the F25. Do you still need debug on this? systemctl status rpc-gssd.service ● rpc-gssd.service - RPC security service for NFS client and server Loaded: loaded (/usr/lib/systemd/system/rpc-gssd.service; static; vendor preset: disabled) Active: failed (Result: exit-code) since Fri 2016-11-25 14:15:34 GMT; 7s ago Process: 5528 ExecStart=/usr/sbin/rpc.gssd $RPCSVCGSSDARGS (code=exited, status=127) Nov 25 14:15:34 shader systemd: Starting RPC security service for NFS client and server... Nov 25 14:15:34 shader rpc.gssd: Inconsistency detected by ld.so: dl-close.c: 811: _dl_close: Assertion `map->l_init_called' failed! Nov 25 14:15:34 shader systemd: rpc-gssd.service: Control process exited, code=exited status=127 Nov 25 14:15:34 shader systemd: Failed to start RPC security service for NFS client and server. Nov 25 14:15:34 shader systemd: rpc-gssd.service: Unit entered failed state. Nov 25 14:15:34 shader systemd: rpc-gssd.service: Failed with result 'exit-code'.
Comment 34 more.andres 2016-12-05 20:19:54 EST
Also bumped into the same issue after upgrading into F25. rpc.gssd: Inconsistency detected by ld.so: dl-close.c: 811: _dl_close: Assertion `map->l_init_called'
Comment 35 Mats Wichmann 2016-12-07 18:33:04 EST
I'm running into this in a very persistent way using git-over-ssh.
Comment 36 Robbie Harwood 2016-12-12 14:23:00 EST
*** Bug 1398370 has been marked as a duplicate of this bug. ***
Comment 37 Anthony Messina 2016-12-27 08:18:32 EST
This seems to be fixed with the updates described in #1398370
Comment 38 Florian Weimer 2016-12-27 08:26:04 EST
(In reply to Anthony Messina from comment #37) > This seems to be fixed with the updates described in #1398370 No, bug 1398370 and bug 1264556 are different bugs, and bug 1264556 is not yet fixed.
Comment 39 Bojan Smojver 2016-12-27 20:07:20 EST
(In reply to Florian Weimer from comment #38) > No, bug 1398370 and bug 1264556 are different bugs, and bug 1264556 is not > yet fixed. Interesting. That gss service now starts fine for me. And openssh stopped with the errors as well.
Comment 40 Fedora End Of Life 2017-11-16 14:45:40 EST
This message is a reminder that Fedora 25 is nearing its end of life. Approximately 4 (four) weeks from now Fedora will stop maintaining and issuing updates for Fedora 25. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as EOL if it remains open with a Fedora 'version' of '25'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version. Thank you for reporting this issue and we are sorry that we were not able to fix it before Fedora 25 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged change the 'version' to a later Fedora version prior this bug is closed as described in the policy above. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete.
Comment 41 Robbie Harwood 2017-11-16 15:59:13 EST
Reproducer is gone, but the issue remains, and may come back.
Comment 42 Carlos O'Donell 2017-11-17 18:59:16 EST
I believe we have fixed the most egregious problems just recently in upstream. Therefore I think this is fixed for Rawhide. I'm going to mark this CLOSED/RAWHIDE, and it can be reopened if someone shows a testcase that fails in Rawhide. commit e5e4d7cc056ffae51fc55b66d9dd0abd99927486 Author: Paul Pluzhnikov <email@example.com> Date: Thu Sep 21 12:14:41 2017 -0700 Fix BZ# 22180. POSIX requires that dlclose() and exit() be thread safe, therefore you can have one thread in the middle of dlclose() and another thread executing exit() without causing any undefined behaviour on the part of the implementation. The existing implementation had a flaw that exit() exit handler processing did not consider a concurrent dlclose() and would not mark already run exit handlers using the ef_free flavour. The consequence of this is that a concurrent exit() with dlclose() will run all the exit handlers that dlclose() had not yet run, but then will block on the loader lock. The concurrent dlclose() will continue to run all the exit handlers again (twice) in violation of the Itanium C++ ABI requirements for __cxa_atexit(). This commit fixes this by having exit() mark all handlers with ef_free to ensure that concurrent dlclose() won't re-run registered exit handlers that have already run.