Bug 1264556 - Make dlclose and exit concurrent calls work correctly per POSIX.
Make dlclose and exit concurrent calls work correctly per POSIX.
Status: ASSIGNED
Product: Fedora
Classification: Fedora
Component: glibc (Show other bugs)
25
Unspecified Linux
unspecified Severity unspecified
: ---
: ---
Assigned To: Carlos O'Donell
Fedora Extras Quality Assurance
:
: 1189856 (view as bug list)
Depends On:
Blocks: 1189856 1274922
  Show dependency treegraph
 
Reported: 2015-09-18 15:31 EDT by Robbie Harwood
Modified: 2017-01-14 13:24 EST (History)
32 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
rpc-gssd output with LD_DEBUG=all (1.46 MB, text/plain)
2016-11-23 06:08 EST, nicolas.rolans
no flags Details
Output of ssh session with LD_DEBUG=all (7.51 MB, text/plain)
2016-11-23 06:09 EST, Lorenzo Dalrio
no flags Details
ssh with LD_DEBUG=all (7.88 MB, text/plain)
2016-11-23 11:18 EST, Viorel Tabara
no flags Details

  None (edit)
Description Robbie Harwood 2015-09-18 15:31:28 EDT
Hello,

krb5 (libgssapi) dlopen()s a file which dlopen()s krb5.  This has in the past caused problems[0] and so in [1] we started dlopen()ing with RTLD_NODELETE.

However, in rawhide the problem is back.  Here's a trace:

(gdb) bt
#0  _dl_close (_map=0xafa160) at dl-close.c:811
#1  0x00007f7b0ca85164 in _dl_catch_error (objname=0xafa0c0, errstring=0xafa0c8, mallocedp=0xafa0b8, operate=0x7f7b0ba07090 <dlclose_doit>, args=0xafa160) at dl-error.c:187
#2  0x00007f7b0ba07631 in _dlerror_run (operate=operate@entry=0x7f7b0ba07090 <dlclose_doit>, args=0xafa160) at dlerror.c:163
#3  0x00007f7b0ba070bf in __dlclose (handle=<optimized out>) at dlclose.c:46
#4  0x00007f7b0b0e17c5 in krb5int_close_plugin (h=0xb020e0) at plugins.c:416
#5  0x00007f7b0c4275aa in ?? ()
#6  0x000000000cc8c040 in ?? ()
#7  0x335b2b426230eb00 in ?? ()
#8  0x0000000000afa0b0 in ?? ()
#9  0x0000000000000000 in ?? ()
(gdb) p map->l_init_called
$6 = 0
(gdb) 

To reiterate, this is during the exit phase of the program, after main has returned, and the program otherwise ran correctly.

[0]: http://krbdev.mit.edu/rt/Ticket/Display.html?id=7135
[1]: http://krbdev.mit.edu/rt/Ticket/Display.html?id=7947
Comment 1 Carlos O'Donell 2015-09-18 22:40:10 EDT
This looks like dlclose racing against exit. It is undefined behaviour to call anything after exit has been started (for many reasons including the fact that destructors are starting to run for all not-yet closed objects).

Only exit will call _dl_fini which will set map->l_init_called to zero, to indicate that the shtudown processing has handled that particular finalization for the object.

Therefore this looks like a user bug. You're going to have to track down if there is a thread running that's still active and calling dlclose, while the rest of the program has already called exit to shutdown. For the sake of robustness you should ensure all threads are stopped and joined before you call exit.

Given my description, is it possible you have a dlclose and an exit happening near eachother in the way I've described?
Comment 2 Carlos O'Donell 2015-09-18 22:54:30 EDT
Sorry, I mentioned "undefined behaviour" and that's not quite true. In the broadest sense you get data races which make your program undefined per the C standard, but that need not be the case. You could in theory have a thread that survives the exit process and is terminated only by the final kill signal. An atexit registered function might even load a value from that thread using an atomic read of some shared global state. Either way, I just wanted to correct the slight language issue.

There is still the real issue that dlclose and exit without any happens-before relationships is going to cause problems. However, because I know nothing about krb5 and libgssapi I have no idea about the threads you might have active. I'm relying on your help there.
Comment 3 Carlos O'Donell 2015-09-22 07:43:02 EDT
Upstream discussion here:
https://www.sourceware.org/ml/libc-alpha/2015-09/msg00536.html

I had my suspicious about glibc's behaviour here, and upstream agrees with me, that while this is dangerous to do in glibc, it is standards mandated behaviour via POSIX that dlclose and exit shall both be thread safe. Therefore any data races are flaws in glibc

Here is a build with fixes for this issue:
http://koji.fedoraproject.org/koji/taskinfo?taskID=11172534
Comment 4 Carlos O'Donell 2015-10-30 10:40:24 EDT
The next step here is to get access again to a machine that reproduce the issue and work out the concurrency fixes required to fix exit and dlclose racing.
Comment 8 Jan Kurik 2016-02-24 08:46:28 EST
This bug appears to have been reported against 'rawhide' during the Fedora 24 development cycle.
Changing version to '24'.

More information and reason for this action is here:
https://fedoraproject.org/wiki/Fedora_Program_Management/HouseKeeping/Fedora24#Rawhide_Rebase
Comment 9 Viorel Tabara 2016-08-21 00:39:13 EDT
I've hit this in ipa-client-install and ssh:

   [root@omiday ~]# ipa-client-install 
   Using existing certificate '/etc/ipa/ca.crt'.
   Discovery was successful!
   Client hostname: omiday.can.local
   Realm: CAN.LOCAL
   DNS Domain: can.local
   IPA Server: stuff.can.local
   BaseDN: dc=can,dc=local

   Continue to configure the system with these values? [no]: yes
   Removed old keys for realm CAN.LOCAL from /etc/krb5.keytab
   Synchronizing time with KDC...
   Attempting to sync time using ntpd.  Will timeout after 15 seconds
   Attempting to sync time using ntpd.  Will timeout after 15 seconds
   Unable to sync time with NTP server, assuming the time is in sync. Please check that 123 UDP port is opened.
   User authorized to enroll computers: admin
   Password for admin@CAN.LOCAL: 
   Joining realm failed: Host is already joined.
   Inconsistency detected by ld.so: dl-close.c: 811: _dl_close: Assertion `map->l_init_called' failed!

   Installation failed. Rolling back changes.
   IPA client is not configured on this system.


   omiday ~ $ ssh -S none stuff.can.local 
   Last login: Sat Aug 20 22:03:41 2016 from 192.168.0.11
   [root@stuff ~]# logout
   Connection to 192.168.0.9 closed.
   Inconsistency detected by ld.so: dl-close.c: 811: _dl_close: Assertion `map->l_init_called' failed!
 
This is on a system upgraded earlier this month from F24 to F25-alpha:

   [root@omiday ~]# dnf history list 100
   ID     | Command line             | Date and time    | Action(s)      | Altered
   -------------------------------------------------------------------------------
      100 | --releasever=25 system-u | 2016-08-11 13:56 | D, E, I, O, U  | 1492 EE
 

Running glibc-2.24-3.fc25.x86_64.
Comment 10 Viorel Tabara 2016-10-06 01:23:59 EDT
While investigating hit-and-miss Ansible playbooks connection errors I 
discovered that disabling 'gssapi-with-mic' works around the ssh issue:

   omiday ~ $ ssh -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey stuff.can.local ; echo $?
   Last login: Wed Oct  5 22:53:08 2016 from 192.168.0.11
   [root@stuff ~]# logout
   Connection to 192.168.0.9 closed.
   Inconsistency detected by ld.so: dl-close.c: 811: _dl_close: Assertion `map->l_init_called' failed!
   127
   omiday ~ $ 
   omiday ~ $ 
   omiday ~ $ 
   omiday ~ $ ssh -o PreferredAuthentications=gssapi-keyex,hostbased,publickey stuff.can.local ; echo $?
   Last login: Wed Oct  5 22:55:16 2016 from 192.168.0.11
   [root@stuff ~]# logout
   Connection to 192.168.0.9 closed.
   0

So I don't know if this helps in any way with the request in comment 4 but if 
there is anything else I can help with let me know.
Comment 11 Robbie Harwood 2016-10-12 14:07:42 EDT
*** Bug 1189856 has been marked as a duplicate of this bug. ***
Comment 12 Baptiste Mille-Mathias 2016-10-31 13:40:16 EDT
to comment #4 the ssh server git.gnome.org exhibits the issue. I have an error each time I clone a repository. (I've open the related bug #
Comment 13 Viorel Tabara 2016-10-31 14:13:48 EDT
(In reply to Baptiste Mille-Mathias from comment #12)
> to comment #4 the ssh server git.gnome.org exhibits the issue. I have an
> error each time I clone a repository. (I've open the related bug #

I've worked around it by setting PreferredAuthentications=publickey in ssh_config so scripts relying on ssh exit code still work. It's a client side issue.
Comment 14 Bojan Smojver 2016-11-12 01:47:16 EST
I'm seeing this after upgrade to F25. But only on one x86_64 system, which is a VM running on ESXi. The other two x86_64 systems, which are bare metal, do not seem to be showing this problem.

It appears for both ssh and rcp.gssd, if it matters.
Comment 15 Bojan Smojver 2016-11-15 22:16:06 EST
Does anyone know how to fix this for good in glibc? Or krb5?

This is 100% repeatable on my F25 x86_64 VM that uses Kerberos against AD.

PS. Have another F25 x86_64 machine (bare metal) that uses Kerberos against MIT krb5 server and I haven't seen it there. In fact, rpc-gssd.service fires up on it just fine.
Comment 16 Robbie Harwood 2016-11-15 23:15:51 EST
(In reply to Bojan Smojver from comment #15)
> Does anyone know how to fix this for good in glibc? Or krb5?

This is a glibc bug, not a krb5 bug.  It is a race condition between dlclose() and exit() that trips an assert in glibc.

A possible workaround might be to LD_PRELOAD something to turn dlclose() into a no-op, but I can't speak to what the consequences of this are.

Alternately, a build with the assert patched out may work; I do not know.
Comment 17 Bojan Smojver 2016-11-16 00:37:55 EST
Hmm, wouldn't patching out assert just mask the real issue?

Anyhow, if this is a race, is there a way to fix the race or is the nature of the problem such that this is simply unfixable?
Comment 18 Carlos O'Donell 2016-11-16 08:04:18 EST
(In reply to Bojan Smojver from comment #17)
> Hmm, wouldn't patching out assert just mask the real issue?
> 
> Anyhow, if this is a race, is there a way to fix the race or is the nature
> of the problem such that this is simply unfixable?

It is fixable, we (Fedora glibc team) have just prioritized other issues over this one. Thank you for commenting on this issue though since it helps us keep the priority current.
Comment 19 Bojan Smojver 2016-11-16 16:58:58 EST
(In reply to Carlos O'Donell from comment #18)

> It is fixable, we (Fedora glibc team) have just prioritized other issues
> over this one. Thank you for commenting on this issue though since it helps
> us keep the priority current.

No worries. I can pester you with questions all you want. :-)

PS. For people that are hitting this problem, the system runs in a degraded state, because rpc-gssd.service won't start.
Comment 20 Lorenzo Dalrio 2016-11-20 05:39:42 EST
I am hitting this one after upgrading from f24 to f25 yesterday on a t450 laptop.
Comment 21 Florian Weimer 2016-11-20 08:36:59 EST
Have we actually confirmed this is a race involving multiple threads?
Comment 22 Viorel Tabara 2016-11-20 14:59:52 EST
(In reply to Florian Weimer from comment #21)
> Have we actually confirmed this is a race involving multiple threads?

How do you do that?
Comment 23 Florian Weimer 2016-11-20 15:29:20 EST
(In reply to Viorel Tabara from comment #22)
> (In reply to Florian Weimer from comment #21)
> > Have we actually confirmed this is a race involving multiple threads?
> 
> How do you do that?

Run “info threads” if you can reproduce the assertion failure under GDB, or when examining the core file.  If there is just one thread, the program may still have been multi-threaded at the time the bug was trigger, so it is only a reliable indicator when multiple threads are present.  If you can trigger the bug while running under GDB, then you can watch out for the LWP messages.

(The question was more directed towards Carlos.)
Comment 24 Michael Cronenworth 2016-11-23 00:45:28 EST
Downgrading to glibc-2.23.1-11.fc24 makes the assert stop.

OpenSSH does not have multiple threads running when this occurs.

It seems to only affect OpenSSH. Other apps that dlopen() gssapi krb5 libs, such as the OpenSSL and cURL clients, do not hit this assert.
Comment 25 Bojan Smojver 2016-11-23 01:06:21 EST
(In reply to Michael Cronenworth from comment #24)

> It seems to only affect OpenSSH.

Try starting rpc-gssd.service. I have a machine where it works and one where it won't start because of the assert.
Comment 26 Florian Weimer 2016-11-23 05:55:26 EST
I think we have two different bugs here.  The first one is the multi-threading race between dlclose and process exit.

The other bug is some other form of link map corruption, probably along the lines of bug 1395758.

Can those who encounter the other bug (with ssh and rpc-ssd.service) please run these processes with LD_DEBUG=all and attach the output?  Thanks.
Comment 27 nicolas.rolans 2016-11-23 06:08 EST
Created attachment 1223171 [details]
rpc-gssd output with LD_DEBUG=all

As requested by Florian Weimer.
Comment 28 Lorenzo Dalrio 2016-11-23 06:09 EST
Created attachment 1223172 [details]
Output of ssh session with LD_DEBUG=all

This is the output of ssh with LD_DEBUG=all.
Comment 29 Fernando Herrera 2016-11-23 10:57:30 EST
I think the problem with ssh and krb is that _exit calls _dl_fini for /usr/lib64/gssntlmssp/gssntlmssp.so first and then it calls _dl_fini for /lib64/libgssapi_krb5.so.2 which does freeMechList -> releaseMechInfo which calls krb5int_close_plugin for the already closed gssntlmssp.so
Comment 30 Viorel Tabara 2016-11-23 11:18 EST
Created attachment 1223334 [details]
ssh with LD_DEBUG=all

(In reply to Florian Weimer from comment #26)
> > Can those who encounter the other bug (with ssh and rpc-ssd.service) please
> run these processes with LD_DEBUG=all and attach the output?  Thanks.

There you go. Thanks!
Comment 31 Fernando Herrera 2016-11-23 11:46:38 EST
gssntlmssp.so plugin is opened at gss_init_sec_context -> gssint_get_mechanism but as mech->gss_init_sec_context is NULL the function returns without closing the plugin
Comment 32 Florian Weimer 2016-11-24 10:15:18 EST
The ssh client assert has been filed as bug 1398370.
Comment 33 Colin Simpson 2016-11-25 09:17:58 EST
I see rpc.gssd failing to start from boot in the F25.

Do you still need debug on this?



systemctl status rpc-gssd.service
● rpc-gssd.service - RPC security service for NFS client and server
   Loaded: loaded (/usr/lib/systemd/system/rpc-gssd.service; static; vendor preset: disabled)
   Active: failed (Result: exit-code) since Fri 2016-11-25 14:15:34 GMT; 7s ago
  Process: 5528 ExecStart=/usr/sbin/rpc.gssd $RPCSVCGSSDARGS (code=exited, status=127)

Nov 25 14:15:34 shader systemd[1]: Starting RPC security service for NFS client and server...
Nov 25 14:15:34 shader rpc.gssd[5528]: Inconsistency detected by ld.so: dl-close.c: 811: _dl_close: Assertion `map->l_init_called' failed!
Nov 25 14:15:34 shader systemd[1]: rpc-gssd.service: Control process exited, code=exited status=127
Nov 25 14:15:34 shader systemd[1]: Failed to start RPC security service for NFS client and server.
Nov 25 14:15:34 shader systemd[1]: rpc-gssd.service: Unit entered failed state.
Nov 25 14:15:34 shader systemd[1]: rpc-gssd.service: Failed with result 'exit-code'.
Comment 34 more.andres 2016-12-05 20:19:54 EST
Also bumped into the same issue after upgrading into F25.

rpc.gssd[1179]: Inconsistency detected by ld.so: dl-close.c: 811: _dl_close: Assertion `map->l_init_called'
Comment 35 Mats Wichmann 2016-12-07 18:33:04 EST
I'm running into this in a very persistent way using git-over-ssh.
Comment 36 Robbie Harwood 2016-12-12 14:23:00 EST
*** Bug 1398370 has been marked as a duplicate of this bug. ***
Comment 37 Anthony Messina 2016-12-27 08:18:32 EST
This seems to be fixed with the updates described in #1398370
Comment 38 Florian Weimer 2016-12-27 08:26:04 EST
(In reply to Anthony Messina from comment #37)
> This seems to be fixed with the updates described in #1398370

No, bug 1398370 and bug 1264556 are different bugs, and bug 1264556 is not yet fixed.
Comment 39 Bojan Smojver 2016-12-27 20:07:20 EST
(In reply to Florian Weimer from comment #38)

> No, bug 1398370 and bug 1264556 are different bugs, and bug 1264556 is not
> yet fixed.

Interesting. That gss service now starts fine for me. And openssh stopped with the errors as well.

Note You need to log in before you can comment on or make changes to this bug.