Bug 1398370 - Plugin unloading causes undefined behavior: Inconsistency detected by ld.so: dl-close.c: 811: _dl_close: Assertion `map->l_init_called' failed!
Plugin unloading causes undefined behavior: Inconsistency detected by ld.so: ...
Status: CLOSED ERRATA
Product: Fedora
Classification: Fedora
Component: glibc (Show other bugs)
25
Unspecified Linux
unspecified Severity unspecified
: ---
: ---
Assigned To: Carlos O'Donell
Fedora Extras Quality Assurance
: Reopened
: 1389881 1399594 1401341 (view as bug list)
Depends On:
Blocks: 1406666
  Show dependency treegraph
 
Reported: 2016-11-24 10:11 EST by Shawn Starr
Modified: 2016-12-27 10:51 EST (History)
43 users (show)

See Also:
Fixed In Version: glibc-2.24-4.fc25
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2016-12-27 10:51:11 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Sourceware 11941 None None None 2016-12-13 08:33 EST

  None (edit)
Description Shawn Starr 2016-11-24 10:11:58 EST
Description of problem:

After logging out via ssh another machine, I get this notice from the ld loader:

 Inconsistency detected by ld.so: dl-close.c: 811: _dl_close: Assertion `map->l_init_called' failed!

Version-Release number of selected component (if applicable):
glibc-2.24-3.fc25.x86_64


How reproducible: 

Noticed only when connecting to another machine running same GNU libc(?)

Steps to Reproduce:
1. ssh to another machine running Fedora 25
2. Logout, notice message
3.

Actual results:
Shows ld loader message

Expected results:
None should be shown

Additional info:

I don't know what triggers this.
Comment 1 Florian Weimer 2016-11-24 10:15:56 EST
This was previously reported as a rider on bug 1264556, which is actually about a different issue.
Comment 2 Tomas Mraz 2016-11-28 04:24:45 EST
*** Bug 1389881 has been marked as a duplicate of this bug. ***
Comment 3 Ward 2016-11-28 05:09:35 EST
I'm also getting this after upgrading to Fedora 25.

When doing:
ssh -o GSSAPIAuthentication=no <host>
the message is gone.
Comment 4 Leif Madsen 2016-11-28 14:42:54 EST
Also confirmed I see it here as well. There was some discussion (possible red herring) about kerberos auth being an issue. Just FYI that I authenticate to work via kerberos (kinit ...)

Disabling GSSAPIAuthentication (per comment #3) also removes the message when logging off of a remote ssh connection. I initially saw this message when using rsync+ssh.
Comment 5 Jakub Jelen 2016-11-29 06:18:22 EST
*** Bug 1399594 has been marked as a duplicate of this bug. ***
Comment 6 Florian Weimer 2016-11-29 07:32:27 EST
I cannot reproduce this with my variant of the Red Hat Kerberos configuration (default Fedora 25 Workstation installation plus krb5-workstation plus adding the REDHAT.COM realm to /etc/krb5.conf).  gssproxy is running, but I have not done anything to configure it.  I tried logging into interactive SSH servers both using public key authentication and GSSAPI Kerberos authentication, and after logging out, I did not observe a crash.
Comment 7 Roman Kagan 2016-11-29 07:38:57 EST
(In reply to Florian Weimer from comment #6)
> I cannot reproduce this with my variant of the Red Hat Kerberos
> configuration (default Fedora 25 Workstation installation plus
> krb5-workstation plus adding the REDHAT.COM realm to /etc/krb5.conf). 
> gssproxy is running, but I have not done anything to configure it.  I tried
> logging into interactive SSH servers both using public key authentication
> and GSSAPI Kerberos authentication, and after logging out, I did not observe
> a crash.

Make sure to have gssntlmssp installed.  gssproxy doesn't trigger the problem in ssh.
Comment 8 Ward 2016-11-29 07:41:25 EST
Hi Florian,

I also have krb5-workstation installed and gssproxy running but all configs are at the default settings (untouched). I have it when logging into CentOS 6 and 7 machines but not on Ubuntu. All are not configured to use GSSAPI.
Comment 9 Florian Weimer 2016-11-29 07:52:01 EST
(In reply to Roman Kagan from comment #7)
> (In reply to Florian Weimer from comment #6)
> > I cannot reproduce this with my variant of the Red Hat Kerberos
> > configuration (default Fedora 25 Workstation installation plus
> > krb5-workstation plus adding the REDHAT.COM realm to /etc/krb5.conf). 
> > gssproxy is running, but I have not done anything to configure it.  I tried
> > logging into interactive SSH servers both using public key authentication
> > and GSSAPI Kerberos authentication, and after logging out, I did not observe
> > a crash.
> 
> Make sure to have gssntlmssp installed.  gssproxy doesn't trigger the
> problem in ssh.

Still no luck.  I even tried rebooting the system.  gssntlmssp is not loaded during the SSH session, I only see proxymech.
Comment 10 Florian Weimer 2016-11-29 07:52:36 EST
(In reply to Ward from comment #8)
> Hi Florian,
> 
> I also have krb5-workstation installed and gssproxy running but all configs
> are at the default settings (untouched). I have it when logging into CentOS
> 6 and 7 machines but not on Ubuntu. All are not configured to use GSSAPI.

Did you upgrade from an earlier Fedora release?
Comment 11 Ward 2016-11-29 07:56:21 EST
Hi Florian, 

Yes, this is after an upgrade F24->F25.
Comment 12 Roman Kagan 2016-11-29 08:22:19 EST
Another repro:

install krb5-devel and run

# gss-server foo &
# gss-client -spnego localhost bar baz

This results in loading gssntlmssp and triggers that error message.
(W/o "-spnego" gssntlmssp is not loaded and the error is not triggered).

FWIW it all is on a F25 machine upgraded from F24 (and from F23 before).

/etc/krb5.conf is left at default but a TGT is present in the session.
Comment 13 Florian Weimer 2016-11-29 08:30:01 EST
(In reply to Roman Kagan from comment #12)
> Another repro:
> 
> install krb5-devel and run
> 
> # gss-server foo &
> # gss-client -spnego localhost bar baz
> 
> This results in loading gssntlmssp and triggers that error message.
> (W/o "-spnego" gssntlmssp is not loaded and the error is not triggered).

Like this?

# gss-client -spnego localhost bar baz
GSS-API error initializing context: Unspecified GSS failure.  Minor code may provide more information
GSS-API error initializing context: SPNEGO cannot find mechanisms to negotiate
Inconsistency detected by ld.so: dl-close.c: 811: _dl_close: Assertion `map->l_init_called' failed!

Thanks, that should allow us to figure out what's going on.
Comment 14 Andy Wang 2016-11-30 13:28:19 EST
I'm seeing this same problem when quitting an sftp/ssh session against some AWS instances I have. (running Amazon Linux in the AWs instances).

If I remove gssntlmssp the problem goes away.  That package is installed as a dependency of pidgin-sipe that I need to use though.
Comment 15 Roman Kagan 2016-12-01 04:19:05 EST
I just took another look at this and I can confirm what Fernando Herrera said in bug 1264556#c29:

libgssapi defines a destructor, gssint_mechglue_fini, which calls freeMechList() which iterates the list of mechanisms and calls releaseMechInfo() on each.  That, in turn, calls krb5int_close_plugin() -> dlclose() for every mech loaded from a plugin.

The problem is that there are no guarantees about the ordering of destructors from different shared objects.  In this case gssntlmssp.so gets destructed by the loader before libgssapi, so dlclose() hits the assertion.

A quick workaround seems to be to set a global variable in gssapi_mechglue_fini and then check it in krb5int_close_plugin() and skip dlclose() if it's set.
Comment 16 Roman Kagan 2016-12-01 16:31:35 EST
BTW this problem is not in glibc but in krb5 which is relying on undefined behavior.

Can someone with the rights please adjust the component?
Comment 17 Roman Kagan 2016-12-01 16:41:02 EST
Sorry but it's not in gssntlmssp.  It's in krb5 whose plugin system is broken.
Comment 18 Florian Weimer 2016-12-03 01:48:46 EST
(In reply to Roman Kagan from comment #15)
> I just took another look at this and I can confirm what Fernando Herrera
> said in bug 1264556#c29:
> 
> libgssapi defines a destructor, gssint_mechglue_fini, which calls
> freeMechList() which iterates the list of mechanisms and calls
> releaseMechInfo() on each.  That, in turn, calls krb5int_close_plugin() ->
> dlclose() for every mech loaded from a plugin.
> 
> The problem is that there are no guarantees about the ordering of
> destructors from different shared objects.  In this case gssntlmssp.so gets
> destructed by the loader before libgssapi, so dlclose() hits the assertion.

The order is not unspecified according to the base ELF specification:

“The order in which the dynamic linker calls termination functions is the exact reverse order of their corresponding initialization functions. […] The dynamic linker ensures that it will not execute any initialization or termination functions more than once.”

dlopen obviously introduces complications here.  What libgssapi does looks reasonable (dlcose the plugins it dlopened).  But during process termination, the plugin modules have their termination functions called before the ELF destructor in libgssapi runs, and apparently, the dlclose call on the plugin modules asserts in this scenario.

I have not yet built a standalone test case, but it's not entirely clear to me why this wouldn't be considered a dynamic linker bug.

> A quick workaround seems to be to set a global variable in gssapi_mechglue_fini
> and then check it in krb5int_close_plugin() and skip dlclose() if it's set.

But this essentially disables plugin unloading.
Comment 19 Roman Kagan 2016-12-03 04:59:52 EST
(In reply to Florian Weimer from comment #18)
> (In reply to Roman Kagan from comment #15)
> > I just took another look at this and I can confirm what Fernando Herrera
> > said in bug 1264556#c29:
> > 
> > libgssapi defines a destructor, gssint_mechglue_fini, which calls
> > freeMechList() which iterates the list of mechanisms and calls
> > releaseMechInfo() on each.  That, in turn, calls krb5int_close_plugin() ->
> > dlclose() for every mech loaded from a plugin.
> > 
> > The problem is that there are no guarantees about the ordering of
> > destructors from different shared objects.  In this case gssntlmssp.so gets
> > destructed by the loader before libgssapi, so dlclose() hits the assertion.
> 
> The order is not unspecified according to the base ELF specification:
> 
> “The order in which the dynamic linker calls termination functions is the
> exact reverse order of their corresponding initialization functions. […] The
> dynamic linker ensures that it will not execute any initialization or
> termination functions more than once.”

I guess you're right, but in practice on unload glibc just sorts the list of shared objects in reverse dependency order (so that dependent objects are unloaded before dependencies).  For normal dynamic linking this matches the behavior you quote.

> dlopen obviously introduces complications here.

Exactly.  In particular, the current glibc implementation puts dlopened objects at the end of the shared object list, and on unload only applies the dependency sorting.

> What libgssapi does looks
> reasonable (dlcose the plugins it dlopened).  But during process
> termination, the plugin modules have their termination functions called
> before the ELF destructor in libgssapi runs, and apparently, the dlclose
> call on the plugin modules asserts in this scenario.

Not exactly: as dlclose hasn't been called on the plugin, it just gets unloaded by the dynamic loader along with other shared objects.  If the dependency ordering puts the plugin unload before the libgssapi unload, dlclose in libgssapi's destructor is run on an already unloaded plugin.

> I have not yet built a standalone test case, but it's not entirely clear to
> me why this wouldn't be considered a dynamic linker bug.

The point here is that gssntlmssp.so is linked against libgssapi_krb5.so, so the former is unloaded before the latter, triggering the problem.

Here's the repro:

### "library":
>>> cat y.c
#include <stdio.h>
#include <dlfcn.h>

static void *lib;

void y()
{
        void (*x)();

        lib = dlopen("libx.so", RTLD_NOW | RTLD_LOCAL | RTLD_NODELETE);
        if (!lib)
                goto dlerr;

        x = dlsym(lib, "x");
        if (!x)
                goto dlerr;

        return;
dlerr:
        fprintf(stderr, "dlsym: %s\n", dlerror());
}

static __attribute__((destructor)) void dtor()
{
        fprintf(stderr, "%s:%s >>>\n", __FILE__, __func__);
        if (lib)
                dlclose(lib);
        fprintf(stderr, "%s:%s <<<\n", __FILE__, __func__);
}
>>> gcc -g -O2 -fpic -shared y.c -o liby.so -ldl

### "plugin": note that it's linked against the "library"
>>> cat x.c
#include <stdio.h>

void x(void)
{
}

static __attribute__((destructor)) void dtor()
{
        fprintf(stderr, "%s:%s ===\n", __FILE__, __func__);
}
>>> gcc -g -O2 -fpic -shared x.c -o libx.so -ly -L.

### main:
>>> cat z.c
extern void y();

int main()
{
        y();
}
>>> gcc -g -O2 z.c -o z -ly -L.


### now run it:
>>> LD_LIBRARY_PATH=. ./z
x.c:dtor ===
y.c:dtor >>>
Inconsistency detected by ld.so: dl-close.c: 811: _dl_close: Assertion `map->l_init_called' failed!
>>> echo $?
127


> > A quick workaround seems to be to set a global variable in gssapi_mechglue_fini
> > and then check it in krb5int_close_plugin() and skip dlclose() if it's set.
> 
> But this essentially disables plugin unloading.

Not really, because
a) at this point the program is exiting already so at worst you'd make valgrind unhappy by "leaking" a bit of glibc-internal state
b) due to RTLD_NODELETE this dlclose() is almost a no-op anyway

An alternative is to stop using RTLD_NODELETE (what is it for here BTW?) because the assertion in dlclose() only triggers on such handles.

Still a proper fix IMO is to adjust the design of the gssapi plugin system somehow.
Comment 20 Roman Kagan 2016-12-03 05:37:15 EST
Another workaround would be to test whether the plugin object is still loaded via dlopen(RTLD_NOLOAD) or dlinfo() before attempting dlclose().
Comment 21 Simo Sorce 2016-12-05 06:51:51 EST
If someone can suggest a proper fix for libkrb5 I can propose it upstream.
Comment 22 Florian Weimer 2016-12-05 07:07:06 EST
(In reply to Roman Kagan from comment #19)

> > What libgssapi does looks
> > reasonable (dlcose the plugins it dlopened).  But during process
> > termination, the plugin modules have their termination functions called
> > before the ELF destructor in libgssapi runs, and apparently, the dlclose
> > call on the plugin modules asserts in this scenario.
> 
> Not exactly: as dlclose hasn't been called on the plugin, it just gets
> unloaded by the dynamic loader along with other shared objects.  If the
> dependency ordering puts the plugin unload before the libgssapi unload,
> dlclose in libgssapi's destructor is run on an already unloaded plugin.

Well, that's the bug.  ld.so can run the ELF destructor for the object, but must not invalidate any opened handle for it, so that it is safe to call dlclose on them.

> > > A quick workaround seems to be to set a global variable in gssapi_mechglue_fini
> > > and then check it in krb5int_close_plugin() and skip dlclose() if it's set.
> > 
> > But this essentially disables plugin unloading.
> 
> Not really, because
> a) at this point the program is exiting already so at worst you'd make
> valgrind unhappy by "leaking" a bit of glibc-internal state
> b) due to RTLD_NODELETE this dlclose() is almost a no-op anyway

I think that's just an argument for not unloading plugins.  If you think this is unnecessary (and plugins should use ELF destructors to implement any cleanup actions on their own), then you shouldn't introduce a flag, but remove the unloading completely.

But the intent is probably that you can clean up after a dlopen/dclose of the Kerberos libraries themselves, which is why I'm not yet convinced your argument is right.
Comment 23 Roman Kagan 2016-12-05 09:18:05 EST
(In reply to Florian Weimer from comment #22)
> (In reply to Roman Kagan from comment #19)
> 
> > > What libgssapi does looks
> > > reasonable (dlcose the plugins it dlopened).  But during process
> > > termination, the plugin modules have their termination functions called
> > > before the ELF destructor in libgssapi runs, and apparently, the dlclose
> > > call on the plugin modules asserts in this scenario.
> > 
> > Not exactly: as dlclose hasn't been called on the plugin, it just gets
> > unloaded by the dynamic loader along with other shared objects.  If the
> > dependency ordering puts the plugin unload before the libgssapi unload,
> > dlclose in libgssapi's destructor is run on an already unloaded plugin.
> 
> Well, that's the bug.  ld.so can run the ELF destructor for the object, but
> must not invalidate any opened handle for it, so that it is safe to call
> dlclose on them.

You certainly want to mark it unusable.  Perhaps you can special-case dlclose() though as it's not quite "using" the handle. In the end this seems to work when dlopen-ed without RTLD_NODELETE.

> > > > A quick workaround seems to be to set a global variable in gssapi_mechglue_fini
> > > > and then check it in krb5int_close_plugin() and skip dlclose() if it's set.
> > > 
> > > But this essentially disables plugin unloading.
> > 
> > Not really, because
> > a) at this point the program is exiting already so at worst you'd make
> > valgrind unhappy by "leaking" a bit of glibc-internal state
> > b) due to RTLD_NODELETE this dlclose() is almost a no-op anyway
> 
> I think that's just an argument for not unloading plugins.  If you think
> this is unnecessary (and plugins should use ELF destructors to implement any
> cleanup actions on their own), then you shouldn't introduce a flag, but
> remove the unloading completely.
> 
> But the intent is probably that you can clean up after a dlopen/dclose of
> the Kerberos libraries themselves, which is why I'm not yet convinced your
> argument is right.

I'm not sure I'm following...

If the plugin is linked against libgssapi there's no way to ensure the plugin's _fini doesn't call into libgssapi, so the plugin has to be unloaded first.

As to the intent I'm not clear what it was but the commit that introduced RTLD_NODELETE, https://github.com/krb5/krb5/commit/0f46175d632ae03ab7d4cfba5e62534d31e128e0, mentioned http://krbdev.mit.edu/rt/Ticket/Display.html?id=7135 where there's some relevant discussion.
Comment 24 Florian Weimer 2016-12-05 09:33:37 EST
(In reply to Roman Kagan from comment #23)
> (In reply to Florian Weimer from comment #22)
> > (In reply to Roman Kagan from comment #19)
> > 
> > > > What libgssapi does looks
> > > > reasonable (dlcose the plugins it dlopened).  But during process
> > > > termination, the plugin modules have their termination functions called
> > > > before the ELF destructor in libgssapi runs, and apparently, the dlclose
> > > > call on the plugin modules asserts in this scenario.
> > > 
> > > Not exactly: as dlclose hasn't been called on the plugin, it just gets
> > > unloaded by the dynamic loader along with other shared objects.  If the
> > > dependency ordering puts the plugin unload before the libgssapi unload,
> > > dlclose in libgssapi's destructor is run on an already unloaded plugin.
> > 
> > Well, that's the bug.  ld.so can run the ELF destructor for the object, but
> > must not invalidate any opened handle for it, so that it is safe to call
> > dlclose on them.
> 
> You certainly want to mark it unusable.  Perhaps you can special-case
> dlclose() though as it's not quite "using" the handle. In the end this seems
> to work when dlopen-ed without RTLD_NODELETE.

Right, all calls for functions except dlclose have to fail.  It's similar to a two-way shutdown for sockets.

> > > > > A quick workaround seems to be to set a global variable in gssapi_mechglue_fini
> > > > > and then check it in krb5int_close_plugin() and skip dlclose() if it's set.
> > > > 
> > > > But this essentially disables plugin unloading.
> > > 
> > > Not really, because
> > > a) at this point the program is exiting already so at worst you'd make
> > > valgrind unhappy by "leaking" a bit of glibc-internal state
> > > b) due to RTLD_NODELETE this dlclose() is almost a no-op anyway
> > 
> > I think that's just an argument for not unloading plugins.  If you think
> > this is unnecessary (and plugins should use ELF destructors to implement any
> > cleanup actions on their own), then you shouldn't introduce a flag, but
> > remove the unloading completely.
> > 
> > But the intent is probably that you can clean up after a dlopen/dclose of
> > the Kerberos libraries themselves, which is why I'm not yet convinced your
> > argument is right.
> 
> I'm not sure I'm following...

I didn't talk about NODELETE.  I meant the loop which closes plugin modules.  it seems rather unnecessary, based on your explanation.
Comment 25 Robbie Harwood 2016-12-05 12:39:09 EST
(In reply to Roman Kagan from comment #19)
> An alternative is to stop using RTLD_NODELETE (what is it for here BTW?)
> because the assertion in dlclose() only triggers on such handles.

Due to the same assertion, triggered a different way: http://krbdev.mit.edu/rt/Ticket/Display.html?id=7135
Comment 26 Simo Sorce 2016-12-05 12:59:22 EST
> > You certainly want to mark it unusable.  Perhaps you can special-case
> > dlclose() though as it's not quite "using" the handle. In the end this seems
> > to work when dlopen-ed without RTLD_NODELETE.

> Right, all calls for functions except dlclose have to fail.  It's similar to a > two-way shutdown for sockets.

Wouldn't it be more correct to make it so that all the plugins are finilized first and only then libgssapi is unloaded ?
Comment 27 Simo Sorce 2016-12-05 15:47:32 EST
*** Bug 1401341 has been marked as a duplicate of this bug. ***
Comment 28 Roman Kagan 2016-12-05 15:56:38 EST
(In reply to Robbie Harwood from comment #25)
> (In reply to Roman Kagan from comment #19)
> > An alternative is to stop using RTLD_NODELETE (what is it for here BTW?)
> > because the assertion in dlclose() only triggers on such handles.
> 
> Due to the same assertion, triggered a different way:
> http://krbdev.mit.edu/rt/Ticket/Display.html?id=7135

Well, no.  Exactly per that discussion, it was added to address some openssl-related issue but somehow was thought to fix the original problem, which has never been confirmed.

OTOH, as I mentioned in my comment #23, that discussion points out the uselessness of dlclose()-ing plugins in libgssapi_krb5 _fini.
Comment 29 Robbie Harwood 2016-12-05 16:06:53 EST
(In reply to Roman Kagan from comment #28)
> (In reply to Robbie Harwood from comment #25)
> > (In reply to Roman Kagan from comment #19)
> > > An alternative is to stop using RTLD_NODELETE (what is it for here BTW?)
> > > because the assertion in dlclose() only triggers on such handles.
> > 
> > Due to the same assertion, triggered a different way:
> > http://krbdev.mit.edu/rt/Ticket/Display.html?id=7135
> 
> Well, no.  Exactly per that discussion, it was added to address some
> openssl-related issue but somehow was thought to fix the original problem,
> which has never been confirmed.

Sorry, what?  The question is why we pass RTLD_NODELETE, and the answer is that we hit this assertion when we didn't pass it.

As per http://krbdev.mit.edu/rt/Ticket/Display.html?id=7947 it definitely did cause the assertion to disappear for a while.
Comment 30 Simo Sorce 2016-12-05 18:58:47 EST
FWIW, I now hit this with just the gssproxy interposer plugin.
Comment 31 Carlos O'Donell 2016-12-06 00:03:54 EST
The questionable logic is in elf/dl-close.c (_dl_close).

In 2006 the logic was changed from:

-  /* First see whether we can remove the object at all.  */
-  if (__builtin_expect (map->l_flags_1 & DF_1_NODELETE, 0)
-      && map->l_init_called)
-    /* Nope.  Do nothing.  */
-    return;

to:

+  /* First see whether we can remove the object at all.  */
+  if (__builtin_expect (map->l_flags_1 & DF_1_NODELETE, 0))
+    {
+      assert (map->l_init_called);
+      /* Nope.  Do nothing.  */
+      return;
+    }

It is possible to be in the middle of exit() and have already run the destructors for a DF_1_NODELETE object. Subsequent destructors from dependencies that try to dlclose() a valid handle to such a dependent object would assert with the post-2006 change (which means it works fine in RHEL5 :-)).

I posit the assert is nonsense because the object is DF_1_NODELETE and cannot be removed anyway, the implementation is free to ignore the dlclose hint.

Cleaning up the orignal logic:

diff --git a/elf/dl-close.c b/elf/dl-close.c
index 6489703..29a074c 100644
--- a/elf/dl-close.c
+++ b/elf/dl-close.c
@@ -807,11 +807,8 @@ _dl_close (void *_map)
 
   /* First see whether we can remove the object at all.  */
   if (__glibc_unlikely (map->l_flags_1 & DF_1_NODELETE))
-    {
-      assert (map->l_init_called);
-      /* Nope.  Do nothing.  */
-      return;
-    }
+    /* Nope.  Do nothing.  */
+    return;
 
   if (__builtin_expect (map->l_direct_opencount, 1) == 0)
     _dl_signal_error (0, map->l_name, NULL, N_("shared object not open"));
---

Fixes the issue.

[carlos@athas rhbz1398370]$ LD_LIBRARY_PATH=. ./z-normal 
x.c:dtor ===
y.c:dtor >>>
y.c:dtor <<<

We absolutely need a test case for:
- dlopen DF_1_NODELETE object L...
- Destructors processed during exit processing for L...
- Then the L is dlclose'd via valid earlier handle by dependencies destructor.

And the provided minimal test in comment #19 is almost this.

I understand DF_1_NODELETE was added to avoid destructor ordering issues, and I don't understand the original problem. So someone will have to come up with another reproducer for the original problem (if there is one).

It's late here in EST so I'm going to leave this here for Florian to double check in the morning.

COPR build with the fix for testing (build in progress):
https://copr.fedorainfracloud.org/coprs/codonell/glibc-f25/
Comment 32 Florian Weimer 2016-12-06 04:50:07 EST
(In reply to Carlos O'Donell from comment #31)
> I posit the assert is nonsense because the object is DF_1_NODELETE and
> cannot be removed anyway, the implementation is free to ignore the dlclose
> hint.
> 
> Cleaning up the orignal logic:
> 
> diff --git a/elf/dl-close.c b/elf/dl-close.c
> index 6489703..29a074c 100644
> --- a/elf/dl-close.c
> +++ b/elf/dl-close.c
> @@ -807,11 +807,8 @@ _dl_close (void *_map)
>  
>    /* First see whether we can remove the object at all.  */
>    if (__glibc_unlikely (map->l_flags_1 & DF_1_NODELETE))
> -    {
> -      assert (map->l_init_called);
> -      /* Nope.  Do nothing.  */
> -      return;
> -    }
> +    /* Nope.  Do nothing.  */
> +    return;
>  
>    if (__builtin_expect (map->l_direct_opencount, 1) == 0)
>      _dl_signal_error (0, map->l_name, NULL, N_("shared object not open"));
> ---
> 
> Fixes the issue.

Based on the comments around updates to l_init_called, I tend to agree that the assert is bogus.

But the code has other issues as well: The open counter is incremented by dlopen with RTLD_NODELETE, but not decremented in dlclose, which is rather suspicious.  It is also accessed outside of the rtld lock, which is almost certainly a bug.
Comment 33 Simo Sorce 2016-12-06 09:54:36 EST
I have opened a PR (https://github.com/krb5/krb5/pull/576) with a potential solution for libgssapi_krb5.so, it requires also minor changes to pugins to void leaks, but should at least avoid the current assert() even if the plugins are not changed, by simply avoid a call to dlclose() in that case.

Comments welcome.
Comment 34 Greg Hudson 2016-12-06 11:41:01 EST
I could have been mistaken when committing http://krbdev.mit.edu/rt/Ticket/Display.html?id=7947 that using RTLD_NODELETE would help with http://krbdev.mit.edu/rt/Ticket/Display.html?id=7135 .  (But people definitely did see this assertion before we started using RTLD_NODELETE.)
We still benefit from RTLD_NODELETE for plugin modules which link against OpenSSL, due to that library's initialization and finalization semantics.

Roman asserted in comment 19 that gss_krb5int_lib_fini() doesn't need to unload GSS plugin modules because the program is exiting.  This is not necessarily true; there are definitely circumstances (typically involving PAM) where libgssapi_krb5 is unloaded without the program exiting.  If we do not dlclose() the handles we got from dlopen(), we would presumably leak memory in those cases.  The use of RTLD_NODELETE means the GSS plugin modules stick around in the address space, but we don't leak a succession of libdl handles to them when repeatedly loading and unloading libgssapi_krb5.
Comment 35 Adam Williamson 2016-12-06 13:36:17 EST
I'm seeing this one on our openQA worker boxes with rpc-gssd.service, for the record:

-- Unit rpc-gssd.service has begun starting up.
Dec 06 18:35:14 qa05.qa.fedoraproject.org rpc.gssd[6407]: Inconsistency detected by ld.so: dl-close.c: 811: _dl_close: Assertion `map->l_init_called' failed!
Dec 06 18:35:14 qa05.qa.fedoraproject.org systemd[1]: rpc-gssd.service: Control process exited, code=exited status=127
Dec 06 18:35:14 qa05.qa.fedoraproject.org audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=rpc-gssd comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=failed'
Dec 06 18:35:14 qa05.qa.fedoraproject.org systemd[1]: Failed to start RPC security service for NFS client and server.
-- Subject: Unit rpc-gssd.service has failed
Comment 36 Carlos O'Donell 2016-12-07 19:05:56 EST
(In reply to Florian Weimer from comment #32)
> (In reply to Carlos O'Donell from comment #31)
> > I posit the assert is nonsense because the object is DF_1_NODELETE and
> > cannot be removed anyway, the implementation is free to ignore the dlclose
> > hint.
> > 
> > Cleaning up the orignal logic:
> > 
> > diff --git a/elf/dl-close.c b/elf/dl-close.c
> > index 6489703..29a074c 100644
> > --- a/elf/dl-close.c
> > +++ b/elf/dl-close.c
> > @@ -807,11 +807,8 @@ _dl_close (void *_map)
> >  
> >    /* First see whether we can remove the object at all.  */
> >    if (__glibc_unlikely (map->l_flags_1 & DF_1_NODELETE))
> > -    {
> > -      assert (map->l_init_called);
> > -      /* Nope.  Do nothing.  */
> > -      return;
> > -    }
> > +    /* Nope.  Do nothing.  */
> > +    return;
> >  
> >    if (__builtin_expect (map->l_direct_opencount, 1) == 0)
> >      _dl_signal_error (0, map->l_name, NULL, N_("shared object not open"));
> > ---
> > 
> > Fixes the issue.
> 
> Based on the comments around updates to l_init_called, I tend to agree that
> the assert is bogus.

Thank you for the review.

> But the code has other issues as well: The open counter is incremented by
> dlopen with RTLD_NODELETE, but not decremented in dlclose, which is rather
> suspicious.  It is also accessed outside of the rtld lock, which is almost
> certainly a bug.

I agree, technically you are allowed to use dlopen _again_ to promote flags and therefore you could be adding to l_f_flags while the dlclose is reading it. Therefore we need to take the lock.

I'll put together a v2 patch and test it.
Comment 37 Michael Liao 2016-12-12 03:01:53 EST
Just for your reference. I had this issue as well but found a workaround. For my fc25 installation, once I removed 'gssntlmssp-0.7.0-1.fc25.x86_64', that annoying assertion disappeared. I found that as I cannot found this issue in other two machines just upgraded to fc25. After detailed examination, gssntlmssp is the only difference.
Comment 38 Simo Sorce 2016-12-12 08:22:14 EST
Carlos, any news on this, it is a bit disruptive as I am doing development on F25 and use multiple gssapi plugins.
Comment 39 Florian Weimer 2016-12-12 10:05:42 EST
(In reply to Simo Sorce from comment #38)
> Carlos, any news on this, it is a bit disruptive as I am doing development
> on F25 and use multiple gssapi plugins.

At I assumed you wanted to fix this or workaround it in krb5.  If that's not the case, you might want to reassign the bug back to glibc.
Comment 40 Robbie Harwood 2016-12-12 14:21:38 EST
I do not consider this a krb5 bug and neither does our upstream.  I may work around it if this goes on too much longer.  Apologies for not noticing the assignment.
Comment 41 Robbie Harwood 2016-12-12 14:23:00 EST

*** This bug has been marked as a duplicate of bug 1264556 ***
Comment 42 Florian Weimer 2016-12-12 14:33:32 EST
This is *not* a duplicate of bug 1264556, which refers to the concurrency issue between dlclose and exit, not this bug, which is about calling dlclose during process shutdown.
Comment 43 Robbie Harwood 2016-12-12 15:06:28 EST
(In reply to Florian Weimer from comment #42)
> This is *not* a duplicate of bug 1264556, which refers to the concurrency
> issue between dlclose and exit, not this bug, which is about calling dlclose
> during process shutdown.

As long as it gets fixed I don't care how it gets tracked.  In both cases the assert is triggered during process cleanup, and it's the same assert.  Apologies for the noise.
Comment 44 Carlos O'Donell 2016-12-13 23:02:37 EST
I have correct fix for this going upstream with a self containted test case. I won't get this finished until tomorrow though.

For those who have the problem it would be nice if you tested the following glibc f25 COPR repo build to confirm the fix:
https://copr.fedorainfracloud.org/coprs/codonell/glibc-f25/
Comment 45 Michael Cronenworth 2016-12-13 23:17:43 EST
(In reply to Carlos O'Donell from comment #44)
> For those who have the problem it would be nice if you tested the following
> glibc f25 COPR repo build to confirm the fix:
> https://copr.fedorainfracloud.org/coprs/codonell/glibc-f25/

Fixes the issue for me. Tested a few other apps and I don't see regressions. Thanks.
Comment 46 Simo Sorce 2016-12-14 06:51:43 EST
Tested and seem to work fine and resolve the issue (no more messages in the logs and gssapi with loaded plugins seem to be working fine).
Comment 47 Viorel Tabara 2016-12-19 16:35:54 EST
(In reply to Carlos O'Donell from comment #44)

LGTM, thanks.
Comment 48 Christopher Sullivan 2016-12-20 20:34:46 EST
(In reply to Carlos O'Donell from comment #44)


Tested using SSH to a Fedora 23/24/25 and CentOS 7 Hosts, and it fixed the issue, thanks.
Comment 49 Simo Sorce 2016-12-21 06:48:31 EST
Any word on when this is going to land in F25 ? I just discovered it is breaking my backups :-)
Comment 50 Carlos O'Donell 2016-12-21 08:36:36 EST
(In reply to Simo Sorce from comment #49)
> Any word on when this is going to land in F25 ? I just discovered it is
> breaking my backups :-)

I plan to do upstream, Rawhide, and F25 today. Working on this right now.
Comment 51 Carlos O'Donell 2016-12-22 00:38:54 EST
Upstream fix with regression test posted:
https://www.sourceware.org/ml/libc-alpha/2016-12/msg00859.html
Comment 52 Carlos O'Donell 2016-12-23 15:09:19 EST
Upstream master is now fixed.
I have synchronized rawhide to master and I'm testing right now.

Next steps:
- Sync rawhide to master.
- Backport to glibc 2.24 stable branch.
- Sync glibc F25 to glibc 2.24 stable branch.
Comment 53 Carlos O'Donell 2016-12-23 16:02:05 EST
Upstream glibc 2.24 backport and testing complete.
Final rawhide scratch build in progress.
Final F25 scratch build in progress.
Comment 54 Carlos O'Donell 2016-12-23 18:33:56 EST
Fixed in Fedora Rawhide, and final build here:
https://koji.fedoraproject.org/koji/taskinfo?taskID=17042959

Fedora 25 still building/testing.
Comment 55 Carlos O'Donell 2016-12-23 20:18:51 EST
Fixed in F25. Final build in progress.
Comment 56 Carlos O'Donell 2016-12-23 22:15:05 EST
Oddly the builds are done but they are staying open...
https://koji.fedoraproject.org/koji/taskinfo?taskID=17043458
Comment 57 Fedora Update System 2016-12-23 22:23:12 EST
glibc-2.24-4.fc25 has been submitted as an update to Fedora 25. https://bodhi.fedoraproject.org/updates/FEDORA-2016-b49fa138f4
Comment 58 Fedora Update System 2016-12-24 23:22:09 EST
glibc-2.24-4.fc25 has been pushed to the Fedora 25 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2016-b49fa138f4
Comment 59 Fedora Update System 2016-12-27 10:51:11 EST
glibc-2.24-4.fc25 has been pushed to the Fedora 25 stable repository. If problems still persist, please make note of it in this bug report.

Note You need to log in before you can comment on or make changes to this bug.