Bug 2251557 - glibc: Corrupt DTV after reuse of a TLS module ID following dlclose with unused TLS
Summary: glibc: Corrupt DTV after reuse of a TLS module ID following dlclose with unus...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: glibc
Version: 39
Hardware: All
OS: Linux
unspecified
high
Target Milestone: ---
Assignee: Carlos O'Donell
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-11-26 12:15 UTC by Hector Martin
Modified: 2024-02-09 14:36 UTC (History)
15 users (show)

Fixed In Version: glibc-2.38.9000-26.fc40 glibc-2.38-16.fc39 glibc-2.37-18.fc38
Clone Of:
Environment:
Last Closed: 2024-02-09 14:36:31 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)
Proposed patch (confirmed fixes the issue) (404 bytes, patch)
2023-11-26 12:16 UTC, Hector Martin
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Sourceware 29039 0 P2 UNCONFIRMED _dl_tlsdesc_dynamic (sometimes) returns garbage offsets 2023-11-26 12:20:03 UTC

Description Hector Martin 2023-11-26 12:15:31 UTC
Upstream bug report (from 2022-04 but nothing was done): https://sourceware.org/bugzilla/show_bug.cgi?id=29039

Here's the sequence of events:

- Early on during dynamic module load, `_dl_assign_tls_modid` is called to find a free module ID for the module. It finds a free slotinfo entry, and sets its map to point to the module, marking it as used. At this point the generation ID is the default zero.
- Relocations are processed. Dynamic TLS relocations create their associated data by calling `_dl_make_tlsdesc_dynamic`. This calls `map_generation` to identify the generation for this module. That goes through the slot info, looking for an entry that already has the current map (module) assigned and a valid generation. We do have an entry, but the generation is zero, so that fails. The fallback path then assigns the current global generation + 1.
- Once all relocations have been processed, the linker calls `_dl_add_to_slotinfo`. This finally sets up the slot info to the current map (which was already set anyway), and the current global generation + 1.
- Whenever a TLS symbol is used for this module, it will call `_dl_tlsdesc_dynamic`. The first time, that will check the DTV generation against the symbol generation. Since it will be too old, it calls into the slow path via `__tls_get_addr`, which checks the generation and calls `_dl_update_slotinfo`
- `_dl_update_slotinfo` finally expands the DTV if needed, and initializes the DTV entry with the pointer `TLS_DTV_UNALLOCATED`. The caller then checks that, and finally allocates the TLS block for this thread/module.
- When a module is unloaded, `remove_slotinfo` will find its slot info entry, and set the map to NULL and the generation to, again, the current generation + 1 (which will have incremented by now if a DTV update happened in the interim).

This all goes wrong when a modid is reused:

- A module with TLS is dynamically loaded. It gets assigned a slot, a generation, everything. So far so good.
- That module **never uses its TLS symbols**, so the DTV resize/allocation never happens.
- The module is unloaded, which replaces its slot pointer with NULL and bumps up the generation
- A bunch of other stuff happens involving TLS, which causes the generation to increase and the DTV to be updated. Since the module was unloaded, its DTV entry is no longer in use and is not initialized.
- Another module is loaded. It is assigned the same module ID. **`_dl_assign_tls_modid` assigns it to the same slot, but leaves the generation untouched, which is now nonzero since it was never cleared.**
- Since the generation is nonzero, relocation processing goes ahead and **puts that old generation into the TLS relocation entries**. This generation is now obsolete, since the global generation increased in the interim.
- After relocations, `_dl_add_to_slotinfo` is finally called, which bumps up the generation in the slot info to the current one + 1. **But the relocations were already processed with an old generation!**
- On first access to TLS data, `_dl_tlsdesc_dynamic` checks the generation. Since the DTV generation is newer than the symbol generation, it checks the pointer against `TLS_DTV_UNALLOCATED`, **which is (void*)-1**. Since the DTV entry was never initialized, its value is zero, so it returns from the happy path.
- Some pointer arithmetic nonsense later, that NULL pointer is dereferenced.

This is currently breaking GNOME in Fedora Asahi.

Reproducible: Always

Steps to Reproduce:
1. Install Fedora Asahi GNOME
2. Try to start an X11 GNOME session
Actual Results:  
Mutter crashloops

Expected Results:  
Mutter does not crashloop

Comment 1 Hector Martin 2023-11-26 12:16:05 UTC
Created attachment 2001506 [details]
Proposed patch (confirmed fixes the issue)

Comment 2 Neal Gompa 2023-11-26 12:20:03 UTC
Added some links to GNOME bugs ultimately caused by this.

Comment 3 Neal Gompa 2023-11-26 12:23:55 UTC
On further reflection, similar but not caused by this bug, so removed.

Comment 4 Szabolcs Nagy 2023-11-27 10:17:11 UTC
(In reply to Hector Martin from comment #1)
> Created attachment 2001506 [details]
> Proposed patch (confirmed fixes the issue)

can you please submit this patch to the libc-alpha list.

Comment 5 Hector Martin 2023-11-28 06:23:47 UTC
(In reply to Szabolcs Nagy from comment #4)
> (In reply to Hector Martin from comment #1)
> > Created attachment 2001506 [details]
> > Proposed patch (confirmed fixes the issue)
> 
> can you please submit this patch to the libc-alpha list.

Done.

Comment 6 Neal Gompa 2023-11-28 08:30:05 UTC
Please can we get this committed to the Fedora glibc package in the meantime? This is seriously breaking our systems...

Comment 7 Hector Martin 2023-11-28 08:53:29 UTC
It's probably breaking other things too, seeing as it's a fully reproducible bug that doesn't depend on race conditions or anything like that, and is architecture-independent. I'm sure we're not the first to hit this, just nobody bothered to debug it until now (even though it was known for over 1.5 years)...

Comment 8 Szabolcs Nagy 2023-11-28 09:55:43 UTC
(In reply to Hector Martin from comment #7)
> It's probably breaking other things too, seeing as it's a fully reproducible
> bug that doesn't depend on race conditions or anything like that, and is
> architecture-independent.

note that aarch64 is the only target defaulting to tlsdesc, other targets would
only hit it with specific modules that explicitly build with tlsdesc (and only
x86, arm, aarch64 has tlsdesc support currently).

traditional tls does not look at the generation count at reloc processing time,
only at __tls_get_addr call time, while tlsdesc does to use it in the dynamic
tlsdesc fast path.

Comment 9 Hector Martin 2023-11-28 14:06:38 UTC
> note that aarch64 is the only target defaulting to tlsdesc, other targets would
> only hit it with specific modules that explicitly build with tlsdesc (and only
> x86, arm, aarch64 has tlsdesc support currently).

I guess Mesa must build with that option on then (or is this an Arch thing?), since the original bug report was on x86-64.

Comment 10 Carlos O'Donell 2023-11-28 14:11:27 UTC
Upstream review for this is complete with a Reviewed-by from Szabolcs who will commit this:
https://patchwork.sourceware.org/project/glibc/patch/20231128-tls-modid-reuse-v1-1-431c73f37fc7@marcan.st/

Once the patch is in upstream we can sync this with Rawhide and F39.

Comment 11 Hector Martin 2023-11-28 14:30:10 UTC
I would hope this also gets fixed in F38, since that is still supported.

Comment 12 Florian Weimer 2023-11-28 14:35:36 UTC
(In reply to Hector Martin from comment #11)
> I would hope this also gets fixed in F38, since that is still supported.

Yes, we will backport it upstream and sync Fedora to the upstream release branch.

Comment 13 Carlos O'Donell 2023-12-05 14:40:29 UTC
This is fixed in Rawhide (F40) and still needs to be backported into F39 and F38 (as requested).

Comment 14 Carlos O'Donell 2024-01-19 14:33:59 UTC
This has been backported to upstream glibc 2.37, and 2.38, so this can be brought directly into the F39 and F38 builds of glibc.

commit 874d4186975560fb79d5ebd46a4f378a2e3f7657
Author:     Hector Martin <marcan>
AuthorDate: Tue Nov 28 15:23:07 2023 +0900
Commit:     Szabolcs Nagy <szabolcs.nagy>
CommitDate: Fri Dec 22 14:37:46 2023 +0000

    elf: Fix TLS modid reuse generation assignment (BZ 29039)

Comment 15 Florian Weimer 2024-02-09 14:36:31 UTC
Fixed in all supported Fedora releases.


Note You need to log in before you can comment on or make changes to this bug.