Upstream bug report (from 2022-04 but nothing was done): https://sourceware.org/bugzilla/show_bug.cgi?id=29039 Here's the sequence of events: - Early on during dynamic module load, `_dl_assign_tls_modid` is called to find a free module ID for the module. It finds a free slotinfo entry, and sets its map to point to the module, marking it as used. At this point the generation ID is the default zero. - Relocations are processed. Dynamic TLS relocations create their associated data by calling `_dl_make_tlsdesc_dynamic`. This calls `map_generation` to identify the generation for this module. That goes through the slot info, looking for an entry that already has the current map (module) assigned and a valid generation. We do have an entry, but the generation is zero, so that fails. The fallback path then assigns the current global generation + 1. - Once all relocations have been processed, the linker calls `_dl_add_to_slotinfo`. This finally sets up the slot info to the current map (which was already set anyway), and the current global generation + 1. - Whenever a TLS symbol is used for this module, it will call `_dl_tlsdesc_dynamic`. The first time, that will check the DTV generation against the symbol generation. Since it will be too old, it calls into the slow path via `__tls_get_addr`, which checks the generation and calls `_dl_update_slotinfo` - `_dl_update_slotinfo` finally expands the DTV if needed, and initializes the DTV entry with the pointer `TLS_DTV_UNALLOCATED`. The caller then checks that, and finally allocates the TLS block for this thread/module. - When a module is unloaded, `remove_slotinfo` will find its slot info entry, and set the map to NULL and the generation to, again, the current generation + 1 (which will have incremented by now if a DTV update happened in the interim). This all goes wrong when a modid is reused: - A module with TLS is dynamically loaded. It gets assigned a slot, a generation, everything. So far so good. - That module **never uses its TLS symbols**, so the DTV resize/allocation never happens. - The module is unloaded, which replaces its slot pointer with NULL and bumps up the generation - A bunch of other stuff happens involving TLS, which causes the generation to increase and the DTV to be updated. Since the module was unloaded, its DTV entry is no longer in use and is not initialized. - Another module is loaded. It is assigned the same module ID. **`_dl_assign_tls_modid` assigns it to the same slot, but leaves the generation untouched, which is now nonzero since it was never cleared.** - Since the generation is nonzero, relocation processing goes ahead and **puts that old generation into the TLS relocation entries**. This generation is now obsolete, since the global generation increased in the interim. - After relocations, `_dl_add_to_slotinfo` is finally called, which bumps up the generation in the slot info to the current one + 1. **But the relocations were already processed with an old generation!** - On first access to TLS data, `_dl_tlsdesc_dynamic` checks the generation. Since the DTV generation is newer than the symbol generation, it checks the pointer against `TLS_DTV_UNALLOCATED`, **which is (void*)-1**. Since the DTV entry was never initialized, its value is zero, so it returns from the happy path. - Some pointer arithmetic nonsense later, that NULL pointer is dereferenced. This is currently breaking GNOME in Fedora Asahi. Reproducible: Always Steps to Reproduce: 1. Install Fedora Asahi GNOME 2. Try to start an X11 GNOME session Actual Results: Mutter crashloops Expected Results: Mutter does not crashloop
Created attachment 2001506 [details] Proposed patch (confirmed fixes the issue)
Added some links to GNOME bugs ultimately caused by this.
On further reflection, similar but not caused by this bug, so removed.
(In reply to Hector Martin from comment #1) > Created attachment 2001506 [details] > Proposed patch (confirmed fixes the issue) can you please submit this patch to the libc-alpha list.
(In reply to Szabolcs Nagy from comment #4) > (In reply to Hector Martin from comment #1) > > Created attachment 2001506 [details] > > Proposed patch (confirmed fixes the issue) > > can you please submit this patch to the libc-alpha list. Done.
Please can we get this committed to the Fedora glibc package in the meantime? This is seriously breaking our systems...
It's probably breaking other things too, seeing as it's a fully reproducible bug that doesn't depend on race conditions or anything like that, and is architecture-independent. I'm sure we're not the first to hit this, just nobody bothered to debug it until now (even though it was known for over 1.5 years)...
(In reply to Hector Martin from comment #7) > It's probably breaking other things too, seeing as it's a fully reproducible > bug that doesn't depend on race conditions or anything like that, and is > architecture-independent. note that aarch64 is the only target defaulting to tlsdesc, other targets would only hit it with specific modules that explicitly build with tlsdesc (and only x86, arm, aarch64 has tlsdesc support currently). traditional tls does not look at the generation count at reloc processing time, only at __tls_get_addr call time, while tlsdesc does to use it in the dynamic tlsdesc fast path.
> note that aarch64 is the only target defaulting to tlsdesc, other targets would > only hit it with specific modules that explicitly build with tlsdesc (and only > x86, arm, aarch64 has tlsdesc support currently). I guess Mesa must build with that option on then (or is this an Arch thing?), since the original bug report was on x86-64.
Upstream review for this is complete with a Reviewed-by from Szabolcs who will commit this: https://patchwork.sourceware.org/project/glibc/patch/20231128-tls-modid-reuse-v1-1-431c73f37fc7@marcan.st/ Once the patch is in upstream we can sync this with Rawhide and F39.
I would hope this also gets fixed in F38, since that is still supported.
(In reply to Hector Martin from comment #11) > I would hope this also gets fixed in F38, since that is still supported. Yes, we will backport it upstream and sync Fedora to the upstream release branch.
This is fixed in Rawhide (F40) and still needs to be backported into F39 and F38 (as requested).
This has been backported to upstream glibc 2.37, and 2.38, so this can be brought directly into the F39 and F38 builds of glibc. commit 874d4186975560fb79d5ebd46a4f378a2e3f7657 Author: Hector Martin <marcan> AuthorDate: Tue Nov 28 15:23:07 2023 +0900 Commit: Szabolcs Nagy <szabolcs.nagy> CommitDate: Fri Dec 22 14:37:46 2023 +0000 elf: Fix TLS modid reuse generation assignment (BZ 29039)
Fixed in all supported Fedora releases.