Bug 2388910 - Race condition in pyanaconda concurrent DBus module initialization causes "Illegal instruction" errors
Summary: Race condition in pyanaconda concurrent DBus module initialization causes "Il...
Keywords:
Status: CLOSED DUPLICATE of bug 2247319
Alias: None
Product: Fedora
Classification: Fedora
Component: anaconda
Version: 42
Hardware: aarch64
OS: Linux
unspecified
urgent
Target Milestone: ---
Assignee: anaconda-maint
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2025-08-16 07:29 UTC by Tomas Dabašinskas
Modified: 2025-09-03 16:15 UTC (History)
4 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2025-08-27 15:55:39 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)
anaconda_race_condition_logs.tar.gz (2.47 MB, application/gzip)
2025-08-16 07:31 UTC, Tomas Dabašinskas
no flags Details

Description Tomas Dabašinskas 2025-08-16 07:29:04 UTC
A race condition occurs when multiple DBus modules are started concurrently during anaconda initialization, specifically affecting the Localization module. The issue manifests as "Illegal instruction" errors when multiple processes simultaneously access the thread-unsafe langtable C extension library.

Reproducible: Sometimes

Steps to Reproduce:
Reproduction
Environment: Live media creation via livemedia-creator
Frequency: Intermittent (race condition dependent)
Log evidence: Found in live/mock-live-*/build/logs/anaconda/dbus.log

Actual Results:
Build failures: ~40% of live media builds fail due to this race condition in anaconda


Expected Results:
live media builds pass

Additional Information:
*Affected Component*

Module: pyanaconda/modules/localization/
Function: _build_layout_infos() in pyanaconda/localization.py
Library: langtable C extension

*Symptoms*

Failed builds: Show "Illegal instruction (core dumped)" errors in DBus logs
Successful builds: Complete without langtable-related errors
Error pattern: Service org.fedoraproject.Anaconda.Modules.Localization has failed to start: Process exited with status 132

*Root Cause*

Concurrent module startup: StartModulesTask starts multiple DBus modules simultaneously
Thread-unsafe C extension: langtable library is not safe for concurrent access
Race condition: Multiple processes call _build_layout_infos() during LocalizationService initialization
Memory corruption: Simultaneous access to langtable's C functions causes "Illegal instruction" errors

*Code Locations*

Race trigger: pyanaconda/modules/localization/localization.py:82 - self._layout_infos = _build_layout_infos()
Concurrent startup: pyanaconda/modules/boss/module_manager/start_modules.py:125-135
langtable usage: pyanaconda/localization.py:397-420 - _build_layout_infos() function

*Impact*

Build failures: ~40% of live media builds fail due to this race condition
Installation blocking: Prevents successful system installation
Resource waste: Failed builds consume time and resources

*Proposed Solutions*

Add synchronization: Implement locks around langtable usage
Lazy initialization: Move _build_layout_infos() to on-demand execution
Process isolation: Restrict langtable access to main process only
Retry mechanism: Add automatic retry for failed module starts

*Evidence Files*

Failed build logs: live/mock-live-l4564cra/build/logs/anaconda/dbus.log
Successful build logs: live/mock-live-rt9e6oqx/build/logs/anaconda/dbus.log
Source code: pyanaconda/modules/localization/localization.py

*Builds performed*

Build Results Summary with Timestamps

✅ SUCCESSFUL Builds (9/15 = 60%)
live/mock-live-ouzs7l0z ✅ 2025-08-15 20:30:31,096 - Results are in /build/results
live/mock-live-rt9e6oqx ✅ 2025-08-15 22:10:31,624 - Results are in /build/results
live/mock-live-4p_pgtlt ✅ 2025-08-15 23:16:53,699 - Results are in /build/results
live/mock-live-i4usnz84 ✅ 2025-08-16 00:22:39,141 - Results are in /build/results
live/mock-live-z84u4uj7 ✅ 2025-08-16 01:36:36,836 - Results are in /build/results
live/mock-live-1fivcx53 ✅ 2025-08-16 02:50:16,707 - Results are in /build/results
live/mock-live-6r6p6xnx ✅ 2025-08-16 04:03:33,094 - Results are in /build/results
live/mock-live-25p6kg_d ✅ 2025-08-16 05:17:10,352 - Results are in /build/results
live/mock-live-r8lwbe6s ✅ 2025-08-16 05:52:30,257 - Complete! (installation finished)

❌ FAILED Builds (6/15 = 40%)

live/mock-live-l4564cra ❌ 2025-08-15 16:15:51,069 - Running anaconda failed
live/mock-live-j8qj1qom ❌ 2025-08-15 21:06:53,178 - Running anaconda failed
live/mock-live-mnfl0m82 ❌ 2025-08-16 00:34:13,983 - Running anaconda failed
live/mock-live-0qtpriq3 ❌ 2025-08-16 01:47:45,440 - Running anaconda failed
live/mock-live-ahbl9ew7 ❌ 2025-08-16 03:01:47,062 - Running anaconda failed
live/mock-live-_arsq4gf ❌ 2025-08-16 04:15:01,157 - Running anaconda failed

Key Statistics
Total builds: 15
Successful: 9 (60%)
Failed: 6 (40%)
Failure rate: 40%

Timeline Analysis
Builds spanned: 2025-08-15 16:15 to 2025-08-16 05:52 (approximately 13.5 hours)
Failure pattern: Failures occurred throughout the timeline, not clustered
Success pattern: Successful builds also distributed across the timeline

Race condition: Intermittent failures confirm the race condition nature
The timestamps show that the race condition affects builds randomly throughout the build process, with no clear pattern based on timing, confirming it's a true race condition rather than a systematic issue.

Comment 1 Tomas Dabašinskas 2025-08-16 07:31:36 UTC
Created attachment 2103817 [details]
anaconda_race_condition_logs.tar.gz

This archive contains sanitized logs and configuration files from 15 live media creation attempts that demonstrate a race condition in pyanaconda's Localization module. The data shows a clear pattern of intermittent failures (46.7% failure rate) caused by concurrent access to the thread-unsafe langtable C extension library.

Contents
15 build directories with complete logs from both successful and failed builds
Sanitized log files: build.log, anaconda.log, dbus.log, packaging.log, storage.log
Sanitized kickstart files: live-aarch64.ks (personal information removed)
Timing analysis: Build timestamps showing success/failure patterns
Error evidence: "Illegal instruction" errors in failed builds

Key Evidence
Race Condition Pattern: 7 failed builds vs 8 successful builds over ~13 hours
Error Signature: "Illegal instruction" errors in Localization module initialization
Root Cause: Concurrent access to langtable library during _build_layout_infos() calls
Reproducibility: Consistent failure pattern across multiple builds

Sanitization Applied
✅ Personal usernames replaced with [USERNAME]
✅ SSH keys replaced with [SSH_KEY_REDACTED]
✅ Hostnames replaced with [HOSTNAME]
✅ Password hashes replaced with [PASSWORD_HASH_REDACTED]
✅ Large debug logs replaced with explanatory messages
✅ Post-installation scripts removed from kickstart files

Technical Details
Compression: Maximum pigz compression (6.4x ratio)
Files: 265 total files, 2.5M compressed from 16M original
Time Range: 2025-08-15 16:15 to 2025-08-16 05:17
Architecture: aarch64 builds

Comment 2 Tomas Dabašinskas 2025-08-18 07:51:28 UTC
this is not limited to the localization module,

    Build: mock-live-0_rl_ix1
    Failed Module: org.fedoraproject.Anaconda.Modules.Payloads
    Error: Process org.fedoraproject.Anaconda.Modules.Payloads exited with status 139 (Segmentation fault)
    Timing: Failed at 21:29:34 during concurrent module startup

    Build: mock-live-7d5thxk3
    Failed Module: org.fedoraproject.Anaconda.Modules.Storage
    Error: Process org.fedoraproject.Anaconda.Modules.Storage exited with status 132 (Illegal instruction)
    Timing: Failed at 22:36:41 during concurrent module startup

    Build: mock-live-funuh0c9
    Failed Module: org.fedoraproject.Anaconda.Modules.Services
    Error: Process org.fedoraproject.Anaconda.Modules.Services exited with status 132 (Illegal instruction)
    Timing: Failed at 21:25:53 during concurrent module startup

Comment 3 Adam Williamson 2025-08-27 15:55:39 UTC
This is a dupe, but can you add any new information to the original bug? Thanks for looking into it.

Note the bug doesn't affect Kiwi, so a good way to avoid this is, build your lives with Kiwi. :D

*** This bug has been marked as a duplicate of bug 2247319 ***

Comment 4 Brian Lane 2025-09-03 15:29:41 UTC
Is this really a dupe though? It looks to me like this report clearly finds the cause and suggests fixes, the other bug is against libblockdev and not anaconda.
Also 'use kiwi' isn't really an option for some users, especially with something that's been working. It also seems that this bug could possibly hit regular installer users.

Comment 5 Adam Williamson 2025-09-03 16:15:01 UTC
It seems like pretty clearly the same bug to me, yeah: SIGILL when building live images on aarch64 (only) with livemedia-creator.

I agree that this bug seems to have a new diagnosis that's worth investigating, but that doesn't mean it's a different bug. Closing a later bug as a dupe of an earlier one isn't a judgement of how "good" the report is, it's just done to ensure the discussion doesn't split and everyone who is interested in the bug is aware of this new information.

The libblockdev assignment on the original bug was done in https://bugzilla.redhat.com/show_bug.cgi?id=2247319#c16 based on analysis of a backtrace that Kevin got, but after we dug through that for quite some time, we kinda started thinking around https://bugzilla.redhat.com/show_bug.cgi?id=2247319#c75 that we might be barking up the wrong tree. This definitely looks like a different tree to bark up.

> Also 'use kiwi' isn't really an option for some users

Is it not? We are kind of generally trying to push things lightly towards Kiwi, for various reasons: it's *one* tool that does a lot of the things we want, it's actively maintained by more than one person, there are multiple Fedora-aligned people who vaguely understand how it works (Neal, me, Kevin), and the kiwi template repository has *actual CI* so when we change something in it, we can tell whether it's broken before we break a real compose. If there are any remaining reasons to use lmc beyond "I'm used to it", I think we'd quite like to get rid of them for reasons far beyond this bug.


Note You need to log in before you can comment on or make changes to this bug.