Bug 1689037 - anaconda sometimes crashes with a signal 11 quite early in install process
Summary: anaconda sometimes crashes with a signal 11 quite early in install process
Status: NEW
Alias: None
Product: Fedora
Classification: Fedora
Component: anaconda
Version: 30
Hardware: All
OS: Linux
unspecified
high
Target Milestone: ---
Assignee: Anaconda Maintenance Team
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard: openqa
Keywords:
: 1691016 (view as bug list)
Depends On:
Blocks: PPCTracker
TreeView+ depends on / blocked
 
Reported: 2019-03-15 02:03 UTC by Adam Williamson
Modified: 2019-03-29 10:39 UTC (History)
14 users (show)

(edit)
Clone Of:
(edit)
Last Closed:


Attachments (Terms of Use)
one of the backtraces (158.74 KB, text/plain)
2019-03-15 02:05 UTC, Adam Williamson
no flags Details
the other backtrace (385.98 KB, text/plain)
2019-03-15 02:05 UTC, Adam Williamson
no flags Details

Description Adam Williamson 2019-03-15 02:03:50 UTC
This is a bit of a fuzzy problem, but it definitely happens enough that it seems to be a real thing.

Sometimes, openQA tests fail because anaconda just suddenly dies, usually quite early in the install. The visible symptom is that the installer disappears and you get a black screen instead (but can switch to a tty successfully and poke about). The logs show it crashed on signal 11. A core dump is saved.

Here are two recent x86_64 tests that failed this way:

https://openqa.fedoraproject.org/tests/364044
https://openqa.fedoraproject.org/tests/363986

You can get log and core dump files from the 'Logs & Assets' tab for each test. I have downloaded the core dumps from each and run them through gdb. They produce similar but not identical backtraces, which I will attach, that *seem* to suggest the crash may be in GTK+ somewhere - both seem to run through gtk_css_static_style_compute_value .

This same sort of things seems to happen very often on aarch64. For instance, for the same compose (Fedora-30-20190314.n.0), I can see at least four tests that failed in what looks like the same way on aarch64:

https://openqa.stg.fedoraproject.org/tests/494898
https://openqa.stg.fedoraproject.org/tests/494897
https://openqa.stg.fedoraproject.org/tests/494895
https://openqa.stg.fedoraproject.org/tests/494885

The core dumps can be found on the Logs & Assets tabs again, but I don't have backtraces as I don't have an aarch64 host handy to generate them on.

Comment 1 Adam Williamson 2019-03-15 02:05 UTC
Created attachment 1544247 [details]
one of the backtraces

Comment 2 Adam Williamson 2019-03-15 02:05 UTC
Created attachment 1544248 [details]
the other backtrace

Comment 3 Vendula Poncova 2019-03-15 10:04:12 UTC
From trace1:

#2  <signal handler called>
#3  0x0000000000000000 in ?? ()
#4  0x00007f11d718fda7 in gtk_css_static_style_compute_value at gtkcssstaticstyle.c:237

From trace2:

#2  <signal handler called>
#3  0x0000000000000000 in ?? ()
#4  0x00007f72777a35fa in gtk_css_value_position_compute at gtkcsspositionvalue.c:48
#5  0x00007f7277788ca8 in gtk_css_value_array_compute at gtkcssarrayvalue.c:59
#6  0x00007f72777b1da7 in gtk_css_static_style_compute_value at gtkcssstaticstyle.c:237

Based on the backtraces, the error is not triggered by the same code in Anaconda. The problem really seems to be in the function gtk_css_static_style_compute_value. Reassigning to gtk3.

Comment 4 Michael Catanzaro 2019-03-15 14:43:41 UTC
I've seen this many times before. It's never a problem in gtk_css_static_style_compute_value. Always turns out to be memory corruption in some unrelated code. Could be Anaconda, could be GTK, but the backtrace is almost certainly useless. You need to catch this under asan or valgrind to have any chance.

Comment 5 Michael Catanzaro 2019-03-15 14:56:59 UTC
Company says: "the GTK CSS stack does a lot of memory allocations, so it's always a common place where corruptions are found"

It's just something we've learned again and again the hard way. Very hard. Memory corruption is the worst. :(

Comment 6 Adam Williamson 2019-03-15 15:07:08 UTC
For the record, it seems debugging memory corruption is very hard, especially for a non-native distribution installer :/ We need to run it through valgrind and hit the crash, apparently.

Owen asked when this started happening, and if we assume the aarch64 issue is the same thing, I *think* we can pin it down to some time between 2018-10-28 and 2018-11-14 in Rawhide; I can't see any aarch64 fails that look like this bug in Fedora-Rawhide-20181028.n.0 or earlier composes, while I *do* see multiple failures that look like this (at least, sudden black screen early in the install process - the logs have aged out, unfortunately) in Fedora-Rawhide-20181114.n.0.

Unfortunately bisecting the packages that changed between those dates may be hard as we probably don't have a 20181114.n.0 tree or images lying around anywhere to work from :/ releng cleans out the nightly composes every couple of weeks to save space.

Comment 7 Dan Horák 2019-03-28 14:09:56 UTC
Adam, it reminds be my ticket https://pagure.io/releng/issue/7763 for defining a retention policy for older composes to allow bisecting between composes like would be useful here.

Comment 8 Michel Normand 2019-03-29 10:28:34 UTC
*** Bug 1691016 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.