Bug 1689037 - anaconda sometimes crashes with a signal 11 quite early in install process
Summary: anaconda sometimes crashes with a signal 11 quite early in install process
Keywords:
Status: ASSIGNED
Alias: None
Product: Fedora
Classification: Fedora
Component: anaconda
Version: 33
Hardware: All
OS: Linux
unspecified
high
Target Milestone: ---
Assignee: Anaconda Maintenance Team
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard: openqa
: 1691016 (view as bug list)
Depends On:
Blocks: PPCTracker
TreeView+ depends on / blocked
 
Reported: 2019-03-15 02:03 UTC by Adam Williamson
Modified: 2020-08-11 15:20 UTC (History)
24 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Type: Bug


Attachments (Terms of Use)
one of the backtraces (158.74 KB, text/plain)
2019-03-15 02:05 UTC, Adam Williamson
no flags Details
the other backtrace (385.98 KB, text/plain)
2019-03-15 02:05 UTC, Adam Williamson
no flags Details
better backtrace(?) (615.61 KB, text/plain)
2019-07-10 23:50 UTC, Adam Williamson
no flags Details
slightly different backtrace from commit 4035158de46ce373b7521daf61c5b6df83312968 (660.45 KB, text/plain)
2019-07-11 23:00 UTC, Adam Williamson
no flags Details

Description Adam Williamson 2019-03-15 02:03:50 UTC
This is a bit of a fuzzy problem, but it definitely happens enough that it seems to be a real thing.

Sometimes, openQA tests fail because anaconda just suddenly dies, usually quite early in the install. The visible symptom is that the installer disappears and you get a black screen instead (but can switch to a tty successfully and poke about). The logs show it crashed on signal 11. A core dump is saved.

Here are two recent x86_64 tests that failed this way:

https://openqa.fedoraproject.org/tests/364044
https://openqa.fedoraproject.org/tests/363986

You can get log and core dump files from the 'Logs & Assets' tab for each test. I have downloaded the core dumps from each and run them through gdb. They produce similar but not identical backtraces, which I will attach, that *seem* to suggest the crash may be in GTK+ somewhere - both seem to run through gtk_css_static_style_compute_value .

This same sort of things seems to happen very often on aarch64. For instance, for the same compose (Fedora-30-20190314.n.0), I can see at least four tests that failed in what looks like the same way on aarch64:

https://openqa.stg.fedoraproject.org/tests/494898
https://openqa.stg.fedoraproject.org/tests/494897
https://openqa.stg.fedoraproject.org/tests/494895
https://openqa.stg.fedoraproject.org/tests/494885

The core dumps can be found on the Logs & Assets tabs again, but I don't have backtraces as I don't have an aarch64 host handy to generate them on.

Comment 1 Adam Williamson 2019-03-15 02:05:09 UTC
Created attachment 1544247 [details]
one of the backtraces

Comment 2 Adam Williamson 2019-03-15 02:05:27 UTC
Created attachment 1544248 [details]
the other backtrace

Comment 3 Vendula Poncova 2019-03-15 10:04:12 UTC
From trace1:

#2  <signal handler called>
#3  0x0000000000000000 in ?? ()
#4  0x00007f11d718fda7 in gtk_css_static_style_compute_value at gtkcssstaticstyle.c:237

From trace2:

#2  <signal handler called>
#3  0x0000000000000000 in ?? ()
#4  0x00007f72777a35fa in gtk_css_value_position_compute at gtkcsspositionvalue.c:48
#5  0x00007f7277788ca8 in gtk_css_value_array_compute at gtkcssarrayvalue.c:59
#6  0x00007f72777b1da7 in gtk_css_static_style_compute_value at gtkcssstaticstyle.c:237

Based on the backtraces, the error is not triggered by the same code in Anaconda. The problem really seems to be in the function gtk_css_static_style_compute_value. Reassigning to gtk3.

Comment 4 Michael Catanzaro 2019-03-15 14:43:41 UTC
I've seen this many times before. It's never a problem in gtk_css_static_style_compute_value. Always turns out to be memory corruption in some unrelated code. Could be Anaconda, could be GTK, but the backtrace is almost certainly useless. You need to catch this under asan or valgrind to have any chance.

Comment 5 Michael Catanzaro 2019-03-15 14:56:59 UTC
Company says: "the GTK CSS stack does a lot of memory allocations, so it's always a common place where corruptions are found"

It's just something we've learned again and again the hard way. Very hard. Memory corruption is the worst. :(

Comment 6 Adam Williamson 2019-03-15 15:07:08 UTC
For the record, it seems debugging memory corruption is very hard, especially for a non-native distribution installer :/ We need to run it through valgrind and hit the crash, apparently.

Owen asked when this started happening, and if we assume the aarch64 issue is the same thing, I *think* we can pin it down to some time between 2018-10-28 and 2018-11-14 in Rawhide; I can't see any aarch64 fails that look like this bug in Fedora-Rawhide-20181028.n.0 or earlier composes, while I *do* see multiple failures that look like this (at least, sudden black screen early in the install process - the logs have aged out, unfortunately) in Fedora-Rawhide-20181114.n.0.

Unfortunately bisecting the packages that changed between those dates may be hard as we probably don't have a 20181114.n.0 tree or images lying around anywhere to work from :/ releng cleans out the nightly composes every couple of weeks to save space.

Comment 7 Dan Horák 2019-03-28 14:09:56 UTC
Adam, it reminds be my ticket https://pagure.io/releng/issue/7763 for defining a retention policy for older composes to allow bisecting between composes like would be useful here.

Comment 8 Michel Normand 2019-03-29 10:28:34 UTC
*** Bug 1691016 has been marked as a duplicate of this bug. ***

Comment 9 Adam Williamson 2019-07-10 23:07:10 UTC
So, I may actually have got *somewhere* with this. I came up with an anaconda updates.img that (if I got it right, anyway) runs anaconda with PYTHONMALLOC=debug (per https://stackoverflow.com/questions/20112989/how-to-use-valgrind-with-python - how we ever did our jobs before Stack Overflow I have no idea), then I hacked up openQA staging to run 8 ppc64le install tests at a time with that updates image and triggered it until I hit the bug:

https://openqa.stg.fedoraproject.org/tests/566449

the system logs give us this tantalizing traceback, which looks a lot more useful than the previous one:

22:48:39,818 CRIT systemd-coredump:Process 2145 (anaconda) of user 0 dumped core.#012#012Stack trace of thread 2145:#012#0  0x00007fff9d8f1ddc malloc (libc.so.6)#012#1  0x00007fff6cf924f4 n/a (librsvg-2.so.2)#012#2  0x00007fff6ccb4b90 n/a (librsvg-2.so.2)#012#3  0x00007fff6ccb5704 n/a (librsvg-2.so.2)#012#4  0x00007fff6ccb5214 n/a (librsvg-2.so.2)#012#5  0x00007fff9d89ba14 __call_tls_dtors (libc.so.6)#012#6  0x00007fff9d89afd8 __run_exit_handlers (libc.so.6)#012#7  0x00007fff9d89b038 exit (libc.so.6)#012#8  0x00007fff8cb51510 sync_signal_handler (_isys.so)#012#9  0x00007fff9da704d8 __kernel_sigtramp_rt64 (linux-vdso64.so.1)#012#10 0x00007fff811e7660 hb_blob_destroy (libharfbuzz.so.0)#012#11 0x00007fff812b1068 _hb_graphite2_shaper_face_data_destroy (libharfbuzz.so.0)#012#12 0x00007fff812aaebc hb_shape_plan_create_cached2 (libharfbuzz.so.0)#012#13 0x00007fff812abc9c hb_shape_full (libharfbuzz.so.0)#012#14 0x00007fff812abd2c hb_shape (libharfbuzz.so.0)#012#15 0x00007fff813219a4 n/a (libpangoft2-1.0.so.0)#012#16 0x00007fff81318b14 n/a (libpangoft2-1.0.so.0)#012#17 0x00007fff81a34a6c n/a (libpango-1.0.so.0)#012#18 0x00007fff81a4f758 pango_shape_full (libpango-1.0.so.0)#012#19 0x00007fff81a3a328 n/a (libpango-1.0.so.0)#012#20 0x00007fff81a3cb4c n/a (libpango-1.0.so.0)#012#21 0x00007fff81a40260 n/a (libpango-1.0.so.0)#012#22 0x00007fff81a42678 n/a (libpango-1.0.so.0)#012#23 0x00007fff718c31d0 gtk_cell_renderer_text_get_preferred_width (libgtk-3.so.0)#012#24 0x00007fff718b7ab0 gtk_cell_renderer_get_preferred_width (libgtk-3.so.0)#012#25 0x00007fff718aa124 gtk_cell_area_request_renderer (libgtk-3.so.0)#012#26 0x00007fff718ab348 compute_size (libgtk-3.so.0)#012#27 0x00007fff718ad9b4 gtk_cell_area_box_get_preferred_width (libgtk-3.so.0)#012#28 0x00007fff718a47c0 gtk_cell_area_get_preferred_width (libgtk-3.so.0)#012#29 0x00007fff71c1469c gtk_tree_view_column_cell_get_size (libgtk-3.so.0)#012#30 0x00007fff71bf1974 validate_row (libgtk-3.so.0)#012#31 0x00007fff71bfc988 do_validate_rows (libgtk-3.so.0)#012#32 0x00007fff71bfd35c gtk_tree_view_get_preferred_width (libgtk-3.so.0)#012#33 0x00007fff71b23a08 gtk_widget_query_size_for_orientation (libgtk-3.so.0)#012#34 0x00007fff71b242e0 gtk_widget_compute_size_for_orientation (libgtk-3.so.0)#012#35 0x00007fff71b02214 gtk_scrolled_window_measure (libgtk-3.so.0)#012#36 0x00007fff718fd9dc gtk_css_custom_gadget_get_preferred_size (libgtk-3.so.0)#012#37 0x00007fff71904434 gtk_css_gadget_get_preferred_size (libgtk-3.so.0)#012#38 0x00007fff71afcd38 gtk_scrolled_window_get_preferred_width (libgtk-3.so.0)#012#39 0x00007fff71b23a08 gtk_widget_query_size_for_orientation (libgtk-3.so.0)#012#40 0x00007fff71b242e0 gtk_widget_compute_size_for_orientation (libgtk-3.so.0)#012#41 0x00007fff719d28a4 gtk_grid_request_run (libgtk-3.so.0)#012#42 0x00007fff719d2bcc gtk_grid_get_size (libgtk-3.so.0)#012#43 0x00007fff718fd9dc gtk_css_custom_gadget_get_preferred_size (libgtk-3.so.0)#012#44 0x00007fff71904434 gtk_css_gadget_get_preferred_size (libgtk-3.so.0)#012#45 0x00007fff719cf8b8 gtk_grid_get_preferred_width (libgtk-3.so.0)#012#46 0x00007fff71b23a08 gtk_widget_query_size_for_orientation (libgtk-3.so.0)#012#47 0x00007fff71b242e0 gtk_widget_compute_size_for_orientation (libgtk-3.so.0)#012#48 0x00007fff7187c93c gtk_box_get_content_size (libgtk-3.so.0)#012#49 0x00007fff718fd9dc gtk_css_custom_gadget_get_preferred_size (libgtk-3.so.0)#012#50 0x00007fff71904434 gtk_css_gadget_get_preferred_size (libgtk-3.so.0)#012#51 0x00007fff7187d928 gtk_box_get_preferred_width (libgtk-3.so.0)#012#52 0x00007fff71b23a08 gtk_widget_query_size_for_orientation (libgtk-3.so.0)#012#53 0x00007fff71b242e0 gtk_widget_compute_size_for_orientation (libgtk-3.so.0)#012#54 0x00007fff717c9a20 gtk_alignment_get_preferred_size (libgtk-3.so.0)#012#55 0x00007fff71b23a08 gtk_widget_query_size_for_orientation (libgtk-3.so.0)#012#56 0x00007fff71b242e0 gtk_widget_compute_size_for_orientation (libgtk-3.so.0)#012#57 0x00007fff7187c93c gtk_box_get_content_size (libgtk-3.so.0)#012#58 0x00007fff718fd9dc gtk_css_custom_gadget_get_preferred_size (libgtk-3.so.0)#012#59 0x00007fff71904434 gtk_css_gadget_get_preferred_size (libgtk-3.so.0)#012#60 0x00007fff7187d928 gtk_box_get_preferred_width (libgtk-3.so.0)#012#61 0x00007fff71b23a08 gtk_widget_query_size_for_orientation (libgtk-3.so.0)#012#62 0x00007fff71b242e0 gtk_widget_compute_size_for_orientation (libgtk-3.so.0)#012#63 0x00007fff718756f0 gtk_bin_get_preferred_width (libgtk-3.so.0)#012#012Stack trace of thread 2208:#012#0  0x00007fff9d971a7c __poll (libc.so.6)#012#1  0x00007fff8f352a08 g_poll (libglib-2.0.so.0)#012#2  0x00007fff8f33b038 g_main_context_iterate.isra.0 (libglib-2.0.so.0)#012#3  0x00007fff8f33b1e8 g_main_context_iteration (libglib-2.0.so.0)#012#4  0x00007fff8f33b28c glib_worker_main (libglib-2.0.so.0)#012#5  0x00007fff8f37963c g_thread_proxy (libglib-2.0.so.0)#012#6  0x00007fff9d4199a8 start_thread (libpthread.so.0)#012#7  0x00007fff9d981d18 __clone (libc.so.6)

i.e. (unless I'm way off, which is not unpossible!) it looks like we're crashing on a malloc in libc, via librsvg. Significantly, librsvg is something that *did* change between 2018-10-28 and 2018-11-14: exactly on 2018-11-14 it went from librsvg2-2.44.8-1.fc30 to librsvg2-2.44.9-1.fc30, and that build was in the 20181114.n.0 compose when this bug seems to have started happening. And also significantly, there *do* seem to be some possibly-relevant changes between 2.44.8 and 2.44.9, like these:

https://github.com/GNOME/librsvg/commit/c81739dc1049218e44283d65132af7d8d1a66386
https://github.com/GNOME/librsvg/commit/c353e713ae9e3d5c6ef42e17d787c9b02f641b8f

Obviously it'd be good if I could get a full traceback, but that's made a bit complicated because I'm reproducing the bug on ppc64le (it happens *way* more often there than on x86_64) but do not have a native ppc64le environment handy right now to get a traceback. I will try and sort that out somehow, and I might also try reverting suspicious-looking commits from the 2.44.8 to 2.44.9 range in librsvg to see if that makes the bug go away. If anyone else wants to poke at it, the coredump is at https://openqa.stg.fedoraproject.org/tests/566449/file/_boot_to_anaconda-anaconda.core.tar.gz .

Comment 10 Adam Williamson 2019-07-10 23:50:18 UTC
Created attachment 1589256 [details]
better backtrace(?)

Hmm, well, when I get a backtrace out of gdb (realized I could do it in a mock env on the worker host), the librsvg bits don't show up. Not sure why not. But the harfbuzz stuff does. So I'm fiddling about with harfbuzz. Here's the backtrace I got.

Comment 11 Adam Williamson 2019-07-11 00:12:04 UTC
Hmm. I think I somehow messed up the Rawhide compose range where this seems to have started happening, above. I believe it's actually between 20181021.n.0 and 20181120.n.0.

Comment 12 Adam Williamson 2019-07-11 00:41:20 UTC
OK, so I think I have a suspect! I think it's harfbuzz 2.0.0.

We got a bit lucky: harfbuzz's API and dependencies have apparently stayed sufficiently static for the last several months that you can just drop the harfbuzz 1.8.8 package into an anaconda updates image and it works. So I can test a current Rawhide image, but with an "updates" image which overwrite the harfbuzz files with the files from harfbuzz-1.8.8-1.fc30 . So, I did that, and ran the test 32 times (so far): it has not crashed once. Until now, I got at least one fail in each 16 attempts, usually at least 1 in each 8. That seems a pretty strong indicator that we're looking at a change between harfbuzz 1.8.8 and 2.0.0 as the culprit here.

Re-assigning to harfbuzz at least till someone tells me I'm wrong. :D I'll try and bisect this further (but it'll be tomorrow unless we get very lucky, as I have to go out in 30 mins or so).

Comment 13 Adam Williamson 2019-07-11 00:43:04 UTC
Oh, forgot to mention, I also did a similar test but dropping in the files from harfbuzz-2.0.0-1.fc30, and in *that* case the crash still happens. That's why I think the bug is specifically between 1.8.8 and 2.0.0.

Comment 14 Behdad Esfahbod 2019-07-11 18:58:15 UTC
Nothing pops out to me.  But you definitely should try with latest HarfBuzz (2.5.3) and see if it fixes that.  Should be trivially drop-in replacement.

Comment 15 Behdad Esfahbod 2019-07-11 18:59:32 UTC
_hb_graphite2_shaper_face_data_destroy

That's suspect...  I mean.  We should not be using Graphite fonts for boot for sure.

Comment 16 Adam Williamson 2019-07-11 20:27:21 UTC
behdad: we know it happens with 2.5.3, because that's been in Rawhide since 2019-06-27, and this is still happening commonly to Rawhide tests since then. Here it is happening on the most recent compose, for e.g.:

https://openqa.stg.fedoraproject.org/tests/566132

I'm still working on bisecting this; I screwed up my first bisect run somehow, probably didn't throw in enough repeats of every build. I'm giving it another shot now.

Comment 17 Behdad Esfahbod 2019-07-11 20:36:43 UTC
I see.

My suspicion is this commit:

  https://github.com/harfbuzz/harfbuzz/commit/e4e74c2751ac24178086cce2811d34d8019b6f85

But seriously, why do we get into graphite at all is the real question.

Comment 18 Behdad Esfahbod 2019-07-11 20:37:29 UTC
Multiple threads involved?  That would definitely make sense with the code I linked to being problematic.

Comment 19 Behdad Esfahbod 2019-07-11 20:45:52 UTC
Filed https://github.com/harfbuzz/harfbuzz/issues/1829

Comment 20 Matthias Clasen 2019-07-11 20:55:20 UTC
graphite might be in the picture because we pick a font such as Gentium for some off-color character. I've seen that happening for 0x2028 (line separator), recently.

Comment 21 Matthias Clasen 2019-07-11 21:01:31 UTC
Here is the text that is shaped: မြန်မာ

Comment 22 Behdad Esfahbod 2019-07-11 21:05:18 UTC
Okay maybe that's picking up Padauk graphite font.

Comment 23 Adam Williamson 2019-07-11 21:59:02 UTC
FWIW my current bisect does not have that commit in its range. At present the range is bee93e269711a3eda4e7d762b730522564fe6e87 to 7003b601afd02b0ba7e839510a7d0b886da09aaa . It's really tricky to have confidence in the results as the bug doesn't happen *super* often - I'm currently running 40 tests on each revision, and sometimes for a 'bad' revision I only get 1 failure. So it's a bit tricky. I'm 100% confident in the 'bad' results, the 'good' results are a bit questionable, I might have to go up even further to 80. But that's what I have ATM.

At least it seems pretty certain the bad commit is before 7003b601afd02b0ba7e839510a7d0b886da09aaa .

Comment 24 Adam Williamson 2019-07-11 23:00:33 UTC
Created attachment 1589700 [details]
slightly different backtrace from commit 4035158de46ce373b7521daf61c5b6df83312968

Still bisecting, but an interesting result: with commit 4035158de46ce373b7521daf61c5b6df83312968 we get what looks like the same failure, but with a slightly different backtrace. It still involves _hb_graphite2_shaper_face_data_destroy , it's just a bit of a different path.

Comment 25 Adam Williamson 2019-07-12 00:24:45 UTC
e640f3a6b16f41cee5f7868ec738fda01244e96a crashes the same way as 4035158de46ce373b7521daf61c5b6df83312968 .

So...my bisection hit a somewhat surprising result. It pretty strongly says that this commit is the cause:

https://github.com/harfbuzz/harfbuzz/commit/e4f27f368f8f0509fa47f6a28f3984e90b40588f

We have a definite fail with that commit:

https://openqa.stg.fedoraproject.org/tests/567465

With the previous commit, bee93e269711a3eda4e7d762b730522564fe6e87 , I have tried the test 120 times now - because I was so surprised at this result - and it has not failed once. So I'm really pretty sure this is it.

On the face of it, all this does is move a struct definition out from being inline in _hb_ot_shape_fallback_kern , if I'm reading it right. There's no obvious functional change at all.

However, having stared at it until I went cross-eyed...and bearing in mind that my C is pretty shaky and I am sort of applying knowledge from Python scoping here, which for all I know is completely different...is it possible that the difference could be to do with 'font'? 'font' is the name assigned to one of the arguments for `_hb_ot_shape_fallback_kern`, and then - again, based on my very shaky C knowledge - in the old code, the inline struct definition did some stuff with 'font'. Which...I dunno C scoping, but wouldn't that be the 'font' that was passed in as an argument? Whereas once the struct definition is taken out of line, it wouldn't have that 'font' in scope any more?

Again, I may be way off here, that's just all I could think of based on my limited knowledge. If this really doesn't seem to make any sense, I can try the bisection *again*, but at this point the result seems pretty solid.

Comment 26 Behdad Esfahbod 2019-07-12 00:50:38 UTC
Not really.  It's exact same code.

Try bisecting again?  Skip a few commits forward / backward?

Comment 27 Adam Williamson 2019-07-12 01:06:59 UTC
Yeah, it's the same code, which is why I got to thinking the *things it's working with* may be different, i.e. scoping. But it's only an idea.

So, I tried doing a build of 2.5.3 with a patch that basically 'reverts' e4f27f by moving the struct definition back inline...and it hits the bug. So now I'm just entirely baffled, and I've spent the whole day on this. Fun!

I'm going to re-do the tests of bee93e26 and e4f27f by hand just in case my test script somehow screwed up...

Comment 28 Adam Williamson 2019-07-12 01:57:28 UTC
OK, on the manual re-run I got a fail for bee93e26. So, back to bisecting...

Comment 29 Jens Petersen 2019-07-12 06:52:53 UTC
I am bit confused: which arch's are affected exactly?   Only aarch64 and ppc64le?

What happens if you remove say all other fonts than /usr/share/fonts/sil-padauk/ ?

Comment 30 Adam Williamson 2019-07-12 14:18:29 UTC
It seems to affect all arches, but it happens *much more often* on ppc64le and aarch64, which is why I mainly use them for investigation/reproduction. It only happens very, very occasionally on x86_64.

I'll try removing other fonts in a bit, still trying to get a proper bisect first. I'm now up to running the test 160 times on every tested revision...

Comment 31 Adam Williamson 2019-07-12 18:02:12 UTC
My current bisect looks like it's gonna land on the same commit Behdad identified - e4e74c2751ac24178086cce2811d34d8019b6f85 .

Comment 32 Adam Williamson 2019-07-12 22:34:48 UTC
OK, indeed, as expected, with 200 runs of the tests on every frickin' commit, my bisect comes down to e4e74c2751ac24178086cce2811d34d8019b6f85 . I've also just confirmed that building the current Rawhide package with a manual revert of that patch avoids the bug: ran that test 200 times as well, and it passed every one.

I've sent an official Rawhide build with the revert, since it'd be nice to not have this flake happening to the aarch64/ppc64le tests. Once it's fixed properly upstream we can drop the revert and pull the fix instead.

Comment 33 Behdad Esfahbod 2019-07-12 22:37:13 UTC
Thanks.  I'm reverting upstream until we figure out a proper fix.

Comment 34 Behdad Esfahbod 2019-07-12 22:40:36 UTC
I still can't quite reason why that piece of code becomes a problem.  Are multiple threads involved?  That's the only way I can see this *possibly* related.  Even then not sure why.

Comment 35 Adam Williamson 2019-07-13 04:03:33 UTC
anaconda did get redesigned, about a year ago, into modules that communicate via dbus:

https://fedoraproject.org/wiki/Changes/AnacondaModularization

I'm guessing that could possibly be involved?

Comment 36 Dan Horák 2019-07-13 06:48:25 UTC
(In reply to Behdad Esfahbod from comment #34)
> I still can't quite reason why that piece of code becomes a problem.  Are
> multiple threads involved?  That's the only way I can see this *possibly*
> related.  Even then not sure why.

yes, anaconda is a multithreaded application with each spoke handled by a separate thread

And awesome work, Adam, thanks :-)

Comment 37 Adam Williamson 2019-09-13 20:25:29 UTC
So, something like this is happening again :/

But this time even with PYTHONMALLOC=debug set, the backtrace is in gtk_css:

#0  0x0000ffff81bd34c4 in __GI___waitpid (pid=<optimized out>, stat_loc=stat_loc@entry=0xffffef7a8d6c, 
    options=options@entry=0) at ../sysdeps/unix/sysv/linux/waitpid.c:30
#1  0x0000ffff7184f154 in sync_signal_handler (signum=<optimized out>) at isys.c:143
#2  <signal handler called>
#3  0x0000000000000000 in ?? ()
#4  0x0000ffff60221f30 in gtk_css_static_style_compute_value (style=0xaaaaf1fc6660, provider=0xaaaaf4d71260, 
    parent_style=0xaaaaf2a00be0, id=52, specified=0xffff64030a40, section=0x0) at gtkcssstaticstyle.c:237
#5  0x0000ffff6020dadc in _gtk_css_lookup_resolve (lookup=lookup@entry=0xaaaaf5431d90, 
    provider=provider@entry=0xaaaaf4d71260, style=style@entry=0xaaaaf1fc6660, 
    parent_style=parent_style@entry=0xaaaaf2a00be0) at gtkcsslookup.c:122
#6  0x0000ffff60221e30 in gtk_css_static_style_new_compute (provider=0xaaaaf4d71260, 
    matcher=matcher@entry=0xffffef7aa278, parent=parent@entry=0xaaaaf2a00be0) at gtkcssstaticstyle.c:195
#7  0x0000ffff6020fff0 in gtk_css_node_create_style (cssnode=0xaaaaf4e22740) at gtkcssnode.c:371
#8  gtk_css_node_real_update_style (cssnode=0xaaaaf4e22740, change=8598372560, timestamp=107356270, 
    style=0xaaaaf52fa890) at gtkcssnode.c:425
#9  0x0000ffff6020eeb4 in gtk_css_node_ensure_style (cssnode=0xaaaaf4e22740, 
    current_time=current_time@entry=107356270) at gtkcssnode.c:1007

etc. etc. Is this still useless and indicative of memory corruption we're not finding?

Comment 38 Michael Catanzaro 2019-09-13 23:05:04 UTC
(In reply to Adam Williamson from comment #37)
> etc. etc. Is this still useless and indicative of memory corruption we're
> not finding?

Yes indeed, sadly:

(In reply to Michael Catanzaro from comment #4)
> I've seen this many times before. It's never a problem in
> gtk_css_static_style_compute_value. Always turns out to be memory corruption
> in some unrelated code. Could be Anaconda, could be GTK, but the backtrace
> is almost certainly useless. You need to catch this under asan or valgrind
> to have any chance.

Memory corruption is the absolute worst. Very hard to track down. :/

Comment 39 Adam Williamson 2019-09-16 15:21:42 UTC
Hmm, here it is apparently happening on x86_64 even:

https://openqa.fedoraproject.org/tests/451841

but it seems we didn't store the coredump on that occasion :(

sigh, I love these bugs. I did check if harfbuzz regressed, but it doesn't look like it (the revert still looks to be applied).

Comment 40 Jens Petersen 2019-11-04 05:09:41 UTC
It still happens with F31 final release presumably?

Comment 41 Adam Williamson 2019-11-04 15:56:07 UTC
The new crash still happens, yes. The harfbuzz one is still fixed by the reversion, at least last I checked.

Comment 42 Adam Williamson 2020-03-13 00:51:32 UTC
We still see this failure case quite commonly on aarch64 in openQA. Commonly enough that I'm writing a hack into the openQA package to restart all aarch64 tests that fail on the first module :/ Up to and including Rawhide.

Comment 43 Adam Williamson 2020-03-13 00:52:43 UTC
moving to gtk3 for now as we have no reason to suspect harfbuzz and I don't really know what else to point at.

Comment 44 Michael Catanzaro 2020-03-13 13:25:31 UTC
Again, this is memory corruption, so the provided backtraces are not actionable and do not indicate anything wrong in GTK. I've seen crashes in gtk_css_static_style_compute_value() many times and it *always* turns out to be the application corrupting memory somehow. The GTK CSS machinery shows up in the backtrace just because it gets called very frequently, but the game was lost much earlier when the memory corruption first occurred.

Moving this back to anaconda for now, as that's the only reasonable component to use until we know where the memory corruption is actually occurring. I doubt the problem is somewhere in anaconda's codebase, because anaconda is written in python, but until we know where the memory corruption is coming from, there's no better component to assign the bug to. The problem could be anywhere in any library that anaconda uses (most likely), or in the python interpreter itself (much less likely). It could even be somewhere in GTK (just not where the backtrace is pointing to). We're not going to find out without either (a) running anaconda under valgrind, or (b) asan builds of everything (python and every library anaconda links to). Obviously (a) would be easier.

Comment 45 Adam Williamson 2020-03-13 15:40:23 UTC
"I doubt the problem is somewhere in anaconda's codebase, because anaconda is written in python, but until we know where the memory corruption is coming from, there's no better component to assign the bug to."

I figured gtk3 was a better catch-all than anaconda for precisely this reason :) but it doesn't really matter, it's just that it needs to be assigned *somewhere*.

Comment 46 Dan Horák 2020-03-13 15:50:02 UTC
option (a) should be doable with an updates image using the following change in anaconda

diff --git a/data/tmux.conf b/data/tmux.conf
index 87c9cb7c7..ac5f5cfbb 100644
--- a/data/tmux.conf
+++ b/data/tmux.conf
@@ -23,7 +23,7 @@ set-option -g history-limit 10000
 # For more infromation see:
 # rhbz#1764666
 # rhbz#1722181
-new-session -d -s anaconda -n main "LD_PRELOAD=libgomp.so.1 anaconda"
+new-session -d -s anaconda -n main "valgrind <some options> anaconda"
 
 set-option status-right '#[fg=blue]#(echo -n "Switch tab: Alt+Tab | Help: F1 ")'

Comment 47 Adam Williamson 2020-05-22 22:20:36 UTC
So I gave this a preliminary shot, but it's not flying. I tried both this:

-new-session -d -s anaconda -n main "LD_PRELOAD=libgomp.so.1 anaconda"
+new-session -d -s anaconda -n main "LD_PRELOAD=libgomp.so.1 valgrind --tool=memcheck --leak-check=full --leak-resolution=high --num-callers=20 --log-file=/tmp/vgdump.log anaconda"

and this:

-new-session -d -s anaconda -n main "LD_PRELOAD=libgomp.so.1 anaconda"
+new-session -d -s anaconda -n main "valgrind --tool=memcheck --leak-check=full --leak-resolution=high --num-callers=20 --log-file=/tmp/vgdump.log anaconda"

but neither makes it to the installer within 50 minutes of booting (on an aarch64 VM), which means they're either not working at all or running so slow as to be useless. Didn't get any logs so can't tell which.

I took those valgrind args from the GNOME docs, I am no expert on valgrind so didn't know what else to try. Anyone have any other suggestions?

Comment 48 Jerry James 2020-06-12 15:03:25 UTC
If I understand correctly, you are not looking for a memory leak, but rather for memory corruption.  It may very well be that, with the options in comment 47, it is running so slowly as to be useless.  Try this:

new-session -d -s anaconda -n main "LD_PRELOAD=libgomp.so.1 valgrind --tool=memcheck --leak-check=no --num-callers=10 --log-file=/tmp/vgdump.log anaconda"

If that finds a problem and 10 callers is not enough to diagnose the issue, repeat with --num-callers set to a higher value.

Comment 49 Adam Williamson 2020-06-19 16:53:25 UTC
Thanks, Jerry. Yeah, I figured that might be the issue, but I don't really know valgrind at all so I didn't know what to change. I'll try it that way, thanks.

Comment 50 Adam Williamson 2020-07-03 20:54:41 UTC
note, can't get to this ATM because it's easiest to reproduce on aarch64 or ppc64le, but we don't have those back up in the new infra yet, we're running on reduced capacity. Once those workers are back I can try and look at this again.

Comment 51 Michael Catanzaro 2020-07-03 23:04:09 UTC
It's worth a try even on x86_64. Most likely, the underlying bug occurs on all architectures and it's just a timing difference or something. With luck, valgrind might reveal the problem even on x86_64.

Comment 52 Jerry James 2020-07-03 23:24:49 UTC
Also, if it is still too slow with the options in comment 48, try reducing --num-callers a bit.  You probably don't want to go lower than 5; it becomes too hard to figure out what's going on with such small values.

Comment 53 Ben Cotton 2020-08-11 15:20:21 UTC
This bug appears to have been reported against 'rawhide' during the Fedora 33 development cycle.
Changing version to 33.


Note You need to log in before you can comment on or make changes to this bug.