This is a bit of a fuzzy problem, but it definitely happens enough that it seems to be a real thing.
Sometimes, openQA tests fail because anaconda just suddenly dies, usually quite early in the install. The visible symptom is that the installer disappears and you get a black screen instead (but can switch to a tty successfully and poke about). The logs show it crashed on signal 11. A core dump is saved.
Here are two recent x86_64 tests that failed this way:
You can get log and core dump files from the 'Logs & Assets' tab for each test. I have downloaded the core dumps from each and run them through gdb. They produce similar but not identical backtraces, which I will attach, that *seem* to suggest the crash may be in GTK+ somewhere - both seem to run through gtk_css_static_style_compute_value .
This same sort of things seems to happen very often on aarch64. For instance, for the same compose (Fedora-30-20190314.n.0), I can see at least four tests that failed in what looks like the same way on aarch64:
The core dumps can be found on the Logs & Assets tabs again, but I don't have backtraces as I don't have an aarch64 host handy to generate them on.
Created attachment 1544247 [details]
one of the backtraces
Created attachment 1544248 [details]
the other backtrace
#2 <signal handler called>
#3 0x0000000000000000 in ?? ()
#4 0x00007f11d718fda7 in gtk_css_static_style_compute_value at gtkcssstaticstyle.c:237
#2 <signal handler called>
#3 0x0000000000000000 in ?? ()
#4 0x00007f72777a35fa in gtk_css_value_position_compute at gtkcsspositionvalue.c:48
#5 0x00007f7277788ca8 in gtk_css_value_array_compute at gtkcssarrayvalue.c:59
#6 0x00007f72777b1da7 in gtk_css_static_style_compute_value at gtkcssstaticstyle.c:237
Based on the backtraces, the error is not triggered by the same code in Anaconda. The problem really seems to be in the function gtk_css_static_style_compute_value. Reassigning to gtk3.
I've seen this many times before. It's never a problem in gtk_css_static_style_compute_value. Always turns out to be memory corruption in some unrelated code. Could be Anaconda, could be GTK, but the backtrace is almost certainly useless. You need to catch this under asan or valgrind to have any chance.
Company says: "the GTK CSS stack does a lot of memory allocations, so it's always a common place where corruptions are found"
It's just something we've learned again and again the hard way. Very hard. Memory corruption is the worst. :(
For the record, it seems debugging memory corruption is very hard, especially for a non-native distribution installer :/ We need to run it through valgrind and hit the crash, apparently.
Owen asked when this started happening, and if we assume the aarch64 issue is the same thing, I *think* we can pin it down to some time between 2018-10-28 and 2018-11-14 in Rawhide; I can't see any aarch64 fails that look like this bug in Fedora-Rawhide-20181028.n.0 or earlier composes, while I *do* see multiple failures that look like this (at least, sudden black screen early in the install process - the logs have aged out, unfortunately) in Fedora-Rawhide-20181114.n.0.
Unfortunately bisecting the packages that changed between those dates may be hard as we probably don't have a 20181114.n.0 tree or images lying around anywhere to work from :/ releng cleans out the nightly composes every couple of weeks to save space.
Adam, it reminds be my ticket https://pagure.io/releng/issue/7763 for defining a retention policy for older composes to allow bisecting between composes like would be useful here.
*** Bug 1691016 has been marked as a duplicate of this bug. ***
So, I may actually have got *somewhere* with this. I came up with an anaconda updates.img that (if I got it right, anyway) runs anaconda with PYTHONMALLOC=debug (per https://stackoverflow.com/questions/20112989/how-to-use-valgrind-with-python - how we ever did our jobs before Stack Overflow I have no idea), then I hacked up openQA staging to run 8 ppc64le install tests at a time with that updates image and triggered it until I hit the bug:
the system logs give us this tantalizing traceback, which looks a lot more useful than the previous one:
22:48:39,818 CRIT systemd-coredump:Process 2145 (anaconda) of user 0 dumped core.#012#012Stack trace of thread 2145:#012#0 0x00007fff9d8f1ddc malloc (libc.so.6)#012#1 0x00007fff6cf924f4 n/a (librsvg-2.so.2)#012#2 0x00007fff6ccb4b90 n/a (librsvg-2.so.2)#012#3 0x00007fff6ccb5704 n/a (librsvg-2.so.2)#012#4 0x00007fff6ccb5214 n/a (librsvg-2.so.2)#012#5 0x00007fff9d89ba14 __call_tls_dtors (libc.so.6)#012#6 0x00007fff9d89afd8 __run_exit_handlers (libc.so.6)#012#7 0x00007fff9d89b038 exit (libc.so.6)#012#8 0x00007fff8cb51510 sync_signal_handler (_isys.so)#012#9 0x00007fff9da704d8 __kernel_sigtramp_rt64 (linux-vdso64.so.1)#012#10 0x00007fff811e7660 hb_blob_destroy (libharfbuzz.so.0)#012#11 0x00007fff812b1068 _hb_graphite2_shaper_face_data_destroy (libharfbuzz.so.0)#012#12 0x00007fff812aaebc hb_shape_plan_create_cached2 (libharfbuzz.so.0)#012#13 0x00007fff812abc9c hb_shape_full (libharfbuzz.so.0)#012#14 0x00007fff812abd2c hb_shape (libharfbuzz.so.0)#012#15 0x00007fff813219a4 n/a (libpangoft2-1.0.so.0)#012#16 0x00007fff81318b14 n/a (libpangoft2-1.0.so.0)#012#17 0x00007fff81a34a6c n/a (libpango-1.0.so.0)#012#18 0x00007fff81a4f758 pango_shape_full (libpango-1.0.so.0)#012#19 0x00007fff81a3a328 n/a (libpango-1.0.so.0)#012#20 0x00007fff81a3cb4c n/a (libpango-1.0.so.0)#012#21 0x00007fff81a40260 n/a (libpango-1.0.so.0)#012#22 0x00007fff81a42678 n/a (libpango-1.0.so.0)#012#23 0x00007fff718c31d0 gtk_cell_renderer_text_get_preferred_width (libgtk-3.so.0)#012#24 0x00007fff718b7ab0 gtk_cell_renderer_get_preferred_width (libgtk-3.so.0)#012#25 0x00007fff718aa124 gtk_cell_area_request_renderer (libgtk-3.so.0)#012#26 0x00007fff718ab348 compute_size (libgtk-3.so.0)#012#27 0x00007fff718ad9b4 gtk_cell_area_box_get_preferred_width (libgtk-3.so.0)#012#28 0x00007fff718a47c0 gtk_cell_area_get_preferred_width (libgtk-3.so.0)#012#29 0x00007fff71c1469c gtk_tree_view_column_cell_get_size (libgtk-3.so.0)#012#30 0x00007fff71bf1974 validate_row (libgtk-3.so.0)#012#31 0x00007fff71bfc988 do_validate_rows (libgtk-3.so.0)#012#32 0x00007fff71bfd35c gtk_tree_view_get_preferred_width (libgtk-3.so.0)#012#33 0x00007fff71b23a08 gtk_widget_query_size_for_orientation (libgtk-3.so.0)#012#34 0x00007fff71b242e0 gtk_widget_compute_size_for_orientation (libgtk-3.so.0)#012#35 0x00007fff71b02214 gtk_scrolled_window_measure (libgtk-3.so.0)#012#36 0x00007fff718fd9dc gtk_css_custom_gadget_get_preferred_size (libgtk-3.so.0)#012#37 0x00007fff71904434 gtk_css_gadget_get_preferred_size (libgtk-3.so.0)#012#38 0x00007fff71afcd38 gtk_scrolled_window_get_preferred_width (libgtk-3.so.0)#012#39 0x00007fff71b23a08 gtk_widget_query_size_for_orientation (libgtk-3.so.0)#012#40 0x00007fff71b242e0 gtk_widget_compute_size_for_orientation (libgtk-3.so.0)#012#41 0x00007fff719d28a4 gtk_grid_request_run (libgtk-3.so.0)#012#42 0x00007fff719d2bcc gtk_grid_get_size (libgtk-3.so.0)#012#43 0x00007fff718fd9dc gtk_css_custom_gadget_get_preferred_size (libgtk-3.so.0)#012#44 0x00007fff71904434 gtk_css_gadget_get_preferred_size (libgtk-3.so.0)#012#45 0x00007fff719cf8b8 gtk_grid_get_preferred_width (libgtk-3.so.0)#012#46 0x00007fff71b23a08 gtk_widget_query_size_for_orientation (libgtk-3.so.0)#012#47 0x00007fff71b242e0 gtk_widget_compute_size_for_orientation (libgtk-3.so.0)#012#48 0x00007fff7187c93c gtk_box_get_content_size (libgtk-3.so.0)#012#49 0x00007fff718fd9dc gtk_css_custom_gadget_get_preferred_size (libgtk-3.so.0)#012#50 0x00007fff71904434 gtk_css_gadget_get_preferred_size (libgtk-3.so.0)#012#51 0x00007fff7187d928 gtk_box_get_preferred_width (libgtk-3.so.0)#012#52 0x00007fff71b23a08 gtk_widget_query_size_for_orientation (libgtk-3.so.0)#012#53 0x00007fff71b242e0 gtk_widget_compute_size_for_orientation (libgtk-3.so.0)#012#54 0x00007fff717c9a20 gtk_alignment_get_preferred_size (libgtk-3.so.0)#012#55 0x00007fff71b23a08 gtk_widget_query_size_for_orientation (libgtk-3.so.0)#012#56 0x00007fff71b242e0 gtk_widget_compute_size_for_orientation (libgtk-3.so.0)#012#57 0x00007fff7187c93c gtk_box_get_content_size (libgtk-3.so.0)#012#58 0x00007fff718fd9dc gtk_css_custom_gadget_get_preferred_size (libgtk-3.so.0)#012#59 0x00007fff71904434 gtk_css_gadget_get_preferred_size (libgtk-3.so.0)#012#60 0x00007fff7187d928 gtk_box_get_preferred_width (libgtk-3.so.0)#012#61 0x00007fff71b23a08 gtk_widget_query_size_for_orientation (libgtk-3.so.0)#012#62 0x00007fff71b242e0 gtk_widget_compute_size_for_orientation (libgtk-3.so.0)#012#63 0x00007fff718756f0 gtk_bin_get_preferred_width (libgtk-3.so.0)#012#012Stack trace of thread 2208:#012#0 0x00007fff9d971a7c __poll (libc.so.6)#012#1 0x00007fff8f352a08 g_poll (libglib-2.0.so.0)#012#2 0x00007fff8f33b038 g_main_context_iterate.isra.0 (libglib-2.0.so.0)#012#3 0x00007fff8f33b1e8 g_main_context_iteration (libglib-2.0.so.0)#012#4 0x00007fff8f33b28c glib_worker_main (libglib-2.0.so.0)#012#5 0x00007fff8f37963c g_thread_proxy (libglib-2.0.so.0)#012#6 0x00007fff9d4199a8 start_thread (libpthread.so.0)#012#7 0x00007fff9d981d18 __clone (libc.so.6)
i.e. (unless I'm way off, which is not unpossible!) it looks like we're crashing on a malloc in libc, via librsvg. Significantly, librsvg is something that *did* change between 2018-10-28 and 2018-11-14: exactly on 2018-11-14 it went from librsvg2-2.44.8-1.fc30 to librsvg2-2.44.9-1.fc30, and that build was in the 20181114.n.0 compose when this bug seems to have started happening. And also significantly, there *do* seem to be some possibly-relevant changes between 2.44.8 and 2.44.9, like these:
Obviously it'd be good if I could get a full traceback, but that's made a bit complicated because I'm reproducing the bug on ppc64le (it happens *way* more often there than on x86_64) but do not have a native ppc64le environment handy right now to get a traceback. I will try and sort that out somehow, and I might also try reverting suspicious-looking commits from the 2.44.8 to 2.44.9 range in librsvg to see if that makes the bug go away. If anyone else wants to poke at it, the coredump is at https://openqa.stg.fedoraproject.org/tests/566449/file/_boot_to_anaconda-anaconda.core.tar.gz .
Created attachment 1589256 [details]
Hmm, well, when I get a backtrace out of gdb (realized I could do it in a mock env on the worker host), the librsvg bits don't show up. Not sure why not. But the harfbuzz stuff does. So I'm fiddling about with harfbuzz. Here's the backtrace I got.
Hmm. I think I somehow messed up the Rawhide compose range where this seems to have started happening, above. I believe it's actually between 20181021.n.0 and 20181120.n.0.
OK, so I think I have a suspect! I think it's harfbuzz 2.0.0.
We got a bit lucky: harfbuzz's API and dependencies have apparently stayed sufficiently static for the last several months that you can just drop the harfbuzz 1.8.8 package into an anaconda updates image and it works. So I can test a current Rawhide image, but with an "updates" image which overwrite the harfbuzz files with the files from harfbuzz-1.8.8-1.fc30 . So, I did that, and ran the test 32 times (so far): it has not crashed once. Until now, I got at least one fail in each 16 attempts, usually at least 1 in each 8. That seems a pretty strong indicator that we're looking at a change between harfbuzz 1.8.8 and 2.0.0 as the culprit here.
Re-assigning to harfbuzz at least till someone tells me I'm wrong. :D I'll try and bisect this further (but it'll be tomorrow unless we get very lucky, as I have to go out in 30 mins or so).
Oh, forgot to mention, I also did a similar test but dropping in the files from harfbuzz-2.0.0-1.fc30, and in *that* case the crash still happens. That's why I think the bug is specifically between 1.8.8 and 2.0.0.
Nothing pops out to me. But you definitely should try with latest HarfBuzz (2.5.3) and see if it fixes that. Should be trivially drop-in replacement.
That's suspect... I mean. We should not be using Graphite fonts for boot for sure.
behdad: we know it happens with 2.5.3, because that's been in Rawhide since 2019-06-27, and this is still happening commonly to Rawhide tests since then. Here it is happening on the most recent compose, for e.g.:
I'm still working on bisecting this; I screwed up my first bisect run somehow, probably didn't throw in enough repeats of every build. I'm giving it another shot now.
My suspicion is this commit:
But seriously, why do we get into graphite at all is the real question.
Multiple threads involved? That would definitely make sense with the code I linked to being problematic.
graphite might be in the picture because we pick a font such as Gentium for some off-color character. I've seen that happening for 0x2028 (line separator), recently.
Here is the text that is shaped: မြန်မာ
Okay maybe that's picking up Padauk graphite font.
FWIW my current bisect does not have that commit in its range. At present the range is bee93e269711a3eda4e7d762b730522564fe6e87 to 7003b601afd02b0ba7e839510a7d0b886da09aaa . It's really tricky to have confidence in the results as the bug doesn't happen *super* often - I'm currently running 40 tests on each revision, and sometimes for a 'bad' revision I only get 1 failure. So it's a bit tricky. I'm 100% confident in the 'bad' results, the 'good' results are a bit questionable, I might have to go up even further to 80. But that's what I have ATM.
At least it seems pretty certain the bad commit is before 7003b601afd02b0ba7e839510a7d0b886da09aaa .
Created attachment 1589700 [details]
slightly different backtrace from commit 4035158de46ce373b7521daf61c5b6df83312968
Still bisecting, but an interesting result: with commit 4035158de46ce373b7521daf61c5b6df83312968 we get what looks like the same failure, but with a slightly different backtrace. It still involves _hb_graphite2_shaper_face_data_destroy , it's just a bit of a different path.
e640f3a6b16f41cee5f7868ec738fda01244e96a crashes the same way as 4035158de46ce373b7521daf61c5b6df83312968 .
So...my bisection hit a somewhat surprising result. It pretty strongly says that this commit is the cause:
We have a definite fail with that commit:
With the previous commit, bee93e269711a3eda4e7d762b730522564fe6e87 , I have tried the test 120 times now - because I was so surprised at this result - and it has not failed once. So I'm really pretty sure this is it.
On the face of it, all this does is move a struct definition out from being inline in _hb_ot_shape_fallback_kern , if I'm reading it right. There's no obvious functional change at all.
However, having stared at it until I went cross-eyed...and bearing in mind that my C is pretty shaky and I am sort of applying knowledge from Python scoping here, which for all I know is completely different...is it possible that the difference could be to do with 'font'? 'font' is the name assigned to one of the arguments for `_hb_ot_shape_fallback_kern`, and then - again, based on my very shaky C knowledge - in the old code, the inline struct definition did some stuff with 'font'. Which...I dunno C scoping, but wouldn't that be the 'font' that was passed in as an argument? Whereas once the struct definition is taken out of line, it wouldn't have that 'font' in scope any more?
Again, I may be way off here, that's just all I could think of based on my limited knowledge. If this really doesn't seem to make any sense, I can try the bisection *again*, but at this point the result seems pretty solid.
Not really. It's exact same code.
Try bisecting again? Skip a few commits forward / backward?
Yeah, it's the same code, which is why I got to thinking the *things it's working with* may be different, i.e. scoping. But it's only an idea.
So, I tried doing a build of 2.5.3 with a patch that basically 'reverts' e4f27f by moving the struct definition back inline...and it hits the bug. So now I'm just entirely baffled, and I've spent the whole day on this. Fun!
I'm going to re-do the tests of bee93e26 and e4f27f by hand just in case my test script somehow screwed up...
OK, on the manual re-run I got a fail for bee93e26. So, back to bisecting...
I am bit confused: which arch's are affected exactly? Only aarch64 and ppc64le?
What happens if you remove say all other fonts than /usr/share/fonts/sil-padauk/ ?
It seems to affect all arches, but it happens *much more often* on ppc64le and aarch64, which is why I mainly use them for investigation/reproduction. It only happens very, very occasionally on x86_64.
I'll try removing other fonts in a bit, still trying to get a proper bisect first. I'm now up to running the test 160 times on every tested revision...
My current bisect looks like it's gonna land on the same commit Behdad identified - e4e74c2751ac24178086cce2811d34d8019b6f85 .
OK, indeed, as expected, with 200 runs of the tests on every frickin' commit, my bisect comes down to e4e74c2751ac24178086cce2811d34d8019b6f85 . I've also just confirmed that building the current Rawhide package with a manual revert of that patch avoids the bug: ran that test 200 times as well, and it passed every one.
I've sent an official Rawhide build with the revert, since it'd be nice to not have this flake happening to the aarch64/ppc64le tests. Once it's fixed properly upstream we can drop the revert and pull the fix instead.
Thanks. I'm reverting upstream until we figure out a proper fix.
I still can't quite reason why that piece of code becomes a problem. Are multiple threads involved? That's the only way I can see this *possibly* related. Even then not sure why.
anaconda did get redesigned, about a year ago, into modules that communicate via dbus:
I'm guessing that could possibly be involved?
(In reply to Behdad Esfahbod from comment #34)
> I still can't quite reason why that piece of code becomes a problem. Are
> multiple threads involved? That's the only way I can see this *possibly*
> related. Even then not sure why.
yes, anaconda is a multithreaded application with each spoke handled by a separate thread
And awesome work, Adam, thanks :-)
So, something like this is happening again :/
But this time even with PYTHONMALLOC=debug set, the backtrace is in gtk_css:
#0 0x0000ffff81bd34c4 in __GI___waitpid (pid=<optimized out>, stat_loc=stat_loc@entry=0xffffef7a8d6c,
options=options@entry=0) at ../sysdeps/unix/sysv/linux/waitpid.c:30
#1 0x0000ffff7184f154 in sync_signal_handler (signum=<optimized out>) at isys.c:143
#2 <signal handler called>
#3 0x0000000000000000 in ?? ()
#4 0x0000ffff60221f30 in gtk_css_static_style_compute_value (style=0xaaaaf1fc6660, provider=0xaaaaf4d71260,
parent_style=0xaaaaf2a00be0, id=52, specified=0xffff64030a40, section=0x0) at gtkcssstaticstyle.c:237
#5 0x0000ffff6020dadc in _gtk_css_lookup_resolve (lookup=lookup@entry=0xaaaaf5431d90,
parent_style=parent_style@entry=0xaaaaf2a00be0) at gtkcsslookup.c:122
#6 0x0000ffff60221e30 in gtk_css_static_style_new_compute (provider=0xaaaaf4d71260,
matcher=matcher@entry=0xffffef7aa278, parent=parent@entry=0xaaaaf2a00be0) at gtkcssstaticstyle.c:195
#7 0x0000ffff6020fff0 in gtk_css_node_create_style (cssnode=0xaaaaf4e22740) at gtkcssnode.c:371
#8 gtk_css_node_real_update_style (cssnode=0xaaaaf4e22740, change=8598372560, timestamp=107356270,
style=0xaaaaf52fa890) at gtkcssnode.c:425
#9 0x0000ffff6020eeb4 in gtk_css_node_ensure_style (cssnode=0xaaaaf4e22740,
current_time=current_time@entry=107356270) at gtkcssnode.c:1007
etc. etc. Is this still useless and indicative of memory corruption we're not finding?
(In reply to Adam Williamson from comment #37)
> etc. etc. Is this still useless and indicative of memory corruption we're
> not finding?
Yes indeed, sadly:
(In reply to Michael Catanzaro from comment #4)
> I've seen this many times before. It's never a problem in
> gtk_css_static_style_compute_value. Always turns out to be memory corruption
> in some unrelated code. Could be Anaconda, could be GTK, but the backtrace
> is almost certainly useless. You need to catch this under asan or valgrind
> to have any chance.
Memory corruption is the absolute worst. Very hard to track down. :/
Hmm, here it is apparently happening on x86_64 even:
but it seems we didn't store the coredump on that occasion :(
sigh, I love these bugs. I did check if harfbuzz regressed, but it doesn't look like it (the revert still looks to be applied).
It still happens with F31 final release presumably?
The new crash still happens, yes. The harfbuzz one is still fixed by the reversion, at least last I checked.
We still see this failure case quite commonly on aarch64 in openQA. Commonly enough that I'm writing a hack into the openQA package to restart all aarch64 tests that fail on the first module :/ Up to and including Rawhide.
moving to gtk3 for now as we have no reason to suspect harfbuzz and I don't really know what else to point at.
Again, this is memory corruption, so the provided backtraces are not actionable and do not indicate anything wrong in GTK. I've seen crashes in gtk_css_static_style_compute_value() many times and it *always* turns out to be the application corrupting memory somehow. The GTK CSS machinery shows up in the backtrace just because it gets called very frequently, but the game was lost much earlier when the memory corruption first occurred.
Moving this back to anaconda for now, as that's the only reasonable component to use until we know where the memory corruption is actually occurring. I doubt the problem is somewhere in anaconda's codebase, because anaconda is written in python, but until we know where the memory corruption is coming from, there's no better component to assign the bug to. The problem could be anywhere in any library that anaconda uses (most likely), or in the python interpreter itself (much less likely). It could even be somewhere in GTK (just not where the backtrace is pointing to). We're not going to find out without either (a) running anaconda under valgrind, or (b) asan builds of everything (python and every library anaconda links to). Obviously (a) would be easier.
"I doubt the problem is somewhere in anaconda's codebase, because anaconda is written in python, but until we know where the memory corruption is coming from, there's no better component to assign the bug to."
I figured gtk3 was a better catch-all than anaconda for precisely this reason :) but it doesn't really matter, it's just that it needs to be assigned *somewhere*.
option (a) should be doable with an updates image using the following change in anaconda
diff --git a/data/tmux.conf b/data/tmux.conf
index 87c9cb7c7..ac5f5cfbb 100644
@@ -23,7 +23,7 @@ set-option -g history-limit 10000
# For more infromation see:
-new-session -d -s anaconda -n main "LD_PRELOAD=libgomp.so.1 anaconda"
+new-session -d -s anaconda -n main "valgrind <some options> anaconda"
set-option status-right '#[fg=blue]#(echo -n "Switch tab: Alt+Tab | Help: F1 ")'
So I gave this a preliminary shot, but it's not flying. I tried both this:
-new-session -d -s anaconda -n main "LD_PRELOAD=libgomp.so.1 anaconda"
+new-session -d -s anaconda -n main "LD_PRELOAD=libgomp.so.1 valgrind --tool=memcheck --leak-check=full --leak-resolution=high --num-callers=20 --log-file=/tmp/vgdump.log anaconda"
-new-session -d -s anaconda -n main "LD_PRELOAD=libgomp.so.1 anaconda"
+new-session -d -s anaconda -n main "valgrind --tool=memcheck --leak-check=full --leak-resolution=high --num-callers=20 --log-file=/tmp/vgdump.log anaconda"
but neither makes it to the installer within 50 minutes of booting (on an aarch64 VM), which means they're either not working at all or running so slow as to be useless. Didn't get any logs so can't tell which.
I took those valgrind args from the GNOME docs, I am no expert on valgrind so didn't know what else to try. Anyone have any other suggestions?
If I understand correctly, you are not looking for a memory leak, but rather for memory corruption. It may very well be that, with the options in comment 47, it is running so slowly as to be useless. Try this:
new-session -d -s anaconda -n main "LD_PRELOAD=libgomp.so.1 valgrind --tool=memcheck --leak-check=no --num-callers=10 --log-file=/tmp/vgdump.log anaconda"
If that finds a problem and 10 callers is not enough to diagnose the issue, repeat with --num-callers set to a higher value.
Thanks, Jerry. Yeah, I figured that might be the issue, but I don't really know valgrind at all so I didn't know what to change. I'll try it that way, thanks.
note, can't get to this ATM because it's easiest to reproduce on aarch64 or ppc64le, but we don't have those back up in the new infra yet, we're running on reduced capacity. Once those workers are back I can try and look at this again.
It's worth a try even on x86_64. Most likely, the underlying bug occurs on all architectures and it's just a timing difference or something. With luck, valgrind might reveal the problem even on x86_64.
Also, if it is still too slow with the options in comment 48, try reducing --num-callers a bit. You probably don't want to go lower than 5; it becomes too hard to figure out what's going on with such small values.
This bug appears to have been reported against 'rawhide' during the Fedora 33 development cycle.
Changing version to 33.
Assigned to mail list is nonsense, returning to new.
I don't know if it's related to this issue, but I had a similar problem running anaconda in my CentOS remix:
Starting installer, one moment...
anaconda 18.104.22.168-1.el8 for CentOS Stream 8 started.
* installation log files are stored in /tmp during the installation
* shell is available on TTY2 and in second TMUX pane (ctrl+b, then press 2)
* when reporting a bug add logs from /tmp as separate text/plain attachments
12:09:07 Not asking for VNC because we don't have a network
No protocol specified
No protocol specified
Anaconda received signal 11!.
New LWP 6738]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
0x00007fed1cd094a2 in waitpid () from /lib64/libpthread.so.0
Saved corefile /tmp/anaconda.core.6730
[Inferior 1 (process 6730) detached]
In my case, using a GNOME Xorg session (instead of Wayland) solves the problem. Using a Wayland session makes Anaconda always fail. KDE remix, which uses Xorg, has no issue.
Hope this helps.
This message is a reminder that Fedora 33 is nearing its end of life.
Fedora will stop maintaining and issuing updates for Fedora 33 on 2021-11-30.
It is Fedora's policy to close all bug reports from releases that are no longer
maintained. At that time this bug will be closed as EOL if it remains open with a
Fedora 'version' of '33'.
Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version'
to a later Fedora version.
Thank you for reporting this issue and we are sorry that we were not
able to fix it before Fedora 33 is end of life. If you would still like
to see this bug fixed and are able to reproduce it against a later version
of Fedora, you are encouraged change the 'version' to a later Fedora
version prior this bug is closed as described in the policy above.
Although we aim to fix as many bugs as possible during every release's
lifetime, sometimes those efforts are overtaken by events. Often a
more recent Fedora release includes newer upstream software that fixes
bugs or makes them obsolete.
It does seem possible that this has stopped happening, actually. It's a bit hard to be sure, because it's a very intermittent bug, but I checked several hundred aarch64 tests and doesn't look like there's been a case of this recently...