Bug 2218885 - Firefox crashes with Illegal instruction on Intel Core 2 Duo CPU T9600
Summary: Firefox crashes with Illegal instruction on Intel Core 2 Duo CPU T9600
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: gcc
Version: 38
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
Assignee: Jakub Jelinek
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
: 2218919 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-06-30 12:27 UTC by Matti Linnanvuori
Modified: 2023-07-04 16:54 UTC (History)
19 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-07-04 11:36:09 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
Core dump (8.62 MB, application/octet-stream)
2023-06-30 12:27 UTC, Matti Linnanvuori
no flags Details
Core dump (12.73 MB, application/octet-stream)
2023-06-30 14:15 UTC, Matti Linnanvuori
no flags Details

Description Matti Linnanvuori 2023-06-30 12:27:05 UTC
Created attachment 1973397 [details]
Core dump

Description of problem:
Firefox crashes with Illegal instruction on Intel Core 2 Duo CPU T9600

Version-Release number of selected component (if applicable):
114.0.2-2.fc38

How reproducible:
Always.

Steps to Reproduce:
1. Install Firefox on a computer with Intel Core 2 Duo.
2. Try to start Firefox

Actual results:
Firefox crashes with Illegal instruction on Intel Core 2 Duo.

Expected results:
Firefox starts without errors.

Comment 1 Martin Stransky 2023-06-30 12:35:52 UTC
Can you please try latest build? https://koji.fedoraproject.org/koji/buildinfo?buildID=2223514
Thanks.

Comment 2 Matti Linnanvuori 2023-06-30 14:13:47 UTC
Still a core dump but Illegal instruction no longer written to a shell.

Comment 3 Matti Linnanvuori 2023-06-30 14:15:37 UTC
Created attachment 1973409 [details]
Core dump

Comment 4 Martin Stransky 2023-06-30 14:47:51 UTC
*** Bug 2218919 has been marked as a duplicate of this bug. ***

Comment 5 Martin Stransky 2023-06-30 14:48:15 UTC
More info at https://bodhi.fedoraproject.org/updates/FEDORA-2023-07fdb0c13e

Comment 6 Martin Stransky 2023-06-30 14:52:52 UTC
According to updates, illegal instruction perfectly reproducible on

Core i-540 (Clarkdale, 2009/2010) - should have exactly the same instruction set as a Nehalem.

AMD E-450 mobile APU (Bobcat, 2011) - lacks full SSE 4/4.1/4.2 support (SSE 4a only), otherwise instruction-wise comparable to a Nehalem.

Comment 7 Martin Stransky 2023-06-30 14:53:28 UTC
This affects Fedora 38/Rawhide only, Fedora 37 should be fine.

Comment 8 samoht0 2023-06-30 20:29:25 UTC
In OpenSUSE builds this root of the problem was found:

>Instruction vshufps is a SSE instruction coded in AVX notation, that is why it requires AVX support.
>
>ILL as gcc or code optimiser have changed instruction. With that change code works faster on compatible CPUs by using >AVX, avoiding penalties for hopping between ordinary and VEX coding scheme.

So right instruction, but bad ("optimized") coding scheme.

https://bugzilla.opensuse.org/show_bug.cgi?id=1212101#c25
https://bugzilla.mozilla.org/show_bug.cgi?id=1838323#c2

We also have
-O3 -mavx /builddir/build/BUILD/firefox-114.0.2/gfx/skia/skia/src/opts/SkOpts_avx.cpp
in logs which is suspected to be the bad-encoded part of source.

OpenSUSE disabled LTO to make it work for now:

https://bugzilla.opensuse.org/show_bug.cgi?id=1212101#c62

Seems like, disabling PGO/LTO is the only solution for now, as this analysis directs to a optimizer regression, requiring AVX support for SSE opcode decoding. Weird.

Comment 9 samoht0 2023-07-01 06:17:08 UTC
GCC upstream bug with master branch patch:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110334

Comment 10 samoht0 2023-07-01 09:54:42 UTC
To my understanding there might be currently one option to fix compiling firefox without fixing GCC.

Cut -mavx from custom compiler options, if that's possible.

Upstream has some point, saying different ISA flags is general a user error. While that's technically right, building systems of complex software aren't exactly handy in real world.

@Martin
What do you think about this approach?

Comment 11 samoht0 2023-07-01 12:51:02 UTC
Bad flag is added here:

https://searchfox.org/mozilla-central/source/gfx/skia/generate_mozbuild.py

SOURCES['skia/src/opts/SkOpts_avx.cpp'].flags += ['-mavx']

SOURCES['skia/src/opts/SkOpts_avx.cpp'].flags += ['']
should do, I guess.

But can't be patched, as it's auto generated. Sed before build stage, maybe.

Comment 12 Jakub Jelinek 2023-07-01 20:38:08 UTC
My understanding of the problem is that this is a general incompatibility between using different ISA options (or even optimization options) for different compilation units vs. C++ comdats (e.g. function template instantiations or non-static inline functions), which if unlucky one can get with any compiler.
If some function template instantiation or non-static inline function etc. which happens to be not inlined for whatever reason is compiled in more than one compilation unit with different flags, linker chooses just one copy from those.
If it chooses one with smaller ISA requirements and the out of line comdat copy is called from both compilation units, one just gets perhaps less performant version (or say if it is -O0 vs. -O2 or something similar, one could get unoptimized code called even from optimized CUs etc.),
but if the linker chooses one with higher ISA requirements, such as -mavx vs. -msse2, you can get the above problems.
This can happen I believe with both GCC and clang, and with or without LTO, everything depends on the fine details what each compiler decides to inline and what it doesn't.
Without LTO, perhaps the user has slightly more control over what is happening, I think the linker chooses the first copy of set of comdat sections it sees, so if e.g. the -msse2 objects are linked before the -mavx objects, it might happen to work without LTO.
With LTO, which copy is linked first depends on the LTO partitioning and could differ from the user supplied order of object files to be linked.

While some changes have been committed for https://gcc.gnu.org/PR110334 , I wouldn't call any of those fixes for this problem, because it is not really fixable (well, perhaps ELF and linker could be extended to have some more advanced ways of choosing priorities of comdat sections).
Disabling -mavx can be one possibility of a workaround, another one as Richard Biener said in the GCC PR would be replace -mavx with -mavx -fno-lto (assuming -flto isn't added later on), so that those skia objects with extra flags wouldn't be handled by LTO and there could be greater chance the linker would choose the non-avx version.  Or use always_inline attribute on whatever function template or non-static inline function is problematic, or use less always_inline attributes (I believe part of the reason why it hasn't been inlined was too many functions marked always_inline into which GCC doesn't inline during early inlining non-always_inline functions).

Comment 13 Jakub Jelinek 2023-07-03 06:59:33 UTC
Perhaps best fix would be not enable -mavx for the whole compilation units, but instead enable it inside of the compilation unit on specific functions with pragmas/target attribute (e.g.
#pragma GCC push_options
#pragma GCC target("avx")
after #include directives in the CU and
#pragma GCC pop_options
at the end) or its clang counterparts, that way the comdats defined in headers don't have AVX enabled, so if they are not inlined, they can be called from other CUs without requiring AVX.
But, what gets inlined into the AVX routines will use AVX.

Comment 14 Martin Stransky 2023-07-03 14:08:44 UTC
Thanks. For the workaround, do I need to handle -mavx2 too? or -mavx is enough?

Comment 15 Jakub Jelinek 2023-07-03 14:49:41 UTC
Depends on if you want a minimal workaround or something longer term.
The problem certainly isn't limited to just -mavx, it is about different CUs with different ISA options and comdat functions in them.
Without LTO, one could e.g. find all the *.o files built with extra -m* options which are linked together, and for those gather
with
readelf -WS those_objects*.o | grep AXG | sed 's/^.*\.text\.//;s/ .*$//'
the list of the comdat functions and then for each item in the list search all the other objects
whether they don't contain the same symbols as well.
With LTO, this would need to be done on the objects from the LTO partitions (one could e.g. use -save-temps
during the LTO link to leave tons of files around and search those).
The problem is that both in case of using #pragma GCC target/target attribute etc. or when using LTO,
different functions in one LTO partition can have different compilation options, so one could gather
the list of comdat functions across all objects and then either compare them using objdump -dr whether
the contain the same instructions, or try to map them back to the compilation options using debuginfo.

Nick, could one use linker dumps to find out which comdat sections have been picked and which have been
eliminated and perhaps find out if those sections were the same or different?

Comment 16 Martin Stransky 2023-07-03 15:28:07 UTC
As this affects bundled skia library only, I expect it may also help to disable LTO for it?

Comment 17 samoht0 2023-07-03 16:31:59 UTC
(In reply to Martin Stransky from comment #14)
> Thanks. For the workaround, do I need to handle -mavx2 too? or -mavx is
> enough?

According to build logs, we have also compiler flags from:
SOURCES['skia/src/opts/SkOpts_hsw.cpp'].flags += ['-mavx2', '-mf16c', '-mfma']
SOURCES['skia/src/opts/SkOpts_skx.cpp'].flags += ['-mavx512f', '-mavx512dq', '-mavx512cd', '-mavx512bw', '-mavx512vl']

For some quick fix, to ship 115.0, I like the idea to cut ISA flags from this two lines plus

SOURCES['skia/src/opts/SkOpts_avx.cpp'].flags += ['-mavx']

But Jakub is probably be right log term. We've plenty of -mavx and -mavx2 ISA flags in video decoders (av1, aom, vpx). Anyway, even if a similar bug appears there, this shouldn't affect browsing and basic usage.

(In reply to Martin Stransky from comment #16)
> As this affects bundled skia library only, I expect it may also help to
> disable LTO for it?

To my understanding, Jakub said, it depends. But as firefox-114.0.2-1.fc38 worked fine without LTO, I'm confident, the odds are good.

Comment 18 Martin Stransky 2023-07-04 08:15:54 UTC
Let's se if disabled LTO help.

Comment 19 Fedora Update System 2023-07-04 10:41:43 UTC
FEDORA-2023-8ffb38cd0c has been submitted as an update to Fedora 39. https://bodhi.fedoraproject.org/updates/FEDORA-2023-8ffb38cd0c

Comment 20 Fedora Update System 2023-07-04 11:36:09 UTC
FEDORA-2023-8ffb38cd0c has been pushed to the Fedora 39 stable repository.
If problem still persists, please make note of it in this bug report.

Comment 21 samoht0 2023-07-04 16:54:55 UTC
Works on non-VAX CPUs now.
Disabling LTO just for problematic skia compiler calls (adding -fno-lto) might be an option to test, but I'm fine with LTO disabled globally.


Note You need to log in before you can comment on or make changes to this bug.