Bug 2226905 - curve25519-dalek tests compiled with rust 1.71+ crash with SIGSEGV on s390x
Summary: curve25519-dalek tests compiled with rust 1.71+ crash with SIGSEGV on s390x
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Fedora
Classification: Fedora
Component: llvm
Version: rawhide
Hardware: Unspecified
OS: Linux
unspecified
medium
Target Milestone: ---
Assignee: Tom Stellard
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-07-26 22:29 UTC by Fabio Valentini
Modified: 2024-10-07 15:48 UTC (History)
16 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2024-10-07 15:48:04 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)

Description Fabio Valentini 2023-07-26 22:29:15 UTC
The package for curve25519-dalek failed during the F39 mass rebuild with Rust 1.71 on s390x only, with a test crashing with SIGSEGV.

This happens on both rawhide and fedora 38, but not fedora 37 (which still has Rust 1.70).

Running cargo test with "--test-threads 1" in mock / qemu I get the following culprit:

test edwards::test::basepoint_tables ... qemu: uncaught target signal 11 (Segmentation fault) - core dumped


Reproducible: Always

Steps to Reproduce:
1. fedpkg clone rust-curve25519-dalek
2. cd rust-curve25519-dalek
3. fedpkg srpm
4. mock -r fedora-rawhide-s390x ./*.src.rpm
Actual Results:  
Tests crash with a segmentation fault on s390x.

Expected Results:  
Tests pass (as they did with Rust <= 1.70).

Comment 1 Josh Stone 2023-07-27 19:50:20 UTC
I can reproduce this on the latest curve25519-dalek in its git repo, with upstream builds of 1.71.0, 1.72-beta, and 1.73-nightly, but upstream 1.70.0 is fine.

RUSTFLAGS=-Ccodegen-units=1 cargo test --lib --release

Comment 2 Josh Stone 2023-07-27 22:40:05 UTC
cargo-bisect-rustc found this:

searched nightlies: from nightly-2023-02-28 to nightly-2023-07-27
regressed nightly: nightly-2023-05-09
searched commit range: https://github.com/rust-lang/rust/compare/c4190f2d3a46a59f435f7b42f58bc22b2f4d6917...2f2c438dce75d8cc532c3baa849eeddc0901802c
regressed commit: https://github.com/rust-lang/rust/commit/dfe31889e10e36eed53327d1ca624fbf21b475a5

But that's pretty surprising that turning OFF an optimization would cause problems, unless it was masking something else.

Comment 3 Fabio Valentini 2023-07-27 22:45:13 UTC
Thanks for investigating! That seems suspicious, yes.

Looking at the code for the failing test, I don't see anything suspicious though ...
https://github.com/dalek-cryptography/curve25519-dalek/blob/3.2.1/src/edwards.rs#L1408-L1432

Comment 4 Josh Stone 2023-07-28 00:42:36 UTC
Well, I can confirm that forcing that pass on with -Zmir-enable-passes=+RenameReturnPlace fixes the test on all toolchains, and likewise forcing it off (-) breaks it with earlier toolchains that were working. I'll try more testing with that off to see if there's an underlying regression point.

Comment 5 Josh Stone 2023-07-28 05:32:09 UTC
With -Zmir-opt-level=0 (because -Zmir-enable-passes isn't old enough), I bisected to nightly-2022-03-10 working, nightly-2022-03-11 crashing.

https://github.com/rust-lang/rust/compare/458262b1315e0de7be940fe95e111bb045e4a2a4...5f4e0677190b82e61dc507e3e72caf89da8e5e28

That includes commit 0c7d0a1 "Use new pass manager on s390x with LLVM 14", and indeed adding -Znew-llvm-pass-manager=no makes it work again. But current Rust and LLVM don't have that option anymore, and anyway I should stop black-boxing this and figure out what's actually changing in the codegen output to make this crash. :)

Comment 6 Fedora Release Engineering 2023-08-16 08:13:39 UTC
This bug appears to have been reported against 'rawhide' during the Fedora Linux 39 development cycle.
Changing version to 39.

Comment 7 Fabio Valentini 2024-05-06 17:57:52 UTC
This is still happening with the latest version of rustc and LLVM in Rawhide and the latest version of curve25519-dalek.

I actually need to build this package because something I'm working on needs a newer version, so I'll need to disable tests on s390x for now.

Comment 8 Nikita Popov 2024-05-08 06:25:30 UTC
I think this is an issue in post RA pseudo expansion. We go from

  renamable $r2q = L128 $r15d, 14920, killed $r2d :: (load (s128) from %stack.13, align 8)

to

  $r2d = LG $r15d, 14920, $r2d :: (load (s128) from %stack.13, align 8)
  $r3d = LG $r15d, 14928, killed $r2d :: (load (s128) from %stack.13, align 8)

Note how $r2d gets over-written now.

Comment 9 Nikita Popov 2024-05-08 07:29:11 UTC
I've filed https://github.com/llvm/llvm-project/issues/91437 with that finding for now.

Comment 10 Fabio Valentini 2024-05-15 17:14:52 UTC
Thank you! Looks like it's fixed in the development branch and on its way to be backported to the LLVM 18 branch :party:

Comment 11 Fabio Valentini 2024-10-07 15:48:04 UTC
As far as I can tell, the fix for this is in llvm 18.1.6+, which is available on Fedora 40+. Thanks!


Note You need to log in before you can comment on or make changes to this bug.