Bug 2226905

Summary: curve25519-dalek tests compiled with rust 1.71.0 crash with SIGSEGV on s390x
Product: [Fedora] Fedora Reporter: Fabio Valentini <decathorpe>
Component: rustAssignee: Rust SIG <rust-sig>
Status: NEW --- QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 39CC: amulhern, igor.raits, jistone, rust-sig, TicoTimo
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2226527    

Description Fabio Valentini 2023-07-26 22:29:15 UTC
The package for curve25519-dalek failed during the F39 mass rebuild with Rust 1.71 on s390x only, with a test crashing with SIGSEGV.

This happens on both rawhide and fedora 38, but not fedora 37 (which still has Rust 1.70).

Running cargo test with "--test-threads 1" in mock / qemu I get the following culprit:

test edwards::test::basepoint_tables ... qemu: uncaught target signal 11 (Segmentation fault) - core dumped


Reproducible: Always

Steps to Reproduce:
1. fedpkg clone rust-curve25519-dalek
2. cd rust-curve25519-dalek
3. fedpkg srpm
4. mock -r fedora-rawhide-s390x ./*.src.rpm
Actual Results:  
Tests crash with a segmentation fault on s390x.

Expected Results:  
Tests pass (as they did with Rust <= 1.70).

Comment 1 Josh Stone 2023-07-27 19:50:20 UTC
I can reproduce this on the latest curve25519-dalek in its git repo, with upstream builds of 1.71.0, 1.72-beta, and 1.73-nightly, but upstream 1.70.0 is fine.

RUSTFLAGS=-Ccodegen-units=1 cargo test --lib --release

Comment 2 Josh Stone 2023-07-27 22:40:05 UTC
cargo-bisect-rustc found this:

searched nightlies: from nightly-2023-02-28 to nightly-2023-07-27
regressed nightly: nightly-2023-05-09
searched commit range: https://github.com/rust-lang/rust/compare/c4190f2d3a46a59f435f7b42f58bc22b2f4d6917...2f2c438dce75d8cc532c3baa849eeddc0901802c
regressed commit: https://github.com/rust-lang/rust/commit/dfe31889e10e36eed53327d1ca624fbf21b475a5

But that's pretty surprising that turning OFF an optimization would cause problems, unless it was masking something else.

Comment 3 Fabio Valentini 2023-07-27 22:45:13 UTC
Thanks for investigating! That seems suspicious, yes.

Looking at the code for the failing test, I don't see anything suspicious though ...
https://github.com/dalek-cryptography/curve25519-dalek/blob/3.2.1/src/edwards.rs#L1408-L1432

Comment 4 Josh Stone 2023-07-28 00:42:36 UTC
Well, I can confirm that forcing that pass on with -Zmir-enable-passes=+RenameReturnPlace fixes the test on all toolchains, and likewise forcing it off (-) breaks it with earlier toolchains that were working. I'll try more testing with that off to see if there's an underlying regression point.

Comment 5 Josh Stone 2023-07-28 05:32:09 UTC
With -Zmir-opt-level=0 (because -Zmir-enable-passes isn't old enough), I bisected to nightly-2022-03-10 working, nightly-2022-03-11 crashing.

https://github.com/rust-lang/rust/compare/458262b1315e0de7be940fe95e111bb045e4a2a4...5f4e0677190b82e61dc507e3e72caf89da8e5e28

That includes commit 0c7d0a1 "Use new pass manager on s390x with LLVM 14", and indeed adding -Znew-llvm-pass-manager=no makes it work again. But current Rust and LLVM don't have that option anymore, and anyway I should stop black-boxing this and figure out what's actually changing in the codegen output to make this crash. :)

Comment 6 Fedora Release Engineering 2023-08-16 08:13:39 UTC
This bug appears to have been reported against 'rawhide' during the Fedora Linux 39 development cycle.
Changing version to 39.