Bug 2058803 - rust 1.59.0 on fedora-rawhide-s390x gets stuck compiling some crates
Summary: rust 1.59.0 on fedora-rawhide-s390x gets stuck compiling some crates
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: rust
Version: rawhide
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Rust SIG
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks: ZedoraTracker
TreeView+ depends on / blocked
 
Reported: 2022-02-25 23:01 UTC by Fabio Valentini
Modified: 2022-03-02 17:05 UTC (History)
10 users (show)

Fixed In Version: rust-1.59.0-2.fc37
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-03-02 08:33:19 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)

Description Fabio Valentini 2022-02-25 23:01:32 UTC
This rawhide build has been running for almost 5 hours now:
https://koji.fedoraproject.org/koji/taskinfo?taskID=83328158

The build for the same version on Fedora 36 with Rust 1.58.1 finished within 13 minutes, and that included the snail-speed armv7hl, which is no longer even built on rawhide.

I have built other crates on rawhide with Rust 1.59.0 and none of those had problems, but the build for rust-inferno-0.10.8-3.fc37 seems to have gotten stuck, with none of the builds of crate dependencies ever finishing (there's only log output for them getting started, but none of them finishing):

from the log (in %build):

+ /usr/bin/env CARGO_HOME=.cargo RUSTC_BOOTSTRAP=1 'RUSTFLAGS=-Copt-level=3 -Cdebuginfo=2 -Ccodegen-units=1 -Clink-arg=-Wl,-z,relro -Clink-arg=-Wl,-z,now -Clink-arg=-Wl,-dT,/builddir/build/BUILD/inferno-0.10.8/.package_note-rust-inferno-0.10.8-3.fc37.s390x.ld --cap-lints=warn' /usr/bin/cargo build -j2 -Z avoid-dev-deps --release
   Compiling version_check v0.9.4
   Compiling libc v0.2.119
   Compiling proc-macro2 v1.0.36
(..)
   Compiling clap v2.34.0
   Compiling dashmap v4.0.2
   Compiling structopt-derive v0.4.18

(and then nothing happens)

Comment 1 Fabio Valentini 2022-02-26 15:11:01 UTC
It happened again, with clap_derive:
https://koji.fedoraproject.org/koji/taskinfo?taskID=83366868

Looks like it might be related to compilation of procedural macros?
inferno got stuck with structopt-derive, clap_derive is a proc-macro crate.

Comment 2 Fabio Valentini 2022-02-26 15:18:58 UTC
Looks like it also affects alacritty:
https://koji.fedoraproject.org/koji/taskinfo?taskID=83353501

Comment 3 Zbigniew Jędrzejewski-Szmek 2022-02-26 18:13:45 UTC
I think I'll rebuild with ExcludeArch for now. I seriously doubt that anyone is using alacritty on s390x.

Comment 4 Josh Stone 2022-02-28 21:17:19 UTC
CCing a few LLVM folks, as that's where it appears to be stuck, according to "perf top":

Overhead  Shared Object                        Symbol
  33.23%  libLLVM-13.so                        [.] llvm::CodeMetrics::collectEphemeralValues
  20.13%  libLLVM-13.so                        [.] llvm::isSafeToSpeculativelyExecute
  19.12%  libLLVM-13.so                        [.] llvm::SmallPtrSetImplBase::insert_imp_big
   8.35%  libLLVM-13.so                        [.] llvm::SmallPtrSetImplBase::Grow
   6.71%  libLLVM-13.so                        [.] llvm::SmallPtrSetImplBase::FindBucketFor
   2.09%  libLLVM-13.so                        [.] llvm::CallGraphNode::removeCallEdgeFor
   1.24%  libLLVM-13.so                        [.] 0x0000000002913e26

I'll refrain from re-assigning the component for now, as I'm not sure why this would only start with Rust 1.59...

Comment 5 Fabio Valentini 2022-02-28 23:10:32 UTC
I just confirmed that this is affecting rust 1.59 on fedora-{rawhide,36,35,34}-x390x.

A good test case is the "rust-clap_derive" package (without my "revert building on s390x" commit from the rawhide branch), as it has few dependencies, and compiles relatively fast, even in QEMU (or at least, it *would* be fast, if it worked).

So it appears that LLVM 13 is not to blame, since Rust on Fedora 34 is using LLVM 12.

Comment 6 Josh Stone 2022-03-01 02:54:22 UTC
Using cargo-bisect-rustc found a regression point between 1.58.1 and 1.59.0.

---

searched nightlies: from nightly-2021-12-01 to nightly-2022-02-28
regressed nightly: nightly-2021-12-24
searched commit range: https://github.com/rust-lang/rust/compare/34926f0a1681458588a2d4240c0715ef9eff7d35...c09a9529c51cde41c1101e56049d418edb07bf71
regressed commit: https://github.com/rust-lang/rust/commit/e98309298d927307c5184f4869604bd068d26183

<details>
<summary>bisected with <a href='https://github.com/rust-lang/cargo-bisect-rustc'>cargo-bisect-rustc</a> v0.6.1</summary>


Host triple: s390x-unknown-linux-gnu
Reproduce with:
```bash
cargo bisect-rustc --access github --start 2021-12-01 --end 2022-02-28 --timeout 300 -- build --release
```
</details>

---

That merge commit is for https://github.com/rust-lang/rust/pull/90408

The change to ItemSortKey looks suspicious to me, though I'm not sure why that would get it stuck in LLVM, but I'm testing a possible fix for that now. If that doesn't work, I'll file a Rust issue and follow up that way.

Comment 7 Nikita Popov 2022-03-01 08:42:30 UTC
@Josh Do you happen to have an IR reproducer that hangs opt?

My first suspicion here was more NewPM catastrophic inlining (I believe that landed in 1.59), but we do disable that for both s390x and LLVM < 13, so that can't be it.

Here's an old report for a hang in collectEphemeralValues(): https://github.com/rust-lang/rust/issues/66617

Comment 8 Josh Stone 2022-03-01 23:24:45 UTC
My fix does appear to solve it: https://github.com/rust-lang/rust/pull/94505
And a scratch build: https://koji.fedoraproject.org/koji/taskinfo?taskID=83519938

But I'm not entirely satisfied with how LLVM would be affected, so I still captured bitcode for it.
https://jistone.fedorapeople.org/bz2058803/

We don't need rpmbuild to reproduce the problem. In clean source of "clap_derive 3.12", for each compiler I ran:

$ RUSTFLAGS=-Ccodegen-units=1 cargo rustc --release -- -Csave-temps

For 1.59.0-1.fc37 and nightly-2022-03-01, the file I uploaded is the last one it wrote before hanging. Then I picked the equivalent no-opt.bc for 1.59.0-2.fc37 and "patched" (same nightly), both with my PR fix.

However, I haven't found any way to make opt hang on the "bad" ones...

Comment 9 Josh Stone 2022-03-02 01:25:04 UTC
> In clean source of "clap_derive 3.12",

Oops, typo, that should be 3.1.2.

Comment 10 Fedora Update System 2022-03-02 07:44:33 UTC
FEDORA-2022-c9bd6f0053 has been submitted as an update to Fedora 37. https://bodhi.fedoraproject.org/updates/FEDORA-2022-c9bd6f0053

Comment 11 Fedora Update System 2022-03-02 08:33:19 UTC
FEDORA-2022-c9bd6f0053 has been pushed to the Fedora 37 stable repository.
If problem still persists, please make note of it in this bug report.

Comment 12 Nikita Popov 2022-03-02 08:56:18 UTC
> However, I haven't found any way to make opt hang on the "bad" ones...

I can reproduce the hang with "opt -O2 -enable-new-pm=0" on current LLVM HEAD.

Comment 13 Nikita Popov 2022-03-02 11:08:46 UTC
Based on the inlining debug log, I strongly suspect that this is the same catastrophic cross-SCC inlining issue that we've previously seen with the new pass manager (it's another instance involving recursive drop glue). This just happens to be a case where it occurs with the legacy pass manager, but not the new pass manager.

I don't see an obvious way to port the NewPM fix from https://reviews.llvm.org/D120584 to the LegacyPM inliner, because we don't have a direct way to fetch the SCC for a CG node there. Though that should become a moot point soon(TM) anyway, with the legacy PM going away.

Comment 14 Josh Stone 2022-03-02 17:05:42 UTC
> I can reproduce the hang with "opt -O2 -enable-new-pm=0" on current LLVM HEAD.

Oh, is it default-enabled in opt now too? Maybe that option should be renamed, but it's going away anyway...

Too bad if this is the inlining thing again, because then the Rust change is just incidental bad luck. I do think that fix is still correct in its own right, but the LLVM problem will remain lurking until 15, I guess. Or if D120584 is feasible for the 14 release branch, maybe we can wean Rust from oldPM even on s390x with 14.


Note You need to log in before you can comment on or make changes to this bug.