2293668 – Enabling LTO when building the openjdk project runs out of memory

Bug 2293668 - Enabling LTO when building the openjdk project runs out of memory

Summary: Enabling LTO when building the openjdk project runs out of memory

Keywords:
Status:	CLOSED EOL
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	gcc
Sub Component:
Version:	40
Hardware:	Unspecified
OS:	Linux
Priority:	unspecified
Severity:	low
Target Milestone:	---
Assignee:	Jakub Jelinek
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2024-06-21 15:45 UTC by Sanne Grinovero
Modified:	2025-05-20 09:26 UTC (History)
CC List:	13 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2025-05-20 09:26:08 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Sanne Grinovero 2024-06-21 15:45:17 UTC

Attempting to build the OpenJDK project on Fedora 40; there's an option to configure it to enable link-time optimisations.

When enabling this option, the build is unable to complete successfully as the `lto1` process is consuming too much memory; it was able to exhaust ~150GB of virtual memory before it got killed. It's unknown what the upper limit would be as I can't let it go higher.

A colleague tried this on Fedora 39 and he was able to build it successfully.

Versions on the failing machine:
Tools summary:
* Boot JDK:       openjdk version "22.0.1" 2024-04-16 OpenJDK Runtime Environment (Red_Hat-22.0.1.0.8-1) (build 22.0.1+8) OpenJDK 64-Bit Server VM (Red_Hat-22.0.1.0.8-1) (build 22.0.1+8, mixed mode, sharing) (at /usr/lib/jvm/jre-22-openjdk-22.0.1.0.8-1.rolling.fc40.x86_64)
* Toolchain:      gcc (GNU Compiler Collection)
* C Compiler:     Version 14.1.1 (at /usr/bin/gcc)
* C++ Compiler:   Version 14.1.1 (at /usr/bin/g++)


Reproducible: Always

Steps to Reproduce:
1. git clone --branch jdk23 --depth 1 https://github.com/openjdk/jdk.git
2. cd jdk
3. bash configure --with-jvm-variants=custom --with-jvm-features=cds,compiler1,compiler2,g1gc,serialgc,jfr,jni-check,jvmci,jvmti,management,services,link-time-opt --enable-generate-classlist --disable-manpages --with-vendor-name=Experiments
3. make images

Careful as this will exhaust all system memory, in some cases leading to a Gnome UI crash before safeguards kick in to terminate the build.
Actual Results:  
The build process gets killed as the system runs out of memory.

Expected Results:  
It should complete the build successfully.

Also reported to the OpenJDK project:
 - https://bugs.openjdk.org/browse/JDK-8334616

Comment 1 Sanne Grinovero 2024-06-24 12:51:51 UTC

For reference, I suspect it's a regression on the linker side because a colleague using Fedora 39 could build the same version of OpenJDK just fine.
He's using GCC version 13.2.1 .

Comment 2 Nick Clifton 2024-06-25 14:15:20 UTC

Hi Sanne,

  I think that it is unlikely that it is the linker that is to blame, since the lto1 process is actually the compiler rather than the linker.

  If you are able/willing to risk a test, it would be interesting to see if adding "-fuse-ld=gold" or "-fuse-ld=lld" to the gcc command line makes any difference. (IE does using the GOLD linker or the LLD linker make the problem go away).

  For reference I did try running the steps you outline on my Fedora 39 system at it did run of memory there as well.  (This was with gcc 13.3.1 and binutils 2.40).  But then I only have a measly 32Gb of RAM installed on my machine.  So I am guessing that building OpenJDK requires big iron no matter what.

  Does the "make images" command build lots of things in parallel ?  If so, then maybe another workaround would be to build the images sequentially.

Cheers
  Nick

Comment 3 Sanne Grinovero 2024-06-26 08:39:41 UTC

Thanks for your help Nick!

I'll try to switch linker, great idea.

I also tried on a Fedora 39 machine via podman, memory limited to 60GB (my host only has 64), and it also wasn't able to complete so perhaps it's not a regression indeed. I asked the colleague that was able to build it on Fedora 39 to double check.

I'm not sure if "make images" is potentially attempting to run multiple things in parallel as I don't normally work with the OpenJDK codebase, but having observed system resources just before it gets terminated, I could see only a single lto1 process consuming all memory: it starts low and then starts growing fast and steady until running out and being terminated.

Comment 4 Sanne Grinovero 2024-06-26 08:47:47 UTC

It was suggested to me that I should include which packages I have installed exactly.
On my Fedora 40 container which I used to reproduce the problem I have:

FROM quay.io/fedora/fedora:40
RUN dnf -y update
RUN dnf -y install git maven java-22-openjdk-devel gcc autoconf automake binutils glibc-devel make libtool pkgconf unzip zip gcc-c++ alsa-lib-devel cups-devel fontconfig-devel libXtst-devel libXt-devel libXrender-devel libXrandr-devel libXi-devel --nodocs

I also reproduced on my host machine, which is also running an updated Fedora 40 but probably has many more packages not relevant to this.

Comment 5 Sanne Grinovero 2024-07-01 20:19:25 UTC

I've been able to further narrow down the differences between the successful and failing build scripts we had, and it seems clear now that it's not a regression on the gcc side but rather triggered by an unusual set of build parameters I was using.

So not a regression, apologies for that. However the reproducer steps described here are still able to get the lto1 process in such a state that it seems to quickly consume all available memory.

In step 3. above, if one adds the "shenandoahgc" to the list of jvm features the problem is avoided; I'm not sure why that could be the case:

bash configure --with-jvm-variants=custom --with-jvm-features=cds,compiler1,compiler2,g1gc,serialgc,jfr,jni-check,jvmci,jvmti,management,services,link-time-opt,shenandoahgc --enable-generate-classlist --disable-manpages --with-vendor-name=Experiments

In conclusion: not a regression, and not as critical as I initial thought.
I've lowered the severity.

Thanks

Comment 6 Nick Clifton 2024-07-02 10:59:53 UTC

> In step 3. above, if one adds the "shenandoahgc" to the list of jvm features the problem is avoided; I'm not sure why that could be the case:

Shenandohgc appears to be a garbage collector feature, so my guess is that this is eliminating a lot of unneeded code and hence making the lto1 compilation use less resources.

  https://wiki.openjdk.org/display/shenandoah/Main

Comment 7 Sanne Grinovero 2024-07-02 13:02:07 UTC

Correct, Shenandohgc is one of the garbage collection implementations.

But the build only succeeds when the "shenandoahgc" feature is NOT removed; when it's removed I would expect unneeded code being removed but what we see is that the lto1 process starts consuming way more memory, it actually grows really fast until it gets terminated.

I honestly have no idea how, but removing some code seems to trigger a very specific situation.

Comment 8 Nick Clifton 2024-07-03 09:55:48 UTC

Hang on - I think I am missing something here - is the shenandoahgc feature part of the build process for OpenJDK or is it part of the run time once OpenJDK has been built ?  If it is part of the build process, then the build working when it is enabled makes sense - it removes code from the build, resulting in the final lto1 process having to do less work and consume less resources.  If instead shenandoahgc is only part of the run-time then enabling it would add more code to the build and should - in theory - make the lto1 process even more likely to run out of resources.

On theory - actually more of guess - is that if the shenandoahgc feature is not enabled the OpenJDK code base uses a different method for allocating memory - it might be conservative and try to allocate as much as possible, since it cannot rely on garbage collection making more memory available later on - and so the code ends up with huge chunks of allocated memory that, err somehow, causes the lto1 process to blow up.  OK, I am grasping at straws here...

All of which might be moot.  If it is safe to enable the shenandoahgc feature, and doing so makes the lto1 process work, then why not just leave it enabled and close this ticket ?

Comment 9 Sanne Grinovero 2024-07-03 11:18:48 UTC

Ah, yes that could be confusing, apologies.

No Shenandoah is not a "part of the build process for OpenJDK", it's a capability that can be used optionally at runtime by users of the built OpenJDK binaries; the availability of this capability can be excluded at build time.  I don't expect its presence among the built features to reduce the amount of code being built, but rather the opposite, but this might of course have other consequences on the compiler.

So that's what is puzzling us.

That said I'm not an expert of the build process of OpenJDK and I can't say with 100% confidence that when enabling the shenandoahgc feature this won't have other side effects on the build flags, looking at the parameters being passed to the `lto1` process this seems unlikely but please have a look as well. So the leading theory is that the shape of the code being compiled in this state, although less overall code, gets lto1 to manifest some form of asymptotic complexity.

Comment 10 Nick Clifton 2024-07-03 12:08:05 UTC

I have not been able to track down when the shenandoahgc feature is changing the build so dramatically (but I am still looking).  In the meantime I did find another potential work-around for the problem: omit the "link-time-opt" feature from the configuration.

Comment 11 Nick Clifton 2024-07-03 12:57:28 UTC

Nope - sorry - I cannot figure out why the build is behaving so differently when shenandoahgc is not enabled.  It makes no sense. :-(

Comment 12 Aoife Moloney 2025-04-25 11:04:21 UTC

This message is a reminder that Fedora Linux 40 is nearing its end of life.
Fedora will stop maintaining and issuing updates for Fedora Linux 40 on 2025-05-13.
It is Fedora's policy to close all bug reports from releases that are no longer
maintained. At that time this bug will be closed as EOL if it remains open with a
'version' of '40'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, change the 'version' 
to a later Fedora Linux version. Note that the version field may be hidden.
Click the "Show advanced fields" button if you do not see it.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora Linux 40 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora Linux, you are encouraged to change the 'version' to a later version
prior to this bug being closed.

Comment 13 Aoife Moloney 2025-05-20 09:26:08 UTC

Fedora Linux 40 entered end-of-life (EOL) status on 2025-05-13.

Fedora Linux 40 is no longer maintained, which means that it
will not receive any further security or bug fix updates. As a result we
are closing this bug.

If you can reproduce this bug against a currently maintained version of Fedora Linux
please feel free to reopen this bug against that version. Note that the version
field may be hidden. Click the "Show advanced fields" button if you do not see
the version field.

If you are unable to reopen this bug, please file a new report against an
active release.

Thank you for reporting this bug and we are sorry it could not be fixed.

Note You need to log in before you can comment on or make changes to this bug.