Bug 1954375

Summary: libLLVM.so fails to link using LTO with ld.bfd + LLVMGold plugin
Product: [Fedora] Fedora Reporter: Tom Stellard <tstellar>
Component: binutilsAssignee: Nick Clifton <nickc>
Status: CLOSED UPSTREAM QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: rawhideCC: adscvr, aoliva, dvlasenk, fweimer, hongjiu.lu, jakub, nickc, sipoyare, tbaeder
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-07-05 15:59:30 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Tom Stellard 2021-04-28 03:53:24 UTC
Description of problem:
Trying to build the Fedora LLVM package using clang -flto and ld.bfd with the LLVMgold plugin fails with this error message:

/usr/bin/ld: lib64/libLLVMMC.a: member lib64/libLLVMMC.a(SubtargetFeature.cpp.o) in archive is not an object

I spent some time debugging this, and it looks like the root cause is that ld.bfd opens too many files while linking.  The open call that fails is here:

https://github.com/bminor/binutils-gdb/blob/master/bfd/plugin.c#L210

Version-Release number of selected component (if applicable):
GNU ld version 2.36.1-8.fc35


How reproducible:
Always

Steps to Reproduce:
1. git clone https://src.fedoraproject.org/forks/tstellar/rpms/llvm.git
2. cd llvm 
3. git checkout build-with-clang
4. fedpkg build --release rawhide scratch-buid --srpm

Comment 1 Nick Clifton 2021-04-29 12:22:45 UTC
(In reply to Tom Stellard from comment #0)

> 4. fedpkg build --release rawhide scratch-buid --srpm

Is there a wat to recreate this build in a local environment, where I can have access to the object files and build logs ?

I did try "rpmbuild -bb llvm.spec" but this fails with:

  CMake Error at CMakeLists.txt:245 (message):
    In-source builds are not allowed.

Comment 2 Tom Stellard 2021-04-29 16:20:19 UTC
It should work with mockbuild:

fedpkg mockbuild --no-cleanup-after

Comment 3 Nick Clifton 2021-04-30 15:52:31 UTC
(In reply to Tom Stellard from comment #0)
Hi Tom,

  Thanks - I can reproduce the failure now.

> I spent some time debugging this, and it looks like the root cause is that
> ld.bfd opens too many files while linking.  The open call that fails is here:
> 
> https://github.com/bminor/binutils-gdb/blob/master/bfd/plugin.c#L210

  There is a comment in the code just above this line, which implies that this behaviour is intended:

  /* The plugin API expects that the file descriptor won't be closed
     and reused as done by the bfd file cache.  So open it again.
     dup isn't good enough.  plugin IO uses lseek/read while BFD uses
     fseek/fread.  It isn't wise to mix the unistd and stdio calls on
     the same underlying file descriptor.  */

  If this is true, and the API really does expect this, then I am not sure that there is much that can be done.  But.. I have not been able to find any official API documentation for LTO plugins, so maybe this is common practice, rather than required behaviour.

  Question - can the LLVMgold plugin implement its own file descriptor caching ?

  Of course if the plugin works with gold and not with ld.bfd, then this implies that the problem is with the bfd plugin code.  I am going to see if I can change the code to use the bfd library's file descriptor cache, and what breaks when I do that...

Cheers
  Nick

Comment 4 Tom Stellard 2021-04-30 16:26:30 UTC
(In reply to Nick Clifton from comment #3)
>   Of course if the plugin works with gold and not with ld.bfd, then this
> implies that the problem is with the bfd plugin code.  I am going to see if
> I can change the code to use the bfd library's file descriptor cache, and
> what breaks when I do that...
> 

It seems like that might help. Looking at the output of lsof, I noticed that each archive is re-opened (and kept open) every time one of it's objects is read.

Comment 5 Nick Clifton 2021-05-12 15:10:40 UTC
Hi Tom,

  I have made some progress, but not enough.  I have a patch to only use one file descriptor per archive, but now the linker is failing with a message from the plugin:

   error: Failed to get a view of /usr/lib/gcc/x86_64-redhat-linux/11/../../../../lib64/crti.o
   LLVMgold.so: plugin reported error claiming file file /usr/lib64/crti.o 

  This comes from the linker's get_view() function (in ld/plugin.c:get_view) where the file descriptor is used to read in data.  So now I am investigating what happens there...

Cheers
  Nick

Comment 6 Nick Clifton 2021-05-14 13:28:56 UTC
Hi Tom,

  Are you able to test out a scratch build of a new binutils:

 https://koji.fedoraproject.org/koji/taskinfo?taskID=67894924

  I was not able to find a cure for the file descriptor issue - I think that it is a fundamental design flaw in the plugin API.  Instead this patch simply tries to increase the number of available file descriptors, should the initial limit be reached.  In my tests however, the linker process continued until a message simply saying "Terminated" appeared.  I assume that this was due to some other kind of resource exhaustion - memory maybe - but I was running in a mock chroot, so maybe you will have better luck.

Cheers
  Nick

Comment 7 Tom Stellard 2021-05-19 04:57:24 UTC
I was able to test the builds, but it didn't change anything for me.  I'm also running in a mock, but maybe there is something about my environment that is preventing binutils from increasing the file descriptor limit?  Do you think there are any changes we can make to the LLVM plugin that might help fix this issue?

Comment 8 Nick Clifton 2021-05-21 14:45:20 UTC
(In reply to Tom Stellard from comment #7)
> I was able to test the builds, but it didn't change anything for me. 

Darn.

> I'm
> also running in a mock, but maybe there is something about my environment
> that is preventing binutils from increasing the file descriptor limit?

Maybe does "prlimit" say about NOFILE when run inside the mock ?

> Do
> you think there are any changes we can make to the LLVM plugin that might
> help fix this issue?

Possibly.  I did wonder if the plugin needs to keep the files open for as long as it does.  If we could find some way for it call release_input_file() on each archive member after it has finished loading it, then that might help.  I am not an expert on the plugin's code though, so you might be better able to make that kind of change....

Comment 9 Tom Stellard 2021-05-21 21:57:34 UTC
(In reply to Nick Clifton from comment #6)
> Hi Tom,
> 
>   Are you able to test out a scratch build of a new binutils:
> 
>  https://koji.fedoraproject.org/koji/taskinfo?taskID=67894924
> 

Is it possible you forgot to add the patch to this build?  Which patch was supposed to increase the file limit?

Comment 10 Nick Clifton 2021-05-24 10:33:41 UTC
Hi Tom,

Oh foo!  You are right - the patch is not there.  I have no idea what I did wrong, but I will try again...

Cheers
  Nick

Comment 11 Nick Clifton 2021-05-24 12:04:22 UTC
Hi Tom,

  Right, let's try again.  Here is a build which does contain the plugin file descriptor increasing patch...

https://koji.fedoraproject.org/koji/taskinfo?taskID=68595229

Cheers
  Nick

Comment 12 Timm Bäder 2021-06-02 14:36:12 UTC
So, I've been investigating this a bit and came up with the following script to reproduce the problem locally:

https://paste.centos.org/view/f4cfffa8

I can reliably reproduce the fd exhaustion with N = 450 with clang, but I need N = 900 with gcc (or clang with LTO disabled).


I also found the following in ld's ld/plugin.c: https://github.com/bminor/binutils-gdb/blob/master/ld/plugin.c#L1229-L1241

That seems pretty important for this case. I tried doing the following change in LLVM's llvm/tools/gold/gold-plugin.cpp:

@@ -547,9 +552,10 @@ static ld_plugin_status claim_file_hook(const ld_plugin_input_file *file,
     handleAllErrors(ObjOrErr.takeError(), [&](const ErrorInfoBase &EI) {
       std::error_code EC = EI.convertToErrorCode();
       if (EC == object::object_error::invalid_file_type ||
-          EC == object::object_error::bitcode_section_not_found)
+          EC == object::object_error::bitcode_section_not_found) {
         *claimed = 0;
-      else
+        close(file->fd);
+      } else
         message(LDPL_FATAL,
                 "LLVM gold plugin has failed to create LTO module: %s",
                 EI.message().c_str());


(with an appropriate include at the top of the file) and... it just worked.

This is basically just doing what the ld/plugin.c explicitly doesn't do for the LLVM plugin, i.e. just close()'ing the file again, since the plugin didn't claim it.
(I've also added the file->id = -1 assignment by casting the const away but that doesn't seem to change anything).

It seems like that comment was introduced in https://github.com/bminor/binutils-gdb/commit/f4b78d1 but that's about the end of my knowledge here.

Should we re-evaluate whether the comment in ld/plugin.c is still true? Or is the patch above for LLVM's gold-plugin.cpp the way to go?
In any case, it needs some testing, in particular with thinlto enabled.

Comment 13 Timm Bäder 2021-06-13 10:45:28 UTC
Okay, my previous reproducer was reproducing a different problem.

This patch:

diff --git a/llvm/tools/gold/gold-plugin.cpp b/llvm/tools/gold/gold-plugin.cpp
index 036da3f2ecad..d986d074b571 100644
--- a/llvm/tools/gold/gold-plugin.cpp
+++ b/llvm/tools/gold/gold-plugin.cpp
@@ -556,6 +556,14 @@ static ld_plugin_status claim_file_hook(const ld_plugin_input_file *file,
     return *claimed ? LDPS_ERR : LDPS_OK;
   }
 
+  if (!options::thinlto) {
+    // Close the fd now. The linker opens it specifically for this claim_file
+    // call and doesn't close it.
+    // When thinlto is enabled, we need to keep it open but we can still close it
+    // when thinlto is disabled.
+    release_input_file(file->handle);
+  }
+
   std::unique_ptr<InputFile> Obj = std::move(*ObjOrErr);
 
   Modules.emplace_back();


leads me to...

         | LTO    | thinlto
ld.bfd   | Works  | Broken
ld.gold  | Works  | Works

It does not affect the ld.bfd+thinlto case naturally.

Nick, is the comment in the patch appropriate? Is there any documentation that confirms this, so I can
point upstream at it?


Thanks

Comment 14 Nick Clifton 2021-06-14 11:05:11 UTC
(In reply to Timm Bäder from comment #13)
Hi Timm,

> +    // Close the fd now. The linker opens it specifically for this
> claim_file
> +    // call and doesn't close it.
> +    // When thinlto is enabled, we need to keep it open but we can still
> close it
> +    // when thinlto is disabled.

> Nick, is the comment in the patch appropriate?

Yes.

> Is there any documentation that confirms this, so I can
> point upstream at it?

Sadly no.  This was my biggest problem investigating this issue from my side.
It appears that the linker plugin architecture is completely undocumented.

On the plus side however this does mean that you should be able to make any
changes that you need to without having to worry that they will be blocked
by a poorly thought out specification...

Cheers
  Nick

Comment 15 H.J. Lu 2021-07-05 15:59:30 UTC
Fixed in binutils 2.37.