Bug 809616

Summary: default enabled --build-id causes excessive memory utilization for f77 codes with a large BSS
Product: Red Hat Enterprise Linux 6 Reporter: Travis Gummels <tgummels>
Component: binutilsAssignee: Nick Clifton <nickc>
Status: CLOSED ERRATA QA Contact: qe-baseos-tools-bugs
Severity: medium Docs Contact:
Priority: medium    
Version: 6.2CC: jan.kratochvil, law, mnowak, rdassen, syeghiay, woodard
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-06-20 14:03:47 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 787802    
Attachments:
Description Flags
Reproducer
none
Skip sections with no contents none

Description Travis Gummels 2012-04-03 20:00:10 UTC
Created attachment 574941 [details]
Reproducer

Description of problem:

Fortran77 requires static arrays. These end up as BSS sections in the binary. 
This particular test has a large array in memory:
[25] .bss              NOBITS           0000000000608380  00008374
      00000004fb4d1528  0000000000000000  WA       0     0     32
This 0x4fb4d1528 ends up being 19GB 

When this process is being linked and it gets to the part where it it calculates the build-id. It allocates all that space (which is nothing but zeros) and then calculates the buildid including that. This greatly increases the time needed to link the binary, can cause this shared diskless machine to oomkill, and impacts other users who are trying to do something. 

When we attach to the bloated process with gdb:

(gdb) where
#0  0x00000000004282a3 in sha1_process_block (buffer=0x2aaf56f99bde,  len=<value optimized out>, ctx=0x7fffffffd2c0) at ./sha1.c:355
#1  0x000000000042925b in sha1_process_bytes (buffer=<value optimized out>,  len=<value optimized out>, ctx=0x7fffffffd2c0) at ./sha1.c:245
#2  0x00002aaaaad1beca in bfd_elf64_checksum_contents (abfd=0x6a5be0,  process=0x4291b0 <sha1_process_bytes>, arg=0x7fffffffd2c0) at elfcode.h:1206
#3  0x0000000000420ee7 in gldelf_x86_64_write_build_id_section (abfd=0x6a5be0) at eelf_x86_64.c:906
#4  0x00002aaaaad26a2f in _bfd_elf_write_object_contents (abfd=0x6a5be0) at elf.c:5155
#5  0x00002aaaaad01437 in bfd_close (abfd=0x6a5be0) at opncls.c:692
#6  0x0000000000417f7c in main (argc=46, argv=0x7fffffffd6b8) at ./ldmain.c:515

Looking at where the problem seems to be:
#2  0x00002aaaaad1beca in bfd_elf64_checksum_contents (abfd=0x6a5be0,  process=0x4291b0 <sha1_process_bytes>, arg=0x7fffffffd2c0) at elfcode.h:1206
1206                    (*process) (sec->contents, i_shdr.sh_size, arg);
(gdb) p *i_shdr 

$3 = {sh_name = 240, sh_type = 8, sh_flags = 3, sh_addr = 6325120,  sh_offset = 0, sh_size = 21396002088, sh_link = 0, sh_info = 0,  sh_addralign = 32, sh_entsize = 0, bfd_section = 0xbd2b30, contents = 0x0}

There is our 19GB.

Thus we can see clearly that the problem is when it is calculating the checksum of for the BSS.

We have a work around passing in -Wl,--build-id={none,uuid} but we believe that it would better if we had an optimized buildid calculation which didn't allocate the bss when it calculates the checksum. 

The problem appears to only happen on rhel6 not rhel5 or F16.

It seems to be geared toward mpich2 rather than openmpi.

Version-Release number of selected component (if applicable):
binutils-2.20.51.0.2-5.28.el6.x86_64

How reproducible:
100%

Steps to Reproduce:

The attached reproducer requires MPI.
  
Actual results:
Excessive memory utilization.

Expected results:
Appropriate memory utilization.

Additional info:

Comment 1 Nick Clifton 2012-04-05 09:03:21 UTC
Created attachment 575329 [details]
Skip sections with no contents

Comment 2 Nick Clifton 2012-04-05 09:05:39 UTC
This bug has also been reported against the FSF binutils:

  http://sourceware.org/bugzilla/show_bug.cgi?id=12451

The uploaded patch is a simplified version of the patch that fixes that PR.  Once some internal networking problems are resolved I will add it to the RHEL6.3 binutils rpm.

Cheers
  Nick

Comment 3 Ben Woodard 2012-04-05 18:17:46 UTC
Thanks Nick, 

I never would have made the association between 12451 and the reported bug. The patch looks really simple and I think that the users are OK with the current workaround for a bit but if you would like some testing on some alpha/beta test packages for 6.3 just let us know where to find the packages and we'll run them through our env for a while.

Comment 5 Michal Nowak 2012-04-10 14:33:14 UTC
(In reply to comment #0)
> Created attachment 574941 [details]
> Reproducer

$ sh REPRODUCER_SCRIPT.sh
[...]
mpif77 -c  -O3 -fPIC cg.f
make[1]: mpif77: Command not found
[...]

I have openmpi and mpich2 packages installed. How do I compile it on RHEL6?

Comment 7 Ben Woodard 2012-04-10 17:24:02 UTC
right or wrong mpif77 is in /usr/lib64/mpich2/bin/mpif77 and so you need to add /usr/lib64/mpich2/bin to your path.

Comment 10 Jeff Law 2012-04-17 21:23:20 UTC
*** Bug 691347 has been marked as a duplicate of this bug. ***

Comment 15 errata-xmlrpc 2012-06-20 14:03:47 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2012-0872.html