Bug 809616 - default enabled --build-id causes excessive memory utilization for f77 codes with a large BSS
default enabled --build-id causes excessive memory utilization for f77 codes ...
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: binutils (Show other bugs)
6.2
All Linux
medium Severity medium
: rc
: ---
Assigned To: Nick Clifton
qe-baseos-tools
:
: 691347 (view as bug list)
Depends On:
Blocks: 787802
  Show dependency treegraph
 
Reported: 2012-04-03 16:00 EDT by Travis Gummels
Modified: 2017-04-06 09:28 EDT (History)
6 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2012-06-20 10:03:47 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Reproducer (29.17 KB, application/x-bzip)
2012-04-03 16:00 EDT, Travis Gummels
no flags Details
Skip sections with no contents (843 bytes, patch)
2012-04-05 05:03 EDT, Nick Clifton
no flags Details | Diff

  None (edit)
Description Travis Gummels 2012-04-03 16:00:10 EDT
Created attachment 574941 [details]
Reproducer

Description of problem:

Fortran77 requires static arrays. These end up as BSS sections in the binary. 
This particular test has a large array in memory:
[25] .bss              NOBITS           0000000000608380  00008374
      00000004fb4d1528  0000000000000000  WA       0     0     32
This 0x4fb4d1528 ends up being 19GB 

When this process is being linked and it gets to the part where it it calculates the build-id. It allocates all that space (which is nothing but zeros) and then calculates the buildid including that. This greatly increases the time needed to link the binary, can cause this shared diskless machine to oomkill, and impacts other users who are trying to do something. 

When we attach to the bloated process with gdb:

(gdb) where
#0  0x00000000004282a3 in sha1_process_block (buffer=0x2aaf56f99bde,  len=<value optimized out>, ctx=0x7fffffffd2c0) at ./sha1.c:355
#1  0x000000000042925b in sha1_process_bytes (buffer=<value optimized out>,  len=<value optimized out>, ctx=0x7fffffffd2c0) at ./sha1.c:245
#2  0x00002aaaaad1beca in bfd_elf64_checksum_contents (abfd=0x6a5be0,  process=0x4291b0 <sha1_process_bytes>, arg=0x7fffffffd2c0) at elfcode.h:1206
#3  0x0000000000420ee7 in gldelf_x86_64_write_build_id_section (abfd=0x6a5be0) at eelf_x86_64.c:906
#4  0x00002aaaaad26a2f in _bfd_elf_write_object_contents (abfd=0x6a5be0) at elf.c:5155
#5  0x00002aaaaad01437 in bfd_close (abfd=0x6a5be0) at opncls.c:692
#6  0x0000000000417f7c in main (argc=46, argv=0x7fffffffd6b8) at ./ldmain.c:515

Looking at where the problem seems to be:
#2  0x00002aaaaad1beca in bfd_elf64_checksum_contents (abfd=0x6a5be0,  process=0x4291b0 <sha1_process_bytes>, arg=0x7fffffffd2c0) at elfcode.h:1206
1206                    (*process) (sec->contents, i_shdr.sh_size, arg);
(gdb) p *i_shdr 

$3 = {sh_name = 240, sh_type = 8, sh_flags = 3, sh_addr = 6325120,  sh_offset = 0, sh_size = 21396002088, sh_link = 0, sh_info = 0,  sh_addralign = 32, sh_entsize = 0, bfd_section = 0xbd2b30, contents = 0x0}

There is our 19GB.

Thus we can see clearly that the problem is when it is calculating the checksum of for the BSS.

We have a work around passing in -Wl,--build-id={none,uuid} but we believe that it would better if we had an optimized buildid calculation which didn't allocate the bss when it calculates the checksum. 

The problem appears to only happen on rhel6 not rhel5 or F16.

It seems to be geared toward mpich2 rather than openmpi.

Version-Release number of selected component (if applicable):
binutils-2.20.51.0.2-5.28.el6.x86_64

How reproducible:
100%

Steps to Reproduce:

The attached reproducer requires MPI.
  
Actual results:
Excessive memory utilization.

Expected results:
Appropriate memory utilization.

Additional info:
Comment 1 Nick Clifton 2012-04-05 05:03:21 EDT
Created attachment 575329 [details]
Skip sections with no contents
Comment 2 Nick Clifton 2012-04-05 05:05:39 EDT
This bug has also been reported against the FSF binutils:

  http://sourceware.org/bugzilla/show_bug.cgi?id=12451

The uploaded patch is a simplified version of the patch that fixes that PR.  Once some internal networking problems are resolved I will add it to the RHEL6.3 binutils rpm.

Cheers
  Nick
Comment 3 Ben Woodard 2012-04-05 14:17:46 EDT
Thanks Nick, 

I never would have made the association between 12451 and the reported bug. The patch looks really simple and I think that the users are OK with the current workaround for a bit but if you would like some testing on some alpha/beta test packages for 6.3 just let us know where to find the packages and we'll run them through our env for a while.
Comment 5 Michal Nowak 2012-04-10 10:33:14 EDT
(In reply to comment #0)
> Created attachment 574941 [details]
> Reproducer

$ sh REPRODUCER_SCRIPT.sh
[...]
mpif77 -c  -O3 -fPIC cg.f
make[1]: mpif77: Command not found
[...]

I have openmpi and mpich2 packages installed. How do I compile it on RHEL6?
Comment 7 Ben Woodard 2012-04-10 13:24:02 EDT
right or wrong mpif77 is in /usr/lib64/mpich2/bin/mpif77 and so you need to add /usr/lib64/mpich2/bin to your path.
Comment 10 Jeff Law 2012-04-17 17:23:20 EDT
*** Bug 691347 has been marked as a duplicate of this bug. ***
Comment 15 errata-xmlrpc 2012-06-20 10:03:47 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2012-0872.html

Note You need to log in before you can comment on or make changes to this bug.