Bug 1662656 - valgrind slow loading of binaries with debuginfo
Summary: valgrind slow loading of binaries with debuginfo
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: valgrind
Version: rawhide
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Mark Wielaard
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-12-31 09:52 UTC by Richard W.M. Jones
Modified: 2021-10-02 01:27 UTC (History)
8 users (show)

Fixed In Version: valgrind-3.17.0-13.fc35 valgrind-3.17.0-13.fc34
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-09-24 20:30:48 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
perf report (35.84 KB, text/plain)
2018-12-31 10:09 UTC, Richard W.M. Jones
no flags Details

Description Richard W.M. Jones 2018-12-31 09:52:44 UTC
Description of problem:

Since a recent upgrade to valgrind, it is very slow to start up
on certain binaries.  For example:

$ valgrind --log-file=/dev/null /bin/gcc
   <---- very long pause here
gcc: fatal error: no input files
compilation terminated.

It takes 7.6 seconds to load the GCC binary on my machine.

Other binaries are not affected so much (eg. /bin/ls is quite fast).

I have another program which I'm testing which also shows the
same slow start-up behaviour, and I modified it to insert a
print statement at the beginning of main().  The pause appears
to happen before main() is entered.

Version-Release number of selected component (if applicable):

valgrind-3.14.0-7.fc29.x86_64

How reproducible:

Unknown - only happens on some machines.

Steps to Reproduce:
1. See description above.

Comment 1 Richard W.M. Jones 2018-12-31 10:09:15 UTC
Created attachment 1517584 [details]
perf report

perf report from:

# perf record -g -F 99 -- valgrind --log-file=/dev/null /bin/gcc

Comment 2 Richard W.M. Jones 2018-12-31 10:18:34 UTC
Ah ha!  I've worked out what changed which caused the slowdown.  It was the
installation of debuginfo for the binary.

On another machine of mine:

$ time valgrind --log-file=/dev/null /bin/gcc
gcc: fatal error: no input files
compilation terminated.

real	0m0.863s
user	0m0.674s
sys	0m0.035s

Then after installing gcc-debuginfo:

$ time valgrind --log-file=/dev/null /bin/gcc
gcc: fatal error: no input files
compilation terminated.

real	0m18.058s
user	0m17.949s
sys	0m0.054s

Comment 3 Mark Wielaard 2018-12-31 12:51:29 UTC
Yeah valgrind is slow reading debuginfo and somewhat eager. Although this seems somewhat excessive.
Thanks for the perf report. It looks like a lot of time goes into vgPlain_debugLog_vprintf. I wonder if valgrind is trying to tell us something about the debuginfo it is seeing. Could you run without --log-file=/dev/null or put the output in a file and attach it if there is something odd in it?

Also could you try with --read-inline-info=no to see if it might be related to getting better backtraces?

Comment 4 Richard W.M. Jones 2018-12-31 13:16:27 UTC
The complete output without --log-file is:

==13709== Memcheck, a memory error detector
==13709== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==13709== Using Valgrind-3.14.0 and LibVEX; rerun with -h for copyright info
==13709== Command: /bin/gcc
==13709== 
gcc: fatal error: no input files
compilation terminated.
==13709== 
==13709== HEAP SUMMARY:
==13709==     in use at exit: 169,913 bytes in 43 blocks
==13709==   total heap usage: 187 allocs, 144 frees, 197,242 bytes allocated
==13709== 
==13709== LEAK SUMMARY:
==13709==    definitely lost: 4,253 bytes in 4 blocks
==13709==    indirectly lost: 0 bytes in 0 blocks
==13709==      possibly lost: 0 bytes in 0 blocks
==13709==    still reachable: 165,660 bytes in 39 blocks
==13709==         suppressed: 0 bytes in 0 blocks
==13709== Rerun with --leak-check=full to see details of leaked memory
==13709== 
==13709== For counts of detected and suppressed errors, rerun with: -v
==13709== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

There don't appear to be any errors about bad debuginfo.

The --read-inline-info option DOES make a big difference:

$ time valgrind --log-file=/dev/null --read-inline-info=yes /bin/gcc 
gcc: fatal error: no input files
compilation terminated.

real	0m7.904s
user	0m7.840s
sys	0m0.048s

$ time valgrind --log-file=/dev/null --read-inline-info=no /bin/gcc
gcc: fatal error: no input files
compilation terminated.

real	0m0.511s
user	0m0.475s
sys	0m0.035s

Comment 5 Ben Cotton 2019-08-13 17:11:33 UTC
This bug appears to have been reported against 'rawhide' during the Fedora 31 development cycle.
Changing version to '31'.

Comment 6 Ben Cotton 2019-08-13 19:38:37 UTC
This bug appears to have been reported against 'rawhide' during the Fedora 31 development cycle.
Changing version to 31.

Comment 7 Mark Wielaard 2019-08-26 12:45:04 UTC
We have to discuss upstream what to do about this. The defaults are now so that more debuginfo is read, which does improve warning/error reporting. But the slowdown is pretty big.

Comment 8 Ben Cotton 2020-11-03 15:06:55 UTC
This message is a reminder that Fedora 31 is nearing its end of life.
Fedora will stop maintaining and issuing updates for Fedora 31 on 2020-11-24.
It is Fedora's policy to close all bug reports from releases that are no longer
maintained. At that time this bug will be closed as EOL if it remains open with a
Fedora 'version' of '31'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 31 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 9 Mark Wielaard 2021-06-09 08:21:01 UTC
This has become more urgent now that valgrind (3.17.0) supports debuginfod and that has been enabled by default in rawhide: https://fedoraproject.org/wiki/Changes/DebuginfodByDefault

Comment 10 Daniel Berrangé 2021-06-09 10:15:04 UTC
The debuginfod by default behaviour certainly feels like a good step for something like GDB because that's not so performance critical in the paths where it acquires the data. For valgrind though, I think maintaining high performance is more important than use of debuginfod, because valgrind already has a significant penalty compared to native execution.  

So if we can't get the valgrind performance improved in the near term (ie before F35 release GA), can we consider disabling use of debuginfod by default, specifically for valgrind only, as a temporary measure ?

Comment 11 Mark Wielaard 2021-06-09 14:47:32 UTC
Yes, if we cannot make the valgrind debuginfo reader faster before F35 GA then we should disable debuginfod downloading by default for valgrind. But I believe I can fix it (or at least make it a lot faster). But now I like to keep it enabled in rawhide because it already found some issues.

Comment 12 Fedora Update System 2021-09-20 12:26:09 UTC
FEDORA-2021-04b130ee33 has been submitted as an update to Fedora 35. https://bodhi.fedoraproject.org/updates/FEDORA-2021-04b130ee33

Comment 13 Mark Wielaard 2021-09-20 12:42:53 UTC
Please test the proposed update valgrind-3.17.0-13.fc35. For the original testcase, valgrind -q /usr/bin/gcc, on my local system, with gcc-debuginfo installed, runtime goes from 12 seconds to 0.6 seconds. Without any debuginfo installed runtime under valgrind is 0.4 seconds. So debuginfo reading for the default case should now have an overhead of only ~0.2 seconds.

Comment 14 Richard W.M. Jones 2021-09-20 13:13:38 UTC
I'm using tests/test-nbd-tls.sh from nbdkit.
https://gitlab.com/nbdkit/nbdkit/-/blob/master/tests/test-nbd-tls.sh
With DEBUGINFOD_URLS unset this test takes 9 seconds, so I'll take
that as my baseline.

Using nbdkit + this commit reverted:
https://gitlab.com/nbdkit/nbdkit/-/commit/8c94611ca3854c4e

With valgrind-3.17.0-11.fc35.x86_64 the following test takes 30 seconds:

$ make && time make check-valgrind TESTS=test-nbd-tls.sh

With valgrind-3.17.0-13.fc36.x86_64 it takes about 15 seconds
(consistent over several runs).

It seems as if debuginfod is still taking about 6 seconds, although
it's an improvement on before.

Oh wait!  I don't actually have elfutils-debuginfod installed!
Let's try that again with debuginfod installed and running ...

It still takes 15 seconds.

I would say it is an improvement, but there may still be something
I'm missing here.

I have a few questions:

 - Does debuginfod globally install the debuginfo RPMs that it downloads?
   (It seems like "no", because the above test uses gnutls but I do not
   see any gnutls-debuginfo* packages installed.)

 - Does debuginfod cache the RPMs, so at least they don't need to be
   downloaded twice?

Comment 15 Mark Wielaard 2021-09-20 13:28:56 UTC
(In reply to Richard W.M. Jones from comment #14)
> I'm using tests/test-nbd-tls.sh from nbdkit.
> https://gitlab.com/nbdkit/nbdkit/-/blob/master/tests/test-nbd-tls.sh
> With DEBUGINFOD_URLS unset this test takes 9 seconds, so I'll take
> that as my baseline.
> 
> Using nbdkit + this commit reverted:
> https://gitlab.com/nbdkit/nbdkit/-/commit/8c94611ca3854c4e
> 
> With valgrind-3.17.0-11.fc35.x86_64 the following test takes 30 seconds:
> 
> $ make && time make check-valgrind TESTS=test-nbd-tls.sh
> 
> With valgrind-3.17.0-13.fc36.x86_64 it takes about 15 seconds
> (consistent over several runs).
> 
> It seems as if debuginfod is still taking about 6 seconds, although
> it's an improvement on before.
> 
> Oh wait!  I don't actually have elfutils-debuginfod installed!
> Let's try that again with debuginfod installed and running ...
> 
> It still takes 15 seconds.
> 
> I would say it is an improvement, but there may still be something
> I'm missing here.

The 3.17.0-133 patches improve debuginfo reading however it is installed (either build into the binary, installed as separate debuginfo packages or automatically downloaded through debuginfod).
From the above I don't really know which scenario you are testing. You can see if the patches help by looking at the different between running valgrind --read-inline-info=yes (the default) and valgrind --read-inline-info=no. There is also --read-var-info=yes/no (but that defaults to no, so should not impact the normal case unless explicitly enabled).

> I have a few questions:
> 
>  - Does debuginfod globally install the debuginfo RPMs that it downloads?
>    (It seems like "no", because the above test uses gnutls but I do not
>    see any gnutls-debuginfo* packages installed.)

debuginfod-client (which valgrind uses when installed and DEBUGINFOD_URLS is set) doesn't globally install debuginfo RPMS.
It downloads individual .debug/dwz files (and source files, but valgrind never requests those) under
$XDG_CACHE_HOME/debuginfod_client/ (or $HOME/.cache/debuginfod_client/)
 
>  - Does debuginfod cache the RPMs, so at least they don't need to be
>    downloaded twice?

See above. It does cache downloaded files, but those aren't RPMs.

Comment 16 Richard W.M. Jones 2021-09-20 14:27:15 UTC
I see a lot of <hash>/debuginfo files under $HOME/.cache/debuginfod_client/
some of which binary match "gnutls".

Comment 17 Mark Wielaard 2021-09-20 18:11:47 UTC
(In reply to Richard W.M. Jones from comment #16)
> I see a lot of <hash>/debuginfo files under $HOME/.cache/debuginfod_client/
> some of which binary match "gnutls".

That is good, those will be used when you have debuginfod-client installed and DEBUGINFOD_URLS set.

But the question really is not about debuginfod but how well the updates to the valgrind DWARF reader is working.
The best way to test this is to see (with debuginfo installed, either through rpms or through debuginfod) what
the timing difference is between valgrind --read-inline-info=no and valgrind --read-inline-info=yes (the default).

Comment 18 Fedora Update System 2021-09-20 20:54:40 UTC
FEDORA-2021-04b130ee33 has been pushed to the Fedora 35 testing repository.
Soon you'll be able to install the update with the following command:
`sudo dnf upgrade --enablerepo=updates-testing --advisory=FEDORA-2021-04b130ee33`
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2021-04b130ee33

See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates.

Comment 19 Richard W.M. Jones 2021-09-21 08:12:20 UTC
I'm still unclear how this is supposed to work (why doesn't debuginfod
just install the debuginfo RPMs?), or how --read-inline-info could affect
debuginfo, however I did some simple testing and there's no obvious
slowdown so I guess it's fixed.

Comment 20 Mark Wielaard 2021-09-21 12:45:39 UTC
(In reply to Richard W.M. Jones from comment #19)
> I'm still unclear how this is supposed to work (why doesn't debuginfod
> just install the debuginfo RPMs?), or how --read-inline-info could affect
> debuginfo, however I did some simple testing and there's no obvious
> slowdown so I guess it's fixed.

Thanks for testing.

Lets see if I can explain how "this" works.

The fix itself has nothing to do with how you got the debuginfo associated with the binary
(and libraries) you want to run under valgrind. Lets just assume the debuginfo is there,
either build into the binary itself, installed through (system wide) rpms, or through debuginfod
(because you installed elfutils-debuginfod-client and have DEBUGINFOD_URLS set).

The --read-inline-info=yes setting is the default for valgrind (in most cases). This is the description from the manual:

       --read-inline-info=<yes|no> [default: see below]
           When enabled, Valgrind will read information about inlined function
           calls from DWARF3 debug info. This slows Valgrind startup and makes
           it use more memory (typically for each inlined piece of code, 6
           words and space for the function name), but it results in more
           descriptive stacktraces. Currently, this functionality is enabled
           by default only for Linux, Android and Solaris targets and only for
           the tools Memcheck, Massif, Helgrind and DRD. Here is an example of
           some stacktraces with --read-inline-info=no:

               ==15380== Conditional jump or move depends on uninitialised value(s)
               ==15380==    at 0x80484EA: main (inlinfo.c:6)
               ==15380==
               ==15380== Conditional jump or move depends on uninitialised value(s)
               ==15380==    at 0x8048550: fun_noninline (inlinfo.c:6)
               ==15380==    by 0x804850E: main (inlinfo.c:34)
               ==15380==
               ==15380== Conditional jump or move depends on uninitialised value(s)
               ==15380==    at 0x8048520: main (inlinfo.c:6)

           And here are the same errors with --read-inline-info=yes:

               ==15377== Conditional jump or move depends on uninitialised value(s)
               ==15377==    at 0x80484EA: fun_d (inlinfo.c:6)
               ==15377==    by 0x80484EA: fun_c (inlinfo.c:14)
               ==15377==    by 0x80484EA: fun_b (inlinfo.c:20)
               ==15377==    by 0x80484EA: fun_a (inlinfo.c:26)
               ==15377==    by 0x80484EA: main (inlinfo.c:33)
               ==15377==
               ==15377== Conditional jump or move depends on uninitialised value(s)
               ==15377==    at 0x8048550: fun_d (inlinfo.c:6)
               ==15377==    by 0x8048550: fun_noninline (inlinfo.c:41)
               ==15377==    by 0x804850E: main (inlinfo.c:34)
               ==15377==
               ==15377== Conditional jump or move depends on uninitialised value(s)
               ==15377==    at 0x8048520: fun_d (inlinfo.c:6)
               ==15377==    by 0x8048520: main (inlinfo.c:35)

So if the debuginfo is available then valgrind is able to provide much more descriptive
backtraces for issues. But this does come with a startup penalty. This fix makes this
penalty much lower (from several seconds to less than a second).

If the patches work well it should be hard to even measure the (extra) slowdown from having
debuginfo available. Before the patch there could be a significant different between running
with the default --read-inline-info=yes and --read-inline-info=no.

This doesn't really have much to do with debuginfod, except that because valgrind now uses and
recommends elfutils-debuginfod-client and on Fedora 35 DEBUGINFOD_URLS is setup by default it
is more likely the debuginfo (or at least the system libraries) for your program is available.

For how debuginfod works and why it provides individual debug files and not whole RPMs see https://fedoraproject.org/wiki/Debuginfod and https://sourceware.org/elfutils/Debuginfod.html

Comment 21 Richard W.M. Jones 2021-09-21 13:15:28 UTC
Thanks - I understand better now.  I was confused because I thought the
fix was something to do with the time taken to download the debuginfo data.

Comment 22 Fedora Update System 2021-09-23 22:59:30 UTC
FEDORA-2021-eb2442ba01 has been submitted as an update to Fedora 34. https://bodhi.fedoraproject.org/updates/FEDORA-2021-eb2442ba01

Comment 23 Fedora Update System 2021-09-24 20:30:48 UTC
FEDORA-2021-04b130ee33 has been pushed to the Fedora 35 stable repository.
If problem still persists, please make note of it in this bug report.

Comment 24 Fedora Update System 2021-09-24 21:48:17 UTC
FEDORA-2021-eb2442ba01 has been pushed to the Fedora 34 testing repository.
Soon you'll be able to install the update with the following command:
`sudo dnf upgrade --enablerepo=updates-testing --advisory=FEDORA-2021-eb2442ba01`
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2021-eb2442ba01

See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates.

Comment 25 Fedora Update System 2021-10-02 01:27:27 UTC
FEDORA-2021-eb2442ba01 has been pushed to the Fedora 34 stable repository.
If problem still persists, please make note of it in this bug report.


Note You need to log in before you can comment on or make changes to this bug.