Bug 1965374 - glibc: valgrind suppressions no longer active after debuginfo removal
Summary: glibc: valgrind suppressions no longer active after debuginfo removal
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: glibc
Version: 34
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Florian Weimer
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-05-27 15:08 UTC by Mattias Ellert
Modified: 2021-07-08 08:56 UTC (History)
22 users (show)

Fixed In Version: glibc-2.33.9000-21.fc35 glibc-2.32-8.fc33 glibc-2.33-18.fc34
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-07-06 13:25:38 UTC
Type: Bug


Attachments (Terms of Use)

Description Mattias Ellert 2021-05-27 15:08:52 UTC
Description of problem:

Referring to the koschei monitoring of the HepMC3 package build in Fedora:

https://koschei.fedoraproject.org/package/HepMC3?collection=f34
https://koschei.fedoraproject.org/package/HepMC3?collection=f33

The package has started failing to build on Fedora 33 and 34.
For both F33 and F34 the first failing build list glibc as a package that was updated since the previous (successful) attempt.

From 2.33-5.fc34 to 2.33-8.fc34 and
from 2.32-4.fc33 to 2.32-6.fc33 respectively.

None of the other listed updated dependencies looks relevant.

The line of code that triggers the errors is line 107 in HepMC3-3.2.3/src/WriterAscii.cc:
```
    m_cursor += sprintf(m_cursor, "U %s %s\n", Units::name(evt.momentum_unit()).c_str(), Units::name(evt.length_unit()).c_str());
```
I do not see any obviously wrong with this line.

I made some test and replaced the above line with:
```
    std::cerr << Units::name(evt.momentum_unit()).c_str() << std::endl;
    std::cerr << Units::name(evt.length_unit()).c_str() << std::endl;
    m_cursor += sprintf(m_cursor, "U %s %s\n", "GEV", "MM");
```
which print outs the two strings (thereby accessing the same memory) and replacing the strings with hardcoded values in the sprintf call. This does not trigger any errors from valgrind.

Just to test that the use of temporaries in the call is not the cause I also tried:
```
    std::string mu = Units::name(evt.momentum_unit());
    std::string lu = Units::name(evt.length_unit());
    m_cursor += sprintf(m_cursor, "U %s %s\n", mu.c_str(), lu.c_str());
```
This triggers the same errors from valgrind as the original line.

I can't tell what is the cause of the issue here. Is it a problem in the updated glibc or its headers, is it a problem with g++ that was not seen before the glibc update or is it a false positive from valgrind or is there something wrong with the code I do not see.

Version-Release number of selected component (if applicable):

F33:
gcc-c++-10.3.1-1.fc33.aarch64
glibc-devel-2.32-6.fc33.aarch64
valgrind-1:3.16.1-18.fc33.aarch64

F34:
gcc-c++-11.1.1-1.fc34.aarch64
glibc-devel-2.33-8.fc34.aarch64
valgrind-1:3.17.0-1.fc34.aarch64

F32 (not affected):
gcc-c++-10.3.1-1.fc32.aarch64
glibc-devel-2.31-6.fc32.aarch64
valgrind-1:3.16.1-18.fc32.aarch64

F35/rawhide (also not affected):
gcc-c++-11.1.1-2.fc35.aarch64
glibc-devel-2.33.9000-2.fc35.aarch64
valgrind-1:3.17.0-3.fc35.aarch64

How reproducible:

Always on Fedora 33 and Fedora 34.

Steps to Reproduce:
1. Build HepMC3 package on Fedora 33 or 34 for aarch64

Actual results:
Failure

Expected results:
Success

Additional info:

I tried to download old glibc packages from koji:

glibc-2.33-5.fc34.aarch64.rpm
glibc-common-2.33-5.fc34.aarch64.rpm
glibc-devel-2.33-5.fc34.aarch64.rpm
glibc-langpack-en-2.33-5.fc34.aarch64.rpm

Installing these in a mock chroot on aarch64-test01.fedorainfracloud.org the HemMC3 package is still buildable.

Please reassing to a different component if you think another component is more appropriate.

Comment 1 Florian Weimer 2021-05-27 15:15:01 UTC
It seems that valgrind had suppressions that relied on glibc debugging information:

==20030== Conditional jump or move depends on uninitialised value(s)
==20030==    at 0x4C195EC: ??? (in /usr/lib64/libc-2.33.so)
==20030==    by 0x4BED35F: ??? (in /usr/lib64/libc-2.33.so)
==20030==    by 0x4BF753B: ??? (in /usr/lib64/libc-2.33.so)
==20030==    by 0x4C7506B: __sprintf_chk (in /usr/lib64/libc-2.33.so)
==20030==    by 0x48CF4F7: HepMC3::WriterAscii::write_event(HepMC3::GenEvent const&) (stdio2.h:38)
==20030==    by 0x10A50B: main (testIO6.cc:27)

After removal of that debugging information, these suppressions no longer work. Perhaps there is a fix for this without bringing back all the debugging information.

Comment 2 andrii.verbytskyi 2021-06-01 09:05:53 UTC
I've just tried to rebuild 

valgrind-3.17.0-3.fc35.src.rpm

for aarch64 on f33 and f34 and see 

-- Running  tests in gdbserver_tests -----------------------------------
hginfo:          valgrind   --tool=helgrind --vgdb=yes --vgdb-error=0 --vgdb-prefix=./vgdb-prefix-hginfo -q  ./../helgrind/tests/hg01_all_ok  (progB: ./gdb --quiet -l 60 --nx ../helgrind/tests/hg01_all_ok)
*** hginfo failed (stderr) ***
*** hginfo failed (stdoutB) ***
*** hginfo failed (stderrB) ***
hgtls:           valgrind   --tool=helgrind --vgdb=yes --vgdb-error=0 --vgdb-prefix=./vgdb-prefix-hgtls -q  ./../none/tests/tls  (progB: ./gdb --quiet -l 60 --nx ../none/tests/tls)
*** hgtls failed (stdoutB) ***
mcblocklistsearch: valgrind   --tool=memcheck --vgdb=yes --vgdb-error=0 --vgdb-prefix=./vgdb-prefix-mcblocklistsearch -q  ./../memcheck/tests/leak-tree  (progB: ./gdb --quiet -l 60 --nx 1>&2 ../memcheck/tests/leak-tree)
mcbreak:         valgrind   --tool=memcheck --vgdb=yes --vgdb-error=0 --vgdb-prefix=./vgdb-prefix-mcbreak ./t  (progB: ./gdb --quiet -l 60 --nx ./t)
sh: line 1: 14371 Aborted                 (core dumped) ./gdb --quiet -l 60 --nx ./t < mcbreak.stdinB.gdb > mcbreak.stdoutB.out 2> mcbreak.stderrB.out



Best regards,

Andrii

Comment 3 Mark Wielaard 2021-06-01 10:05:50 UTC
Is this an arm64 only issue? Or are you seeing this on other architectures too?

Comment 4 andrii.verbytskyi 2021-06-01 13:26:58 UTC
I've tried for arm64 only. The f33/f34 builds failed, but f32 and rawhide were ok. Just restarted the builds for other architectures.

Comment 5 andrii.verbytskyi 2021-06-01 22:50:02 UTC
The rebuild fails on multiple architectures.
It succeeds only for all *86 and for aarch64@rawhide/f32.

Comment 6 Florian Weimer 2021-06-15 19:11:47 UTC
glibc-2.33.9000-16.fc35 may improve this situation: https://koji.fedoraproject.org/koji/buildinfo?buildID=1772153

It is available in the f35-build-side-42221 side tag. I may not have guessed the relevant symbols correctly, but we can extend the symbol set easily enough.

If rawhide is confirmed fixed by this, I will backport to Fedora 34 and 33.

Comment 7 Mattias Ellert 2021-06-16 15:31:23 UTC
The HepMC3 build in koschei for f35 (rawhide) failed with glibc-2.33.9000-18.fc35.aarch64
https://koschei.fedoraproject.org/package/HepMC3?collection=f35

Comment 8 Florian Weimer 2021-06-16 15:39:16 UTC
(In reply to Mattias Ellert from comment #7)
> The HepMC3 build in koschei for f35 (rawhide) failed with
> glibc-2.33.9000-18.fc35.aarch64
> https://koschei.fedoraproject.org/package/HepMC3?collection=f35

The valgrind errors suggest that the required all symbols are present, so maybe the issue wasn't the missing symbols after all. Hmm.

Unfortunately I do not have the resources to analyze this further at this time, sorry.

Comment 9 Florian Weimer 2021-06-17 11:00:30 UTC
I had to revert the addition of the __str* symbols due to bug 1973026.

I spent some time figuring out why __strlen_mte is apparently selected:

==2874915== Conditional jump or move depends on uninitialised value(s)
==2874915==    at 0x4C29AEC: __strlen_mte (in /usr/lib64/libc.so.6)
==2874915==    by 0x4BF008F: __vfprintf_internal (in /usr/lib64/libc.so.6)
==2874915==    by 0x4BFA0FB: ??? (in /usr/lib64/libc.so.6)
==2874915==    by 0x4C84E9B: __sprintf_chk (in /usr/lib64/libc.so.6)
==2874915==    by 0x48CF037: HepMC3::WriterAscii::write_event(HepMC3::GenEvent const&) (stdio2.h:38)
==2874915==    by 0x10D71F: main (testBoost.cc:132)

This may be a binutils bug. It looks like the IFUNC is bypassed and the weak definition of __strlen_mte is used to bind strlen locally. For example:

Dump of assembler code for function __strdup:
   0x0000000000093e90 <+0>:     paciasp
   0x0000000000093e94 <+4>:     stp     x29, x30, [sp, #-32]!
   0x0000000000093e98 <+8>:     mov     x29, sp
   0x0000000000093e9c <+12>:    stp     x19, x20, [sp, #16]
   0x0000000000093ea0 <+16>:    mov     x20, x0
   0x0000000000093ea4 <+20>:    bl      0x9dac0 <__strlen_mte>
   0x0000000000093ea8 <+24>:    add     x19, x0, #0x1
   0x0000000000093eac <+28>:    mov     x0, x19
   0x0000000000093eb0 <+32>:    bl      0x27e50 <malloc@plt>
   0x0000000000093eb4 <+36>:    cbz     x0, 0x93ed0 <__strdup+64>
   0x0000000000093eb8 <+40>:    mov     x2, x19
   0x0000000000093ebc <+44>:    mov     x1, x20
   0x0000000000093ec0 <+48>:    ldp     x19, x20, [sp, #16]
   0x0000000000093ec4 <+52>:    ldp     x29, x30, [sp], #32
   0x0000000000093ec8 <+56>:    autiasp
   0x0000000000093ecc <+60>:    b       0x9b780 <__memcpy_generic>
   0x0000000000093ed0 <+64>:    ldp     x19, x20, [sp, #16]
   0x0000000000093ed4 <+68>:    ldp     x29, x30, [sp], #32
   0x0000000000093ed8 <+72>:    autiasp
   0x0000000000093edc <+76>:    ret
End of assembler dump.

I really don't want to bring back the full symbol table for libc.so.6 because it is so large.

Maybe __strlen_asimd can be emulated better by valgrind, but it is significantly more complex.

Comment 10 Florian Weimer 2021-06-17 11:07:04 UTC
__strlen_mte is used deliberately:

/* It doesn't make sense to send libc-internal strlen calls through a PLT. */
	.globl __GI_strlen; __GI_strlen = __strlen_mte

So no binutils bug.

Comment 11 Mark Wielaard 2021-06-17 11:23:45 UTC
These are the intercepted symbols from valgrind shared/vg_replace_strmem.c:

bcmp
bcopy
__GI_memchr
__GI_memcmp
__GI_memcpy
__GI_memmove
__GI_mempcpy
__GI___rawmemchr
__GI_stpcpy
__GI_strcasecmp
__GI___strcasecmp_l
__GI_strcasecmp_l
__GI_strcat
__GI_strchr
__GI_strcmp
__GI_strcpy
__GI_strcspn
__GI_strlen
__GI_strncasecmp
__GI___strncasecmp_l
__GI_strncasecmp_l
__GI_strncmp
__GI_strncpy
__GI_strnlen
__GI_strrchr
__GI_wcsnlen
__GI_wmemchr
index
memchr
memcmp
__memcmp_sse2
__memcmp_sse4_1
memcpy
__memcpy_avx_unaligned_erms
__memcpy_chk
__memcpy_sse2
memcpyZAGLIBCZu2Zd2Zd5
memcpyZAZAGLIBCZu2Zd14
memcpyZDVARIANTZDsse3x
memcpyZDVARIANTZDsse42
memcpyZPZa
memmove
__memmove_chk
memmoveZDVARIANTZDsse3x
memmoveZDVARIANTZDsse42
memmoveZPZa
mempcpy
memrchr
memset
memsetZPZa
putenv
rawmemchr
rindex
setenv
stpcpy
__stpcpy_chk
__stpcpy_sse2
__stpcpy_sse2_unaligned
stpncpy
strcasecmp
strcasecmp_l
strcasestr
strcat
strchr
strchrnul
__strchr_sse2
__strchr_sse2_no_bsf
strcmp
__strcmp_sse2
__strcmp_sse42
strcpy
__strcpy_chk
strcspn
strlcat
strlcpy
strlen
__strlen_sse2
__strlen_sse2_no_bsf
__strlen_sse42
strncasecmp
strncasecmp_l
strncat
strncmp
__strncmp_sse2
__strncmp_sse42
strncpy
__strncpy_sse2
__strncpy_sse2_unaligned
strnlen
strpbrk
strrchr
__strrchr_sse2
__strrchr_sse2_no_bsf
__strrchr_sse42
strspn
strstr
__strstr_sse2
__strstr_sse42
unsetenv
wcschr
wcscmp
wcscpy
wcslen
wcsncmp
wcsnlen
wcsrchr
wmemchr

Note that some of these might be ancient, non-existing or unneeded these days.
The intercepts are there for functions that are so optimized that valgrind/memcheck cannot proof them correct (at the time).

Comment 12 Florian Weimer 2021-06-17 12:32:49 UTC
Thanks. I will not include the public symbols, only the __ variants, assuming that valgrind can lift them from .dynsym.

Comment 13 Mattias Ellert 2021-06-17 15:16:06 UTC
HepMC3 scratch build in f35-build-side-42819 with glibc-2.33.9000-21.fc35:

https://koji.fedoraproject.org/koji/taskinfo?taskID=70305267

Succeeds on all architectures, including aarch64 that used to fail.

Comment 14 andrii.verbytskyi 2021-06-17 15:19:49 UTC
Thank you all!

Comment 15 Florian Weimer 2021-06-18 07:23:13 UTC
Another HepMC3 scratch build against glibc-2.33.9000-22.fc35:

https://koji.fedoraproject.org/koji/taskinfo?taskID=70343510

I'm going to treat this as a glibc bug (and not valgrind improperly depending on glibc internals). I plan to backport the changes to Fedora 34 and 33, which also received the glibc debuginfo updates.

Comment 16 Fedora Update System 2021-06-18 15:07:09 UTC
FEDORA-2021-483200b24b has been submitted as an update to Fedora 34. https://bodhi.fedoraproject.org/updates/FEDORA-2021-483200b24b

Comment 17 Fedora Update System 2021-06-19 01:11:09 UTC
FEDORA-2021-483200b24b has been pushed to the Fedora 34 testing repository.
Soon you'll be able to install the update with the following command:
`sudo dnf upgrade --enablerepo=updates-testing --advisory=FEDORA-2021-483200b24b`
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2021-483200b24b

See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates.

Comment 18 Fedora Update System 2021-06-20 01:07:34 UTC
FEDORA-2021-483200b24b has been pushed to the Fedora 34 stable repository.
If problem still persists, please make note of it in this bug report.

Comment 19 Mattias Ellert 2021-06-20 05:51:37 UTC
HepMC3 build in koschei for Fedora 34 now OK:

https://koschei.fedoraproject.org/package/HepMC3?collection=f34

Only Fedora 33 is still broken.

Comment 20 Fedora Update System 2021-06-21 10:33:42 UTC
FEDORA-2021-f29b4643c7 has been submitted as an update to Fedora 33. https://bodhi.fedoraproject.org/updates/FEDORA-2021-f29b4643c7

Comment 21 Fedora Update System 2021-06-22 01:11:07 UTC
FEDORA-2021-f29b4643c7 has been pushed to the Fedora 33 testing repository.
Soon you'll be able to install the update with the following command:
`sudo dnf upgrade --enablerepo=updates-testing --advisory=FEDORA-2021-f29b4643c7`
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2021-f29b4643c7

See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates.

Comment 22 Jan Kratochvil 2021-06-27 07:16:41 UTC
If there are needed more symbols they need to be added to .dynsym, not to an incomplete .symtab: Bug 1975895 Comment 2

Comment 23 Florian Weimer 2021-06-27 08:45:20 UTC
(In reply to Jan Kratochvil from comment #22)
> If there are needed more symbols they need to be added to .dynsym, not to an
> incomplete .symtab: Bug 1975895 Comment 2

Many of these .symtab symbols are hidden, so they cannot be added to .dynsym. Ideally, tools would be fixed not to assume that .symtab is a superset of .dynsym.

Comment 24 Jan Kratochvil 2021-06-27 09:16:14 UTC
(In reply to Florian Weimer from comment #23)
> Many of these .symtab symbols are hidden, so they cannot be added to
> .dynsym. Ideally, tools would be fixed not to assume that .symtab is a
> superset of .dynsym.

Why the STB_LOCAL symbols cannot be in .dynsym? Maybe some tool does not support that and it could be fixed? Isn't it another option instead of the .symtab+.dynsym merging? Wouldn't be the STB_LOCAL in .dynsym more compatible across consumers? Or I suspect it would break existing ld.so (I have not checked)?

Comment 25 Fedora Update System 2021-07-07 01:03:59 UTC
FEDORA-2021-f29b4643c7 has been pushed to the Fedora 33 stable repository.
If problem still persists, please make note of it in this bug report.

Comment 26 Florian Weimer 2021-07-07 15:58:03 UTC
Reproducer (build with -O2):

#define _FORTIFY_SOURCE 2
#include <string>
#include <stdio.h>

void __attribute__((weak))
test(const char *a, const char *b)
{
  std::string mu = a;
  std::string lu = b;
  char buf[256];
  sprintf(buf, "U %s %s", mu.c_str(), lu.c_str());
  puts(buf);
}

int
main()
{
  test("GEV", "MM");
}

Comment 27 Mark Wielaard 2021-07-08 08:50:51 UTC
(In reply to Florian Weimer from comment #26)
> Reproducer (build with -O2):
> 
> #define _FORTIFY_SOURCE 2
> #include <string>
> #include <stdio.h>
> 
> void __attribute__((weak))
> test(const char *a, const char *b)
> {
>   std::string mu = a;
>   std::string lu = b;
>   char buf[256];
>   sprintf(buf, "U %s %s", mu.c_str(), lu.c_str());
>   puts(buf);
> }
> 
> int
> main()
> {
>   test("GEV", "MM");
> }

I cannot reproduce this on either fedora 34 or rawhide on x86_64.
In which situations does this cause valgrind memcheck errors?

P.S. Note that this bug is already closed.

Comment 28 Florian Weimer 2021-07-08 08:56:07 UTC
(In reply to Mark Wielaard from comment #27)
> (In reply to Florian Weimer from comment #26)
> > Reproducer (build with -O2):
> > 
> > #define _FORTIFY_SOURCE 2
> > #include <string>
> > #include <stdio.h>
> > 
> > void __attribute__((weak))
> > test(const char *a, const char *b)
> > {
> >   std::string mu = a;
> >   std::string lu = b;
> >   char buf[256];
> >   sprintf(buf, "U %s %s", mu.c_str(), lu.c_str());
> >   puts(buf);
> > }
> > 
> > int
> > main()
> > {
> >   test("GEV", "MM");
> > }
> 
> I cannot reproduce this on either fedora 34 or rawhide on x86_64.
> In which situations does this cause valgrind memcheck errors?

The bug only occurred on aarch64. I made some changes to .symtab in Fedora 34, so I wanted to re-run the reproducer. I verified that it fails with glibc-2.33-12.fc34 (if I recall correctly; not all the relevant builds are still in Koji, but that one is still there). The reproducer passes on aarch64 on current rawhide and Fedora 34, and Fedora 34 with my upcoming changes in glibc-2.33-19.fc34 (essentially backporting the .symtab construction to Fedora 34).

So there is nothing to do here really. I just kept rewriting that reproducer from scratch over and over again, so I wanted to have something that I can cut-and-paste easily. Sorry for any confusion caused.


Note You need to log in before you can comment on or make changes to this bug.