1796433 – valgrind must ignore main thread's DTV and DTV entries.

Bug 1796433 - valgrind must ignore main thread's DTV and DTV entries.

Summary: valgrind must ignore main thread's DTV and DTV entries.

Keywords:
Status:	CLOSED RAWHIDE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	valgrind
Sub Component:
Version:	32
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	Mark Wielaard
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1796559 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-01-30 12:32 UTC by Pavel Březina
Modified:	2020-03-05 13:06 UTC (History)
CC List:	16 users (show)
Fixed In Version:	valgrind-3.15.0-17.fc32
Clone Of:
Environment:
Last Closed:	2020-02-14 13:29:35 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Pavel Březina 2020-01-30 12:32:54 UTC

Description of problem:
Hi Florian,
I see these two valgrind errors on recent Fedora rawhide when testing SSSD. Is this a known issue?

==752382== 24 bytes in 1 blocks are possibly lost in loss record 2 of 18
==752382==    at 0x483A809: malloc (vg_replace_malloc.c:309)
==752382==    by 0x4013D84: tls_get_addr_tail (in /usr/lib64/ld-2.30.9000.so)
==752382==    by 0x4019EEB: __tls_get_addr (in /usr/lib64/ld-2.30.9000.so)
==752382==    by 0x6D3FAB5: ???
==752382==    by 0x6D2ECFA: ???
==752382==    by 0x6D1FB74: ???
==752382==    by 0x6CF3A14: ???
==752382==    by 0x4011781: call_init.part.0 (in /usr/lib64/ld-2.30.9000.so)
==752382==    by 0x4011890: _dl_init (in /usr/lib64/ld-2.30.9000.so)
==752382==    by 0x49CB034: _dl_catch_exception (in /usr/lib64/libc-2.30.9000.so)
==752382==    by 0x4015C33: dl_open_worker (in /usr/lib64/ld-2.30.9000.so)
==752382==    by 0x49CAFD7: _dl_catch_exception (in /usr/lib64/libc-2.30.9000.so)
==752382==
{
   <insert_a_suppression_name_here>
   Memcheck:Leak
   match-leak-kinds: possible
   fun:malloc
   fun:tls_get_addr_tail
   fun:__tls_get_addr
   obj:*
   obj:*
   obj:*
   obj:*
   fun:call_init.part.0
   fun:_dl_init
   fun:_dl_catch_exception
   fun:dl_open_worker
   fun:_dl_catch_exception
}

==752382== 512 bytes in 1 blocks are possibly lost in loss record 10 of 18
==752382==    at 0x483A809: malloc (vg_replace_malloc.c:309)
==752382==    by 0x4013C7C: _dl_resize_dtv (in /usr/lib64/ld-2.30.9000.so)
==752382==    by 0x40146BD: _dl_update_slotinfo (in /usr/lib64/ld-2.30.9000.so)
==752382==    by 0x401481B: update_get_addr (in /usr/lib64/ld-2.30.9000.so)
==752382==    by 0x4019EEB: __tls_get_addr (in /usr/lib64/ld-2.30.9000.so)
==752382==    by 0x6D3FAB5: ???
==752382==    by 0x6D2ECFA: ???
==752382==    by 0x6D1FB74: ???
==752382==    by 0x6CF3A14: ???
==752382==    by 0x4011781: call_init.part.0 (in /usr/lib64/ld-2.30.9000.so)
==752382==    by 0x4011890: _dl_init (in /usr/lib64/ld-2.30.9000.so)
==752382==    by 0x49CB034: _dl_catch_exception (in /usr/lib64/libc-2.30.9000.so)
==752382== 
{
   <insert_a_suppression_name_here>
   Memcheck:Leak
   match-leak-kinds: possible
   fun:malloc
   fun:_dl_resize_dtv
   fun:_dl_update_slotinfo
   fun:update_get_addr
   fun:__tls_get_addr
   obj:*
   obj:*
   obj:*
   obj:*
   fun:call_init.part.0
   fun:_dl_init
   fun:_dl_catch_exception
}

Version-Release number of selected component (if applicable):
glibc-2.30.9000-31.fc32.x86_64

How reproducible:
Always.

Comment 1 Pavel Březina 2020-01-30 12:51:37 UTC

Steps to reproduce:
1. Clone SSSD https://github.com/SSSD/sssd
2. autoreconf -if && ./configure && make && make check
3. valgrind --leak-check=full ./dlopen-tests

Comment 2 Florian Weimer 2020-01-30 12:53:09 UTC

(In reply to Pavel Březina from comment #0)
> Description of problem:
> Hi Florian,
> I see these two valgrind errors on recent Fedora rawhide when testing SSSD.
> Is this a known issue?
> 
> ==752382== 24 bytes in 1 blocks are possibly lost in loss record 2 of 18
> ==752382==    at 0x483A809: malloc (vg_replace_malloc.c:309)
> ==752382==    by 0x4013D84: tls_get_addr_tail (in /usr/lib64/ld-2.30.9000.so)
> ==752382==    by 0x4019EEB: __tls_get_addr (in /usr/lib64/ld-2.30.9000.so)
> ==752382==    by 0x6D3FAB5: ???
> ==752382==    by 0x6D2ECFA: ???
> ==752382==    by 0x6D1FB74: ???
> ==752382==    by 0x6CF3A14: ???
> ==752382==    by 0x4011781: call_init.part.0 (in /usr/lib64/ld-2.30.9000.so)
> ==752382==    by 0x4011890: _dl_init (in /usr/lib64/ld-2.30.9000.so)
> ==752382==    by 0x49CB034: _dl_catch_exception (in
> /usr/lib64/libc-2.30.9000.so)
> ==752382==    by 0x4015C33: dl_open_worker (in /usr/lib64/ld-2.30.9000.so)
> ==752382==    by 0x49CAFD7: _dl_catch_exception (in
> /usr/lib64/libc-2.30.9000.so)
> ==752382==

Would you please install additional debuginfo packages?

I suspect this is just a lazy TLS allocation. Depending on how the application shuts down, this memory cannot be deallocated by glibc, and thus shows up as a valgrind false positive.

Comment 3 Pavel Březina 2020-01-30 14:49:38 UTC

(In reply to Florian Weimer from comment #2)
> 
> Would you please install additional debuginfo packages?

Do you have any specific in mind? I tried installing glibc and kernel debuginfo but I got the same result.

Comment 4 Florian Weimer 2020-01-30 14:51:22 UTC

(In reply to Pavel Březina from comment #3)
> (In reply to Florian Weimer from comment #2)
> > 
> > Would you please install additional debuginfo packages?
> 
> Do you have any specific in mind? I tried installing glibc and kernel
> debuginfo but I got the same result.

GDB should list the missing packages you need to install. You will have to run the test under GDB to see them.

Comment 5 Florian Weimer 2020-01-30 15:40:03 UTC

(In reply to Pavel Březina from comment #1)
> Steps to reproduce:
> 1. Clone SSSD https://github.com/SSSD/sssd
> 2. autoreconf -if && ./configure && make && make check

I run into a build error:

src/providers/ad/ad_gpo_ndr.c: In function ‘ndr_pull_security_ace_object_type’:
src/providers/ad/ad_gpo_ndr.c:108:13: error: implicit declaration of function  ndr_pull_get_switch_value’; did you mean ‘ndr_pull_set_switch_value’? [-Werror=implicit-function-declaration]
  108 |     level = ndr_pull_get_switch_value(ndr, r);
      |             ^~~~~~~~~~~~~~~~~~~~~~~~~
      |             ndr_pull_set_switch_value
cc1: some warnings being treated as errors

HEAD is at d3d72b907b12448c52004a1b3aabfee0b516bf2f.

Comment 6 Pavel Březina 2020-01-31 09:51:10 UTC

Sorry, about that. You actually need to use this branch that fix build on rawhide:
https://github.com/pbrezina/sssd/tree/ndr

GDB remains silent in this case so I do not know what I need to install.

This is the test source:
https://github.com/SSSD/sssd/blob/master/src/tests/dlopen-tests.c

Comment 7 Florian Weimer 2020-01-31 12:29:24 UTC

After installing debugging information, the test runs into a timeout under valgrind:

Running suite(s): dlopen
0%: Checks: 1, Failures: 0, Errors: 1
src/tests/dlopen-tests.c:239:E:dlopen:test_dlopen_base:0: (after this point) Test timeout expired

How do I increase the test timeout?

Comment 8 Florian Weimer 2020-01-31 12:40:19 UTC

This appears to bypass the timeout:

CK_FORK=no CK_DEFAULT_TIMEOUT=0 valgrind --leak-check=full ./dlopen-tests

Comment 9 Florian Weimer 2020-01-31 14:20:47 UTC

I think the reason why valgrind does not report a full backtrace is that the object which performs the TLS allocation is dlclose'd. valgrind cannot interpret the stack trace in this case because the object is gone when it is printed.

It seems to be this TLS allocation:

#0  tls_get_addr_tail (ti=0x7ffff5d2dbf8, dtv=0x7ffff7c480a0, the_map=0x43a130) at ../elf/dl-tls.c:739
#1  0x00007ffff7fe9eec in __tls_get_addr () at ../sysdeps/x86_64/tls_get_addr.S:55
#2  0x00007ffff5bddab6 in _gnutls_fips_mode_enabled () at fips.c:68
#3  0x00007ffff5bcccfb in _gnutls_rnd_preinit () at random.c:110
#4  0x00007ffff5bbdb75 in _gnutls_global_init (constructor=constructor@entry=1) at global.c:308
#5  0x00007ffff5b91a15 in lib_init () at global.c:511
#6  0x00007ffff7fe1782 in call_init (l=<optimized out>, argc=argc@entry=1, argv=argv@entry=0x7fffffffd8e8, env=env@entry=0x7fffffffd8f8) at dl-init.c:72

Running GDB like this disables fork and the test suite timeout:

CK_FORK=no CK_DEFAULT_TIMEOUT=0 gdb ./dlopen-tests

I don't know if this a real leak.

Comment 12 Lukas Slebodnik 2020-02-06 10:31:48 UTC

Much simpler reproducer is in https://bugzilla.redhat.com/show_bug.cgi?id=1796559
so this BZ should be closed as duplicate of BZ1796559

But it still be good to if glibc guys could check whether issue is in glibc
(revealed be new version of samba) or it is abug in samba itself.

Comment 13 Carlos O'Donell 2020-02-06 18:56:15 UTC

(In reply to Lukas Slebodnik from comment #12)
> Much simpler reproducer is in
> https://bugzilla.redhat.com/show_bug.cgi?id=1796559
> so this BZ should be closed as duplicate of BZ1796559
> 
> But it still be good to if glibc guys could check whether issue is in glibc
> (revealed be new version of samba) or it is abug in samba itself.

That's easier to debug. I'm checking to see if I can show this is a real leak or not.

Comment 14 Carlos O'Donell 2020-02-06 21:45:13 UTC

(In reply to Carlos O'Donell from comment #13)
> (In reply to Lukas Slebodnik from comment #12)
> > Much simpler reproducer is in
> > https://bugzilla.redhat.com/show_bug.cgi?id=1796559
> > so this BZ should be closed as duplicate of BZ1796559
> > 
> > But it still be good to if glibc guys could check whether issue is in glibc
> > (revealed be new version of samba) or it is abug in samba itself.
> 
> That's easier to debug. I'm checking to see if I can show this is a real
> leak or not.

Thankfully we enter allocate_dtv_entry only twice.

The first time is here:

(gdb) bt
#0  allocate_dtv_entry (size=24, alignment=8) at ../elf/dl-tls.c:579
#1  allocate_and_init (map=0x415910) at ../elf/dl-tls.c:607
#2  tls_get_addr_tail (ti=0x7ffff70a6bf0, dtv=0x7ffff7df20a0, the_map=0x415910) at ../elf/dl-tls.c:787
#3  0x00007ffff7fe9eec in __tls_get_addr () at ../sysdeps/x86_64/tls_get_addr.S:55
#4  0x00007ffff6f4a166 in _gnutls_fips_mode_enabled () from /lib64/libgnutls.so.30
#5  0x00007ffff6f38f6b in _gnutls_rnd_preinit () from /lib64/libgnutls.so.30
#6  0x00007ffff6f2a0c5 in _gnutls_global_init () from /lib64/libgnutls.so.30
#7  0x00007ffff6efdaf5 in lib_init () from /lib64/libgnutls.so.30
#8  0x00007ffff7fe1782 in call_init (l=<optimized out>, argc=argc@entry=3, argv=argv@entry=0x7fffffffe588, env=env@entry=0x7fffffffe5a8)
    at dl-init.c:72
#9  0x00007ffff7fe1891 in call_init (env=0x7fffffffe5a8, argv=0x7fffffffe588, argc=3, l=<optimized out>) at dl-init.c:30
#10 _dl_init (main_map=0x4052d0, argc=3, argv=0x7fffffffe588, env=0x7fffffffe5a8) at dl-init.c:119
#11 0x00007ffff7f30ff5 in __GI__dl_catch_exception (exception=<optimized out>, operate=<optimized out>, args=<optimized out>)
    at dl-error-skeleton.c:182
#12 0x00007ffff7fe5c34 in dl_open_worker (a=a@entry=0x7fffffffe200) at dl-open.c:758
#13 0x00007ffff7f30f98 in __GI__dl_catch_exception (exception=<optimized out>, operate=<optimized out>, args=<optimized out>)
    at dl-error-skeleton.c:208
#14 0x00007ffff7fe514e in _dl_open (file=0x7fffffffe7e7 "/usr/lib64/samba/libauth-samba4.so", mode=-2147483390,
    caller_dlopen=0x4011c0 <main+74>, nsid=-2, argc=3, argv=<optimized out>, env=0x7fffffffe5a8) at dl-open.c:837
#15 0x00007ffff7fc039c in dlopen_doit (a=a@entry=0x7fffffffe420) at dlopen.c:66
#16 0x00007ffff7f30f98 in __GI__dl_catch_exception (exception=exception@entry=0x7fffffffe3c0, operate=<optimized out>, args=<optimized out>)
    at dl-error-skeleton.c:208
#17 0x00007ffff7f31063 in __GI__dl_catch_error (objname=0x7ffff7fc4050 <last_result+16>, errstring=0x7ffff7fc4058 <last_result+24>,
    mallocedp=0x7ffff7fc4048 <last_result+8>, operate=<optimized out>, args=<optimized out>) at dl-error-skeleton.c:227
#18 0x00007ffff7fc0bd9 in _dlerror_run (operate=operate@entry=0x7ffff7fc0340 <dlopen_doit>, args=args@entry=0x7fffffffe420) at dlerror.c:170
#19 0x00007ffff7fc0428 in __dlopen (file=<optimized out>, mode=<optimized out>) at dlopen.c:87
#20 0x00000000004011c0 in main (argc=3, argv=0x7fffffffe588) at dlopen-check.c:13

I inspect the returned malloc pointer in $rax, and do 'break free if $rdi == [value of $rax]' to get help from gdb's conditional breakpoint handling.

If were to have to track this by hand I've have to wade through ~8000+ API calls, and I know because I used our libmtrace.so to look into this first (but the record accounting is too loose to catch this 1 lost allocation).

(gdb) bt
#0  __GI___libc_free (mem=0x445340) at malloc.c:3087
#1  0x00007ffff7fe4769 in _dl_update_slotinfo (req_modid=11) at ../elf/dl-tls.c:687
#2  0x00007ffff7fe481c in update_get_addr (ti=0x7ffff70a6bf0) at ../elf/dl-tls.c:799
#3  0x00007ffff7fe9eec in __tls_get_addr () at ../sysdeps/x86_64/tls_get_addr.S:55
#4  0x00007ffff6f4a166 in _gnutls_fips_mode_enabled () from /lib64/libgnutls.so.30
#5  0x00007ffff6f38f6b in _gnutls_rnd_preinit () from /lib64/libgnutls.so.30
#6  0x00007ffff6f2a0c5 in _gnutls_global_init () from /lib64/libgnutls.so.30
#7  0x00007ffff6efdaf5 in lib_init () from /lib64/libgnutls.so.30
#8  0x00007ffff7fe1782 in call_init (l=<optimized out>, argc=argc@entry=3, argv=argv@entry=0x7fffffffe588, env=env@entry=0x7fffffffe5a8)
    at dl-init.c:72
#9  0x00007ffff7fe1891 in call_init (env=0x7fffffffe5a8, argv=0x7fffffffe588, argc=3, l=<optimized out>) at dl-init.c:30
#10 _dl_init (main_map=0x405900, argc=3, argv=0x7fffffffe588, env=0x7fffffffe5a8) at dl-init.c:119
#11 0x00007ffff7f30ff5 in __GI__dl_catch_exception (exception=<optimized out>, operate=<optimized out>, args=<optimized out>)
    at dl-error-skeleton.c:182
#12 0x00007ffff7fe5c34 in dl_open_worker (a=a@entry=0x7fffffffe200) at dl-open.c:758
#13 0x00007ffff7f30f98 in __GI__dl_catch_exception (exception=<optimized out>, operate=<optimized out>, args=<optimized out>)
    at dl-error-skeleton.c:208
#14 0x00007ffff7fe514e in _dl_open (file=0x7fffffffe7e7 "/usr/lib64/samba/libauth-samba4.so", mode=-2147483390,
    caller_dlopen=0x4011c0 <main+74>, nsid=-2, argc=3, argv=<optimized out>, env=0x7fffffffe5a8) at dl-open.c:837
#15 0x00007ffff7fc039c in dlopen_doit (a=a@entry=0x7fffffffe420) at dlopen.c:66
#16 0x00007ffff7f30f98 in __GI__dl_catch_exception (exception=exception@entry=0x7fffffffe3c0, operate=<optimized out>, args=<optimized out>)
    at dl-error-skeleton.c:208
#17 0x00007ffff7f31063 in __GI__dl_catch_error (objname=0x7ffff7fc4050 <last_result+16>, errstring=0x7ffff7fc4058 <last_result+24>,
    mallocedp=0x7ffff7fc4048 <last_result+8>, operate=<optimized out>, args=<optimized out>) at dl-error-skeleton.c:227
#18 0x00007ffff7fc0bd9 in _dlerror_run (operate=operate@entry=0x7ffff7fc0340 <dlopen_doit>, args=args@entry=0x7fffffffe420) at dlerror.c:170
#19 0x00007ffff7fc0428 in __dlopen (file=<optimized out>, mode=<optimized out>) at dlopen.c:87
#20 0x00000000004011c0 in main (argc=3, argv=0x7fffffffe588) at dlopen-check.c:13

We free the first allocation here in _dl_update_slotinfo.

Then we allocate another dtv entry.

The next dtv entry is never freed because it's the last in-use entry.

Next is to validate that __libc_freeres can reach this last entry and free it.

This happens here:

537 void
538 _dl_deallocate_tls (void *tcb, bool dealloc_tcb)
539 {
540   dtv_t *dtv = GET_DTV (tcb);
541 
542   /* We need to free the memory allocated for non-static TLS.  */
543   for (size_t cnt = 0; cnt < dtv[-1].counter; ++cnt)
544     free (dtv[1 + cnt].pointer.to_free);
545 
546   /* The array starts with dtv[-1].  */
547   if (dtv != GL(dl_initial_dtv))
548     free (dtv - 1);
549 
550   if (dealloc_tcb)
551     free (*tcb_to_pointer_to_free_location (tcb));
552 }
553 rtld_hidden_def (_dl_deallocate_tls)

I make an artificial call to __libc_freeres() from a destructor.

At this point libpthread.so.0 is still loaded, so __libpthread_freeres should be visible and have a valid value.

The debugger can see it:

(gdb) p __libpthread_freeres
$12 = {void (void)} 0x7ffff7916160 <__libpthread_freeres>

Which is loaded:

7ffff790e000-7ffff7915000 r--p 00000000 09:00 5921721                    /usr/lib64/libpthread-2.31.so
7ffff7915000-7ffff7925000 r-xp 00007000 09:00 5921721                    /usr/lib64/libpthread-2.31.so
7ffff7925000-7ffff792a000 r--p 00017000 09:00 5921721                    /usr/lib64/libpthread-2.31.so
7ffff792a000-7ffff792b000 r--p 0001b000 09:00 5921721                    /usr/lib64/libpthread-2.31.so
7ffff792b000-7ffff792c000 rw-p 0001c000 09:00 5921721                    /usr/lib64/libpthread-2.31.so

=> 0x00007ffff7f67f5a <+90>:	cmpq   $0x0,0x4ee7e(%rip)        # 0x7ffff7fb6de0
   0x00007ffff7f67f62 <+98>:	je     0x7ffff7f67f69 <__GI___libc_freeres+105>
   0x00007ffff7f67f64 <+100>:	callq  0x7ffff7e193a0 <__libpthread_freeres@plt>

The pointer to __libpthread_freeres is in libc:

7ffff7df4000-7ffff7e19000 r--p 00000000 09:00 5921707                    /usr/lib64/libc-2.31.so
7ffff7e19000-7ffff7f69000 r-xp 00025000 09:00 5921707                    /usr/lib64/libc-2.31.so
7ffff7f69000-7ffff7fb4000 r--p 00175000 09:00 5921707                    /usr/lib64/libc-2.31.so
7ffff7fb4000-7ffff7fb7000 r--p 001bf000 09:00 5921707                    /usr/lib64/libc-2.31.so
^^^^^^^^^^^^^^^^^^^^^^^^^ (this is GNU_RELRO RO data)
7ffff7fb7000-7ffff7fba000 rw-p 001c2000 09:00 5921707                    /usr/lib64/libc-2.31.so


00000000001c2de0  0000000200000006 R_X86_64_GLOB_DAT      0000000000000000 __libpthread_freeres + 0
00000000001c2bf8  0000000200000007 R_X86_64_JUMP_SLOT     0000000000000000 __libpthread_freeres + 0
     2: 0000000000000000     0 NOTYPE  WEAK   DEFAULT  UND __libpthread_freeres
 25754: 0000000000000000     0 NOTYPE  WEAK   DEFAULT  UND __libpthread_freeres

The value 0x7ffff7fb6de0 *roughly* matches a R_X86_64_GLOB_DAT relocation against __libpthread_freeres.

Is it really though?

LOAD           0x1bf420 0x00000000001c0420 0x00000000001c0420 0x005050 0x008f48 RW  0x1000

VirtAddr is 0x1c0420, with reloc offset at 00000000001c2de0, for a difference of +0x29c0

PT_LOAD at 0x7ffff7fb4000
VirtAddr is 0x1c0000 (rounded to even page) with reloc offset at 0x1c2de0 for a difference of +0x2de0

Reloc load address is 0x7ffff7fb4000 + 0x2de0 = 0x7ffff7fb6de0

Matches relocation location. Confirmed.

We are actually looking at the R_X86_64_GLOB_DAT result in libc.so.6 in .got, namely:
00000000001c2de0  0000000200000006 R_X86_64_GLOB_DAT      0000000000000000 __libpthread_freeres + 0

(gdb) disassemble 0x7ffff7e193a0
Dump of assembler code for function __libpthread_freeres@plt:
   0x00007ffff7e193a0 <+0>:	endbr64 
   0x00007ffff7e193a4 <+4>:	bnd jmpq *0x19d84d(%rip)        # 0x7ffff7fb6bf8 <__libpthread_freeres>
   0x00007ffff7e193ab <+11>:	nopl   0x0(%rax,%rax,1)
End of assembler dump.

(gdb) x/2x 0x7ffff7fb6bf8
0x7ffff7fb6bf8 <__libpthread_freeres>:	0x00000000	0x00000000

Definately zero, so we never call __libpthread_freeres.

The binary never links against libpthread, therefore libc is loaded first, and __libpthread_freeres is set to 0 while processing R_X86_64_GLOB_DAT.

Even if we link the application against libpthread.so.0 (or preload it), and we make it into free_stacks(), the main thread is always active until the the point at which we call the exit syscall, and therefore we don't normally free unused dtv entries for the main thread. That is to say we never call _dl_deallocate_tls() on the main thread.

Likewise I expect the memory resizing for a new DTV for the main thread has exactly the same behaviour as this. We will never free that DTV because it's the main thread and it's alive until the end.

In summary:

- DTV entries created for the main thread are not normally freed because the main thread is still running. Therefore there is always some lossage with DTV entries that are allocated for the main thread. I tracked a single allocation of 24 bytes, aligned 8, and it is *indeed* lost, but it's part of the main thread's DTV entries and reused. Likewise the 512 bytes lost are from _dl_resize_dtv() (resizing of the dtv entry array) which I can confirm match the execution pattern in the test case, that we expand to 30 entries + 2, so 32*(sizeof(dtv_t))=512 bytes (dtv_t is 16-bytes).

- This issue is an old issue, the main thread has never freed it's own DTV or DTV entries. Perhaps Florian's recent NODELETE work, which moved around some code (particularly that which updated slotinfo), now requires valgrind to update its own supression lists.

- In the future we could possibly free even more things as a last step if: (a) all signals were blocked, and (b) we were running the last bit of code in __libc_freeres, and we knew no TLS accesses would ever come after that point.

I think Mark Wielaard should look at this next from the valgrind side. Leaving needinfo for mark.

Comment 15 Carlos O'Donell 2020-02-06 21:57:24 UTC

*** Bug 1796559 has been marked as a duplicate of this bug. ***

Comment 16 Mark Wielaard 2020-02-06 22:10:00 UTC

We could add a suppression. But I don't fully understand yet why we didn't need that in the past.
If you look at the runs in bug #1796559 you'll see that on Fedora 31 valgrind says:

==94839== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

So it didn't suppress any errors there.

Comment 17 Carlos O'Donell 2020-02-06 22:17:40 UTC

(In reply to Mark Wielaard from comment #16)
> We could add a suppression. But I don't fully understand yet why we didn't
> need that in the past.
> If you look at the runs in bug #1796559 you'll see that on Fedora 31
> valgrind says:
> 
> ==94839== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
> 
> So it didn't suppress any errors there.

In f31 we also have the NODELETE handling, so that can't be the difference.

I can retry the same debugging in an f31 mock chroot and tell you what I see.

I expect to see the same thing since this hasn't changed in glibc.

I'm telling you that from first principles we don't free the main thread's dtv or dtv entries, and those are the things that valgrind complains about.

Comment 18 Mark Wielaard 2020-02-06 22:22:17 UTC

I think what happened is that the DTV entry moved from Still Reachable (there is a pointer to the start of the malloc block) to Possibly lost (there is a pointer into the malloc block which isn't at the start). Still Reachable blocks are normally not reported (they could have been freed, the programmer was just lazy), but Possibly lost blocks are reported (since they cannot be freed directly unless the program somehow gets back to the start of the block). There are also Definitely lost block (to which there isn't any pointer, so they really have been lost and can never be freed).

Comment 19 Carlos O'Donell 2020-02-07 16:02:34 UTC

(In reply to Mark Wielaard from comment #18)
> I think what happened is that the DTV entry moved from Still Reachable
> (there is a pointer to the start of the malloc block) to Possibly lost
> (there is a pointer into the malloc block which isn't at the start). Still
> Reachable blocks are normally not reported (they could have been freed, the
> programmer was just lazy), but Possibly lost blocks are reported (since they
> cannot be freed directly unless the program somehow gets back to the start
> of the block). There are also Definitely lost block (to which there isn't
> any pointer, so they really have been lost and can never be freed).

I'm going to thoroughly investigate this in f31 and see what's going on there. And I'll report back here.

We haven't changed any of the way in which we handle the slotinfo list in the loader or the DTV and DTV entries in the threads, so I find it odd that we've had a change in reporting.

Comment 20 Lukas Slebodnik 2020-02-07 16:30:51 UTC

(In reply to Carlos O'Donell from comment #19)
> (In reply to Mark Wielaard from comment #18)
> > I think what happened is that the DTV entry moved from Still Reachable
> > (there is a pointer to the start of the malloc block) to Possibly lost
> > (there is a pointer into the malloc block which isn't at the start). Still
> > Reachable blocks are normally not reported (they could have been freed, the
> > programmer was just lazy), but Possibly lost blocks are reported (since they
> > cannot be freed directly unless the program somehow gets back to the start
> > of the block). There are also Definitely lost block (to which there isn't
> > any pointer, so they really have been lost and can never be freed).
> 
> I'm going to thoroughly investigate this in f31 and see what's going on
> there. And I'll report back here.
> 
> We haven't changed any of the way in which we handle the slotinfo list in
> the loader or the DTV and DTV entries in the threads, so I find it odd that
> we've had a change in reporting.

I do not think it is related to f31.
I used it as an example in BZ1796559 taht older version of samba does not cause any problem.
So test-case pass with older version of samba even on rawhide

e.g.
[root@f8dc2d658c24 ~]# rpm -q gcc valgrind samba-client-libs
gcc-10.0.1-0.7.fc32.x86_64
valgrind-3.15.0-15.fc32.x86_64
samba-client-libs-4.11.5-0.fc32.x86_64
[root@f8dc2d658c24 ~]# valgrind --leak-check=full ./dlopen-check 2 /usr/lib64/samba/libauth-samba4.so && echo OK
==318== Memcheck, a memory error detector
==318== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==318== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info
==318== Command: ./dlopen-check 2 /usr/lib64/samba/libauth-samba4.so
==318== 
==318== 
==318== HEAP SUMMARY:
==318==     in use at exit: 12,586 bytes in 24 blocks
==318==   total heap usage: 3,845 allocs, 3,821 frees, 667,093 bytes allocated
==318== 
==318== LEAK SUMMARY:
==318==    definitely lost: 0 bytes in 0 blocks
==318==    indirectly lost: 0 bytes in 0 blocks
==318==      possibly lost: 0 bytes in 0 blocks
==318==    still reachable: 12,586 bytes in 24 blocks
==318==         suppressed: 0 bytes in 0 blocks
==318== Reachable blocks (those to which a pointer was found) are not shown.
==318== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==318== 
==318== For lists of detected and suppressed errors, rerun with: -s
==318== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
OK

I used following command in rawhide container to prepare env
sh#  dnf update -y --nogpgcheck
sh#  dnf install -y --nogpgcheck valgrind gcc
sh#  dnf install samba-client-libs
sh#  dnf install -y --nogpgcheck https://kojipkgs.fedoraproject.org//packages/samba/4.11.5/0.fc32/x86_64/samba-client-libs-4.11.5-0.fc32.x86_64.rpm https://kojipkgs.fedoraproject.org//packages/samba/4.11.5/0.fc32/x86_64/samba-common-libs-4.11.5-0.fc32.x86_64.rpm https://kojipkgs.fedoraproject.org//packages/samba/4.11.5/0.fc32/x86_64/libwbclient-4.11.5-0.fc32.x86_64.rpm https://kojipkgs.fedoraproject.org//packages/samba/4.11.5/0.fc32/noarch/samba-common-4.11.5-0.fc32.noarch.rpm

Comment 21 Pavel Březina 2020-02-10 10:33:59 UTC

SSSD uses some samba libraries and this started to occur after samba update to 4.12.0rc1 which broke SSSD build due to removing some functionality from libndr. So perhaps samba is doing something wrong?

See https://github.com/SSSD/sssd/pull/976 and https://github.com/SSSD/sssd/pull/972

Comment 22 Mark Wielaard 2020-02-10 13:02:51 UTC

(In reply to Lukas Slebodnik from comment #20)
> (In reply to Carlos O'Donell from comment #19)
> > (In reply to Mark Wielaard from comment #18)
> > > I think what happened is that the DTV entry moved from Still Reachable
> > > (there is a pointer to the start of the malloc block) to Possibly lost
> > > (there is a pointer into the malloc block which isn't at the start). Still
> > > Reachable blocks are normally not reported (they could have been freed, the
> > > programmer was just lazy), but Possibly lost blocks are reported (since they
> > > cannot be freed directly unless the program somehow gets back to the start
> > > of the block). There are also Definitely lost block (to which there isn't
> > > any pointer, so they really have been lost and can never be freed).
> > 
> > I'm going to thoroughly investigate this in f31 and see what's going on
> > there. And I'll report back here.
> > 
> > We haven't changed any of the way in which we handle the slotinfo list in
> > the loader or the DTV and DTV entries in the threads, so I find it odd that
> > we've had a change in reporting.
> 
> I do not think it is related to f31.
> I used it as an example in BZ1796559 taht older version of samba does not
> cause any problem.
> So test-case pass with older version of samba even on rawhide

I did the opposite, on an f31 system I installed (unpacked) the new samba libraries and ran the dlopen-check testcase against it with f31 valgrind and glibc. Then I can also replicate:

$ LD_LIBRARY_PATH=./lib64:./usr/lib64:./usr/lib64/samba:./usr/lib64/samba/wbclient valgrind --leak-check=full ./dlopen-check 2 ./usr/lib64/samba/libauth-samba4.so 
==242714== Memcheck, a memory error detector
==242714== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==242714== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info
==242714== Command: ./dlopen-check 2 ./usr/lib64/samba/libauth-samba4.so
==242714== 
==242714== 
==242714== HEAP SUMMARY:
==242714==     in use at exit: 95,881 bytes in 39 blocks
==242714==   total heap usage: 3,814 allocs, 3,775 frees, 749,623 bytes allocated
==242714== 
==242714== 24 bytes in 1 blocks are possibly lost in loss record 1 of 10
==242714==    at 0x483980B: malloc (vg_replace_malloc.c:309)
==242714==    by 0x4013547: tls_get_addr_tail.isra.0 (in /usr/lib64/ld-2.30.so)
==242714==    by 0x40195AB: __tls_get_addr (in /usr/lib64/ld-2.30.so)
==242714==    by 0x5C09A85: ???
==242714==    by 0x5BF8CCA: ???
==242714==    by 0x5BE9B3C: ???
==242714==    by 0x5BBDA14: ???
==242714==    by 0x4010F29: call_init.part.0 (in /usr/lib64/ld-2.30.so)
==242714==    by 0x4011030: _dl_init (in /usr/lib64/ld-2.30.so)
==242714==    by 0x49AE494: _dl_catch_exception (in /usr/lib64/libc-2.30.so)
==242714==    by 0x40154F3: dl_open_worker (in /usr/lib64/ld-2.30.so)
==242714==    by 0x49AE437: _dl_catch_exception (in /usr/lib64/libc-2.30.so)
==242714== 
==242714== 512 bytes in 1 blocks are possibly lost in loss record 5 of 10
==242714==    at 0x483980B: malloc (vg_replace_malloc.c:309)
==242714==    by 0x401342C: _dl_resize_dtv (in /usr/lib64/ld-2.30.so)
==242714==    by 0x4013E4B: _dl_update_slotinfo (in /usr/lib64/ld-2.30.so)
==242714==    by 0x4013FAB: update_get_addr (in /usr/lib64/ld-2.30.so)
==242714==    by 0x40195AB: __tls_get_addr (in /usr/lib64/ld-2.30.so)
==242714==    by 0x5C09A85: ???
==242714==    by 0x5BF8CCA: ???
==242714==    by 0x5BE9B3C: ???
==242714==    by 0x5BBDA14: ???
==242714==    by 0x4010F29: call_init.part.0 (in /usr/lib64/ld-2.30.so)
==242714==    by 0x4011030: _dl_init (in /usr/lib64/ld-2.30.so)
==242714==    by 0x49AE494: _dl_catch_exception (in /usr/lib64/libc-2.30.so)
==242714== 
==242714== LEAK SUMMARY:
==242714==    definitely lost: 0 bytes in 0 blocks
==242714==    indirectly lost: 0 bytes in 0 blocks
==242714==      possibly lost: 536 bytes in 2 blocks
==242714==    still reachable: 95,345 bytes in 37 blocks
==242714==         suppressed: 0 bytes in 0 blocks
==242714== Reachable blocks (those to which a pointer was found) are not shown.
==242714== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==242714== 
==242714== For lists of detected and suppressed errors, rerun with: -s
==242714== ERROR SUMMARY: 2 errors from 2 contexts (suppressed: 0 from 0)

I think this shows that the leak is real and already in glibc 2.30. That the new samba libraries just manage to trip it where the old ones didn't (it looks like the leaks are there actually, just inder "still reachable" instead of under "possibly lost").

Comment 23 Mark Wielaard 2020-02-10 13:05:55 UTC

(In reply to Pavel Březina from comment #21)
> SSSD uses some samba libraries and this started to occur after samba update
> to 4.12.0rc1 which broke SSSD build due to removing some functionality from
> libndr. So perhaps samba is doing something wrong?
> 
> See https://github.com/SSSD/sssd/pull/976 and
> https://github.com/SSSD/sssd/pull/972

No, I don't think samba is doing something wrong. And those issues are not directly related.

We'll have to add a default (system) suppression to valgrind for this.
I'll work on that.

Comment 24 Carlos O'Donell 2020-02-10 17:16:45 UTC

My detailed analysis of Fedora 31 looks like this:

* We get a call to allocate_dtv_entry which allocates memory at 0x444620 for the dtv entry.
* Again in _dl_update_slotinfo we free the first entry.
* Again in allocate_dtv_entry we allocate another DTV entry.
* Again we never free the entry in the main thread.

The difference is this:

* We do not call _dl_resize_dtv because the number of libraries in use by the implementation is not yet large enough.
**  Fedora 31: _rtld_local._dl_tls_max_dtv_idx rises to max 9 only which is less than 15 (the initial static max).
**  Fedora Rawhide: _rtld_local._dl_tls_max_dtv_idx rises to max 16 which is one more than 15 (the initial static max) and triggers a resize.

The original DTV entry allocation is no different between f31 and rawhide.

The resize doesn't happen in f31 because of the library count.

Again, nothing has changed on the glibc side.

Comment 25 Mark Wielaard 2020-02-10 23:03:39 UTC

Proposed new default suppressions:

# The main thread dynamic thread vector, DTV, which contains pointers
# to thread local variables, isn't freed.  There are a couple of call
# patterns that can cause it to be extended.
{
  dtv-addr-tail
  Memcheck:Leak
  match-leak-kinds: possible,reachable
  fun:malloc
  fun:tls_get_addr_tail*
  fun:__tls_get_addr
}

{
  dtv-addr-resize
  Memcheck:Leak
  match-leak-kinds: possible,reachable
  fun:malloc
  fun:_dl_resize_dtv
  fun:_dl_update_slotinfo
  fun:update_get_addr
  fun:__tls_get_addr
}

{
  dtv-addr-init
  Memcheck:Leak
  match-leak-kinds: possible,reachable
  fun:malloc
  fun:allocate_dtv_entry
  fun:allocate_and_init
  fun:tls_get_addr_tail*
  fun:__tls_get_addr
}

Comment 26 Ben Cotton 2020-02-11 16:32:05 UTC

This bug appears to have been reported against 'rawhide' during the Fedora 32 development cycle.
Changing version to 32.

Comment 27 Mark Wielaard 2020-02-14 13:29:35 UTC

valgrind-3.15.0-17.fc32 has the default suppressions added from comment #25.
Please let me know if this doesn't resolve the issue for you.

Comment 28 Lukas Slebodnik 2020-02-19 08:21:56 UTC

(In reply to Mark Wielaard from comment #27)
> valgrind-3.15.0-17.fc32 has the default suppressions added from comment #25.
> Please let me know if this doesn't resolve the issue for you.

Thank you very much

Comment 29 Mark Wielaard 2020-03-05 13:06:41 UTC

Just a note that the suppression was accepted upstream and will make it into the next 3.16.0 release.
https://bugs.kde.org/show_bug.cgi?id=417578

And I have backported and submitted updates for f31 and f30 with this fix included:

valgrind-3.15.0-20.fc30
https://bodhi.fedoraproject.org/updates/FEDORA-2020-766b9ef681

valgrind-3.15.0-20.fc31
https://bodhi.fedoraproject.org/updates/FEDORA-2020-63737ee159

Note You need to log in before you can comment on or make changes to this bug.