Description of problem: Hi Florian, I see these two valgrind errors on recent Fedora rawhide when testing SSSD. Is this a known issue? ==752382== 24 bytes in 1 blocks are possibly lost in loss record 2 of 18 ==752382== at 0x483A809: malloc (vg_replace_malloc.c:309) ==752382== by 0x4013D84: tls_get_addr_tail (in /usr/lib64/ld-2.30.9000.so) ==752382== by 0x4019EEB: __tls_get_addr (in /usr/lib64/ld-2.30.9000.so) ==752382== by 0x6D3FAB5: ??? ==752382== by 0x6D2ECFA: ??? ==752382== by 0x6D1FB74: ??? ==752382== by 0x6CF3A14: ??? ==752382== by 0x4011781: call_init.part.0 (in /usr/lib64/ld-2.30.9000.so) ==752382== by 0x4011890: _dl_init (in /usr/lib64/ld-2.30.9000.so) ==752382== by 0x49CB034: _dl_catch_exception (in /usr/lib64/libc-2.30.9000.so) ==752382== by 0x4015C33: dl_open_worker (in /usr/lib64/ld-2.30.9000.so) ==752382== by 0x49CAFD7: _dl_catch_exception (in /usr/lib64/libc-2.30.9000.so) ==752382== { <insert_a_suppression_name_here> Memcheck:Leak match-leak-kinds: possible fun:malloc fun:tls_get_addr_tail fun:__tls_get_addr obj:* obj:* obj:* obj:* fun:call_init.part.0 fun:_dl_init fun:_dl_catch_exception fun:dl_open_worker fun:_dl_catch_exception } ==752382== 512 bytes in 1 blocks are possibly lost in loss record 10 of 18 ==752382== at 0x483A809: malloc (vg_replace_malloc.c:309) ==752382== by 0x4013C7C: _dl_resize_dtv (in /usr/lib64/ld-2.30.9000.so) ==752382== by 0x40146BD: _dl_update_slotinfo (in /usr/lib64/ld-2.30.9000.so) ==752382== by 0x401481B: update_get_addr (in /usr/lib64/ld-2.30.9000.so) ==752382== by 0x4019EEB: __tls_get_addr (in /usr/lib64/ld-2.30.9000.so) ==752382== by 0x6D3FAB5: ??? ==752382== by 0x6D2ECFA: ??? ==752382== by 0x6D1FB74: ??? ==752382== by 0x6CF3A14: ??? ==752382== by 0x4011781: call_init.part.0 (in /usr/lib64/ld-2.30.9000.so) ==752382== by 0x4011890: _dl_init (in /usr/lib64/ld-2.30.9000.so) ==752382== by 0x49CB034: _dl_catch_exception (in /usr/lib64/libc-2.30.9000.so) ==752382== { <insert_a_suppression_name_here> Memcheck:Leak match-leak-kinds: possible fun:malloc fun:_dl_resize_dtv fun:_dl_update_slotinfo fun:update_get_addr fun:__tls_get_addr obj:* obj:* obj:* obj:* fun:call_init.part.0 fun:_dl_init fun:_dl_catch_exception } Version-Release number of selected component (if applicable): glibc-2.30.9000-31.fc32.x86_64 How reproducible: Always.
Steps to reproduce: 1. Clone SSSD https://github.com/SSSD/sssd 2. autoreconf -if && ./configure && make && make check 3. valgrind --leak-check=full ./dlopen-tests
(In reply to Pavel Březina from comment #0) > Description of problem: > Hi Florian, > I see these two valgrind errors on recent Fedora rawhide when testing SSSD. > Is this a known issue? > > ==752382== 24 bytes in 1 blocks are possibly lost in loss record 2 of 18 > ==752382== at 0x483A809: malloc (vg_replace_malloc.c:309) > ==752382== by 0x4013D84: tls_get_addr_tail (in /usr/lib64/ld-2.30.9000.so) > ==752382== by 0x4019EEB: __tls_get_addr (in /usr/lib64/ld-2.30.9000.so) > ==752382== by 0x6D3FAB5: ??? > ==752382== by 0x6D2ECFA: ??? > ==752382== by 0x6D1FB74: ??? > ==752382== by 0x6CF3A14: ??? > ==752382== by 0x4011781: call_init.part.0 (in /usr/lib64/ld-2.30.9000.so) > ==752382== by 0x4011890: _dl_init (in /usr/lib64/ld-2.30.9000.so) > ==752382== by 0x49CB034: _dl_catch_exception (in > /usr/lib64/libc-2.30.9000.so) > ==752382== by 0x4015C33: dl_open_worker (in /usr/lib64/ld-2.30.9000.so) > ==752382== by 0x49CAFD7: _dl_catch_exception (in > /usr/lib64/libc-2.30.9000.so) > ==752382== Would you please install additional debuginfo packages? I suspect this is just a lazy TLS allocation. Depending on how the application shuts down, this memory cannot be deallocated by glibc, and thus shows up as a valgrind false positive.
(In reply to Florian Weimer from comment #2) > > Would you please install additional debuginfo packages? Do you have any specific in mind? I tried installing glibc and kernel debuginfo but I got the same result.
(In reply to Pavel Březina from comment #3) > (In reply to Florian Weimer from comment #2) > > > > Would you please install additional debuginfo packages? > > Do you have any specific in mind? I tried installing glibc and kernel > debuginfo but I got the same result. GDB should list the missing packages you need to install. You will have to run the test under GDB to see them.
(In reply to Pavel Březina from comment #1) > Steps to reproduce: > 1. Clone SSSD https://github.com/SSSD/sssd > 2. autoreconf -if && ./configure && make && make check I run into a build error: src/providers/ad/ad_gpo_ndr.c: In function ‘ndr_pull_security_ace_object_type’: src/providers/ad/ad_gpo_ndr.c:108:13: error: implicit declaration of function ndr_pull_get_switch_value’; did you mean ‘ndr_pull_set_switch_value’? [-Werror=implicit-function-declaration] 108 | level = ndr_pull_get_switch_value(ndr, r); | ^~~~~~~~~~~~~~~~~~~~~~~~~ | ndr_pull_set_switch_value cc1: some warnings being treated as errors HEAD is at d3d72b907b12448c52004a1b3aabfee0b516bf2f.
Sorry, about that. You actually need to use this branch that fix build on rawhide: https://github.com/pbrezina/sssd/tree/ndr GDB remains silent in this case so I do not know what I need to install. This is the test source: https://github.com/SSSD/sssd/blob/master/src/tests/dlopen-tests.c
After installing debugging information, the test runs into a timeout under valgrind: Running suite(s): dlopen 0%: Checks: 1, Failures: 0, Errors: 1 src/tests/dlopen-tests.c:239:E:dlopen:test_dlopen_base:0: (after this point) Test timeout expired How do I increase the test timeout?
This appears to bypass the timeout: CK_FORK=no CK_DEFAULT_TIMEOUT=0 valgrind --leak-check=full ./dlopen-tests
I think the reason why valgrind does not report a full backtrace is that the object which performs the TLS allocation is dlclose'd. valgrind cannot interpret the stack trace in this case because the object is gone when it is printed. It seems to be this TLS allocation: #0 tls_get_addr_tail (ti=0x7ffff5d2dbf8, dtv=0x7ffff7c480a0, the_map=0x43a130) at ../elf/dl-tls.c:739 #1 0x00007ffff7fe9eec in __tls_get_addr () at ../sysdeps/x86_64/tls_get_addr.S:55 #2 0x00007ffff5bddab6 in _gnutls_fips_mode_enabled () at fips.c:68 #3 0x00007ffff5bcccfb in _gnutls_rnd_preinit () at random.c:110 #4 0x00007ffff5bbdb75 in _gnutls_global_init (constructor=constructor@entry=1) at global.c:308 #5 0x00007ffff5b91a15 in lib_init () at global.c:511 #6 0x00007ffff7fe1782 in call_init (l=<optimized out>, argc=argc@entry=1, argv=argv@entry=0x7fffffffd8e8, env=env@entry=0x7fffffffd8f8) at dl-init.c:72 Running GDB like this disables fork and the test suite timeout: CK_FORK=no CK_DEFAULT_TIMEOUT=0 gdb ./dlopen-tests I don't know if this a real leak.
Much simpler reproducer is in https://bugzilla.redhat.com/show_bug.cgi?id=1796559 so this BZ should be closed as duplicate of BZ1796559 But it still be good to if glibc guys could check whether issue is in glibc (revealed be new version of samba) or it is abug in samba itself.
(In reply to Lukas Slebodnik from comment #12) > Much simpler reproducer is in > https://bugzilla.redhat.com/show_bug.cgi?id=1796559 > so this BZ should be closed as duplicate of BZ1796559 > > But it still be good to if glibc guys could check whether issue is in glibc > (revealed be new version of samba) or it is abug in samba itself. That's easier to debug. I'm checking to see if I can show this is a real leak or not.
(In reply to Carlos O'Donell from comment #13) > (In reply to Lukas Slebodnik from comment #12) > > Much simpler reproducer is in > > https://bugzilla.redhat.com/show_bug.cgi?id=1796559 > > so this BZ should be closed as duplicate of BZ1796559 > > > > But it still be good to if glibc guys could check whether issue is in glibc > > (revealed be new version of samba) or it is abug in samba itself. > > That's easier to debug. I'm checking to see if I can show this is a real > leak or not. Thankfully we enter allocate_dtv_entry only twice. The first time is here: (gdb) bt #0 allocate_dtv_entry (size=24, alignment=8) at ../elf/dl-tls.c:579 #1 allocate_and_init (map=0x415910) at ../elf/dl-tls.c:607 #2 tls_get_addr_tail (ti=0x7ffff70a6bf0, dtv=0x7ffff7df20a0, the_map=0x415910) at ../elf/dl-tls.c:787 #3 0x00007ffff7fe9eec in __tls_get_addr () at ../sysdeps/x86_64/tls_get_addr.S:55 #4 0x00007ffff6f4a166 in _gnutls_fips_mode_enabled () from /lib64/libgnutls.so.30 #5 0x00007ffff6f38f6b in _gnutls_rnd_preinit () from /lib64/libgnutls.so.30 #6 0x00007ffff6f2a0c5 in _gnutls_global_init () from /lib64/libgnutls.so.30 #7 0x00007ffff6efdaf5 in lib_init () from /lib64/libgnutls.so.30 #8 0x00007ffff7fe1782 in call_init (l=<optimized out>, argc=argc@entry=3, argv=argv@entry=0x7fffffffe588, env=env@entry=0x7fffffffe5a8) at dl-init.c:72 #9 0x00007ffff7fe1891 in call_init (env=0x7fffffffe5a8, argv=0x7fffffffe588, argc=3, l=<optimized out>) at dl-init.c:30 #10 _dl_init (main_map=0x4052d0, argc=3, argv=0x7fffffffe588, env=0x7fffffffe5a8) at dl-init.c:119 #11 0x00007ffff7f30ff5 in __GI__dl_catch_exception (exception=<optimized out>, operate=<optimized out>, args=<optimized out>) at dl-error-skeleton.c:182 #12 0x00007ffff7fe5c34 in dl_open_worker (a=a@entry=0x7fffffffe200) at dl-open.c:758 #13 0x00007ffff7f30f98 in __GI__dl_catch_exception (exception=<optimized out>, operate=<optimized out>, args=<optimized out>) at dl-error-skeleton.c:208 #14 0x00007ffff7fe514e in _dl_open (file=0x7fffffffe7e7 "/usr/lib64/samba/libauth-samba4.so", mode=-2147483390, caller_dlopen=0x4011c0 <main+74>, nsid=-2, argc=3, argv=<optimized out>, env=0x7fffffffe5a8) at dl-open.c:837 #15 0x00007ffff7fc039c in dlopen_doit (a=a@entry=0x7fffffffe420) at dlopen.c:66 #16 0x00007ffff7f30f98 in __GI__dl_catch_exception (exception=exception@entry=0x7fffffffe3c0, operate=<optimized out>, args=<optimized out>) at dl-error-skeleton.c:208 #17 0x00007ffff7f31063 in __GI__dl_catch_error (objname=0x7ffff7fc4050 <last_result+16>, errstring=0x7ffff7fc4058 <last_result+24>, mallocedp=0x7ffff7fc4048 <last_result+8>, operate=<optimized out>, args=<optimized out>) at dl-error-skeleton.c:227 #18 0x00007ffff7fc0bd9 in _dlerror_run (operate=operate@entry=0x7ffff7fc0340 <dlopen_doit>, args=args@entry=0x7fffffffe420) at dlerror.c:170 #19 0x00007ffff7fc0428 in __dlopen (file=<optimized out>, mode=<optimized out>) at dlopen.c:87 #20 0x00000000004011c0 in main (argc=3, argv=0x7fffffffe588) at dlopen-check.c:13 I inspect the returned malloc pointer in $rax, and do 'break free if $rdi == [value of $rax]' to get help from gdb's conditional breakpoint handling. If were to have to track this by hand I've have to wade through ~8000+ API calls, and I know because I used our libmtrace.so to look into this first (but the record accounting is too loose to catch this 1 lost allocation). (gdb) bt #0 __GI___libc_free (mem=0x445340) at malloc.c:3087 #1 0x00007ffff7fe4769 in _dl_update_slotinfo (req_modid=11) at ../elf/dl-tls.c:687 #2 0x00007ffff7fe481c in update_get_addr (ti=0x7ffff70a6bf0) at ../elf/dl-tls.c:799 #3 0x00007ffff7fe9eec in __tls_get_addr () at ../sysdeps/x86_64/tls_get_addr.S:55 #4 0x00007ffff6f4a166 in _gnutls_fips_mode_enabled () from /lib64/libgnutls.so.30 #5 0x00007ffff6f38f6b in _gnutls_rnd_preinit () from /lib64/libgnutls.so.30 #6 0x00007ffff6f2a0c5 in _gnutls_global_init () from /lib64/libgnutls.so.30 #7 0x00007ffff6efdaf5 in lib_init () from /lib64/libgnutls.so.30 #8 0x00007ffff7fe1782 in call_init (l=<optimized out>, argc=argc@entry=3, argv=argv@entry=0x7fffffffe588, env=env@entry=0x7fffffffe5a8) at dl-init.c:72 #9 0x00007ffff7fe1891 in call_init (env=0x7fffffffe5a8, argv=0x7fffffffe588, argc=3, l=<optimized out>) at dl-init.c:30 #10 _dl_init (main_map=0x405900, argc=3, argv=0x7fffffffe588, env=0x7fffffffe5a8) at dl-init.c:119 #11 0x00007ffff7f30ff5 in __GI__dl_catch_exception (exception=<optimized out>, operate=<optimized out>, args=<optimized out>) at dl-error-skeleton.c:182 #12 0x00007ffff7fe5c34 in dl_open_worker (a=a@entry=0x7fffffffe200) at dl-open.c:758 #13 0x00007ffff7f30f98 in __GI__dl_catch_exception (exception=<optimized out>, operate=<optimized out>, args=<optimized out>) at dl-error-skeleton.c:208 #14 0x00007ffff7fe514e in _dl_open (file=0x7fffffffe7e7 "/usr/lib64/samba/libauth-samba4.so", mode=-2147483390, caller_dlopen=0x4011c0 <main+74>, nsid=-2, argc=3, argv=<optimized out>, env=0x7fffffffe5a8) at dl-open.c:837 #15 0x00007ffff7fc039c in dlopen_doit (a=a@entry=0x7fffffffe420) at dlopen.c:66 #16 0x00007ffff7f30f98 in __GI__dl_catch_exception (exception=exception@entry=0x7fffffffe3c0, operate=<optimized out>, args=<optimized out>) at dl-error-skeleton.c:208 #17 0x00007ffff7f31063 in __GI__dl_catch_error (objname=0x7ffff7fc4050 <last_result+16>, errstring=0x7ffff7fc4058 <last_result+24>, mallocedp=0x7ffff7fc4048 <last_result+8>, operate=<optimized out>, args=<optimized out>) at dl-error-skeleton.c:227 #18 0x00007ffff7fc0bd9 in _dlerror_run (operate=operate@entry=0x7ffff7fc0340 <dlopen_doit>, args=args@entry=0x7fffffffe420) at dlerror.c:170 #19 0x00007ffff7fc0428 in __dlopen (file=<optimized out>, mode=<optimized out>) at dlopen.c:87 #20 0x00000000004011c0 in main (argc=3, argv=0x7fffffffe588) at dlopen-check.c:13 We free the first allocation here in _dl_update_slotinfo. Then we allocate another dtv entry. The next dtv entry is never freed because it's the last in-use entry. Next is to validate that __libc_freeres can reach this last entry and free it. This happens here: 537 void 538 _dl_deallocate_tls (void *tcb, bool dealloc_tcb) 539 { 540 dtv_t *dtv = GET_DTV (tcb); 541 542 /* We need to free the memory allocated for non-static TLS. */ 543 for (size_t cnt = 0; cnt < dtv[-1].counter; ++cnt) 544 free (dtv[1 + cnt].pointer.to_free); 545 546 /* The array starts with dtv[-1]. */ 547 if (dtv != GL(dl_initial_dtv)) 548 free (dtv - 1); 549 550 if (dealloc_tcb) 551 free (*tcb_to_pointer_to_free_location (tcb)); 552 } 553 rtld_hidden_def (_dl_deallocate_tls) I make an artificial call to __libc_freeres() from a destructor. At this point libpthread.so.0 is still loaded, so __libpthread_freeres should be visible and have a valid value. The debugger can see it: (gdb) p __libpthread_freeres $12 = {void (void)} 0x7ffff7916160 <__libpthread_freeres> Which is loaded: 7ffff790e000-7ffff7915000 r--p 00000000 09:00 5921721 /usr/lib64/libpthread-2.31.so 7ffff7915000-7ffff7925000 r-xp 00007000 09:00 5921721 /usr/lib64/libpthread-2.31.so 7ffff7925000-7ffff792a000 r--p 00017000 09:00 5921721 /usr/lib64/libpthread-2.31.so 7ffff792a000-7ffff792b000 r--p 0001b000 09:00 5921721 /usr/lib64/libpthread-2.31.so 7ffff792b000-7ffff792c000 rw-p 0001c000 09:00 5921721 /usr/lib64/libpthread-2.31.so => 0x00007ffff7f67f5a <+90>: cmpq $0x0,0x4ee7e(%rip) # 0x7ffff7fb6de0 0x00007ffff7f67f62 <+98>: je 0x7ffff7f67f69 <__GI___libc_freeres+105> 0x00007ffff7f67f64 <+100>: callq 0x7ffff7e193a0 <__libpthread_freeres@plt> The pointer to __libpthread_freeres is in libc: 7ffff7df4000-7ffff7e19000 r--p 00000000 09:00 5921707 /usr/lib64/libc-2.31.so 7ffff7e19000-7ffff7f69000 r-xp 00025000 09:00 5921707 /usr/lib64/libc-2.31.so 7ffff7f69000-7ffff7fb4000 r--p 00175000 09:00 5921707 /usr/lib64/libc-2.31.so 7ffff7fb4000-7ffff7fb7000 r--p 001bf000 09:00 5921707 /usr/lib64/libc-2.31.so ^^^^^^^^^^^^^^^^^^^^^^^^^ (this is GNU_RELRO RO data) 7ffff7fb7000-7ffff7fba000 rw-p 001c2000 09:00 5921707 /usr/lib64/libc-2.31.so 00000000001c2de0 0000000200000006 R_X86_64_GLOB_DAT 0000000000000000 __libpthread_freeres + 0 00000000001c2bf8 0000000200000007 R_X86_64_JUMP_SLOT 0000000000000000 __libpthread_freeres + 0 2: 0000000000000000 0 NOTYPE WEAK DEFAULT UND __libpthread_freeres 25754: 0000000000000000 0 NOTYPE WEAK DEFAULT UND __libpthread_freeres The value 0x7ffff7fb6de0 *roughly* matches a R_X86_64_GLOB_DAT relocation against __libpthread_freeres. Is it really though? LOAD 0x1bf420 0x00000000001c0420 0x00000000001c0420 0x005050 0x008f48 RW 0x1000 VirtAddr is 0x1c0420, with reloc offset at 00000000001c2de0, for a difference of +0x29c0 PT_LOAD at 0x7ffff7fb4000 VirtAddr is 0x1c0000 (rounded to even page) with reloc offset at 0x1c2de0 for a difference of +0x2de0 Reloc load address is 0x7ffff7fb4000 + 0x2de0 = 0x7ffff7fb6de0 Matches relocation location. Confirmed. We are actually looking at the R_X86_64_GLOB_DAT result in libc.so.6 in .got, namely: 00000000001c2de0 0000000200000006 R_X86_64_GLOB_DAT 0000000000000000 __libpthread_freeres + 0 (gdb) disassemble 0x7ffff7e193a0 Dump of assembler code for function __libpthread_freeres@plt: 0x00007ffff7e193a0 <+0>: endbr64 0x00007ffff7e193a4 <+4>: bnd jmpq *0x19d84d(%rip) # 0x7ffff7fb6bf8 <__libpthread_freeres> 0x00007ffff7e193ab <+11>: nopl 0x0(%rax,%rax,1) End of assembler dump. (gdb) x/2x 0x7ffff7fb6bf8 0x7ffff7fb6bf8 <__libpthread_freeres>: 0x00000000 0x00000000 Definately zero, so we never call __libpthread_freeres. The binary never links against libpthread, therefore libc is loaded first, and __libpthread_freeres is set to 0 while processing R_X86_64_GLOB_DAT. Even if we link the application against libpthread.so.0 (or preload it), and we make it into free_stacks(), the main thread is always active until the the point at which we call the exit syscall, and therefore we don't normally free unused dtv entries for the main thread. That is to say we never call _dl_deallocate_tls() on the main thread. Likewise I expect the memory resizing for a new DTV for the main thread has exactly the same behaviour as this. We will never free that DTV because it's the main thread and it's alive until the end. In summary: - DTV entries created for the main thread are not normally freed because the main thread is still running. Therefore there is always some lossage with DTV entries that are allocated for the main thread. I tracked a single allocation of 24 bytes, aligned 8, and it is *indeed* lost, but it's part of the main thread's DTV entries and reused. Likewise the 512 bytes lost are from _dl_resize_dtv() (resizing of the dtv entry array) which I can confirm match the execution pattern in the test case, that we expand to 30 entries + 2, so 32*(sizeof(dtv_t))=512 bytes (dtv_t is 16-bytes). - This issue is an old issue, the main thread has never freed it's own DTV or DTV entries. Perhaps Florian's recent NODELETE work, which moved around some code (particularly that which updated slotinfo), now requires valgrind to update its own supression lists. - In the future we could possibly free even more things as a last step if: (a) all signals were blocked, and (b) we were running the last bit of code in __libc_freeres, and we knew no TLS accesses would ever come after that point. I think Mark Wielaard should look at this next from the valgrind side. Leaving needinfo for mark.
*** Bug 1796559 has been marked as a duplicate of this bug. ***
We could add a suppression. But I don't fully understand yet why we didn't need that in the past. If you look at the runs in bug #1796559 you'll see that on Fedora 31 valgrind says: ==94839== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0) So it didn't suppress any errors there.
(In reply to Mark Wielaard from comment #16) > We could add a suppression. But I don't fully understand yet why we didn't > need that in the past. > If you look at the runs in bug #1796559 you'll see that on Fedora 31 > valgrind says: > > ==94839== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0) > > So it didn't suppress any errors there. In f31 we also have the NODELETE handling, so that can't be the difference. I can retry the same debugging in an f31 mock chroot and tell you what I see. I expect to see the same thing since this hasn't changed in glibc. I'm telling you that from first principles we don't free the main thread's dtv or dtv entries, and those are the things that valgrind complains about.
I think what happened is that the DTV entry moved from Still Reachable (there is a pointer to the start of the malloc block) to Possibly lost (there is a pointer into the malloc block which isn't at the start). Still Reachable blocks are normally not reported (they could have been freed, the programmer was just lazy), but Possibly lost blocks are reported (since they cannot be freed directly unless the program somehow gets back to the start of the block). There are also Definitely lost block (to which there isn't any pointer, so they really have been lost and can never be freed).
(In reply to Mark Wielaard from comment #18) > I think what happened is that the DTV entry moved from Still Reachable > (there is a pointer to the start of the malloc block) to Possibly lost > (there is a pointer into the malloc block which isn't at the start). Still > Reachable blocks are normally not reported (they could have been freed, the > programmer was just lazy), but Possibly lost blocks are reported (since they > cannot be freed directly unless the program somehow gets back to the start > of the block). There are also Definitely lost block (to which there isn't > any pointer, so they really have been lost and can never be freed). I'm going to thoroughly investigate this in f31 and see what's going on there. And I'll report back here. We haven't changed any of the way in which we handle the slotinfo list in the loader or the DTV and DTV entries in the threads, so I find it odd that we've had a change in reporting.
(In reply to Carlos O'Donell from comment #19) > (In reply to Mark Wielaard from comment #18) > > I think what happened is that the DTV entry moved from Still Reachable > > (there is a pointer to the start of the malloc block) to Possibly lost > > (there is a pointer into the malloc block which isn't at the start). Still > > Reachable blocks are normally not reported (they could have been freed, the > > programmer was just lazy), but Possibly lost blocks are reported (since they > > cannot be freed directly unless the program somehow gets back to the start > > of the block). There are also Definitely lost block (to which there isn't > > any pointer, so they really have been lost and can never be freed). > > I'm going to thoroughly investigate this in f31 and see what's going on > there. And I'll report back here. > > We haven't changed any of the way in which we handle the slotinfo list in > the loader or the DTV and DTV entries in the threads, so I find it odd that > we've had a change in reporting. I do not think it is related to f31. I used it as an example in BZ1796559 taht older version of samba does not cause any problem. So test-case pass with older version of samba even on rawhide e.g. [root@f8dc2d658c24 ~]# rpm -q gcc valgrind samba-client-libs gcc-10.0.1-0.7.fc32.x86_64 valgrind-3.15.0-15.fc32.x86_64 samba-client-libs-4.11.5-0.fc32.x86_64 [root@f8dc2d658c24 ~]# valgrind --leak-check=full ./dlopen-check 2 /usr/lib64/samba/libauth-samba4.so && echo OK ==318== Memcheck, a memory error detector ==318== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al. ==318== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info ==318== Command: ./dlopen-check 2 /usr/lib64/samba/libauth-samba4.so ==318== ==318== ==318== HEAP SUMMARY: ==318== in use at exit: 12,586 bytes in 24 blocks ==318== total heap usage: 3,845 allocs, 3,821 frees, 667,093 bytes allocated ==318== ==318== LEAK SUMMARY: ==318== definitely lost: 0 bytes in 0 blocks ==318== indirectly lost: 0 bytes in 0 blocks ==318== possibly lost: 0 bytes in 0 blocks ==318== still reachable: 12,586 bytes in 24 blocks ==318== suppressed: 0 bytes in 0 blocks ==318== Reachable blocks (those to which a pointer was found) are not shown. ==318== To see them, rerun with: --leak-check=full --show-leak-kinds=all ==318== ==318== For lists of detected and suppressed errors, rerun with: -s ==318== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0) OK I used following command in rawhide container to prepare env sh# dnf update -y --nogpgcheck sh# dnf install -y --nogpgcheck valgrind gcc sh# dnf install samba-client-libs sh# dnf install -y --nogpgcheck https://kojipkgs.fedoraproject.org//packages/samba/4.11.5/0.fc32/x86_64/samba-client-libs-4.11.5-0.fc32.x86_64.rpm https://kojipkgs.fedoraproject.org//packages/samba/4.11.5/0.fc32/x86_64/samba-common-libs-4.11.5-0.fc32.x86_64.rpm https://kojipkgs.fedoraproject.org//packages/samba/4.11.5/0.fc32/x86_64/libwbclient-4.11.5-0.fc32.x86_64.rpm https://kojipkgs.fedoraproject.org//packages/samba/4.11.5/0.fc32/noarch/samba-common-4.11.5-0.fc32.noarch.rpm
SSSD uses some samba libraries and this started to occur after samba update to 4.12.0rc1 which broke SSSD build due to removing some functionality from libndr. So perhaps samba is doing something wrong? See https://github.com/SSSD/sssd/pull/976 and https://github.com/SSSD/sssd/pull/972
(In reply to Lukas Slebodnik from comment #20) > (In reply to Carlos O'Donell from comment #19) > > (In reply to Mark Wielaard from comment #18) > > > I think what happened is that the DTV entry moved from Still Reachable > > > (there is a pointer to the start of the malloc block) to Possibly lost > > > (there is a pointer into the malloc block which isn't at the start). Still > > > Reachable blocks are normally not reported (they could have been freed, the > > > programmer was just lazy), but Possibly lost blocks are reported (since they > > > cannot be freed directly unless the program somehow gets back to the start > > > of the block). There are also Definitely lost block (to which there isn't > > > any pointer, so they really have been lost and can never be freed). > > > > I'm going to thoroughly investigate this in f31 and see what's going on > > there. And I'll report back here. > > > > We haven't changed any of the way in which we handle the slotinfo list in > > the loader or the DTV and DTV entries in the threads, so I find it odd that > > we've had a change in reporting. > > I do not think it is related to f31. > I used it as an example in BZ1796559 taht older version of samba does not > cause any problem. > So test-case pass with older version of samba even on rawhide I did the opposite, on an f31 system I installed (unpacked) the new samba libraries and ran the dlopen-check testcase against it with f31 valgrind and glibc. Then I can also replicate: $ LD_LIBRARY_PATH=./lib64:./usr/lib64:./usr/lib64/samba:./usr/lib64/samba/wbclient valgrind --leak-check=full ./dlopen-check 2 ./usr/lib64/samba/libauth-samba4.so ==242714== Memcheck, a memory error detector ==242714== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al. ==242714== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info ==242714== Command: ./dlopen-check 2 ./usr/lib64/samba/libauth-samba4.so ==242714== ==242714== ==242714== HEAP SUMMARY: ==242714== in use at exit: 95,881 bytes in 39 blocks ==242714== total heap usage: 3,814 allocs, 3,775 frees, 749,623 bytes allocated ==242714== ==242714== 24 bytes in 1 blocks are possibly lost in loss record 1 of 10 ==242714== at 0x483980B: malloc (vg_replace_malloc.c:309) ==242714== by 0x4013547: tls_get_addr_tail.isra.0 (in /usr/lib64/ld-2.30.so) ==242714== by 0x40195AB: __tls_get_addr (in /usr/lib64/ld-2.30.so) ==242714== by 0x5C09A85: ??? ==242714== by 0x5BF8CCA: ??? ==242714== by 0x5BE9B3C: ??? ==242714== by 0x5BBDA14: ??? ==242714== by 0x4010F29: call_init.part.0 (in /usr/lib64/ld-2.30.so) ==242714== by 0x4011030: _dl_init (in /usr/lib64/ld-2.30.so) ==242714== by 0x49AE494: _dl_catch_exception (in /usr/lib64/libc-2.30.so) ==242714== by 0x40154F3: dl_open_worker (in /usr/lib64/ld-2.30.so) ==242714== by 0x49AE437: _dl_catch_exception (in /usr/lib64/libc-2.30.so) ==242714== ==242714== 512 bytes in 1 blocks are possibly lost in loss record 5 of 10 ==242714== at 0x483980B: malloc (vg_replace_malloc.c:309) ==242714== by 0x401342C: _dl_resize_dtv (in /usr/lib64/ld-2.30.so) ==242714== by 0x4013E4B: _dl_update_slotinfo (in /usr/lib64/ld-2.30.so) ==242714== by 0x4013FAB: update_get_addr (in /usr/lib64/ld-2.30.so) ==242714== by 0x40195AB: __tls_get_addr (in /usr/lib64/ld-2.30.so) ==242714== by 0x5C09A85: ??? ==242714== by 0x5BF8CCA: ??? ==242714== by 0x5BE9B3C: ??? ==242714== by 0x5BBDA14: ??? ==242714== by 0x4010F29: call_init.part.0 (in /usr/lib64/ld-2.30.so) ==242714== by 0x4011030: _dl_init (in /usr/lib64/ld-2.30.so) ==242714== by 0x49AE494: _dl_catch_exception (in /usr/lib64/libc-2.30.so) ==242714== ==242714== LEAK SUMMARY: ==242714== definitely lost: 0 bytes in 0 blocks ==242714== indirectly lost: 0 bytes in 0 blocks ==242714== possibly lost: 536 bytes in 2 blocks ==242714== still reachable: 95,345 bytes in 37 blocks ==242714== suppressed: 0 bytes in 0 blocks ==242714== Reachable blocks (those to which a pointer was found) are not shown. ==242714== To see them, rerun with: --leak-check=full --show-leak-kinds=all ==242714== ==242714== For lists of detected and suppressed errors, rerun with: -s ==242714== ERROR SUMMARY: 2 errors from 2 contexts (suppressed: 0 from 0) I think this shows that the leak is real and already in glibc 2.30. That the new samba libraries just manage to trip it where the old ones didn't (it looks like the leaks are there actually, just inder "still reachable" instead of under "possibly lost").
(In reply to Pavel Březina from comment #21) > SSSD uses some samba libraries and this started to occur after samba update > to 4.12.0rc1 which broke SSSD build due to removing some functionality from > libndr. So perhaps samba is doing something wrong? > > See https://github.com/SSSD/sssd/pull/976 and > https://github.com/SSSD/sssd/pull/972 No, I don't think samba is doing something wrong. And those issues are not directly related. We'll have to add a default (system) suppression to valgrind for this. I'll work on that.
My detailed analysis of Fedora 31 looks like this: * We get a call to allocate_dtv_entry which allocates memory at 0x444620 for the dtv entry. * Again in _dl_update_slotinfo we free the first entry. * Again in allocate_dtv_entry we allocate another DTV entry. * Again we never free the entry in the main thread. The difference is this: * We do not call _dl_resize_dtv because the number of libraries in use by the implementation is not yet large enough. ** Fedora 31: _rtld_local._dl_tls_max_dtv_idx rises to max 9 only which is less than 15 (the initial static max). ** Fedora Rawhide: _rtld_local._dl_tls_max_dtv_idx rises to max 16 which is one more than 15 (the initial static max) and triggers a resize. The original DTV entry allocation is no different between f31 and rawhide. The resize doesn't happen in f31 because of the library count. Again, nothing has changed on the glibc side.
Proposed new default suppressions: # The main thread dynamic thread vector, DTV, which contains pointers # to thread local variables, isn't freed. There are a couple of call # patterns that can cause it to be extended. { dtv-addr-tail Memcheck:Leak match-leak-kinds: possible,reachable fun:malloc fun:tls_get_addr_tail* fun:__tls_get_addr } { dtv-addr-resize Memcheck:Leak match-leak-kinds: possible,reachable fun:malloc fun:_dl_resize_dtv fun:_dl_update_slotinfo fun:update_get_addr fun:__tls_get_addr } { dtv-addr-init Memcheck:Leak match-leak-kinds: possible,reachable fun:malloc fun:allocate_dtv_entry fun:allocate_and_init fun:tls_get_addr_tail* fun:__tls_get_addr }
This bug appears to have been reported against 'rawhide' during the Fedora 32 development cycle. Changing version to 32.
valgrind-3.15.0-17.fc32 has the default suppressions added from comment #25. Please let me know if this doesn't resolve the issue for you.
(In reply to Mark Wielaard from comment #27) > valgrind-3.15.0-17.fc32 has the default suppressions added from comment #25. > Please let me know if this doesn't resolve the issue for you. Thank you very much
Just a note that the suppression was accepted upstream and will make it into the next 3.16.0 release. https://bugs.kde.org/show_bug.cgi?id=417578 And I have backported and submitted updates for f31 and f30 with this fix included: valgrind-3.15.0-20.fc30 https://bodhi.fedoraproject.org/updates/FEDORA-2020-766b9ef681 valgrind-3.15.0-20.fc31 https://bodhi.fedoraproject.org/updates/FEDORA-2020-63737ee159