Bug 1596537

Summary:

__libc_freeres / pthread_exit interactions with valgrind.

Product:

Red Hat Enterprise Linux 8

Reporter:

EML <sa212+redhat>

Component:

valgrind

Assignee:

Mark Wielaard <mjw>

valgrind sub component:

system-version

QA Contact:

qe-baseos-tools-bugs

Status:

CLOSED UPSTREAM

Docs Contact:

Severity:

low

Priority:

unspecified

CC:

ashankar, codonell, dj, dsmith, fweimer, jakub, mjw, mnewsome, ohudlick, pfrankli

Version:

8.3

Flags:

pm-rhel: mirror+

Target Milestone:

Target Release:

8.3

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2020-12-10 21:02:32 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
test code	none

Description EML 2018-06-29 08:08:49 UTC

Created attachment 1455453 [details]
test code

Description of problem:

Possible race condition in pthread_exit

Version-Release number of selected component (if applicable):

libc-2.17


How reproducible:

Always

Steps to Reproduce:

1. gcc -lpthread test1.c
2. valgrind --leak-check=full ./a.out

Actual results:

See Valgrind output below

Expected results:

No leak reported by valgind

Additional info:

If the attached code is compiled and run under Valgrind, a possible leak is reported consistently on 7.3, and on about 50% of runs on 6.9.

However, if the 'sleep(1)' line is uncommented, then Valgrind reports *no* issues on both 6.9 and 7.3. In other words, if 'pthread_exit' is called immediately after creating the required (detached) threads, then Valgrind may, or may not, report a possible memory leak.

Note that '--tool=helgrind' does not report any issues. The same results are reported with the current Valgrind (compiled from source) on 6.9.

-------------

==21152== Memcheck, a memory error detector
==21152== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==21152== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info
==21152== Command: ./a.out
==21152== 
in new thread 0
in new thread 1
==21152== 
==21152== HEAP SUMMARY:
==21152==     in use at exit: 560 bytes in 1 blocks
==21152==   total heap usage: 7 allocs, 6 frees, 2,752 bytes allocated
==21152== 
==21152== 560 bytes in 1 blocks are possibly lost in loss record 1 of 1
==21152==    at 0x4C2B975: calloc (vg_replace_malloc.c:711)
==21152==    by 0x4011F04: _dl_allocate_tls (in /usr/lib64/ld-2.17.so)
==21152==    by 0x4E3D9C0: pthread_create@@GLIBC_2.2.5 (in /usr/lib64/libpthread-2.17.so)
==21152==    by 0x400876: main (in /home/elavelle/work/tests/pthreads/a.out)
==21152== 
==21152== LEAK SUMMARY:
==21152==    definitely lost: 0 bytes in 0 blocks
==21152==    indirectly lost: 0 bytes in 0 blocks
==21152==      possibly lost: 560 bytes in 1 blocks
==21152==    still reachable: 0 bytes in 0 blocks
==21152==         suppressed: 0 bytes in 0 blocks
==21152== 
==21152== For counts of detected and suppressed errors, rerun with: -v
==21152== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)

Comment 2 Carlos O'Donell 2018-07-03 18:25:01 UTC

(In reply to EML from comment #0)
> ==21152== 560 bytes in 1 blocks are possibly lost in loss record 1 of 1
> ==21152==    at 0x4C2B975: calloc (vg_replace_malloc.c:711)
> ==21152==    by 0x4011F04: _dl_allocate_tls (in /usr/lib64/ld-2.17.so)
> ==21152==    by 0x4E3D9C0: pthread_create@@GLIBC_2.2.5 (in
> /usr/lib64/libpthread-2.17.so)
> ==21152==    by 0x400876: main (in /home/elavelle/work/tests/pthreads/a.out)

Adding Mark Wielaard to the CC (Valgrind maintainer).

With a detached thread there is no easy way to recover the thread's resources because the thread itself becomes the owner of the thread descriptor and there are some things it cannot free (because it's using them, like thread-local storage memory).

Once the thread is detached the runtime (glibc) no longer has ownership of the thread's descriptor. The detached state indicates we will never join the thread to recover any of the resources being used by the thread.

However, because we have coordination between the kernel and the thread, particularly via clone's CLONE_CHILD_CLEARTID (and CLONE_CHILD_SETTID/set_tid_address()), the kernel knows when the thread is dead, and can mark the tid memory clear; within glibc we use the cleared memory to recover the stack.

What valgrind *appears* to be saying here is:

* A thread was created, memory was allocated for TLS storage via calloc, and then that memory was lost.

This is odd in the case of a detached thread because we entirely re-use the thread's stack, and *everything*, from thread descriptor, to TLS memory, all lives on that stack.

The second thread calls into question the thread stack reuse, since if the first thread exits fast enough we can reuse the stack for the second thread, and instead of allocating everything from new we clear the memory and set it all up again for the new thread. In this case both calls to _dl_allocate_tls() in allocate_stack() (from nptl/allocatestack.c) would not be called because the TLS was already setup (and is the same for all threads, and might be realloc'd if new DSOs were opened requiring more memory e.g. out of sync generation counter).

Either way I don't see a reason for valgrind complaining that the memory is lost. Perhaps Valgrind is unable to see the reuse as the next thread's stack?

Mark, any thoughts?

Comment 3 Mark Wielaard 2018-07-03 19:28:11 UTC

Note that valgrind thinks this is only possibly lost, not definitely (or indirectly).

"possibly lost" means your program is leaking memory, unless you're doing unusual things with pointers that could cause them to point into the middle of an allocated block; see the user manual for some possible causes. Use --show-possibly-lost=no if you don't want to see these reports.

I think this falls in the "unusal things with pointers".

valgrind shouldn't mind you reusing any memory allocated through calloc (), but it needs to see a matching free () for that memory allocation eventually.

Where/when is this memory supposed to be freed (either when it is reused or when it isn't).

Comment 4 Mark Wielaard 2018-07-03 19:33:46 UTC

Also note that this memory block always seems to leak (also with the sleep uncommented) when running with valgrind --run-libc-freeres=no.

==31214== 560 bytes in 1 blocks are possibly lost in loss record 6 of 7
==31214==    at 0x4C2B955: calloc (vg_replace_malloc.c:711)
==31214==    by 0x4012884: allocate_dtv (dl-tls.c:317)
==31214==    by 0x4012884: _dl_allocate_tls (dl-tls.c:533)
==31214==    by 0x4E3E7AB: allocate_stack (allocatestack.c:574)
==31214==    by 0x4E3E7AB: pthread_create@@GLIBC_2.2.5 (pthread_create.c:451)
==31214==    by 0x400806: main (in /tmp/a.out)

So, I think the problem is that somehow __libc_freeres looses the pointer for the dtv somehow (or part of it for a particular thread) and so forgets to free it.

Comment 5 Carlos O'Donell 2018-07-04 00:04:04 UTC

(In reply to Mark Wielaard from comment #4)
> Also note that this memory block always seems to leak (also with the sleep
> uncommented) when running with valgrind --run-libc-freeres=no.
> 
> ==31214== 560 bytes in 1 blocks are possibly lost in loss record 6 of 7
> ==31214==    at 0x4C2B955: calloc (vg_replace_malloc.c:711)
> ==31214==    by 0x4012884: allocate_dtv (dl-tls.c:317)
> ==31214==    by 0x4012884: _dl_allocate_tls (dl-tls.c:533)
> ==31214==    by 0x4E3E7AB: allocate_stack (allocatestack.c:574)
> ==31214==    by 0x4E3E7AB: pthread_create@@GLIBC_2.2.5 (pthread_create.c:451)
> ==31214==    by 0x400806: main (in /tmp/a.out)
> 
> So, I think the problem is that somehow __libc_freeres looses the pointer
> for the dtv somehow (or part of it for a particular thread) and so forgets
> to free it.

This memory is not leaked, it is freed at process exit. Let me explain.

The allocation call chain looks like this:

pthread_create
  -> allocate_stack
    -> _dl_allocate_tls
      -> allocate_dtv
        -> calloc

This matches the valgrind report.

The free call chain should be:

__libc_freeres
  -> nptl_freeres
    -> __free_stacks(0) [free all stacks not "in use"]
      -> _dl_deallocate_tls
        -> free (dtv - 1); [free the dtv itself]

And to determine "in use" we use:

/* Check whether the stack is still used or not.  */
#define FREE_P(descr) ((descr)->tid <= 0)

If one or more detached threads are still alive we will not free their stack usage via __libc_freeres.  Only the kernel writing 0 to the tid in the thread's descriptor would make it safe for __libc_freeres to free that stack.

Therefore for a given N detached threads there may be outstanding N calloc's which allocated the N dtv's for the threads.

How does valgrind handle thread shutdown? Does it kill the threads and then wait for the kernel to say they are dead and *then* call __libc_freeres()? That would be the only way __libc_freeres would free all the in-use stacks and DTVS.

Comment 6 Mark Wielaard 2018-07-04 00:51:59 UTC

(In reply to Carlos O'Donell from comment #5)
> The free call chain should be:
> 
> __libc_freeres
>   -> nptl_freeres
>     -> __free_stacks(0) [free all stacks not "in use"]
>       -> _dl_deallocate_tls
>         -> free (dtv - 1); [free the dtv itself]
> 
> And to determine "in use" we use:
> 
> /* Check whether the stack is still used or not.  */
> #define FREE_P(descr) ((descr)->tid <= 0)
> 
> If one or more detached threads are still alive we will not free their stack
> usage via __libc_freeres.  Only the kernel writing 0 to the tid in the
> thread's descriptor would make it safe for __libc_freeres to free that stack.

When and how does the kernel write 0 to a thread descriptor?

> Therefore for a given N detached threads there may be outstanding N calloc's
> which allocated the N dtv's for the threads.
> 
> How does valgrind handle thread shutdown? Does it kill the threads and then
> wait for the kernel to say they are dead and *then* call __libc_freeres()?
> That would be the only way __libc_freeres would free all the in-use stacks
> and DTVS.

valgrind/the program just runs on native threads.
There is only ever one native thread running (valgrind uses a big lock).

Whenever a thread exits (gets to a point where valgrind knows it will fall off the end of the execution path) it checks if there are any other threads still alive, if there are no others alive it will tear down everything, including running __libc_freeres. Otherwise it simply does an (real) _exit syscall.

When valgrind sees an exit syscall it will tell the thread itself to exit, if it sees an exit_group syscall it will tell all threads to exit. Which will then trigger the above "the last thread alive will call __libc_freeres" as explained above.

Comment 7 Carlos O'Donell 2018-07-04 03:31:58 UTC

(In reply to Mark Wielaard from comment #6)
> (In reply to Carlos O'Donell from comment #5)
> > The free call chain should be:
> > 
> > __libc_freeres
> >   -> nptl_freeres
> >     -> __free_stacks(0) [free all stacks not "in use"]
> >       -> _dl_deallocate_tls
> >         -> free (dtv - 1); [free the dtv itself]
> > 
> > And to determine "in use" we use:
> > 
> > /* Check whether the stack is still used or not.  */
> > #define FREE_P(descr) ((descr)->tid <= 0)
> > 
> > If one or more detached threads are still alive we will not free their stack
> > usage via __libc_freeres.  Only the kernel writing 0 to the tid in the
> > thread's descriptor would make it safe for __libc_freeres to free that stack.
> 
> When and how does the kernel write 0 to a thread descriptor?

This is the protocol between userspace and the kernel via clone's CLONE_CHILD_CLEARTID (and CLONE_CHILD_SETTID/set_tid_address()

When the thread dies the kernel writes a 0 to the tid address and then issues a futex wakeup. This is used to implement pthread_join also.

> > Therefore for a given N detached threads there may be outstanding N calloc's
> > which allocated the N dtv's for the threads.
> > 
> > How does valgrind handle thread shutdown? Does it kill the threads and then
> > wait for the kernel to say they are dead and *then* call __libc_freeres()?
> > That would be the only way __libc_freeres would free all the in-use stacks
> > and DTVS.
> 
> valgrind/the program just runs on native threads.
> There is only ever one native thread running (valgrind uses a big lock).
> 
> Whenever a thread exits (gets to a point where valgrind knows it will fall
> off the end of the execution path) it checks if there are any other threads
> still alive, if there are no others alive it will tear down everything,
> including running __libc_freeres. Otherwise it simply does an (real) _exit
> syscall.

So there there is a race here then. Valgrind needs to wait for the kernel to process all CLONE_CHILD_CLEARTID requests or __libc_freeres may not cleanup all the thread stacks.

> When valgrind sees an exit syscall it will tell the thread itself to exit,
> if it sees an exit_group syscall it will tell all threads to exit. Which
> will then trigger the above "the last thread alive will call __libc_freeres"
> as explained above.

... and wait for all tid's to be cleared by the kernel before calling __libc_freeres.  There may be more things in the future hooked on that tid being cleared by the kernel upon thread death.

Comment 8 Mark Wielaard 2018-07-04 09:01:09 UTC

(In reply to Carlos O'Donell from comment #7)
> (In reply to Mark Wielaard from comment #6)
> > When and how does the kernel write 0 to a thread descriptor?
> 
> This is the protocol between userspace and the kernel via clone's
> CLONE_CHILD_CLEARTID (and CLONE_CHILD_SETTID/set_tid_address()
> 
> When the thread dies the kernel writes a 0 to the tid address and then
> issues a futex wakeup. This is used to implement pthread_join also.
> 
> > > Therefore for a given N detached threads there may be outstanding N calloc's
> > > which allocated the N dtv's for the threads.
> > > 
> > > How does valgrind handle thread shutdown? Does it kill the threads and then
> > > wait for the kernel to say they are dead and *then* call __libc_freeres()?
> > > That would be the only way __libc_freeres would free all the in-use stacks
> > > and DTVS.
> > 
> > valgrind/the program just runs on native threads.
> > There is only ever one native thread running (valgrind uses a big lock).
> > 
> > Whenever a thread exits (gets to a point where valgrind knows it will fall
> > off the end of the execution path) it checks if there are any other threads
> > still alive, if there are no others alive it will tear down everything,
> > including running __libc_freeres. Otherwise it simply does an (real) _exit
> > syscall.
> 
> So there there is a race here then. Valgrind needs to wait for the kernel to
> process all CLONE_CHILD_CLEARTID requests or __libc_freeres may not cleanup
> all the thread stacks.
> 
> > When valgrind sees an exit syscall it will tell the thread itself to exit,
> > if it sees an exit_group syscall it will tell all threads to exit. Which
> > will then trigger the above "the last thread alive will call __libc_freeres"
> > as explained above.
> 
> ... and wait for all tid's to be cleared by the kernel before calling
> __libc_freeres.  There may be more things in the future hooked on that tid
> being cleared by the kernel upon thread death.

OK, so from http://man7.org/linux/man-pages/man2/clone.2.html

CLONE_CHILD_CLEARTID (since Linux 2.5.49)
              Clear (zero) the child thread ID at the location ctid in child
              memory when the child exits, and do a wakeup on the futex at
              that address.  The address involved may be changed by the
              set_tid_address(2) system call.  This is used by threading
              libraries.

That is certainly interesting. valgrind indeed does nothing special with the memory pointed to at ctid, except note whether or not it is defined before/after a clone syscall or set_tid_address. The ctid is called child_tidptr (pointed to by ARG_CHILD_TIDPTR) in coregrind/m_syswrap/syswrap-linux.c. This is actually given to start_thread_NORETURN, but not used and not passed to run_a_thread_NORETURN (probably because main_thread_wrapper_NORETURN cannot pass it on).

The issue is that that address is not owned by valgrind. And it might be hard to wrap and emulate it, given that we would then have to also emulate doing the full futex dance on it. 

And when we are at the point of calling __libc_freeres we won't allow any other thread to run anymore (since we really believe they are dead an gone and we are the only life thread).

If we could fix the tracking (see the discrepancy between main_thread_wrapper_NORETURN and start_thread_NORETURN above), then we might forcibly clear the memory for each thread that has setup ctid just before calling __libc_freeres to indicate all threads are really and truly dead.

Would that work, or does __libc_freeres also depend on the futex being called?

Comment 9 Carlos O'Donell 2018-07-04 16:13:13 UTC

(In reply to Mark Wielaard from comment #8)
> Would that work, or does __libc_freeres also depend on the futex being
> called?

Only pthread_join depends on the futex wakeup being called. In the case of __libc_freeres, and the stack freeing, all we need is a zero to be written to the registered address.

Comment 10 Mark Wielaard 2019-08-08 14:25:44 UTC

I wonder if part of the original issue wasn't simply that __libc_freeres wasn't just https://sourceware.org/PR23329 The __libc_freeres infrastructure is not properly run across DSO boundaries.
It is certainly harder to trigger on Fedora. But it can still be triggered. Correctly handling CLONE_CHILD_CLEARTID under valgrind is tricky and must be done upstream first. Nobody is working on it atm.

Comment 11 Carlos O'Donell 2019-08-08 15:36:15 UTC

(In reply to Mark Wielaard from comment #10)
> I wonder if part of the original issue wasn't simply that __libc_freeres
> wasn't just https://sourceware.org/PR23329 The __libc_freeres infrastructure
> is not properly run across DSO boundaries.

No, in the case of libpthread we called directly via the ptr_freeres hook which
worked arcoss the DSO boundary because libpthread initiaized the hook when it
started up. There was no such initialiation and hook setup for any other DSO.
The upstream fix was to get rid of all __libc_freeres hooks and statically
call the handlers for each library from libc via weak-ref-and-check code. So
there is no more exit handler registration in use.

> It is certainly harder to trigger on Fedora. But it can still be triggered.
> Correctly handling CLONE_CHILD_CLEARTID under valgrind is tricky and must be
> done upstream first. Nobody is working on it atm.

Right, that needs to happen if you're going to wait for all threads to die,
and then try to free all of the stacks, otherwise the race is always there.
Some thread is still alive and holding the stack.

Comment 12 Mark Wielaard 2020-05-01 16:07:08 UTC

Unlikely we will fix this for RHEL7, so moving to RHEL8.
Also this looks a bit like https://bugzilla.redhat.com/show_bug.cgi?id=1796433 for which we ended up with a suppression rule.

Comment 16 Mark Wielaard 2020-12-10 21:02:25 UTC

Needs to be handled upstream first https://bugs.kde.org/show_bug.cgi?id=415141

Comment 17 RHEL Program Management 2020-12-10 21:02:32 UTC

Development Management has reviewed and declined this request. You may appeal this decision by reopening this request.