Bug 1994068 - glibc: pthread_cancel fails with ESRCH yet subsequent pthread_join passes
Summary: glibc: pthread_cancel fails with ESRCH yet subsequent pthread_join passes
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Linux 9
Classification: Red Hat
Component: glibc
Version: 9.0
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: beta
: ---
Assignee: Florian Weimer
QA Contact: qe-baseos-tools-bugs
URL:
Whiteboard:
Depends On: 1994653
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-08-16 16:03 UTC by Václav Kadlčík
Modified: 2023-07-18 14:29 UTC (History)
7 users (show)

Fixed In Version: glibc-2.34-7.el9
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-10-21 13:30:21 UTC
Type: Bug
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
reproducer (2.43 KB, text/plain)
2021-08-16 16:03 UTC, Václav Kadlčík
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHELPLAN-93744 0 None None None 2021-08-17 06:26:39 UTC
Sourceware 12889 0 None None None 2021-08-17 06:26:07 UTC
Sourceware 19193 0 None None None 2021-08-17 06:24:17 UTC
Sourceware 28407 0 P2 ASSIGNED pthread_kill assumes that kill (getpid ()) is equivalent to tgkill (getpid (), gettid()) 2021-10-01 14:14:21 UTC

Description Václav Kadlčík 2021-08-16 16:03:59 UTC
Created attachment 1814520 [details]
reproducer

Description of problem:

I have a reproducer for an old and unrelated gcc bug. With glibc
2.34, the reproducer has started to behave differently than before
and it looks to me like a (glibc?) bug.

I'm attaching the program (rep.c) but the key functionality is:

  err = pthread_create(&th, NULL, tf, NULL);
  err = pthread_cancel(th);
  err = pthread_join(th, NULL);

repeated many times over.

Sometimes pthread_cancel returns ESRCH (this is what's actually
new, I haven't observed these failures before glibc 2.34 but
anyway). When pthread_cancel returns ESRCH then I'd expect the
subsequent pthread_join err with ESRCH, too. However it never
happens in the program.

To sum it up, pthread_cancel looks suspicious to me. It has
started to fail recently and the reported reason (ESRCH) doesn't
get "confirmed" by a subsequent pthread_join call.


Version-Release number of selected component (if applicable):

glibc-2.34-2.el9
gcc-11.2.1-2.2.el9
annobin-9.83-3.el9
kernel-5.14.0-0.rc4.35.el9


Steps to Reproduce:

1. gcc -O2 rep.c -g -o rep -lpthread
2. ./rep 200


Actual results:

Various numbers of the following line:
pthread_cancel failed: No thread with the ID thread could be found


Expected results:

I'd expect either
  * no pthread_cancel failures or
  * every pthread_cancel failure (err=ESRCH) be followed by
    a pthread_join failure (err=ESRCH).


Additional info:

* "Slow" machines tend to produce more occurrences of the problem.
* Not architecture specific

Comment 2 Florian Weimer 2021-08-17 06:24:18 UTC
Nice catch. The changed pthread_cancel implementation in glibc 2.34 triggers a known bug in pthread_kill.

Comment 3 Florian Weimer 2021-08-17 13:51:56 UTC
Patches posted upstream: https://sourceware.org/pipermail/libc-alpha/2021-August/130207.html

Comment 5 Florian Weimer 2021-09-13 12:20:30 UTC
Upstream patches have been committed (also to the 2.34 release branch).

Comment 7 Florian Weimer 2021-10-01 14:14:21 UTC
There's been a regression by the most recent fix, so we need to respin this one.

Comment 8 Florian Weimer 2021-10-21 13:30:21 UTC
Regression fix is in glibc-upstream-2.34-28.patch, part of glibc-2.34-7.el9:

    commit 40bade26d5bcbda3d21fb598c5063d9df62de966
    Author: Florian Weimer <fweimer>
    Date:   Fri Oct 1 18:16:41 2021 +0200
    
        nptl: pthread_kill must send signals to a specific thread [BZ #28407]


Note You need to log in before you can comment on or make changes to this bug.