Bug 1994068

Summary: glibc: pthread_cancel fails with ESRCH yet subsequent pthread_join passes
Product: Red Hat Enterprise Linux 9 Reporter: Václav Kadlčík <vkadlcik>
Component: glibcAssignee: Florian Weimer <fweimer>
Status: CLOSED CURRENTRELEASE QA Contact: qe-baseos-tools-bugs
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 9.0CC: ashankar, codonell, dj, fweimer, mnewsome, pfrankli, sipoyare
Target Milestone: betaKeywords: Bugfix, Patch, Triaged
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: glibc-2.34-7.el9 Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-10-21 13:30:21 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1994653    
Bug Blocks:    
Attachments:
Description Flags
reproducer none

Description Václav Kadlčík 2021-08-16 16:03:59 UTC
Created attachment 1814520 [details]
reproducer

Description of problem:

I have a reproducer for an old and unrelated gcc bug. With glibc
2.34, the reproducer has started to behave differently than before
and it looks to me like a (glibc?) bug.

I'm attaching the program (rep.c) but the key functionality is:

  err = pthread_create(&th, NULL, tf, NULL);
  err = pthread_cancel(th);
  err = pthread_join(th, NULL);

repeated many times over.

Sometimes pthread_cancel returns ESRCH (this is what's actually
new, I haven't observed these failures before glibc 2.34 but
anyway). When pthread_cancel returns ESRCH then I'd expect the
subsequent pthread_join err with ESRCH, too. However it never
happens in the program.

To sum it up, pthread_cancel looks suspicious to me. It has
started to fail recently and the reported reason (ESRCH) doesn't
get "confirmed" by a subsequent pthread_join call.


Version-Release number of selected component (if applicable):

glibc-2.34-2.el9
gcc-11.2.1-2.2.el9
annobin-9.83-3.el9
kernel-5.14.0-0.rc4.35.el9


Steps to Reproduce:

1. gcc -O2 rep.c -g -o rep -lpthread
2. ./rep 200


Actual results:

Various numbers of the following line:
pthread_cancel failed: No thread with the ID thread could be found


Expected results:

I'd expect either
  * no pthread_cancel failures or
  * every pthread_cancel failure (err=ESRCH) be followed by
    a pthread_join failure (err=ESRCH).


Additional info:

* "Slow" machines tend to produce more occurrences of the problem.
* Not architecture specific

Comment 2 Florian Weimer 2021-08-17 06:24:18 UTC
Nice catch. The changed pthread_cancel implementation in glibc 2.34 triggers a known bug in pthread_kill.

Comment 3 Florian Weimer 2021-08-17 13:51:56 UTC
Patches posted upstream: https://sourceware.org/pipermail/libc-alpha/2021-August/130207.html

Comment 5 Florian Weimer 2021-09-13 12:20:30 UTC
Upstream patches have been committed (also to the 2.34 release branch).

Comment 7 Florian Weimer 2021-10-01 14:14:21 UTC
There's been a regression by the most recent fix, so we need to respin this one.

Comment 8 Florian Weimer 2021-10-21 13:30:21 UTC
Regression fix is in glibc-upstream-2.34-28.patch, part of glibc-2.34-7.el9:

    commit 40bade26d5bcbda3d21fb598c5063d9df62de966
    Author: Florian Weimer <fweimer>
    Date:   Fri Oct 1 18:16:41 2021 +0200
    
        nptl: pthread_kill must send signals to a specific thread [BZ #28407]