Bug 1405071

Summary: getaddrinfo looses internal lock with deferred cancellation.
Product: Red Hat Enterprise Linux 7 Reporter: Keyue Hu <rwindz0>
Component: glibcAssignee: glibc team <glibc-bugzilla>
Status: CLOSED WONTFIX QA Contact: qe-baseos-tools-bugs
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 7.0CC: ashankar, codonell, fweimer, mnewsome, pfrankli
Target Milestone: pre-dev-freeze   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-06-18 19:34:34 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
backtrace of deadlock none

Description Keyue Hu 2016-12-15 14:09:45 UTC
Created attachment 1232178 [details]
backtrace of deadlock

Description of problem:
when pthread_cancel() on the thread calling getaddrinfo(), the libc lock in check_pf.c might be left without being unlocked. and then the next getaddrinfo call hangs forever. 


Version-Release number of selected component (if applicable):
glibc-2.17-106.el7_2.8.x86_64


How reproducible:
easy to reproduce.

Steps to Reproduce:
1. start thread calling zookeeper_init on 127.0.0.1 which calls getaddrinfo
2. call pthread_cancel on this thread
3. repeat 1-2

Actual results:
get hanged on getaddrinfo

Expected results:
never hangs


Additional info:
[root@3b3cfab6b378 /]# uname -a
Linux 3b3cfab6b378 3.10.0-327.36.3.el7.x86_64 #1 SMP Mon Oct 24 16:09:20 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
[root@3b3cfab6b378 /]# rpm -q glibc
glibc-2.17-106.el7_2.8.x86_64

Comment 1 Keyue Hu 2016-12-15 14:14:48 UTC
in the source code of glibc, sysdeps/unix/sysv/linux/check_pf.c 

between L322-L356, there are pthread cancellation point in __socket, __bind, or make_request. If we get pthread_cancel, when code goes in L322-L356 the check_pf lock is left locked. 

by the way the upstream glibc seems has no such issue.

Comment 2 Keyue Hu 2016-12-15 14:36:34 UTC
to be correct, the upstream might have the same issue.

Comment 3 Carlos O'Donell 2016-12-16 01:50:12 UTC
There are no cancellation points in __socket or __bind.

Cancellation points in those functions would violate the POSIX requirements that no additional cancellation points be present other than those here:
 2.9.5 Thread Cancellation
http://pubs.opengroup.org/onlinepubs/009695399/functions/xsh_chap02_09.html
"An implementation shall not introduce cancellation points into any other functions specified in this volume of IEEE Std 1003.1-2001."

However, in make_request, there is a __sendto, __recvmsg, and__netlink_assert_response, all of which could be cancellable and that would cause the lock to be lost and the subsequent __check_pf to hang.

There is a _lot_ of code running in make_request, the simplest solution is to push a cleanup handler to unlock the lock.

I've filed an upstream bug for this.
https://sourceware.org/bugzilla/show_bug.cgi?id=20975

Thanks for the bug report.

Comment 4 Keyue Hu 2016-12-16 02:36:23 UTC
Yeah, only __sendto and __recvmsg are cancellable. 

and it is kind of you to fillup upstream bug. thanks!

Comment 6 Carlos O'Donell 2019-06-18 19:34:34 UTC
Red Hat Enterprise Linux 7 is entering Maintenance Phase Support 1 this year and as such this issue will not be considered for fixing in RHEL 7 and is being closed. If you still encounter this issue with Red Hat Enterprise Linux 8, then please open a new issue with such details. Note that the upstream issue will remain for upstream tracking.