Bug 1405071

Summary:

getaddrinfo looses internal lock with deferred cancellation.

Product:

Red Hat Enterprise Linux 7

Reporter:

Keyue Hu <rwindz0>

Component:

glibc

Assignee:

glibc team <glibc-bugzilla>

Status:

CLOSED WONTFIX

QA Contact:

qe-baseos-tools-bugs

Severity:

unspecified

Docs Contact:

Priority:

unspecified

Version:

7.0

CC:

ashankar, codonell, fweimer, mnewsome, pfrankli

Target Milestone:

pre-dev-freeze

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2019-06-18 19:34:34 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
backtrace of deadlock	none

Description Keyue Hu 2016-12-15 14:09:45 UTC

Created attachment 1232178 [details]
backtrace of deadlock

Description of problem:
when pthread_cancel() on the thread calling getaddrinfo(), the libc lock in check_pf.c might be left without being unlocked. and then the next getaddrinfo call hangs forever. 


Version-Release number of selected component (if applicable):
glibc-2.17-106.el7_2.8.x86_64


How reproducible:
easy to reproduce.

Steps to Reproduce:
1. start thread calling zookeeper_init on 127.0.0.1 which calls getaddrinfo
2. call pthread_cancel on this thread
3. repeat 1-2

Actual results:
get hanged on getaddrinfo

Expected results:
never hangs


Additional info:
[root@3b3cfab6b378 /]# uname -a
Linux 3b3cfab6b378 3.10.0-327.36.3.el7.x86_64 #1 SMP Mon Oct 24 16:09:20 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
[root@3b3cfab6b378 /]# rpm -q glibc
glibc-2.17-106.el7_2.8.x86_64

Comment 1 Keyue Hu 2016-12-15 14:14:48 UTC

in the source code of glibc, sysdeps/unix/sysv/linux/check_pf.c 

between L322-L356, there are pthread cancellation point in __socket, __bind, or make_request. If we get pthread_cancel, when code goes in L322-L356 the check_pf lock is left locked. 

by the way the upstream glibc seems has no such issue.

Comment 2 Keyue Hu 2016-12-15 14:36:34 UTC

to be correct, the upstream might have the same issue.

Comment 3 Carlos O'Donell 2016-12-16 01:50:12 UTC

There are no cancellation points in __socket or __bind.

Cancellation points in those functions would violate the POSIX requirements that no additional cancellation points be present other than those here:
 2.9.5 Thread Cancellation
http://pubs.opengroup.org/onlinepubs/009695399/functions/xsh_chap02_09.html
"An implementation shall not introduce cancellation points into any other functions specified in this volume of IEEE Std 1003.1-2001."

However, in make_request, there is a __sendto, __recvmsg, and__netlink_assert_response, all of which could be cancellable and that would cause the lock to be lost and the subsequent __check_pf to hang.

There is a _lot_ of code running in make_request, the simplest solution is to push a cleanup handler to unlock the lock.

I've filed an upstream bug for this.
https://sourceware.org/bugzilla/show_bug.cgi?id=20975

Thanks for the bug report.

Comment 4 Keyue Hu 2016-12-16 02:36:23 UTC

Yeah, only __sendto and __recvmsg are cancellable. 

and it is kind of you to fillup upstream bug. thanks!

Comment 6 Carlos O'Donell 2019-06-18 19:34:34 UTC

Red Hat Enterprise Linux 7 is entering Maintenance Phase Support 1 this year and as such this issue will not be considered for fixing in RHEL 7 and is being closed. If you still encounter this issue with Red Hat Enterprise Linux 8, then please open a new issue with such details. Note that the upstream issue will remain for upstream tracking.