Hide Forgot
Description of problem: Thread issue Version-Release number of selected component (if applicable): glibc-2.12-1.130.el6 How reproducible: Always Steps to Reproduce: 1. Have only "passwd: files" in /etc/nsswitch.conf 2. su as a local user. 3. Compile and run "client-hang" available from bug 847043, comment 3 Actual results: $ ./client-hang Cancelling thread Joining... Joined, trying getpwuid_r call ^C Expected results: The program should run to completion $ ./client-hang Cancelling thread Joining... Joined, trying getpwuid_r call Never get here $ Additional info: This issue in not reproducible on RHEL7
Just to add further context, this issue was discovered when QE's tests that Kaushik added above started failing. The tests were initially written to test a mutex lock issue in the sssd client and were passing in previous RHEL-6 updates. So is there any chance the code could have regressed between 6.4 and 6.5 perhaps?
btw the sssd upstream bug was https://fedorahosted.org/sssd/ticket/1460 fixed by https://git.fedorahosted.org/cgit/sssd.git/commit/?id=86b61156743b7ebdc049450a6f88452890fd9a61
Ok, the good news is that I've duplicated the bug on an old 6.1 VM I had lying around, so this is not a regression. The bad news is that we have at least two bugs here: the deadlocks we get are different depending on whether or not nscd is running. The former is fixed by upstream commit 312be3f9, that (among other things) adds the "c" option to various fopen calls, so that stdio stream operations on /etc/passwd et al are not cancellation points: then the lock used to guard the data structures that control the internal setent/getent calls won't leak, for no other cancellation point is exercised. The latter turns out to be a bug in the testcase. When nscd is disabled, every time we call getpwuid_r, we attempt to connect to the nscd socket, and connect is a mandatory cancellation point. When nscd is enabled, however, the first call makes the connection, and subsequent calls don't call any cancellation points, so the test loops forever, because getpwuid_r is an optional cancellation point, and glibc doesn't introduce an artificial cancellation point in it. Thus pthread_testcancel() needs to be taken out of the #ifdef/#endif block to fix this bug in the testcase.
Upstream commit from comment 6: commit 312be3f9f5eab1643d7dcc7728c76d413d4f2640 Author: Ulrich Drepper <drepper> Date: Tue Nov 15 04:24:42 2011 -0500 Clean up internal fopen uses No need to ever not use c and e.
Created attachment 1209871 [details] Tested patch. Attached patch posted for review: http://post-office.corp.redhat.com/archives/tools-patches/2016-September/msg00048.html
Created attachment 1227440 [details] tst-cancel-getpwuid_r.c There is no reason the test for this issue should use sleep or yield, we have a perfectly acceptable semaphore implementation that guarantees that you are close as possible to issuing a getpwuid_r without all the overhead of sleeping.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2017-0680.html
Oracle claims the patch was incorrect, causing memory corruption: https://blogs.oracle.com/wim/entry/oracle_linux_6_update_9
(In reply to Anssi Johansson from comment #21) > Oracle claims the patch was incorrect, causing memory corruption: > https://blogs.oracle.com/wim/entry/oracle_linux_6_update_9 Thanks, I filed the regression as bug 1437111.
(In reply to Florian Weimer from comment #22) > (In reply to Anssi Johansson from comment #21) > > Oracle claims the patch was incorrect, causing memory corruption: > > https://blogs.oracle.com/wim/entry/oracle_linux_6_update_9 > > Thanks, I filed the regression as bug 1437111. I've closed 1437111 as a duplicate. I'm going to use bug 1437147 to track the fix and missing changes.