Bug 1155291 - hang in test_lock
Summary: hang in test_lock
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: rawhide
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Kyle McMartin
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks: ZedoraTracker PPCTracker ARM64, F-ExcludeArch-aarch64 F-ExcludeArch-ppc64le, PPC64LETracker
TreeView+ depends on / blocked
 
Reported: 2014-10-21 20:08 UTC by Dan Horák
Modified: 2016-12-22 16:43 UTC (History)
17 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1160355 (view as bug list)
Environment:
Last Closed: 2014-11-04 09:23:54 UTC


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Bugzilla 1406031 None None None Never

Internal Links: 1406031

Description Dan Horák 2014-10-21 20:08:57 UTC
We are seeing hung test_lock process during many builds (like in packages bundling gnulib) on various arches (arm, s390, but mostly on ppc*). It might or might not be a problem in gnulib, other possible places are glibc and kernel. Both ppc64 and ppc64le are probably the best candidates to reproduce the problem.

See https://fedoraproject.org/wiki/Architectures/PowerPC/WorkQueue for a minimalized test case.

Comment 1 Dan Horák 2014-10-21 20:15:39 UTC
what might help
- use 1 CPU (disable all except one)
- utilize the system by eg. build a kernel build in paralllel
- retry

Comment 2 Jon Masters 2014-10-22 07:03:54 UTC
Thanks for the reproducer. Generally, we want to run it via this:

./test-driver --test-name test-lock --log-file test-lock.log --trs-file test-lock.trs --color-tests no --enable-hard-errors s --expect-failure no -- ./test-lock

(via the GNU test-driver script rather than directly)

It will lock after an arbitrary number of attempts, where that might be the first one, or the third, etc. Some analysis shows that we are failing in a one shot threading test routine in which the test's main "test_once" function spawns a number of THREAD_COUNT (10) "once_contender_thread"(s) that will each wait for a POSIX rwlock to be fired by the main thread, and then repeat this 50,000 times. After an arbitrary number of iterations the main thread is seeing that one (random) thread is not ready. That would be the case if it was sitting waiting for a signal to wake up following blocking on gl_rwlock_rdlock (which is actually a futex when translated into glibc pthreads). The threads uses these rwlocks after the first iteration (repeat).

So. The whole thing smells (sadly) like some kind of kernel futex bug. It's odd that this affects several architectures (I tried this on AArch64 Fedora 21). Has something dramatic changed in futexes upstream in glibc or the kernel very recently? Does anyone have some thoughts about the best way to triage the kernel futex code here perhaps? I'm too tired tonight.

For the interim I can suggest a couple quick *hacks*. For one, you could disable the test entirely (which you won't like). For another, you can turn on #define ENABLE_DEBUGGING to 1 instead of 0 via a small patch to test-lock.c since the interaction caused by the logging output invariably seems to result in the tests completing in the various quick  runs I did here tonight. If you set debugging on the behavior of the test would otherwise be identical to not setting it. That is the most ugly and nasty approach I agree.

Jon.

Comment 3 Marcin Juszkiewicz 2014-10-28 12:16:46 UTC
Reported upstream: http://savannah.gnu.org/bugs/?43487

Comment 4 Jon Masters 2014-10-29 04:29:01 UTC
Please try a kernel after 76835b0ebf8a7fe85beb03c75121419a7dec52f0 has been applied. I believe this is a bug in the futex code due to a missing barrier. Futexes are used to back NPTL POSIX pthreads that are used in the test case.

Comment 6 Dan Horák 2014-11-04 09:23:54 UTC
jwb: This will get fixed automagically today.  It was included in the 3.16.7
and 3.17.2 stable releases that just happened.


Note You need to log in before you can comment on or make changes to this bug.