We are seeing hung test_lock process during many builds (like in packages bundling gnulib) on various arches (arm, s390, but mostly on ppc*). It might or might not be a problem in gnulib, other possible places are glibc and kernel. Both ppc64 and ppc64le are probably the best candidates to reproduce the problem.
See https://fedoraproject.org/wiki/Architectures/PowerPC/WorkQueue for a minimalized test case.
what might help
- use 1 CPU (disable all except one)
- utilize the system by eg. build a kernel build in paralllel
Thanks for the reproducer. Generally, we want to run it via this:
./test-driver --test-name test-lock --log-file test-lock.log --trs-file test-lock.trs --color-tests no --enable-hard-errors s --expect-failure no -- ./test-lock
(via the GNU test-driver script rather than directly)
It will lock after an arbitrary number of attempts, where that might be the first one, or the third, etc. Some analysis shows that we are failing in a one shot threading test routine in which the test's main "test_once" function spawns a number of THREAD_COUNT (10) "once_contender_thread"(s) that will each wait for a POSIX rwlock to be fired by the main thread, and then repeat this 50,000 times. After an arbitrary number of iterations the main thread is seeing that one (random) thread is not ready. That would be the case if it was sitting waiting for a signal to wake up following blocking on gl_rwlock_rdlock (which is actually a futex when translated into glibc pthreads). The threads uses these rwlocks after the first iteration (repeat).
So. The whole thing smells (sadly) like some kind of kernel futex bug. It's odd that this affects several architectures (I tried this on AArch64 Fedora 21). Has something dramatic changed in futexes upstream in glibc or the kernel very recently? Does anyone have some thoughts about the best way to triage the kernel futex code here perhaps? I'm too tired tonight.
For the interim I can suggest a couple quick *hacks*. For one, you could disable the test entirely (which you won't like). For another, you can turn on #define ENABLE_DEBUGGING to 1 instead of 0 via a small patch to test-lock.c since the interaction caused by the logging output invariably seems to result in the tests completing in the various quick runs I did here tonight. If you set debugging on the behavior of the test would otherwise be identical to not setting it. That is the most ugly and nasty approach I agree.
Reported upstream: http://savannah.gnu.org/bugs/?43487
Please try a kernel after 76835b0ebf8a7fe85beb03c75121419a7dec52f0 has been applied. I believe this is a bug in the futex code due to a missing barrier. Futexes are used to back NPTL POSIX pthreads that are used in the test case.
jwb: This will get fixed automagically today. It was included in the 3.16.7
and 3.17.2 stable releases that just happened.