1155291 – hang in test_lock

Bug 1155291 - hang in test_lock

Summary: hang in test_lock

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	rawhide
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	Kyle McMartin
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	ZedoraTracker PPCTracker ARM64, F-ExcludeArch-aarch64 F-ExcludeArch-ppc64le, PPC64LETracker
TreeView+	depends on / blocked

Reported:	2014-10-21 20:08 UTC by Dan Horák
Modified:	2016-12-22 16:43 UTC (History)
CC List:	17 users (show)
Fixed In Version:
Clone Of:
Clones:	1160355 (view as bug list)
Environment:
Last Closed:	2014-11-04 09:23:54 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1406031	0	unspecified	CLOSED	RFE: enable again gnulib's test-lock	2024-03-05 05:49:33 UTC

Internal Links: 1406031

Description Dan Horák 2014-10-21 20:08:57 UTC

We are seeing hung test_lock process during many builds (like in packages bundling gnulib) on various arches (arm, s390, but mostly on ppc*). It might or might not be a problem in gnulib, other possible places are glibc and kernel. Both ppc64 and ppc64le are probably the best candidates to reproduce the problem.

See https://fedoraproject.org/wiki/Architectures/PowerPC/WorkQueue for a minimalized test case.

Comment 1 Dan Horák 2014-10-21 20:15:39 UTC

what might help
- use 1 CPU (disable all except one)
- utilize the system by eg. build a kernel build in paralllel
- retry

Comment 2 Jon Masters 2014-10-22 07:03:54 UTC

Thanks for the reproducer. Generally, we want to run it via this:

./test-driver --test-name test-lock --log-file test-lock.log --trs-file test-lock.trs --color-tests no --enable-hard-errors s --expect-failure no -- ./test-lock

(via the GNU test-driver script rather than directly)

It will lock after an arbitrary number of attempts, where that might be the first one, or the third, etc. Some analysis shows that we are failing in a one shot threading test routine in which the test's main "test_once" function spawns a number of THREAD_COUNT (10) "once_contender_thread"(s) that will each wait for a POSIX rwlock to be fired by the main thread, and then repeat this 50,000 times. After an arbitrary number of iterations the main thread is seeing that one (random) thread is not ready. That would be the case if it was sitting waiting for a signal to wake up following blocking on gl_rwlock_rdlock (which is actually a futex when translated into glibc pthreads). The threads uses these rwlocks after the first iteration (repeat).

So. The whole thing smells (sadly) like some kind of kernel futex bug. It's odd that this affects several architectures (I tried this on AArch64 Fedora 21). Has something dramatic changed in futexes upstream in glibc or the kernel very recently? Does anyone have some thoughts about the best way to triage the kernel futex code here perhaps? I'm too tired tonight.

For the interim I can suggest a couple quick *hacks*. For one, you could disable the test entirely (which you won't like). For another, you can turn on #define ENABLE_DEBUGGING to 1 instead of 0 via a small patch to test-lock.c since the interaction caused by the logging output invariably seems to result in the tests completing in the various quick runs I did here tonight. If you set debugging on the behavior of the test would otherwise be identical to not setting it. That is the most ugly and nasty approach I agree.

Jon.

Comment 3 Marcin Juszkiewicz 2014-10-28 12:16:46 UTC

Reported upstream: http://savannah.gnu.org/bugs/?43487

Comment 4 Jon Masters 2014-10-29 04:29:01 UTC

Please try a kernel after 76835b0ebf8a7fe85beb03c75121419a7dec52f0 has been applied. I believe this is a bug in the futex code due to a missing barrier. Futexes are used to back NPTL POSIX pthreads that are used in the test case.

Comment 6 Dan Horák 2014-11-04 09:23:54 UTC

jwb: This will get fixed automagically today.  It was included in the 3.16.7
and 3.17.2 stable releases that just happened.

Note You need to log in before you can comment on or make changes to this bug.