Bug 720005 - possible threading lockup issue
Summary: possible threading lockup issue
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 16
Hardware: x86_64
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-07-08 16:37 UTC by Dan Horák
Modified: 2012-09-04 13:45 UTC (History)
17 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2012-09-04 13:45:28 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)

Description Dan Horák 2011-07-08 16:37:46 UTC
I had big troubles to get glibc-2.14-4 built for Fedora 15 on s390x and from what I was able to found is that the builds were stuck when the "sort" utility (multi-threaded) was called from some Makefile. It happened in mock builds with F-14 and RHEL-6 as the hosting system, the buildroot was always F-15 (tested with glibc-2.14-2, glibc-2.14-3 and glibc-2.13.90-9) and only s390x build made the problem (s390 was OK). The workaround was using a virtual machine with only 1 CPU as the host. All these facts make me to think that something could be wrong with thread on s390x.

Andreas, please take this report more as a place holder until more precise information is gathered.

Comment 1 Andreas Schwab 2011-07-11 06:41:43 UTC
Since it's only sort it's more likely a problem in multi-threaded sorting.

Comment 2 Ondrej Vasik 2011-07-11 13:25:21 UTC
...since it's only s390x it might be anywhere... ;)

but keeping on coreutils until Dan will gather more info...

Comment 3 Dan Horák 2011-07-12 08:38:32 UTC
The hanging sort looks like this in gdb (run in the mock chroot):

(gdb) file /bin/sort
Reading symbols from /bin/sort...Reading symbols from /usr/lib/debug/bin/sort.debug...done.
done.
(gdb) attach 2973
Attaching to program: /bin/sort, process 2973
Reading symbols from /lib64/libpthread.so.0...Reading symbols from /usr/lib/debug/lib64/libpthread-2.14.so.debug...done.
[Thread debugging using libthread_db enabled]
[New Thread 0x200080fd910 (LWP 2976)]
[New Thread 0x200076fd910 (LWP 2975)]
[New Thread 0x20006cfd910 (LWP 2974)]
done.
Loaded symbols for /lib64/libpthread.so.0
Reading symbols from /lib64/libc.so.6...Reading symbols from /usr/lib/debug/lib64/libc-2.14.so.debug...done.
done.
Loaded symbols for /lib64/libc.so.6
Reading symbols from /lib/ld64.so.1...Reading symbols from /usr/lib/debug/lib64/ld-2.14.so.debug...done.
done.
Loaded symbols for /lib/ld64.so.1
0x000002000003c894 in __pthread_cond_wait (cond=0x3ffffc17938, mutex=0x3ffffc17910) at pthread_cond_wait.c:156
156	      lll_futex_wait (&cond->__data.__futex, futex_val, pshared);
(gdb) where
#0  0x000002000003c894 in __pthread_cond_wait (cond=0x3ffffc17938, mutex=0x3ffffc17910) at pthread_cond_wait.c:156
#1  0x000000008000a01a in queue_pop (queue=<optimized out>) at sort.c:3905
#2  merge_loop (temp_output=0x0, tfp=0x200001f5d38, total_lines=157, queue=0x3ffffc17908) at sort.c:4053
#3  sortlines (lines=<optimized out>, nthreads=<optimized out>, total_lines=157, node=0x8001f4d0, queue=0x3ffffc17908, tfp=0x200001f5d38, 
    temp_output=0x0, is_lo_child=<optimized out>) at sort.c:4182
#4  0x000000008000a7ac in sortlines (lines=<optimized out>, nthreads=2, total_lines=157, node=0x8001f3d0, queue=0x3ffffc17908, tfp=0x200001f5d38, 
    temp_output=0x0, is_lo_child=<optimized out>) at sort.c:4159
#5  0x000000008000a7ac in sortlines (lines=<optimized out>, nthreads=4, total_lines=157, node=0x8001f1d0, queue=0x3ffffc17908, tfp=0x200001f5d38, 
    temp_output=0x0, is_lo_child=<optimized out>) at sort.c:4159
#6  0x00000000800050e8 in sort (nthreads=4, output_file=0x0, nfiles=0, files=0x8001db68) at sort.c:4454
#7  main (argc=<optimized out>, argv=<optimized out>) at sort.c:5276

Comment 4 Dan Horák 2011-07-12 08:43:40 UTC
Also the "hangs in sort" are more frequent than I originally thought, they are responsible for other stuck builds in different phases, but all hanging in sort.

Comment 5 Dan Horák 2011-07-12 08:59:59 UTC
And because I see a new hanging build on s390 right now (so even 32-bit this time), the problem is much more frequent and possibly originates earlier in the past. Something must be really rotten somewhere ...

Comment 6 Dan Horák 2011-07-12 09:08:55 UTC
(gdb) thread apply all where

Thread 4 (Thread 0x20006cfd910 (LWP 2974)):
#0  0x000002000003c894 in __pthread_cond_wait (cond=0x3ffffc17938, mutex=0x3ffffc17910) at pthread_cond_wait.c:156
#1  0x000000008000a01a in queue_pop (queue=<optimized out>) at sort.c:3905
#2  merge_loop (temp_output=0x0, tfp=0x200001f5d38, total_lines=157, queue=0x3ffffc17908) at sort.c:4053
#3  sortlines (lines=<optimized out>, nthreads=<optimized out>, total_lines=157, node=0x8001f350, queue=0x3ffffc17908, tfp=0x200001f5d38, 
    temp_output=0x0, is_lo_child=<optimized out>) at sort.c:4182
#4  0x000000008000a7ac in sortlines (lines=<optimized out>, nthreads=2, total_lines=157, node=0x8001f250, queue=0x3ffffc17908, tfp=0x200001f5d38, 
    temp_output=0x0, is_lo_child=<optimized out>) at sort.c:4159
#5  0x000000008000a83c in sortlines_thread (data=<optimized out>) at sort.c:4112
#6  0x0000020000038298 in start_thread (arg=0x20006cfd910) at pthread_create.c:307
#7  0x0000020000150bd2 in thread_start () from /lib64/libc.so.6

Thread 3 (Thread 0x200076fd910 (LWP 2975)):
#0  0x000002000003c894 in __pthread_cond_wait (cond=0x3ffffc17938, mutex=0x3ffffc17910) at pthread_cond_wait.c:156
#1  0x000000008000a01a in queue_pop (queue=<optimized out>) at sort.c:3905
#2  merge_loop (temp_output=0x0, tfp=0x200001f5d38, total_lines=157, queue=0x3ffffc17908) at sort.c:4053
#3  sortlines (lines=<optimized out>, nthreads=<optimized out>, total_lines=157, node=0x8001f450, queue=0x3ffffc17908, tfp=0x200001f5d38, 
    temp_output=0x0, is_lo_child=<optimized out>) at sort.c:4182
#4  0x000000008000a83c in sortlines_thread (data=<optimized out>) at sort.c:4112
#5  0x0000020000038298 in start_thread (arg=0x200076fd910) at pthread_create.c:307
#6  0x0000020000150bd2 in thread_start () from /lib64/libc.so.6

Thread 2 (Thread 0x200080fd910 (LWP 2976)):
#0  0x000002000003c894 in __pthread_cond_wait (cond=0x3ffffc17938, mutex=0x3ffffc17910) at pthread_cond_wait.c:156
#1  0x000000008000a01a in queue_pop (queue=<optimized out>) at sort.c:3905
#2  merge_loop (temp_output=0x0, tfp=0x200001f5d38, total_lines=157, queue=0x3ffffc17908) at sort.c:4053
#3  sortlines (lines=<optimized out>, nthreads=<optimized out>, total_lines=157, node=0x8001f2d0, queue=0x3ffffc17908, tfp=0x200001f5d38, 
    temp_output=0x0, is_lo_child=<optimized out>) at sort.c:4182
#4  0x000000008000a83c in sortlines_thread (data=<optimized out>) at sort.c:4112
#5  0x0000020000038298 in start_thread (arg=0x200080fd910) at pthread_create.c:307
#6  0x0000020000150bd2 in thread_start () from /lib64/libc.so.6

Thread 1 (Thread 0x200001fbb00 (LWP 2973)):
#0  0x000002000003c894 in __pthread_cond_wait (cond=0x3ffffc17938, mutex=0x3ffffc17910) at pthread_cond_wait.c:156
#1  0x000000008000a01a in queue_pop (queue=<optimized out>) at sort.c:3905
#2  merge_loop (temp_output=0x0, tfp=0x200001f5d38, total_lines=157, queue=0x3ffffc17908) at sort.c:4053
#3  sortlines (lines=<optimized out>, nthreads=<optimized out>, total_lines=157, node=0x8001f4d0, queue=0x3ffffc17908, tfp=0x200001f5d38, 
    temp_output=0x0, is_lo_child=<optimized out>) at sort.c:4182
#4  0x000000008000a7ac in sortlines (lines=<optimized out>, nthreads=2, total_lines=157, node=0x8001f3d0, queue=0x3ffffc17908, tfp=0x200001f5d38, 
    temp_output=0x0, is_lo_child=<optimized out>) at sort.c:4159
#5  0x000000008000a7ac in sortlines (lines=<optimized out>, nthreads=4, total_lines=157, node=0x8001f1d0, queue=0x3ffffc17908, tfp=0x200001f5d38, 
    temp_output=0x0, is_lo_child=<optimized out>) at sort.c:4159
#6  0x00000000800050e8 in sort (nthreads=4, output_file=0x0, nfiles=0, files=0x8001db68) at sort.c:4454
#7  main (argc=<optimized out>, argv=<optimized out>) at sort.c:5276

Comment 7 Dan Horák 2011-07-22 10:24:45 UTC
I've seen a deadlocked python recently during a build of gx_head in rawhide. And I had problems with gx_head in the past ...

Comment 8 Dan Horák 2011-07-22 11:14:09 UTC
eg. 4 tries were required to build gx_head now - http://s390.koji.fedoraproject.org/koji/taskinfo?taskID=428582

Comment 9 Dan Horák 2011-07-26 14:22:30 UTC
(In reply to comment #7)
> I've seen a deadlocked python recently during a build of gx_head in rawhide.
> And I had problems with gx_head in the past ...

it's from the python-based waf build system

Comment 10 Dan Horák 2011-07-28 20:41:11 UTC
And I should also mention that I haven't seen these deadlocks when building F-16 packages with glibc 2.14.90-[34] in the buildroots.

Comment 11 Dan Horák 2011-08-25 18:25:13 UTC
switched back to glibc as it looks as a general issue, limited to F-15

Comment 12 Andreas Schwab 2011-08-29 07:29:22 UTC
Must be a kernel bug then.

Comment 13 Dan Horák 2011-09-05 06:50:54 UTC
Hm, probably for the first time I see a deadlock in F-16 build (using the waf buildsystem) - http://s390.koji.fedoraproject.org/koji/taskinfo?taskID=455615

Comment 14 Henrik Nordström 2011-09-12 22:02:54 UTC
Saw the same on ARMv7 hardfloat today. Been compling packages for the ARMv7 bootstrap for many days, and today it started hanging. First when trying to build python3 where the test suite repeatedly hunk waiting in futex(0xXXXXXXXX,  FUTEX_WAIT_PRIVATE, ..), and then a little later a sort command hung in the same manner.

In all cases stracing the process is sufficient to get it running again, and strace then reports the following pattern

$ sudo strace -p 9181
Process 9181 attached - interrupt to quit
futex(0xbedeaad8, FUTEX_WAIT_PRIVATE, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable)
futex(0xbedeaad8, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xbedeaabc, FUTEX_WAKE_PRIVATE, 1) = 0

Comment 15 Henrik Nordström 2011-09-12 22:06:14 UTC
Relevant software versions on ARMv7:

kernel-tegra-2.6.40.3-0.fc15.armv7hl
glibc-2.13.90-9.armv7hl

Comment 16 Dave Jones 2011-10-06 19:15:10 UTC
if this is a kernel bug (and I don't see any evidence it is), I'd suggest taking this to linux-kernel, where there are people who actually know something about these architectures.

filing such bugs against fedora kernel is just going to leave them sitting here.

Comment 17 Stephen Gallagher 2011-12-09 14:28:29 UTC
I'm also seeing a deadlock on a Fedora 16 x86_64 system while running mock to build libldb (also a WAF buildsystem) for i686.

There are two processes sitting awaiting a futex:

[sgallagh@vm-048 result]$ sudo strace -p 15276
Process 15276 attached - interrupt to quit
[ Process PID=15276 runs in 32 bit mode. ]
futex(0xf4f023f8, FUTEX_WAIT_PRIVATE, 0, NULL

and

[sgallagh@vm-048 result]$ sudo strace -p 15277
Process 15277 attached - interrupt to quit
[ Process PID=15277 runs in 32 bit mode. ]
futex(0x8880418, FUTEX_WAIT_PRIVATE, 0, NULL


I'm not sure if it's important, but I should note that it's running in 32-bit mode on a 64-bit OS (inside mock).

Comment 18 Simon Farnsworth 2012-02-10 23:15:39 UTC
A question - is anyone seeing this without Python involved? If not, bug #787712 might be relevant; Python in F16 (and possibly earlier versions) has a bug in its low-level implementation of os.fork() (as used by all the multi-process Python modules) that results in a deadlock.

Comment 19 Dan Horák 2012-02-13 13:39:10 UTC
(In reply to comment #18)
> A question - is anyone seeing this without Python involved? If not, bug #787712
> might be relevant; Python in F16 (and possibly earlier versions) has a bug in
> its low-level implementation of os.fork() (as used by all the multi-process
> Python modules) that results in a deadlock.

Yes, the sort utility is also affected.

And it looks to me that the deadlocks are much more common in F-15 than in F >= 16 ...

Comment 20 Warren Togami 2012-02-24 07:41:44 UTC
kernel-3.2.6-3.fc16.x86_64
glibc-2.14.90-24.fc16.4.x86_64

RHEL-6.2 kvm host
/dev/sdb SSD drive exported raw to the kvm guest
  Fedora 16 x86_64 kvm guest
  /dev/vdb is the SSD drive from the host.
  mount -t btrfs -o ssd,compress=lzo /dev/vdb1 /home

Several times a day I randomly experience lockups while running git and some python scripts on this /home filesystem.  They get stuck with FUTEX_WAIT_PRIVATE.  Running these commands under strace -f seems to prevent them from getting stuck.

Comment 21 Simon Farnsworth 2012-02-24 10:44:19 UTC
I would urge anyone who's only seeing this when running Python to first check bug #787712 - Dan Horák is seeing a more general issue on S/390, while there is a *known* Python deadlock bug in Fedora 16, bug #787712.

Please, let's not clutter this bug with people seeing Python lockups due to a known Python bug, and leave it for the S/390 general threading lockup bug.

Comment 22 Henrik Nordström 2012-02-25 13:25:32 UTC
The threading lockup is not isolated to S/390, also seen on ARM and some indications it's also seen on i686. But it's very hard to tell if it's multiple arch-dependent bugs with similar result or a common bug. The general frequency it quite low.

But naturally, everything python goes to bug #787712.

Comment 23 Dave Jones 2012-03-22 16:47:05 UTC
[mass update]
kernel-3.3.0-4.fc16 has been pushed to the Fedora 16 stable repository.
Please retest with this update.

Comment 24 Dave Jones 2012-03-22 16:51:43 UTC
[mass update]
kernel-3.3.0-4.fc16 has been pushed to the Fedora 16 stable repository.
Please retest with this update.

Comment 25 Dave Jones 2012-03-22 17:02:24 UTC
[mass update]
kernel-3.3.0-4.fc16 has been pushed to the Fedora 16 stable repository.
Please retest with this update.


Note You need to log in before you can comment on or make changes to this bug.