I had big troubles to get glibc-2.14-4 built for Fedora 15 on s390x and from what I was able to found is that the builds were stuck when the "sort" utility (multi-threaded) was called from some Makefile. It happened in mock builds with F-14 and RHEL-6 as the hosting system, the buildroot was always F-15 (tested with glibc-2.14-2, glibc-2.14-3 and glibc-2.13.90-9) and only s390x build made the problem (s390 was OK). The workaround was using a virtual machine with only 1 CPU as the host. All these facts make me to think that something could be wrong with thread on s390x. Andreas, please take this report more as a place holder until more precise information is gathered.
Since it's only sort it's more likely a problem in multi-threaded sorting.
...since it's only s390x it might be anywhere... ;) but keeping on coreutils until Dan will gather more info...
The hanging sort looks like this in gdb (run in the mock chroot): (gdb) file /bin/sort Reading symbols from /bin/sort...Reading symbols from /usr/lib/debug/bin/sort.debug...done. done. (gdb) attach 2973 Attaching to program: /bin/sort, process 2973 Reading symbols from /lib64/libpthread.so.0...Reading symbols from /usr/lib/debug/lib64/libpthread-2.14.so.debug...done. [Thread debugging using libthread_db enabled] [New Thread 0x200080fd910 (LWP 2976)] [New Thread 0x200076fd910 (LWP 2975)] [New Thread 0x20006cfd910 (LWP 2974)] done. Loaded symbols for /lib64/libpthread.so.0 Reading symbols from /lib64/libc.so.6...Reading symbols from /usr/lib/debug/lib64/libc-2.14.so.debug...done. done. Loaded symbols for /lib64/libc.so.6 Reading symbols from /lib/ld64.so.1...Reading symbols from /usr/lib/debug/lib64/ld-2.14.so.debug...done. done. Loaded symbols for /lib/ld64.so.1 0x000002000003c894 in __pthread_cond_wait (cond=0x3ffffc17938, mutex=0x3ffffc17910) at pthread_cond_wait.c:156 156 lll_futex_wait (&cond->__data.__futex, futex_val, pshared); (gdb) where #0 0x000002000003c894 in __pthread_cond_wait (cond=0x3ffffc17938, mutex=0x3ffffc17910) at pthread_cond_wait.c:156 #1 0x000000008000a01a in queue_pop (queue=<optimized out>) at sort.c:3905 #2 merge_loop (temp_output=0x0, tfp=0x200001f5d38, total_lines=157, queue=0x3ffffc17908) at sort.c:4053 #3 sortlines (lines=<optimized out>, nthreads=<optimized out>, total_lines=157, node=0x8001f4d0, queue=0x3ffffc17908, tfp=0x200001f5d38, temp_output=0x0, is_lo_child=<optimized out>) at sort.c:4182 #4 0x000000008000a7ac in sortlines (lines=<optimized out>, nthreads=2, total_lines=157, node=0x8001f3d0, queue=0x3ffffc17908, tfp=0x200001f5d38, temp_output=0x0, is_lo_child=<optimized out>) at sort.c:4159 #5 0x000000008000a7ac in sortlines (lines=<optimized out>, nthreads=4, total_lines=157, node=0x8001f1d0, queue=0x3ffffc17908, tfp=0x200001f5d38, temp_output=0x0, is_lo_child=<optimized out>) at sort.c:4159 #6 0x00000000800050e8 in sort (nthreads=4, output_file=0x0, nfiles=0, files=0x8001db68) at sort.c:4454 #7 main (argc=<optimized out>, argv=<optimized out>) at sort.c:5276
Also the "hangs in sort" are more frequent than I originally thought, they are responsible for other stuck builds in different phases, but all hanging in sort.
And because I see a new hanging build on s390 right now (so even 32-bit this time), the problem is much more frequent and possibly originates earlier in the past. Something must be really rotten somewhere ...
(gdb) thread apply all where Thread 4 (Thread 0x20006cfd910 (LWP 2974)): #0 0x000002000003c894 in __pthread_cond_wait (cond=0x3ffffc17938, mutex=0x3ffffc17910) at pthread_cond_wait.c:156 #1 0x000000008000a01a in queue_pop (queue=<optimized out>) at sort.c:3905 #2 merge_loop (temp_output=0x0, tfp=0x200001f5d38, total_lines=157, queue=0x3ffffc17908) at sort.c:4053 #3 sortlines (lines=<optimized out>, nthreads=<optimized out>, total_lines=157, node=0x8001f350, queue=0x3ffffc17908, tfp=0x200001f5d38, temp_output=0x0, is_lo_child=<optimized out>) at sort.c:4182 #4 0x000000008000a7ac in sortlines (lines=<optimized out>, nthreads=2, total_lines=157, node=0x8001f250, queue=0x3ffffc17908, tfp=0x200001f5d38, temp_output=0x0, is_lo_child=<optimized out>) at sort.c:4159 #5 0x000000008000a83c in sortlines_thread (data=<optimized out>) at sort.c:4112 #6 0x0000020000038298 in start_thread (arg=0x20006cfd910) at pthread_create.c:307 #7 0x0000020000150bd2 in thread_start () from /lib64/libc.so.6 Thread 3 (Thread 0x200076fd910 (LWP 2975)): #0 0x000002000003c894 in __pthread_cond_wait (cond=0x3ffffc17938, mutex=0x3ffffc17910) at pthread_cond_wait.c:156 #1 0x000000008000a01a in queue_pop (queue=<optimized out>) at sort.c:3905 #2 merge_loop (temp_output=0x0, tfp=0x200001f5d38, total_lines=157, queue=0x3ffffc17908) at sort.c:4053 #3 sortlines (lines=<optimized out>, nthreads=<optimized out>, total_lines=157, node=0x8001f450, queue=0x3ffffc17908, tfp=0x200001f5d38, temp_output=0x0, is_lo_child=<optimized out>) at sort.c:4182 #4 0x000000008000a83c in sortlines_thread (data=<optimized out>) at sort.c:4112 #5 0x0000020000038298 in start_thread (arg=0x200076fd910) at pthread_create.c:307 #6 0x0000020000150bd2 in thread_start () from /lib64/libc.so.6 Thread 2 (Thread 0x200080fd910 (LWP 2976)): #0 0x000002000003c894 in __pthread_cond_wait (cond=0x3ffffc17938, mutex=0x3ffffc17910) at pthread_cond_wait.c:156 #1 0x000000008000a01a in queue_pop (queue=<optimized out>) at sort.c:3905 #2 merge_loop (temp_output=0x0, tfp=0x200001f5d38, total_lines=157, queue=0x3ffffc17908) at sort.c:4053 #3 sortlines (lines=<optimized out>, nthreads=<optimized out>, total_lines=157, node=0x8001f2d0, queue=0x3ffffc17908, tfp=0x200001f5d38, temp_output=0x0, is_lo_child=<optimized out>) at sort.c:4182 #4 0x000000008000a83c in sortlines_thread (data=<optimized out>) at sort.c:4112 #5 0x0000020000038298 in start_thread (arg=0x200080fd910) at pthread_create.c:307 #6 0x0000020000150bd2 in thread_start () from /lib64/libc.so.6 Thread 1 (Thread 0x200001fbb00 (LWP 2973)): #0 0x000002000003c894 in __pthread_cond_wait (cond=0x3ffffc17938, mutex=0x3ffffc17910) at pthread_cond_wait.c:156 #1 0x000000008000a01a in queue_pop (queue=<optimized out>) at sort.c:3905 #2 merge_loop (temp_output=0x0, tfp=0x200001f5d38, total_lines=157, queue=0x3ffffc17908) at sort.c:4053 #3 sortlines (lines=<optimized out>, nthreads=<optimized out>, total_lines=157, node=0x8001f4d0, queue=0x3ffffc17908, tfp=0x200001f5d38, temp_output=0x0, is_lo_child=<optimized out>) at sort.c:4182 #4 0x000000008000a7ac in sortlines (lines=<optimized out>, nthreads=2, total_lines=157, node=0x8001f3d0, queue=0x3ffffc17908, tfp=0x200001f5d38, temp_output=0x0, is_lo_child=<optimized out>) at sort.c:4159 #5 0x000000008000a7ac in sortlines (lines=<optimized out>, nthreads=4, total_lines=157, node=0x8001f1d0, queue=0x3ffffc17908, tfp=0x200001f5d38, temp_output=0x0, is_lo_child=<optimized out>) at sort.c:4159 #6 0x00000000800050e8 in sort (nthreads=4, output_file=0x0, nfiles=0, files=0x8001db68) at sort.c:4454 #7 main (argc=<optimized out>, argv=<optimized out>) at sort.c:5276
I've seen a deadlocked python recently during a build of gx_head in rawhide. And I had problems with gx_head in the past ...
eg. 4 tries were required to build gx_head now - http://s390.koji.fedoraproject.org/koji/taskinfo?taskID=428582
(In reply to comment #7) > I've seen a deadlocked python recently during a build of gx_head in rawhide. > And I had problems with gx_head in the past ... it's from the python-based waf build system
And I should also mention that I haven't seen these deadlocks when building F-16 packages with glibc 2.14.90-[34] in the buildroots.
switched back to glibc as it looks as a general issue, limited to F-15
Must be a kernel bug then.
Hm, probably for the first time I see a deadlock in F-16 build (using the waf buildsystem) - http://s390.koji.fedoraproject.org/koji/taskinfo?taskID=455615
Saw the same on ARMv7 hardfloat today. Been compling packages for the ARMv7 bootstrap for many days, and today it started hanging. First when trying to build python3 where the test suite repeatedly hunk waiting in futex(0xXXXXXXXX, FUTEX_WAIT_PRIVATE, ..), and then a little later a sort command hung in the same manner. In all cases stracing the process is sufficient to get it running again, and strace then reports the following pattern $ sudo strace -p 9181 Process 9181 attached - interrupt to quit futex(0xbedeaad8, FUTEX_WAIT_PRIVATE, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable) futex(0xbedeaad8, FUTEX_WAKE_PRIVATE, 1) = 0 futex(0xbedeaabc, FUTEX_WAKE_PRIVATE, 1) = 0
Relevant software versions on ARMv7: kernel-tegra-2.6.40.3-0.fc15.armv7hl glibc-2.13.90-9.armv7hl
if this is a kernel bug (and I don't see any evidence it is), I'd suggest taking this to linux-kernel, where there are people who actually know something about these architectures. filing such bugs against fedora kernel is just going to leave them sitting here.
I'm also seeing a deadlock on a Fedora 16 x86_64 system while running mock to build libldb (also a WAF buildsystem) for i686. There are two processes sitting awaiting a futex: [sgallagh@vm-048 result]$ sudo strace -p 15276 Process 15276 attached - interrupt to quit [ Process PID=15276 runs in 32 bit mode. ] futex(0xf4f023f8, FUTEX_WAIT_PRIVATE, 0, NULL and [sgallagh@vm-048 result]$ sudo strace -p 15277 Process 15277 attached - interrupt to quit [ Process PID=15277 runs in 32 bit mode. ] futex(0x8880418, FUTEX_WAIT_PRIVATE, 0, NULL I'm not sure if it's important, but I should note that it's running in 32-bit mode on a 64-bit OS (inside mock).
A question - is anyone seeing this without Python involved? If not, bug #787712 might be relevant; Python in F16 (and possibly earlier versions) has a bug in its low-level implementation of os.fork() (as used by all the multi-process Python modules) that results in a deadlock.
(In reply to comment #18) > A question - is anyone seeing this without Python involved? If not, bug #787712 > might be relevant; Python in F16 (and possibly earlier versions) has a bug in > its low-level implementation of os.fork() (as used by all the multi-process > Python modules) that results in a deadlock. Yes, the sort utility is also affected. And it looks to me that the deadlocks are much more common in F-15 than in F >= 16 ...
kernel-3.2.6-3.fc16.x86_64 glibc-2.14.90-24.fc16.4.x86_64 RHEL-6.2 kvm host /dev/sdb SSD drive exported raw to the kvm guest Fedora 16 x86_64 kvm guest /dev/vdb is the SSD drive from the host. mount -t btrfs -o ssd,compress=lzo /dev/vdb1 /home Several times a day I randomly experience lockups while running git and some python scripts on this /home filesystem. They get stuck with FUTEX_WAIT_PRIVATE. Running these commands under strace -f seems to prevent them from getting stuck.
I would urge anyone who's only seeing this when running Python to first check bug #787712 - Dan Horák is seeing a more general issue on S/390, while there is a *known* Python deadlock bug in Fedora 16, bug #787712. Please, let's not clutter this bug with people seeing Python lockups due to a known Python bug, and leave it for the S/390 general threading lockup bug.
The threading lockup is not isolated to S/390, also seen on ARM and some indications it's also seen on i686. But it's very hard to tell if it's multiple arch-dependent bugs with similar result or a common bug. The general frequency it quite low. But naturally, everything python goes to bug #787712.
[mass update] kernel-3.3.0-4.fc16 has been pushed to the Fedora 16 stable repository. Please retest with this update.