Bug 750419 - Hang/livelock in pthread_create
Hang/livelock in pthread_create
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: kernel (Show other bugs)
6.0
x86_64 Linux
unspecified Severity medium
: rc
: ---
Assigned To: Aristeu Rozanski
Red Hat Kernel QE team
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2011-10-31 23:23 EDT by Todd Lipcon
Modified: 2012-08-01 09:19 EDT (History)
2 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2012-02-09 15:52:01 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
strace log of hung process (45.06 KB, application/x-gzip)
2011-11-14 13:31 EST, Todd Lipcon
no flags Details
another strace (44.84 KB, application/x-gzip)
2011-11-14 13:54 EST, Todd Lipcon
no flags Details

  None (edit)
Description Todd Lipcon 2011-10-31 23:23:45 EDT
Description of problem:

When doing a load test which spawns a lot of JVMs, sometimes the JVM initialization hangs. One of the threads is blocked with the following stack:

Thread 1 (process 8922):
#0  0x000000341a606ea0 in pthread_create@@GLIBC_2.2.5 () from /lib64/libpthread.so.0
#1  0x00007f07738d4759 in os::create_thread(Thread*, os::ThreadType, unsigned long) () from /usr/java/jdk1.6.0_21/jre/lib/amd64/server/libjvm.so
#2  0x00007f07739cf71b in WatcherThread::start() () from /usr/java/jdk1.6.0_21/jre/lib/amd64/server/libjvm.so
#3  0x00007f07739d462a in Threads::create_vm(JavaVMInitArgs*, bool*) () from /usr/java/jdk1.6.0_21/jre/lib/amd64/server/libjvm.so
#4  0x00007f0773701fd0 in JNI_CreateJavaVM () from /usr/java/jdk1.6.0_21/jre/lib/amd64/server/libjvm.so
#5  0x00000000400035f8 in InitializeJVM ()
#6  0x000000004000206e in JavaMain ()
#7  0x000000341a6077e1 in start_thread () from /lib64/libpthread.so.0
#8  0x000000341a2e18ed in clone () from /lib64/libc.so.6

this thread also takes up 100% CPU.

Breaking into gdb on this thread, it appears to always be stuck on this particular instruction: pthread_create@@GLIBC_2.2.5+1680

   │0x341a606e63 <pthread_create@@GLIBC_2.2.5+1619> syscall
   │0x341a606e65 <pthread_create@@GLIBC_2.2.5+1621> mov    0x30c(%rbx),%ecx
   │0x341a606e6b <pthread_create@@GLIBC_2.2.5+1627> mov    %eax,0x63c(%rbx)
   │0x341a606e71 <pthread_create@@GLIBC_2.2.5+1633> or     $0x40,%ecx
   │0x341a606e74 <pthread_create@@GLIBC_2.2.5+1636> mov    %ecx,0x30c(%rbx)
   │0x341a606e7a <pthread_create@@GLIBC_2.2.5+1642> mov    0x8(%rbp),%eax
   │0x341a606e7d <pthread_create@@GLIBC_2.2.5+1645> jmpq   0x341a606dba <pthread_create@@GLIBC_2.2.5+1450>
   │0x341a606e82 <pthread_create@@GLIBC_2.2.5+1650> mov    0x38(%rsp),%rdi
   │0x341a606e87 <pthread_create@@GLIBC_2.2.5+1655> mov    %r15d,%edx
   │0x341a606e8a <pthread_create@@GLIBC_2.2.5+1658> add    %r10,%rdi
   │0x341a606e8d <pthread_create@@GLIBC_2.2.5+1661> mov    %r10,0x8(%rsp)
   │0x341a606e92 <pthread_create@@GLIBC_2.2.5+1666> callq  0x341a605218 <mprotect@plt>
   │0x341a606e97 <pthread_create@@GLIBC_2.2.5+1671> test   %eax,%eax
   │0x341a606e99 <pthread_create@@GLIBC_2.2.5+1673> mov    0x8(%rsp),%r10
   │0x341a606e9e <pthread_create@@GLIBC_2.2.5+1678> jne    0x341a606eb8 <pthread_create@@GLIBC_2.2.5+1704>
  >│0x341a606ea0 <pthread_create@@GLIBC_2.2.5+1680> mov    %r10,0x6a0(%rbx)
   │0x341a606ea7 <pthread_create@@GLIBC_2.2.5+1687> jmpq   0x341a606a7a <pthread_create@@GLIBC_2.2.5+618>
   │0x341a606eac <pthread_create@@GLIBC_2.2.5+1692> xor    %edx,%edx
   │0x341a606eae <pthread_create@@GLIBC_2.2.5+1694> mov    %r10,%rsi
   │0x341a606eb1 <pthread_create@@GLIBC_2.2.5+1697> mov    0x38(%rsp),%rdi
   │0x341a606eb6 <pthread_create@@GLIBC_2.2.5+1702> jmp    0x341a606e8d <pthread_create@@GLIBC_2.2.5+1661>
   │0x341a606eb8 <pthread_create@@GLIBC_2.2.5+1704> mov    0x2110c1(%rip),%rax        # 0x341a817f80
   │0x341a606ebf <pthread_create@@GLIBC_2.2.5+1711> mov    $0x1,%esi
   │0x341a606ec4 <pthread_create@@GLIBC_2.2.5+1716> mov    %fs:(%rax),%r14d
   │0x341a606ec8 <pthread_create@@GLIBC_2.2.5+1720> xor    %eax,%eax
   │0x341a606eca <pthread_create@@GLIBC_2.2.5+1722> lock cmpxchg %esi,0x2153de(%rip)        # 0x341a81c2b0 <stack_cache_lock>



Version-Release number of selected component (if applicable):
glibc-2.12-1.7.el6_0.5.x86_64


How reproducible:
approximately 0.04% of JVM startups experience this issue

Steps to Reproduce:
I can reproduce it with a Hadoop-based stresstest. But I do not have a proper repro script.

This occurs on several different machines. On each machine, when I break into gdb, $pc is on the exact same instruction in libpthread

Actual results:
Thread spinning

Expected results:
pthread_create should not spin!

Additional info:

[todd@p0117 ~]$ uname -a
Linux p0117 2.6.32-71.el6.x86_64 #1 SMP Fri May 20 03:51:51 BST 2011 x86_64 x86_64 x86_64 GNU/Linux
Comment 2 Todd Lipcon 2011-11-01 11:21:40 EDT
Another odd datapoint is that in gdb, the 'stepi' command just hangs. If I then hit ^C or send SIGCONT, it breaks back into gdb at the exact some instruction.
Comment 3 Andreas Schwab 2011-11-04 04:52:30 EDT
What are the register contents and the memory map?
Comment 4 Todd Lipcon 2011-11-04 11:17:41 EDT
Can you let me know which gdb commands to run for you to get this info? I haven't done much assembly-level debugging in gdb. For the memory map, you just want /proc/<pid>/maps?
Comment 5 Andreas Schwab 2011-11-04 11:31:36 EDT
info registers
info proc mappings
Comment 6 Todd Lipcon 2011-11-04 15:13:34 EDT
(gdb) bt
#0  0x000000341a606ea0 in pthread_create@@GLIBC_2.2.5 () from /lib64/libpthread.so.0
#1  0x00007f5fbb9d1759 in os::create_thread(Thread*, os::ThreadType, unsigned long) ()
   from /usr/java/jdk1.6.0_21/jre/lib/amd64/server/libjvm.so
#2  0x00007f5fbbacce0e in JavaThread::JavaThread(void (*)(JavaThread*, Thread*), unsigned long) ()
   from /usr/java/jdk1.6.0_21/jre/lib/amd64/server/libjvm.so
#3  0x00007f5fbbad09eb in CompilerThread::CompilerThread(CompileQueue*, CompilerCounters*) ()
   from /usr/java/jdk1.6.0_21/jre/lib/amd64/server/libjvm.so
#4  0x00007f5fbb69868f in CompileBroker::make_compiler_thread(char const*, CompileQueue*, CompilerCounters*, Thread*) ()
   from /usr/java/jdk1.6.0_21/jre/lib/amd64/server/libjvm.so
#5  0x00007f5fbb698889 in CompileBroker::init_compiler_threads(int) () from /usr/java/jdk1.6.0_21/jre/lib/amd64/server/libjvm.so
#6  0x00007f5fbb69806e in CompileBroker::compilation_init() () from /usr/java/jdk1.6.0_21/jre/lib/amd64/server/libjvm.so
#7  0x00007f5fbbad15c4 in Threads::create_vm(JavaVMInitArgs*, bool*) () from /usr/java/jdk1.6.0_21/jre/lib/amd64/server/libjvm.so
#8  0x00007f5fbb7fefd0 in JNI_CreateJavaVM () from /usr/java/jdk1.6.0_21/jre/lib/amd64/server/libjvm.so
#9  0x00000000400035f8 in InitializeJVM ()
#10 0x000000004000206e in JavaMain ()
#11 0x000000341a6077e1 in start_thread () from /lib64/libpthread.so.0
#12 0x000000341a2e18ed in clone () from /lib64/libc.so.6
(gdb) info registers
rax            0x0      0
rbx            0x7f5fb8700700   140049092970240
rcx            0x341a2de3e7     223777514471
rdx            0x0      0
rsi            0x1000   4096
rdi            0x7f5fb8600000   140049091919872
rbp            0x7f5fbb3dd920   0x7f5fbb3dd920
rsp            0x7f5fbb3dd880   0x7f5fbb3dd880
r8             0x0      0
r9             0x0      0
r10            0x1000   4096
r11            0x246    582
r12            0x100000 1048576
r13            0x7f5fb87009c0   140049092970944
r14            0x4      4
r15            0x7      7
rip            0x341a606ea0     0x341a606ea0 <pthread_create@@GLIBC_2.2.5+1680>
eflags         0x10246  [ PF ZF IF RF ]
cs             0x33     51
ss             0x2b     43
ds             0x0      0
es             0x0      0
fs             0x0      0
gs             0x0      0
(gdb) info proc mappings
process 26942
cmdline = '/usr/java/jdk1.6.0_21/jre/bin/java'
cwd = '/data/4/todd/scratch/tt/local/taskTracker/todd/jobcache/job_201111041159_0002/attempt_201111041159_0002_m_006103_0/work'
exe = '/usr/java/jdk1.6.0_21/jre/bin/java'
Mapped address spaces:

          Start Addr           End Addr       Size     Offset objfile
          0x40000000         0x40009000     0x9000          0                             /usr/java/jdk1.6.0_21/jre/bin/java
          0x40108000         0x4010a000     0x2000     0x8000                             /usr/java/jdk1.6.0_21/jre/bin/java
          0x40223000         0x40244000    0x21000          0                                   [heap]
        0x3419a00000       0x3419a1e000    0x1e000          0                          /lib64/ld-2.12.so
        0x3419c1e000       0x3419c1f000     0x1000    0x1e000                          /lib64/ld-2.12.so
        0x3419c1f000       0x3419c20000     0x1000    0x1f000                          /lib64/ld-2.12.so
        0x3419c20000       0x3419c21000     0x1000          0        
        0x3419e00000       0x3419e02000     0x2000          0                          /lib64/libdl-2.12.so
        0x3419e02000       0x341a002000   0x200000     0x2000                          /lib64/libdl-2.12.so
        0x341a002000       0x341a003000     0x1000     0x2000                          /lib64/libdl-2.12.so
        0x341a003000       0x341a004000     0x1000     0x3000                          /lib64/libdl-2.12.so
        0x341a200000       0x341a375000   0x175000          0                          /lib64/libc-2.12.so
        0x341a375000       0x341a575000   0x200000   0x175000                          /lib64/libc-2.12.so
        0x341a575000       0x341a579000     0x4000   0x175000                          /lib64/libc-2.12.so
        0x341a579000       0x341a57a000     0x1000   0x179000                          /lib64/libc-2.12.so
        0x341a57a000       0x341a57f000     0x5000          0        
        0x341a600000       0x341a617000    0x17000          0                          /lib64/libpthread-2.12.so
        0x341a617000       0x341a817000   0x200000    0x17000                          /lib64/libpthread-2.12.so
        0x341a817000       0x341a818000     0x1000    0x17000                          /lib64/libpthread-2.12.so
        0x341a818000       0x341a819000     0x1000    0x18000                          /lib64/libpthread-2.12.so
        0x341a819000       0x341a81d000     0x4000          0        
        0x341aa00000       0x341aa83000    0x83000          0                          /lib64/libm-2.12.so
        0x341aa83000       0x341ac82000   0x1ff000    0x83000                          /lib64/libm-2.12.so
        0x341ac82000       0x341ac83000     0x1000    0x82000                          /lib64/libm-2.12.so
        0x341ac83000       0x341ac84000     0x1000    0x83000                          /lib64/libm-2.12.so
        0x341b200000       0x341b216000    0x16000          0                          /lib64/libnsl-2.12.so
        0x341b216000       0x341b415000   0x1ff000    0x16000                          /lib64/libnsl-2.12.so
        0x341b415000       0x341b416000     0x1000    0x15000                          /lib64/libnsl-2.12.so
        0x341b416000       0x341b417000     0x1000    0x16000                          /lib64/libnsl-2.12.so
        0x341b417000       0x341b419000     0x2000          0        
        0x37f5c00000       0x37f5c07000     0x7000          0                          /lib64/librt-2.12.so
        0x37f5c07000       0x37f5e06000   0x1ff000     0x7000                          /lib64/librt-2.12.so
        0x37f5e06000       0x37f5e07000     0x1000     0x6000                          /lib64/librt-2.12.so
        0x37f5e07000       0x37f5e08000     0x1000     0x7000                          /lib64/librt-2.12.so
      0x7f5f4a16f000     0x7f5f50000000  0x5e91000          0                     /usr/lib/locale/locale-archive
      0x7f5f50000000     0x7f5f50021000    0x21000          0        
      0x7f5f50021000     0x7f5f54000000  0x3fdf000          0        
      0x7f5f54000000     0x7f5f54021000    0x21000          0        
      0x7f5f54021000     0x7f5f58000000  0x3fdf000          0        
      0x7f5f58000000     0x7f5f58021000    0x21000          0        
      0x7f5f58021000     0x7f5f5c000000  0x3fdf000          0        
      0x7f5f5c000000     0x7f5f5c021000    0x21000          0        
      0x7f5f5c021000     0x7f5f60000000  0x3fdf000          0        
      0x7f5f60000000     0x7f5f60021000    0x21000          0        
      0x7f5f60021000     0x7f5f64000000  0x3fdf000          0        
      0x7f5f64000000     0x7f5f64021000    0x21000          0        
      0x7f5f64021000     0x7f5f68000000  0x3fdf000          0        
      0x7f5f68000000     0x7f5f68021000    0x21000          0        
      0x7f5f68021000     0x7f5f6c000000  0x3fdf000          0        
---Type <return> to continue, or q <return> to quit---
      0x7f5f6d400000     0x7f5f6e8c0000  0x14c0000          0        
      0x7f5f6e8c0000     0x7f5f72800000  0x3f40000          0        
      0x7f5f72800000     0x7f5f91ec0000 0x1f6c0000          0        
      0x7f5f91ec0000     0x7f5f9c2b0000  0xa3f0000          0        
      0x7f5f9c2b0000     0x7f5fabe10000  0xfb60000          0        
      0x7f5fabe10000     0x7f5fb1000000  0x51f0000          0        
      0x7f5fb1000000     0x7f5fb1270000   0x270000          0        
      0x7f5fb1270000     0x7f5fb40a2000  0x2e32000          0        
      0x7f5fb40a2000     0x7f5fb8000000  0x3f5e000          0        
      0x7f5fb8600000     0x7f5fb8601000     0x1000          0        
      0x7f5fb8601000     0x7f5fb8701000   0x100000          0        
      0x7f5fb8701000     0x7f5fb8704000     0x3000          0        
      0x7f5fb8704000     0x7f5fb8802000    0xfe000          0        
      0x7f5fb8802000     0x7f5fb8805000     0x3000          0        
      0x7f5fb8805000     0x7f5fb8903000    0xfe000          0        
      0x7f5fb8903000     0x7f5fb8906000     0x3000          0        
      0x7f5fb8906000     0x7f5fb8a04000    0xfe000          0        
      0x7f5fb8a04000     0x7f5fb8a05000     0x1000          0        
      0x7f5fb8a05000     0x7f5fb9947000   0xf42000          0        
      0x7f5fb9947000     0x7f5fb9ade000   0x197000  0x3014000                     /usr/java/jdk1.6.0_21/jre/lib/rt.jar
      0x7f5fb9ade000     0x7f5fb9b06000    0x28000          0        
      0x7f5fb9b06000     0x7f5fb9b07000     0x1000          0        
      0x7f5fb9b07000     0x7f5fb9c07000   0x100000          0        
      0x7f5fb9c07000     0x7f5fb9c08000     0x1000          0        
      0x7f5fb9c08000     0x7f5fb9d08000   0x100000          0        
      0x7f5fb9d08000     0x7f5fb9d09000     0x1000          0        
      0x7f5fb9d09000     0x7f5fb9e09000   0x100000          0        
      0x7f5fb9e09000     0x7f5fb9e0a000     0x1000          0        
      0x7f5fb9e0a000     0x7f5fb9f0a000   0x100000          0        
      0x7f5fb9f0a000     0x7f5fb9f0b000     0x1000          0        
      0x7f5fb9f0b000     0x7f5fba00b000   0x100000          0        
      0x7f5fba00b000     0x7f5fba00c000     0x1000          0        
      0x7f5fba00c000     0x7f5fba10c000   0x100000          0        
      0x7f5fba10c000     0x7f5fba10d000     0x1000          0        
      0x7f5fba10d000     0x7f5fba20d000   0x100000          0        
      0x7f5fba20d000     0x7f5fba20e000     0x1000          0        
      0x7f5fba20e000     0x7f5fba30e000   0x100000          0        
      0x7f5fba30e000     0x7f5fba30f000     0x1000          0        
      0x7f5fba30f000     0x7f5fba40f000   0x100000          0        
      0x7f5fba40f000     0x7f5fba410000     0x1000          0        
      0x7f5fba410000     0x7f5fba510000   0x100000          0        
      0x7f5fba510000     0x7f5fba511000     0x1000          0        
      0x7f5fba511000     0x7f5fba611000   0x100000          0        
      0x7f5fba611000     0x7f5fba612000     0x1000          0        
      0x7f5fba612000     0x7f5fba712000   0x100000          0        
      0x7f5fba712000     0x7f5fba713000     0x1000          0        
      0x7f5fba713000     0x7f5fba81e000   0x10b000          0        
      0x7f5fba81e000     0x7f5fba83d000    0x1f000          0        
      0x7f5fba83d000     0x7f5fba939000    0xfc000          0        
      0x7f5fba939000     0x7f5fba98b000    0x52000          0        
      0x7f5fba98b000     0x7f5fba996000     0xb000          0        
      0x7f5fba996000     0x7f5fba9b5000    0x1f000          0        
      0x7f5fba9b5000     0x7f5fbaab1000    0xfc000          0        
      0x7f5fbaab1000     0x7f5fbab02000    0x51000          0        
      0x7f5fbab02000     0x7f5fbab81000    0x7f000          0        
      0x7f5fbab81000     0x7f5fbaba9000    0x28000          0        
---Type <return> to continue, or q <return> to quit---
      0x7f5fbaba9000     0x7f5fbabb4000     0xb000          0        
      0x7f5fbabb4000     0x7f5fbac6a000    0xb6000          0        
      0x7f5fbac6a000     0x7f5fbac78000     0xe000          0                     /usr/java/jdk1.6.0_21/jre/lib/amd64/libzip.so
      0x7f5fbac78000     0x7f5fbad7a000   0x102000     0xe000                     /usr/java/jdk1.6.0_21/jre/lib/amd64/libzip.so
      0x7f5fbad7a000     0x7f5fbad7d000     0x3000    0x10000                     /usr/java/jdk1.6.0_21/jre/lib/amd64/libzip.so
      0x7f5fbad7d000     0x7f5fbad7e000     0x1000          0        
      0x7f5fbad7e000     0x7f5fbad8a000     0xc000          0                      /lib64/libnss_files-2.12.so
      0x7f5fbad8a000     0x7f5fbaf89000   0x1ff000     0xc000                      /lib64/libnss_files-2.12.so
      0x7f5fbaf89000     0x7f5fbaf8a000     0x1000     0xb000                      /lib64/libnss_files-2.12.so
      0x7f5fbaf8a000     0x7f5fbaf8b000     0x1000     0xc000                      /lib64/libnss_files-2.12.so
      0x7f5fbaf95000     0x7f5fbaf9c000     0x7000          0                     /usr/java/jdk1.6.0_21/jre/lib/amd64/native_threads/libhpi.so
      0x7f5fbaf9c000     0x7f5fbb09d000   0x101000     0x7000                     /usr/java/jdk1.6.0_21/jre/lib/amd64/native_threads/libhpi.so
      0x7f5fbb09d000     0x7f5fbb09f000     0x2000     0x8000                     /usr/java/jdk1.6.0_21/jre/lib/amd64/native_threads/libhpi.so
      0x7f5fbb09f000     0x7f5fbb0a0000     0x1000          0        
      0x7f5fbb0a0000     0x7f5fbb0c8000    0x28000          0                     /usr/java/jdk1.6.0_21/jre/lib/amd64/libjava.so
      0x7f5fbb0c8000     0x7f5fbb1c8000   0x100000    0x28000                     /usr/java/jdk1.6.0_21/jre/lib/amd64/libjava.so
      0x7f5fbb1c8000     0x7f5fbb1cf000     0x7000    0x28000                     /usr/java/jdk1.6.0_21/jre/lib/amd64/libjava.so
      0x7f5fbb1cf000     0x7f5fbb1dc000     0xd000          0                     /usr/java/jdk1.6.0_21/jre/lib/amd64/libverify.so
      0x7f5fbb1dc000     0x7f5fbb2db000    0xff000     0xd000                     /usr/java/jdk1.6.0_21/jre/lib/amd64/libverify.so
      0x7f5fbb2db000     0x7f5fbb2de000     0x3000     0xc000                     /usr/java/jdk1.6.0_21/jre/lib/amd64/libverify.so
      0x7f5fbb2de000     0x7f5fbb2e1000     0x3000          0        
      0x7f5fbb2e1000     0x7f5fbb3df000    0xfe000          0        
      0x7f5fbb3df000     0x7f5fbbbb9000   0x7da000          0                     /usr/java/jdk1.6.0_21/jre/lib/amd64/server/libjvm.so
      0x7f5fbbbb9000     0x7f5fbbcbb000   0x102000   0x7da000                     /usr/java/jdk1.6.0_21/jre/lib/amd64/server/libjvm.so
      0x7f5fbbcbb000     0x7f5fbbe4c000   0x191000   0x7dc000                     /usr/java/jdk1.6.0_21/jre/lib/amd64/server/libjvm.so
      0x7f5fbbe4c000     0x7f5fbbe8a000    0x3e000          0        
      0x7f5fbbe8a000     0x7f5fbbe91000     0x7000          0                     /usr/java/jdk1.6.0_21/jre/lib/amd64/jli/libjli.so
      0x7f5fbbe91000     0x7f5fbbf92000   0x101000     0x7000                     /usr/java/jdk1.6.0_21/jre/lib/amd64/jli/libjli.so
      0x7f5fbbf92000     0x7f5fbbf94000     0x2000     0x8000                     /usr/java/jdk1.6.0_21/jre/lib/amd64/jli/libjli.so
      0x7f5fbbf94000     0x7f5fbbf95000     0x1000          0        
      0x7f5fbbf95000     0x7f5fbbf9d000     0x8000          0                      /tmp/hsperfdata_todd/26939
      0x7f5fbbf9d000     0x7f5fbbf9e000     0x1000          0        
      0x7f5fbbf9e000     0x7f5fbbf9f000     0x1000          0        
      0x7f5fbbf9f000     0x7f5fbbfa0000     0x1000          0        
      0x7fff90992000     0x7fff909a8000    0x16000          0                           [stack]
      0x7fff909ff000     0x7fff90a00000     0x1000          0                           [vdso]
  0xffffffffff600000 0xffffffffff601000     0x1000          0                   [vsyscall]
Comment 7 Andreas Schwab 2011-11-07 10:46:28 EST
strace?
Comment 8 Todd Lipcon 2011-11-07 11:04:43 EST
Looking for an strace after the process is hung, or before it hangs? Given the low percentage of startups during which it hangs, it would be very difficult for me to get the latter, but I can get the former for you if it's useful.
Comment 9 Andreas Schwab 2011-11-07 11:08:14 EST
Anything is useful.
Comment 10 RHEL Product and Program Management 2011-11-11 01:47:12 EST
Since RHEL 6.2 External Beta has begun, and this bug remains
unresolved, it has been rejected as it is not proposed as
exception or blocker.

Red Hat invites you to ask your support representative to
propose this request, if appropriate and relevant, in the
next release of Red Hat Enterprise Linux.
Comment 11 Todd Lipcon 2011-11-11 18:17:55 EST
After it's hung, strace on the hung thread doesn't give anything:

[todd@p0117 ~]$ strace -p 26942
Process 26942 attached - interrupt to quit
<hangs>

Let me see if I can write a simpler reproducer. We are also going to try to reproduce this on a different JVM version to see if it has something to do with the specific combo of Java 1.6.0u21 with RHEL 6.0.
Comment 12 Todd Lipcon 2011-11-14 13:31:14 EST
Created attachment 533599 [details]
strace log of hung process

Here's strace output from a hung jvm launch. This time was with 64-bit JDK 6u23. The spinning thread was pid 18207
Comment 13 Todd Lipcon 2011-11-14 13:54:29 EST
Created attachment 533602 [details]
another strace

Another strace from a different node just in case it's helpful: 6661 is the hung process here.
Comment 14 Jeff Law 2011-12-16 00:32:18 EST
Is there any way you can get us a method for reproducing this failure?  I've got several pthread/mutex related fixes that are queued up for RHEL.   It'd be good to be able to test them against this problem.
Comment 15 Todd Lipcon 2011-12-16 00:40:08 EST
Unfortunately not - I tried to write a standalone reproducer but couldn't get one to work. We can reproduce it easily, though, so if you have a candidate glibc RPM we can probably deploy it and give it a shot on a large QA cluster.
Comment 16 Jeff Law 2012-02-07 15:54:28 EST
Todd,
I just realized you reported this against a Red Hat Enterprise Linux 6.0 glibc.  It's probably worth testing if some of the fixes, particularly those which went into Red Hat Enterprise Linux 6.2 help.  It's also possible this is a kernel issue of some kind.  Hard to really know right now.  We'd kind of be just shooting in the dark hoping to find something.

The backtrace & straces don't really show any significant commonality.   I'm assuming you've got the right spinning PID, though the straces would tend to indicate the PID should be sleeping in the kernel, not spinning.
The backtrace doesn't make a lot of sense for a process that is spinning; the backtrace looks more like the process is sitting in the kernel.  I guess perhaps it could be spinning in the kernel.

I'll note that we don't necessarily need a standalone reproducer, though it obviously helps.  Even a reproducer on top of the JVM would be a step forward.   Even if it just triggers 1 in 100 times, I can reserve a machine in our farm and just have it keep restarting the jvm until it hangs.
Comment 17 Todd Lipcon 2012-02-07 17:01:43 EST
Definitely got the right pid - I saw this on a bunch of machines, so unless I'm systematically typoing, it's right ;-)

I'm trying to put together a reproducer tarball which would include Hadoop but could run on one machine. But before I send it I want to make sure I can still repro on one of our boxes here that originally exhibited the problem. We've since upgraded to 2.6.32-71.14.1.el6.x86_64 so if it's a kernel bug it may be that we can't reproduce it anymore.
Comment 18 Jeff Law 2012-02-07 17:07:54 EST
Sounds good.  Anxiously waiting...
Comment 19 Todd Lipcon 2012-02-09 13:46:03 EST
Well, in the process of putting together the repro for you, I actually determined this is in fact a kernel bug. After running the repro overnight on 2.6.32-71.el6 I had about 8 or so processes in the hung spinning state. When I tried again on 2.6.32-71.14.1.el6, the bug didn't reproduce.

So, I suppose this can be resolved as fixed somewhere between those two kernel versions. Let me know if you need anything else from me, thanks for the help.
Comment 20 Todd Lipcon 2012-02-09 13:47:10 EST
(I'll also leave the repro job running on the new kernel for a few more days just to be entirely sure)
Comment 21 Jeff Law 2012-02-09 15:52:01 EST
I did a quick scan over the kernel changes in that range to see if any stood out as obvious candidates.  In 2.6.32-71.11.1.el6 there's a rework posix-cpu-timers related to mt exec.  The rest of the changes don't appear to be likely candidates.

I'm going to go ahead and close, if your reproducer trips, we can reopen and dive further in.  I'm going to use ERRATA as the resolution since it appears this was fixed in a kernel errata.

Thanks for your time,
Jeff
Comment 22 Todd Lipcon 2012-02-09 16:52:02 EST
Yep, I thought the same thing when I skimmed the changelog yesterday. Will keep you posted, thanks!
Comment 23 Todd Lipcon 2012-02-12 22:00:25 EST
Several days later, still no hung tasks, the kernel upgrade definitely fixed it.

Note You need to log in before you can comment on or make changes to this bug.