Bug 111840 - [PATCH] LTC5608-kernel hang due to pstack operations
Summary: [PATCH] LTC5608-kernel hang due to pstack operations
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Linux 3
Classification: Red Hat
Component: kernel
Version: 3.0
Hardware: i386
OS: Linux
medium
high
Target Milestone: ---
Assignee: Ingo Molnar
QA Contact: Brian Brock
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2003-12-10 19:40 UTC by IBM Bug Proxy
Modified: 2007-11-30 22:06 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2004-02-20 22:51:31 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
bug5608.patch (522 bytes, text/plain)
2003-12-11 15:20 UTC, IBM Bug Proxy
no flags Details

Description IBM Bug Proxy 2003-12-10 19:40:01 UTC
The following has be reported by IBM LTC:  
kernel hang due to pstack operations
This is a carryover of information from the initial bugzilla #4943 - RH 
107305; I am closing the original bug as pstack is now seeing threads,
which 
is what the original was written against.

Hardware Environment:

Software Environment:


Steps to Reproduce:
1.export LD_ASSUME_KERNEL=2.4.19
2.start a multi-threaded program
3.run pstack against the program's pid

Actual Results:
only see stack for one thread of app

Expected Results:
should see stacks for all threads of the app

Additional Information:
gdb works fine in LinuxThreads mode, but pstack does not.  No errors are 
reported, just only see one stack.

------- Additional Comment #1 From Kenneth E. Brunsen 2003-10-14 22:42
------- 
More Information:

Description of pStack: 
pstack is needed  for collecting debug information  (call stacks) when
the 
Domino server fails at a customer site.    This information is used to 
diagnose the root cause of the failure in order to provide a fix for the 
customer problem.

Platforms: 
We need a binary version of pStack for all platforms supported by
Domino.  
(currently Intel, zSeries) 

Installation: 
We would like pStack to be installed by default.

We use pstack for first failure data collection.  The Domino product is a 
collection of multi-threaded programs all running together sharing
resources 
via shared memory.  

When we  have a failure, it can be such that one program will fail but
another 
program has caused the failure.  

In order to aid us in debugging these crashes, we have an automated test 
tool "nsd" which uses pstack on Linux to obtain the stacks of all
threads in 
all programs.  

No other tool gives us this ability.  Without this feature, remote
debugging - 
ie., customer sites, would be next to impossible except in the most
simplest 
of failures, which we usually catch in-house. 

It is likely that other groups will be able to make use of pstack in
the same 
way we do once it is operational.

 

------- Additional Comment #2 From Khoa D. Huynh 2003-10-16 01:20 ------- 
Glen/Greg - this is a pstack problem.  Please submit this to Red Hat.
 Thanks.

------- Additional Comment #3 From Glen Johnson 2003-10-16 14:08 ------- 
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=107305

------- Additional Comment #4 From Glen Johnson 2003-11-10 10:07 ------- 
------ Additional Comments From roland  2003-11-08 19:39
-------
pstack needs a few updates to match linuxthreads changes.
I will work on these. 

------- Additional Comment #5 From Glen Johnson 2003-11-11 17:32 ------- 
------ Additional Comments From roland  2003-11-11 17:24
-------
*** Bug 106656 has been marked as a duplicate of this bug. *** 

------- Additional Comment #6 From Kenneth E. Brunsen 2003-11-21 12:30
------- 
Got the link and downloaded the test pstack from RH and it was not good.  
Running this pstack, as soon as it hits a java thread within a Domino
process 
it hangs the entire system.  Since it's a hang, there is no netdump, but 
everything is hung, including the console of the system. 

------- Additional Comment #7 From Khoa D. Huynh 2003-11-24 23:16 ------- 
------- Additional Comment #9 From Roland McGrath on 2003-11-21 15:55
------- 

If the system is hung that is by definition a kernel issue.
We may be able to learn more if a) it's possible to use Alt-SysRq on
the console when wedged, the t command will show the nature of the wedge,
and the c command will induce a netdump.  Also, you might be able to
boot with nmi_watchdog=1 and get a netdump from the wedge that way
(depending on the nature of the wedge).



------- Additional Comment #8 From Khoa D. Huynh 2003-12-03 18:29 ------- 
Kenbo - if we can get from you a multi-threaded program that you know will
cause pstack to see only one thread, then the India team can help
debug it.
I know the India team can write a multi-threaded program, but it would be
great if you can provide us a program and instructions to make sure that
we don't waste time trying to recreate it.  Thanks.

------- Additional Comment #9 From Sachin P. Sant 2003-12-04 00:23
------- 
Kenneth , i tried using a small multi-threaded program to recreate the
problem.
When i attached the pid of running program to pstack i was able to see
stack 
for 
all the threads. Here is the test program i used ....

----------------------- Start Test Program ------------------------------

#include <pthread.h>
#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>
#include <errno.h>
#include <assert.h>

static pthread_mutex_t gbl_mutex = PTHREAD_MUTEX_INITIALIZER;
static pthread_cond_t  gbl_condv = PTHREAD_COND_INITIALIZER;

void waitThread_cleanup(void *arg)
{
  int rc;

  rc = pthread_mutex_unlock(&gbl_mutex);
  assert(rc == 0);

  return;
}

void * waitThread(void *arg)
{
  int rc;

  pthread_cleanup_push(waitThread_cleanup, NULL);

  rc = pthread_mutex_lock(&gbl_mutex);
 assert(rc == 0);

  /* wait until this thread is canceled */

  while (1 == 1) {
    rc = pthread_cond_wait(&gbl_condv, &gbl_mutex);
    assert(rc == 0);
  }

  /* this routine never reaches this point */

  rc = pthread_mutex_unlock(&gbl_mutex);
  assert(rc == 0);

  pthread_cleanup_pop(0);

  return NULL;
}

main (int argc, char *argv[])
{
  int i, rc;
  pthread_t wait_tid;

  for (i = 0; i < 1000000; i++)
    {
      fprintf(stderr, "loop %d\n", i);

      rc = pthread_create(&wait_tid, NULL, waitThread, NULL);
      assert(rc == 0);
      sleep (3);
      rc = pthread_cancel(wait_tid);
      assert(rc == 0);

      rc = pthread_join(wait_tid, NULL);
      assert(rc == 0);
    }

  return;
}

----------------------- Start End Program ------------------------------

Here is the output of pstack command i got. The ps -ef | grep mtcond
command 
showed three threads running with pid 32311 , 32312 and 32368. I used
pstack 
with pid 32311.

#ps -ef | grep mtcond
root     32312 32311  0 10:06 tty3     00:00:00 ./mtcond
root     32311  4765  0 10:06 tty3     00:00:00 ./mtcond
root     32368 32312  0 10:08 tty3     00:00:00 ./mtcond
root     32370 32241  0 10:08 tty2     00:00:00 grep mtcond
#
#
#pstack 32311
32311: ./mtcond
----- Thread 32311 -----
0x400f35b1: __nanosleep + 0x11 (3, 0, 804876c, 0, 40016b4c, bfffe5f4) + 10
0x08048872: main + 0x7a (1, bfffe5f4, bfffe5fc, 804853e, 8048930, 0) + 20
0x40054657: __libc_start_main + 0x93 (80487f8, 1, bfffe5f4, 8048528,
8048930, 
40
00dcd4) + 40001a18
----- Thread 32312 -----
0x4011e3f7: __poll + 0x23 (804b9c4, 1, 7d0, 4003514c, 0, 3) + 150
0x40029920: __pthread_manager + 0x17c (3, 0, 4e1, 0, 0, 0) + f7fb44cc
----- Thread 32368 -----
0x40066bb5: __sigsuspend + 0x21 (409739cc, 20, 409739cc, 0, 0, 0) + 90
0x4002c1d9: __pthread_wait_for_restart_signal + 0x59 (40973be0, 80499c0, 
40973aa
4, 40028b46, 80499b8, 0) + 20
0x40028bdc: pthread_cond_wait + 0x118 (80499c0, 80499a8, 0, 0,
8048730, 0) + 20
0x080487d2: waitThread + 0x66 (0, 40973c84, 0, 40029bb1, 0, 0) + d0
0x40029c6f: pthread_start_thread + 0x16f (40973be0, 40973be0, 0, 0, 0,
0) + 
bf68
c40c

Is this the expected output. Can you paste the o/p you got after running 
pstack 
command. Also when you do nm on thread library does it show 
__pthread_threads_debug as an exported symbol.

I am using pstack-1.1-1 on RHAS 2.1 May be i will try this on a RHEL
system 
also.

------- Additional Comment #10 From Kenneth E. Brunsen 2003-12-04
10:04 -------
 
In order to reproduce this bug, you need to do the following on a RHEL
3.0 
system:

  1. About 30 minutes from now (still uploading linux.tar), ftp to the
ltc ftp 
server and download from /kenbo a) linux.tar (~715Mb), b) the pstack
RPM we 
got from RH, and c) nsd.sh.pstk2
  2. unpack linux.tar and use the contents (Pre-Release 6.51 daily non 
production build) to install/setup the Domino server
  3. cd to notes exec directory (/opt/lotus/notes/latest/linux, for
example), 
rename nsd.sh to nsd.sh.org and copy nsd.sh.pstk2 to nsd.sh (make sure
execute 
bits are set)
  3. install the pstack rpm.
  4. in a window, start up Domino from the data directory
  5. once server is up and running, in another window run
/opt/lotus/bin/nsd.  
During the run, when pstack hits a program which has the JVM running
within it 
(such as http or amgr if running java agents), the OS will hang.

If it does not hang, let me know, cause it may be that either 1) the
JVM is 
not running within http or amgr at the time, or 2) the nsd is not firing 
correctly to use the local pstack.




------- Additional Comment #11 From Khoa D. Huynh 2003-12-08 10:26
------- 
Sachin - is there any update from your team on this ?  Have you been
able to
recreate the problem on your end ?  Thanks.

------- Additional Comment #12 From Sachin P. Sant 2003-12-08 22:42
------- 
Khoa / kenneth , sorry for not updating the bug. I was able to
download the 
testcase from LTC ftp site. Thanks to the lotus team for that. I was 
successfully able to install the domino server on a RHEL 3 system. I was 
having 
some problems in getting the domino server up and running. It turned
out that 
compatc-libstdc++ rpm was not installed on the system and domino needs
this 
rpm 
to be installed. Today i will try to recreate the problem and will
update the 
bug report.
Thanks

------- Additional Comment #13 From Sachin P. Sant 2003-12-09 00:55
------- 
Ok finally the setup is up and running. I was able to recreate the
problem.

When i used default pstack which comes with RHEL 3 [ version 1.1-1 ] i
could 
see 
 stack for only one thread , eventough it was a multi threaded program.

Then i upgraded pstack with the one RH had supplied [ version
1.2-3.EL.1 ]. 
Using this pstack i could see stack information of all the threads.
But with 
some threads it hangs the machine. I cannot use keyboard / mouse. But
i can 
ping 
the machine. So it seem that it is not a total hang. 

After playing around with pstack i found that if i try to display
information 
about processes like amgr , sched , ldap , calconn , replica , router
, adminp 
i 
get stack information about all the threads. The problem is only with
event 
and 
server processes. If i try to display stack info of these two
processes it 
hangs 
the machine. So it seems pstack has some problem with these processes.

I will try to go through the source of pstack and try to see if i can
find any 
clues.

Thanks

------- Additional Comment #14 From Glen Johnson 2003-12-09 20:59 ------- 
------ Additional Comments From roland  2003-12-09 20:51
-------
There is no point in examining pstack to understand why the entire
system hangs.  That is by definition a kernel issue.  Please
concentrate on ascertaining exactly what system calls pstack makes
that lead to the wedge.  For example, use strace on pstack with the
output coming to a terminal you can capture to get all possible info
up to the time of the wedge.  This problem must be filed in RH
bugzilla against the kernel,
though it may be a known issue you are using the RHEL3-GA kernel.After
some more debugging this is what i found out. In my senario the hang 
occured while executing function call ptrace at line 555 in pstack.c.
Till now i 
could reach till here. As every time the system hangs it becomes very
difficult 
to even log the information. [ I tried running strace command on
pstack , but 
after hardboot there was nothing in the output file ]

Now since the same pstack code is working on RHAS 2.1 , i compared the
kernel 
code for ptrace [ kernel/ptrace.c and arch/i386/kernel/ptrace.c ] . I
could find 
that there were lot of changes in the function access_process_vm( )
which is 
called via the case PTRACE_PEEKDATA. I will concentrate on these two
files and 
continue my debugging.Glen/Greg - please submit this to Red Hat.  By
the way, since we already
closed Bug 4943, you can now close RH Bug 107305 (if you have not done so
already).  This should be treated as a new bug to address the kernel hang
during pstack usage.  Thanks.

Sachin - please continue to debug this while we submit this to Red Hat.
Thanks.

Comment 1 IBM Bug Proxy 2003-12-11 12:54:10 UTC
------ Additional Comments From ssant.com  2003-12-11 06:48 -------
After some more debugging i was able to get to the problem.

the ptrace( ) call code flow finally lands up in mm/memory.c get_user_pages( ) 
function. Inside this function there is a while loop which runs on following 
condition.

 while (!(map = follow_page(mm, start, write))) {
         spin_unlock(&mm->page_table_lock);
         switch (handle_mm_fault(mm, vma, start, write))

         case 1:
                 tsk->min_flt++;
                 break;
         case 2:
                 tsk->maj_flt++;
                 break;
         case 0:
                 if (i) return i;
                 return -EFAULT;
         default:
                 if (i) return i;
                 return -ENOMEM;
         }
         spin_lock(&mm->page_table_lock);
 }

During failure / hang the while condition gets satisfied . Once in while 
loop the case 1 gets satisfied. When control breaks out of case 1: the 
while condition gets satisfied. Eventually it spins inside while forever. 
It is never able to get out of this while loop.  After going through the 
follow_page function i found that the problem is with the following piece 
of code.

#ifdef __i386__
...
...
#endif

I have a created test patch which resolves the problem. The patch basically 
comments out the above piece of code. I came to this conclusion after comparing 
this code with 2.6 and latest 2.4.23 code base. I also compared this code with 
rhas2.1 and also found that this particular piece of code is not there.

Please apply this test patch using patch -p0 < bug5608.patch and let me know the 
results. Also let me know your comments about this patch.

Thanks
-Sachin 

Comment 2 IBM Bug Proxy 2003-12-11 15:20:59 UTC
Created attachment 96467 [details]
bug5608.patch

Comment 3 IBM Bug Proxy 2003-12-11 15:21:03 UTC
------ Additional Comments From ssant.com  2003-12-11 06:50 -------
 
Test patch for kernel hang problem.

Steps to apply yhe patch
cd /usr/src/linux/mm
patch -p0 < bug5608.patch 

Comment 7 Bob Johnson 2003-12-17 21:40:46 UTC
to IBM - this was a carry over from the pstack issue.
Was this run on the base GA RHEL 3 code or the update 1 beta ?
I suspect on the GA, please retest with the beta code you now have
access to.



Comment 8 IBM Bug Proxy 2003-12-19 22:19:23 UTC
------ Additional Comments From ssant.com  2003-12-18 23:28 -------
I checked the kernel Source for RHEL 3 Update 1 beta [ 2.4.21-6.EL ] and could 
see that there is no change as far as the code in question [ mm/memory.c - 
follow_page() and get_user_pages() ]. I believe the problem will still be there
 with this level of kernel. 

Comment 9 Roland McGrath 2003-12-19 22:27:25 UTC
We need to hear about actual empirical test results, not what IBM's
developers suspect from reading sources.  The patch IBM sent was not a
correct fix, and that change is not what we did in the U1 kernels.
Just test the kernels delivered and report real results for good or ill.

Comment 10 IBM Bug Proxy 2003-12-22 16:03:20 UTC
------ Additional Comments From ssant.com  2003-12-22 03:56 -------
I installed the RHEL-3.0 Update 1 Beta on the test machine. Started the domino 
server and used pstack on the server threads. I was able to see stack 
information of all the threads and the machine did not hang. 

RHEL 3 Update 1 Beta release fixes the kernel hang due to pstack operations. 
Thanks 

Comment 11 IBM Bug Proxy 2004-02-20 16:44:30 UTC
----- Additional Comments From jgarvey.com  2004-02-20 11:42 -------
*** Bug 6071 has been marked as a duplicate of this bug. *** 


Note You need to log in before you can comment on or make changes to this bug.