The following has be reported by IBM LTC: kernel hang due to pstack operations This is a carryover of information from the initial bugzilla #4943 - RH 107305; I am closing the original bug as pstack is now seeing threads, which is what the original was written against. Hardware Environment: Software Environment: Steps to Reproduce: 1.export LD_ASSUME_KERNEL=2.4.19 2.start a multi-threaded program 3.run pstack against the program's pid Actual Results: only see stack for one thread of app Expected Results: should see stacks for all threads of the app Additional Information: gdb works fine in LinuxThreads mode, but pstack does not. No errors are reported, just only see one stack. ------- Additional Comment #1 From Kenneth E. Brunsen 2003-10-14 22:42 ------- More Information: Description of pStack: pstack is needed for collecting debug information (call stacks) when the Domino server fails at a customer site. This information is used to diagnose the root cause of the failure in order to provide a fix for the customer problem. Platforms: We need a binary version of pStack for all platforms supported by Domino. (currently Intel, zSeries) Installation: We would like pStack to be installed by default. We use pstack for first failure data collection. The Domino product is a collection of multi-threaded programs all running together sharing resources via shared memory. When we have a failure, it can be such that one program will fail but another program has caused the failure. In order to aid us in debugging these crashes, we have an automated test tool "nsd" which uses pstack on Linux to obtain the stacks of all threads in all programs. No other tool gives us this ability. Without this feature, remote debugging - ie., customer sites, would be next to impossible except in the most simplest of failures, which we usually catch in-house. It is likely that other groups will be able to make use of pstack in the same way we do once it is operational. ------- Additional Comment #2 From Khoa D. Huynh 2003-10-16 01:20 ------- Glen/Greg - this is a pstack problem. Please submit this to Red Hat. Thanks. ------- Additional Comment #3 From Glen Johnson 2003-10-16 14:08 ------- https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=107305 ------- Additional Comment #4 From Glen Johnson 2003-11-10 10:07 ------- ------ Additional Comments From roland 2003-11-08 19:39 ------- pstack needs a few updates to match linuxthreads changes. I will work on these. ------- Additional Comment #5 From Glen Johnson 2003-11-11 17:32 ------- ------ Additional Comments From roland 2003-11-11 17:24 ------- *** Bug 106656 has been marked as a duplicate of this bug. *** ------- Additional Comment #6 From Kenneth E. Brunsen 2003-11-21 12:30 ------- Got the link and downloaded the test pstack from RH and it was not good. Running this pstack, as soon as it hits a java thread within a Domino process it hangs the entire system. Since it's a hang, there is no netdump, but everything is hung, including the console of the system. ------- Additional Comment #7 From Khoa D. Huynh 2003-11-24 23:16 ------- ------- Additional Comment #9 From Roland McGrath on 2003-11-21 15:55 ------- If the system is hung that is by definition a kernel issue. We may be able to learn more if a) it's possible to use Alt-SysRq on the console when wedged, the t command will show the nature of the wedge, and the c command will induce a netdump. Also, you might be able to boot with nmi_watchdog=1 and get a netdump from the wedge that way (depending on the nature of the wedge). ------- Additional Comment #8 From Khoa D. Huynh 2003-12-03 18:29 ------- Kenbo - if we can get from you a multi-threaded program that you know will cause pstack to see only one thread, then the India team can help debug it. I know the India team can write a multi-threaded program, but it would be great if you can provide us a program and instructions to make sure that we don't waste time trying to recreate it. Thanks. ------- Additional Comment #9 From Sachin P. Sant 2003-12-04 00:23 ------- Kenneth , i tried using a small multi-threaded program to recreate the problem. When i attached the pid of running program to pstack i was able to see stack for all the threads. Here is the test program i used .... ----------------------- Start Test Program ------------------------------ #include <pthread.h> #include <unistd.h> #include <stdlib.h> #include <stdio.h> #include <errno.h> #include <assert.h> static pthread_mutex_t gbl_mutex = PTHREAD_MUTEX_INITIALIZER; static pthread_cond_t gbl_condv = PTHREAD_COND_INITIALIZER; void waitThread_cleanup(void *arg) { int rc; rc = pthread_mutex_unlock(&gbl_mutex); assert(rc == 0); return; } void * waitThread(void *arg) { int rc; pthread_cleanup_push(waitThread_cleanup, NULL); rc = pthread_mutex_lock(&gbl_mutex); assert(rc == 0); /* wait until this thread is canceled */ while (1 == 1) { rc = pthread_cond_wait(&gbl_condv, &gbl_mutex); assert(rc == 0); } /* this routine never reaches this point */ rc = pthread_mutex_unlock(&gbl_mutex); assert(rc == 0); pthread_cleanup_pop(0); return NULL; } main (int argc, char *argv[]) { int i, rc; pthread_t wait_tid; for (i = 0; i < 1000000; i++) { fprintf(stderr, "loop %d\n", i); rc = pthread_create(&wait_tid, NULL, waitThread, NULL); assert(rc == 0); sleep (3); rc = pthread_cancel(wait_tid); assert(rc == 0); rc = pthread_join(wait_tid, NULL); assert(rc == 0); } return; } ----------------------- Start End Program ------------------------------ Here is the output of pstack command i got. The ps -ef | grep mtcond command showed three threads running with pid 32311 , 32312 and 32368. I used pstack with pid 32311. #ps -ef | grep mtcond root 32312 32311 0 10:06 tty3 00:00:00 ./mtcond root 32311 4765 0 10:06 tty3 00:00:00 ./mtcond root 32368 32312 0 10:08 tty3 00:00:00 ./mtcond root 32370 32241 0 10:08 tty2 00:00:00 grep mtcond # # #pstack 32311 32311: ./mtcond ----- Thread 32311 ----- 0x400f35b1: __nanosleep + 0x11 (3, 0, 804876c, 0, 40016b4c, bfffe5f4) + 10 0x08048872: main + 0x7a (1, bfffe5f4, bfffe5fc, 804853e, 8048930, 0) + 20 0x40054657: __libc_start_main + 0x93 (80487f8, 1, bfffe5f4, 8048528, 8048930, 40 00dcd4) + 40001a18 ----- Thread 32312 ----- 0x4011e3f7: __poll + 0x23 (804b9c4, 1, 7d0, 4003514c, 0, 3) + 150 0x40029920: __pthread_manager + 0x17c (3, 0, 4e1, 0, 0, 0) + f7fb44cc ----- Thread 32368 ----- 0x40066bb5: __sigsuspend + 0x21 (409739cc, 20, 409739cc, 0, 0, 0) + 90 0x4002c1d9: __pthread_wait_for_restart_signal + 0x59 (40973be0, 80499c0, 40973aa 4, 40028b46, 80499b8, 0) + 20 0x40028bdc: pthread_cond_wait + 0x118 (80499c0, 80499a8, 0, 0, 8048730, 0) + 20 0x080487d2: waitThread + 0x66 (0, 40973c84, 0, 40029bb1, 0, 0) + d0 0x40029c6f: pthread_start_thread + 0x16f (40973be0, 40973be0, 0, 0, 0, 0) + bf68 c40c Is this the expected output. Can you paste the o/p you got after running pstack command. Also when you do nm on thread library does it show __pthread_threads_debug as an exported symbol. I am using pstack-1.1-1 on RHAS 2.1 May be i will try this on a RHEL system also. ------- Additional Comment #10 From Kenneth E. Brunsen 2003-12-04 10:04 ------- In order to reproduce this bug, you need to do the following on a RHEL 3.0 system: 1. About 30 minutes from now (still uploading linux.tar), ftp to the ltc ftp server and download from /kenbo a) linux.tar (~715Mb), b) the pstack RPM we got from RH, and c) nsd.sh.pstk2 2. unpack linux.tar and use the contents (Pre-Release 6.51 daily non production build) to install/setup the Domino server 3. cd to notes exec directory (/opt/lotus/notes/latest/linux, for example), rename nsd.sh to nsd.sh.org and copy nsd.sh.pstk2 to nsd.sh (make sure execute bits are set) 3. install the pstack rpm. 4. in a window, start up Domino from the data directory 5. once server is up and running, in another window run /opt/lotus/bin/nsd. During the run, when pstack hits a program which has the JVM running within it (such as http or amgr if running java agents), the OS will hang. If it does not hang, let me know, cause it may be that either 1) the JVM is not running within http or amgr at the time, or 2) the nsd is not firing correctly to use the local pstack. ------- Additional Comment #11 From Khoa D. Huynh 2003-12-08 10:26 ------- Sachin - is there any update from your team on this ? Have you been able to recreate the problem on your end ? Thanks. ------- Additional Comment #12 From Sachin P. Sant 2003-12-08 22:42 ------- Khoa / kenneth , sorry for not updating the bug. I was able to download the testcase from LTC ftp site. Thanks to the lotus team for that. I was successfully able to install the domino server on a RHEL 3 system. I was having some problems in getting the domino server up and running. It turned out that compatc-libstdc++ rpm was not installed on the system and domino needs this rpm to be installed. Today i will try to recreate the problem and will update the bug report. Thanks ------- Additional Comment #13 From Sachin P. Sant 2003-12-09 00:55 ------- Ok finally the setup is up and running. I was able to recreate the problem. When i used default pstack which comes with RHEL 3 [ version 1.1-1 ] i could see stack for only one thread , eventough it was a multi threaded program. Then i upgraded pstack with the one RH had supplied [ version 1.2-3.EL.1 ]. Using this pstack i could see stack information of all the threads. But with some threads it hangs the machine. I cannot use keyboard / mouse. But i can ping the machine. So it seem that it is not a total hang. After playing around with pstack i found that if i try to display information about processes like amgr , sched , ldap , calconn , replica , router , adminp i get stack information about all the threads. The problem is only with event and server processes. If i try to display stack info of these two processes it hangs the machine. So it seems pstack has some problem with these processes. I will try to go through the source of pstack and try to see if i can find any clues. Thanks ------- Additional Comment #14 From Glen Johnson 2003-12-09 20:59 ------- ------ Additional Comments From roland 2003-12-09 20:51 ------- There is no point in examining pstack to understand why the entire system hangs. That is by definition a kernel issue. Please concentrate on ascertaining exactly what system calls pstack makes that lead to the wedge. For example, use strace on pstack with the output coming to a terminal you can capture to get all possible info up to the time of the wedge. This problem must be filed in RH bugzilla against the kernel, though it may be a known issue you are using the RHEL3-GA kernel.After some more debugging this is what i found out. In my senario the hang occured while executing function call ptrace at line 555 in pstack.c. Till now i could reach till here. As every time the system hangs it becomes very difficult to even log the information. [ I tried running strace command on pstack , but after hardboot there was nothing in the output file ] Now since the same pstack code is working on RHAS 2.1 , i compared the kernel code for ptrace [ kernel/ptrace.c and arch/i386/kernel/ptrace.c ] . I could find that there were lot of changes in the function access_process_vm( ) which is called via the case PTRACE_PEEKDATA. I will concentrate on these two files and continue my debugging.Glen/Greg - please submit this to Red Hat. By the way, since we already closed Bug 4943, you can now close RH Bug 107305 (if you have not done so already). This should be treated as a new bug to address the kernel hang during pstack usage. Thanks. Sachin - please continue to debug this while we submit this to Red Hat. Thanks.
------ Additional Comments From ssant.com 2003-12-11 06:48 ------- After some more debugging i was able to get to the problem. the ptrace( ) call code flow finally lands up in mm/memory.c get_user_pages( ) function. Inside this function there is a while loop which runs on following condition. while (!(map = follow_page(mm, start, write))) { spin_unlock(&mm->page_table_lock); switch (handle_mm_fault(mm, vma, start, write)) case 1: tsk->min_flt++; break; case 2: tsk->maj_flt++; break; case 0: if (i) return i; return -EFAULT; default: if (i) return i; return -ENOMEM; } spin_lock(&mm->page_table_lock); } During failure / hang the while condition gets satisfied . Once in while loop the case 1 gets satisfied. When control breaks out of case 1: the while condition gets satisfied. Eventually it spins inside while forever. It is never able to get out of this while loop. After going through the follow_page function i found that the problem is with the following piece of code. #ifdef __i386__ ... ... #endif I have a created test patch which resolves the problem. The patch basically comments out the above piece of code. I came to this conclusion after comparing this code with 2.6 and latest 2.4.23 code base. I also compared this code with rhas2.1 and also found that this particular piece of code is not there. Please apply this test patch using patch -p0 < bug5608.patch and let me know the results. Also let me know your comments about this patch. Thanks -Sachin
Created attachment 96467 [details] bug5608.patch
------ Additional Comments From ssant.com 2003-12-11 06:50 ------- Test patch for kernel hang problem. Steps to apply yhe patch cd /usr/src/linux/mm patch -p0 < bug5608.patch
to IBM - this was a carry over from the pstack issue. Was this run on the base GA RHEL 3 code or the update 1 beta ? I suspect on the GA, please retest with the beta code you now have access to.
------ Additional Comments From ssant.com 2003-12-18 23:28 ------- I checked the kernel Source for RHEL 3 Update 1 beta [ 2.4.21-6.EL ] and could see that there is no change as far as the code in question [ mm/memory.c - follow_page() and get_user_pages() ]. I believe the problem will still be there with this level of kernel.
We need to hear about actual empirical test results, not what IBM's developers suspect from reading sources. The patch IBM sent was not a correct fix, and that change is not what we did in the U1 kernels. Just test the kernels delivered and report real results for good or ill.
------ Additional Comments From ssant.com 2003-12-22 03:56 ------- I installed the RHEL-3.0 Update 1 Beta on the test machine. Started the domino server and used pstack on the server threads. I was able to see stack information of all the threads and the machine did not hang. RHEL 3 Update 1 Beta release fixes the kernel hang due to pstack operations. Thanks
----- Additional Comments From jgarvey.com 2004-02-20 11:42 ------- *** Bug 6071 has been marked as a duplicate of this bug. ***