Description of problem: While runnning long regression test with latest oracle database release (11.1.0) found that many processes hungs in T state when Oracle is taking diag. using gdb in RHEL 3 U8 beta kernel (2.4.21-43.ELhugemem). This results in higher memory usage including swap, evantually leading to OOM killing of Oracle server processes. This was not happening in RHEL 3 U7 (2.4.21-40.ELhugemem) Here is kernel stack for hung process. gdb T 43300000 4 1334 1 1324 (NOTLB) Call Trace: [<02137f64>] get_signal_to_deliver [kernel] 0xd4 (0x432ffef0) [<0210c6d0>] do_signal [kernel] 0x0 (0x432fff1c) [<0210c734>] do_signal [kernel] 0x64 (0x432fff20) [<02163b8c>] put_user_size [kernel] 0x3c (0x432fff80) [<02123ac0>] schedule_tail [kernel] 0xa0 (0x432fff9c) Version-Release number of selected component (if applicable): How reproducible: everytime an oracle process is attached by gdb Steps to Reproduce: 1.install oracle 10g 10201 on RHEL3U8beta kernel 2. using gdb , debug a process and take backtrace 3. Actual results: gdb session hang Expected results: gdb session should quit when ran pstack pid Additional info: test case: startup oracle 11.1.0. pstack any oracle process. (example - ora_q000_oastoltp process) pstack command will hang as above. Workaround is to 'kill -CONT pid' of all the processes that are stuck in 'T' state. Verified that this hang also occurs on production version of 10.2.0.2 with the same beta 8 kernel (2.4.21-43ELhugemem. So existing 10.2.0.2 customer may also run into this issue. Problem still exists in 2.4.21-44.ELhugemem (RHEL 3 U8 Beta 2) kernel.
RHEL3 is now closed.
Guru, does the problem only occur while running under gdb or do you get random hangs while running the regression tests outside the debugger? Also, did this work in a previous update release?
Have not seen hang outside the debugger yet. Also problem was *not* present in RHEL 3 U7 (2.4.21-40.ELhugemem). This problem started with RHEL 3 U7 Beta 1 release(2.4.21-43ELhugemem.)
The problem is seen with any process that is being debugged. after gdb quits, the gdb process stays in 'T' state as if gdb itself being traced. its not a hang rather the process is stopped in 'T' state. The problem only seen with debugger.
Guru -- I was able to reproduce this behavior in U8 and verified that it doesn't not occur in U7. This is: When debugging a process with gdb, after quiting, the gdb process remains on T state, while in U7 it terminates successfully. However, this had no effect on the process that was being debugged or the system performance. Are the test systems being impacted, performance-wise, due to this? Or is your intention for this issue to only report that gdb should not remain in T state? SEG has identified one of the changes that could be responsible for this behavior and has made Engineering aware of it. However, given that U8 is scheduled to be released in less than two weeks, it will not be possible for us to pursue a fix for U8. As you are probably aware, U8 is the last official update scheduled for RHEL3. Management has yet to decide on how outstanding RHEL3 issues, like this one, will be addressed. As soon as a decision is made, we will be sharing it with you. Internal Status set to 'Waiting on Customer' Status set to: Waiting on Client This event sent from IssueTracker by martinez issue 96876
Thank you. As reported earlier, Oracle RDBMS, in situations where it perceives a hang, initiates diagnostic dumps which includes stack dumps by invoking gdb. So with gdb in 'T' state, and since the memory is not released back, over a period of time .like few hours to a day, the box is running out of memory after using all the swap. We think this will have tremendous impact on RAC customers where such diagnostic is often taken when a process/node is seen not responding. This event sent from IssueTracker by martinez issue 96876
Note that this regression only occurs when "gdb" is used to attach to a previously existing process (with the "attach" command). Also, in both cases (of running the process under "gdb" or attaching to one), the RHEL3 U8 kernel causes another probably related regression: pesto 1# cat << EOF > xyz.c ? main() ? { ? for (;;) ? ; ? exit(0); ? } ? ^D pesto 15# cc -o xyz xyz.c pesto 16# gdb xyz GNU gdb Red Hat Linux (6.3.0.0-1.62rh) [...] (gdb) run Starting program: /home/ernie/xyz warning: linux_test_for_tracefork: unexpected result from waitpid (4994, status 0x0) warning: linux_test_for_tracefork: failed to kill child [...] I'm assigning this to PeterS, who worked on the U8 zap-threads patch.
Patch posted for internal review on 5-Jul-2006.
A fix for this problem has just been committed to the RHEL3 U8 patch pool this evening (in kernel version 2.4.21-47.EL).
I have verified that this issue is fixed in RHEL 3 U8 release. ( (2.4.21-47.ELhugemem.)
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2006-0437.html