196938 – [Beta RHEL3 U8 Regression] Processes hung while allocating stack using gdb

Bug 196938 - [Beta RHEL3 U8 Regression] Processes hung while allocating stack using gdb

Summary: [Beta RHEL3 U8 Regression] Processes hung while allocating stack using gdb

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 3
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	3.8
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	Ernie Petrides
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	RHEL3U8MustFix
TreeView+	depends on / blocked

Reported:	2006-06-27 18:35 UTC by Red Hat Bugzilla
Modified:	2009-06-18 15:35 UTC (History)
CC List:	18 users (show)
Fixed In Version:	RHSA-2006-0437
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2006-07-20 14:12:23 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2006:0437	0	normal	SHIPPED_LIVE	Important: Updated kernel packages for Red Hat Enterprise Linux 3 Update 8	2006-07-20 13:11:00 UTC

Description Red Hat Bugzilla 2006-06-27 18:35:50 UTC

Description of problem:
While runnning long regression test with latest oracle database release
(11.1.0)  found that many processes hungs in T state when Oracle is taking
diag. using gdb in RHEL 3 U8 beta kernel (2.4.21-43.ELhugemem).

This results in higher memory usage including swap, evantually leading to OOM
killing of Oracle server processes. This was not happening in RHEL 3 U7
(2.4.21-40.ELhugemem)

Here is kernel stack for hung process.

gdb           T 43300000     4  1334      1                1324 (NOTLB)
Call Trace:   [<02137f64>] get_signal_to_deliver [kernel] 0xd4 (0x432ffef0)
[<0210c6d0>] do_signal [kernel] 0x0 (0x432fff1c)
[<0210c734>] do_signal [kernel] 0x64 (0x432fff20)
[<02163b8c>] put_user_size [kernel] 0x3c (0x432fff80)
[<02123ac0>] schedule_tail [kernel] 0xa0 (0x432fff9c) 

Version-Release number of selected component (if applicable):


How reproducible:
everytime an oracle process is attached by gdb

Steps to Reproduce:
1.install oracle 10g 10201 on RHEL3U8beta kernel
2. using gdb , debug a process and take backtrace
3.
  
Actual results:
gdb session hang 

Expected results:
gdb session should quit when ran pstack pid

Additional info:

test case:
startup oracle 11.1.0.
pstack any oracle process. (example -  ora_q000_oastoltp process)
pstack command will hang as above.

Workaround is to 'kill -CONT pid' of all the processes that are stuck in 'T'
state.
Verified that this hang also occurs on production version of 10.2.0.2 with
the same beta 8 kernel (2.4.21-43ELhugemem.

So existing 10.2.0.2 customer may also run into this issue.

Problem still exists in 2.4.21-44.ELhugemem (RHEL 3 U8 Beta 2) kernel.

Comment 1 Red Hat Bugzilla 2006-06-27 22:00:55 UTC

RHEL3 is now closed.

Comment 3 Red Hat Bugzilla 2006-06-29 13:44:28 UTC

Guru, does the problem only occur while running under gdb or do you get random
hangs while running the regression tests outside the debugger? Also, did this
work in a previous update release?

Comment 4 Red Hat Bugzilla 2006-06-29 14:59:08 UTC

Have not seen hang outside the debugger yet.

Also problem was *not* present in RHEL 3 U7 (2.4.21-40.ELhugemem). 

This problem started with RHEL 3 U7 Beta 1 release(2.4.21-43ELhugemem.)

Comment 7 Red Hat Bugzilla 2006-06-29 18:21:37 UTC

The problem is seen with any process that is being debugged. after gdb quits,
the gdb process stays in 'T' state as if gdb itself being traced.  its not a
hang rather the process is stopped in 'T' state.  The problem only seen with
debugger.

Comment 11 Red Hat Bugzilla 2006-06-30 19:44:22 UTC

Guru -- I was able to reproduce this behavior in U8 and verified that it
doesn't not occur in U7. This is: When debugging a process with gdb, after
quiting, the gdb process remains on T state, while in U7 it terminates
successfully. However, this had no effect on the process that was being
debugged or the system performance. Are the test systems being impacted,
performance-wise, due to this? Or is your intention for this issue to only
report that gdb should not remain in T state?

SEG has identified one of the changes that could be responsible for this
behavior and has made Engineering aware of it. However, given that U8 is
scheduled to be released in less than two weeks, it will not be possible
for us to pursue a fix for U8. As you are probably aware, U8 is the last
official update scheduled for RHEL3. Management has yet to decide on how
outstanding RHEL3 issues, like this one, will be addressed. As soon as a
decision is made, we will be sharing it with you.

Internal Status set to 'Waiting on Customer'
Status set to: Waiting on Client

This event sent from IssueTracker by martinez 
 issue 96876

Comment 12 Red Hat Bugzilla 2006-06-30 20:14:24 UTC

Thank you. As reported earlier, Oracle RDBMS,  in situations where it
perceives a hang, initiates diagnostic dumps which includes stack dumps by
invoking gdb. So with  gdb in 'T' state,  and since the memory is not
released back, over a period of time .like few hours to a day, the box is
running out of memory after using all the swap.  We think this will have
tremendous impact on RAC customers where such diagnostic is often taken
when a process/node is seen not responding.  




This event sent from IssueTracker by martinez 
 issue 96876

Comment 16 Red Hat Bugzilla 2006-07-01 00:22:01 UTC

Note that this regression only occurs when "gdb" is used to attach to
a previously existing process (with the "attach" command).  Also, in
both cases (of running the process under "gdb" or attaching to one),
the RHEL3 U8 kernel causes another probably related regression:

  pesto 1# cat << EOF > xyz.c
  ? main()
  ? {
  ?         for (;;)
  ?                 ;
  ?         exit(0);
  ? }
  ? ^D

  pesto 15# cc -o xyz xyz.c

  pesto 16# gdb xyz
  GNU gdb Red Hat Linux (6.3.0.0-1.62rh)
  [...]

  (gdb) run
  Starting program: /home/ernie/xyz 
  warning: linux_test_for_tracefork: unexpected result from waitpid (4994,
    status 0x0)
  warning: linux_test_for_tracefork: failed to kill child
  [...]


I'm assigning this to PeterS, who worked on the U8 zap-threads patch.

Comment 20 Red Hat Bugzilla 2006-07-05 22:28:07 UTC

Patch posted for internal review on 5-Jul-2006.

Comment 21 Red Hat Bugzilla 2006-07-06 00:22:53 UTC

A fix for this problem has just been committed to the RHEL3 U8
patch pool this evening (in kernel version 2.4.21-47.EL).

Comment 28 Red Hat Bugzilla 2006-07-19 06:57:25 UTC

I have verified that this issue is fixed in RHEL 3 U8 release. (
(2.4.21-47.ELhugemem.)

Comment 29 Red Hat Bugzilla 2006-07-20 14:12:23 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2006-0437.html

Note You need to log in before you can comment on or make changes to this bug.