131137 – UNABLE TO RUN ps WHILE DUMPING LARGE CORE (DEADLOCKS)

Bug 131137 - UNABLE TO RUN ps WHILE DUMPING LARGE CORE (DEADLOCKS)

Summary: UNABLE TO RUN ps WHILE DUMPING LARGE CORE (DEADLOCKS)

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Enterprise Linux 2.1
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	2.1
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Jason Baron
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2004-08-27 21:01 UTC by Greg Marsden
Modified:	2013-03-06 05:57 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2007-03-29 17:53:06 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
coredeadlock patch (468 bytes, patch) 2004-08-27 21:03 UTC, Greg Marsden	no flags	Details \| Diff
View All

Description Greg Marsden 2004-08-27 21:01:28 UTC

Description of problem:
Any calls to ps or top, while another process is dumping core, will
hang in rw_sem_down_read_failed

Jul 30 12:45:09 rgmldap30 kernel: ps            D 00000000    0  9539
 9538
        9540      (NOTLB)
Jul 30 12:45:09 rgmldap30 kernel: Call Trace:
[rwsem_down_read_failed+325/368] rwsem_down_read_failed [kernel] 0x145
(0xe3febee8)
Jul 30 12:45:09 rgmldap30 kernel: Call Trace: [<c022d3c5>]
rwsem_down_read_failed [kernel] 0x145 (0xe3febee8)
Jul 30 12:45:09 rgmldap30 kernel: [stext_lock+15326/33696] stext_lock
[kernel] 0x3bde (0xe3febf14)
Jul 30 12:45:09 rgmldap30 kernel: [<c023cbde>] stext_lock [kernel] 0x3bde
(0xe3febf14)
Jul 30 12:45:09 rgmldap30 kernel: [__alloc_pages+15/160] __alloc_pages
[kernel] 0xf (0xe3febf40)
Jul 30 12:45:09 rgmldap30 kernel: [<c013ebbf>] __alloc_pages [kernel] 0xf
(0xe3febf40)
Jul 30 12:45:09 rgmldap30 kernel: [proc_info_read+76/256] proc_info_read
[kernel] 0x4c (0xe3febf58)
Jul 30 12:45:09 rgmldap30 kernel: [<c01692fc>] proc_info_read [kernel]
0x4c
(0xe3febf58)
Jul 30 12:45:09 rgmldap30 kernel: [sys_read+150/288] sys_read [kernel]
0x96
(0xe3febf7c)
Jul 30 12:45:09 rgmldap30 kernel: [<c0146c66>] sys_read [kernel] 0x96
(0xe3febf7c)
Jul 30 12:45:09 rgmldap30 kernel: [sys_open+149/224] sys_open [kernel]
0x95
(0xe3febfa4)
Jul 30 12:45:10 rgmldap30 kernel: [<c0146675>] sys_open [kernel] 0x95
(0xe3febfa4)
Jul 30 12:45:10 rgmldap30 kernel: [system_call+51/56] system_call [kernel]
0x33 (0xe3febfc0)
Jul 30 12:45:10 rgmldap30 kernel: [<c01073e3>] system_call [kernel] 0x33
(0xe3febfc0)




How reproducible:
Always

Steps to Reproduce:
0. ulimit -c unlimited
1.run a program which mmap()s a 2 gb file
2. kill -SIGSEGV $!
3. ps auxw
    

Additional info:
Workaround: set coresize to 0

Comment 1 Greg Marsden 2004-08-27 21:03:51 UTC

Created attachment 103188 [details]
coredeadlock patch

Changes read sem to write sem in coredump path. by bert.barbe

Comment 2 Greg Marsden 2004-08-27 21:06:24 UTC

From: Arjan van de Ven

Comment 3 Greg Marsden 2004-08-27 21:07:52 UTC

From: Arjan van de Ven
read semaphores are *NOT* recursive. In the 2.4.9 era we used to have a
boatload of issues with this semaphore being taken recursively; I
thought we
had all of them fixed but either one came back or one missed the as2.1
branch...
From: Stephen Tweedie
And worse, they break sporadically and unpredictably.  Unless another
thread comes in with a down_write() between the two recursive
down_read()s, everything _appears_ to be working fine.

Comment 4 Greg Marsden 2004-08-27 21:08:32 UTC

Migrating discussion into bugzilla :)

Comment 5 Jason Baron 2004-08-27 21:32:20 UTC

i've tried the test program and, ps does hang, but eventually returns
when the core file is written. The 'rwsem_down_read_failed' message in
and off itself is not a problem, it just means that we didn't
immediately acquire the semaphore. perhaps, it is poorly named.

Comment 6 Bert Barbe 2004-08-27 23:40:08 UTC

Greg, when you tested, didn't you see an actual deadlock - in other 
words, the coredump never finishing ?

Comment 7 Jason Baron 2004-11-01 18:25:16 UTC

This has been open for more than two months. I'm closing it. If its
really an issue pls re-open.

Comment 8 Greg Marsden 2004-11-02 01:15:44 UTC

Sorry, it took forever for these folks to get back to us and install
the kernel patch (I think it just happened). In any event, I did not
see the same behavior that you did with your testcase, either.. Thus
far, they are very happy with the hacked together coredeadlock patch
which is attached to this bugzilla...

Comment 9 Red Hat Bugzilla 2007-02-05 19:18:50 UTC

REOPENED status has been deprecated. ASSIGNED with keyword of Reopened is preferred.

Note You need to log in before you can comment on or make changes to this bug.