Bug 131137 - UNABLE TO RUN ps WHILE DUMPING LARGE CORE (DEADLOCKS)
Summary: UNABLE TO RUN ps WHILE DUMPING LARGE CORE (DEADLOCKS)
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Enterprise Linux 2.1
Classification: Red Hat
Component: kernel (Show other bugs)
(Show other bugs)
Version: 2.1
Hardware: All Linux
medium
medium
Target Milestone: ---
Assignee: Jason Baron
QA Contact:
URL:
Whiteboard:
Keywords: Reopened
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2004-08-27 21:01 UTC by Greg Marsden
Modified: 2013-03-06 05:57 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2007-03-29 17:53:06 UTC
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
coredeadlock patch (468 bytes, patch)
2004-08-27 21:03 UTC, Greg Marsden
no flags Details | Diff

Description Greg Marsden 2004-08-27 21:01:28 UTC
Description of problem:
Any calls to ps or top, while another process is dumping core, will
hang in rw_sem_down_read_failed

Jul 30 12:45:09 rgmldap30 kernel: ps            D 00000000    0  9539
 9538
        9540      (NOTLB)
Jul 30 12:45:09 rgmldap30 kernel: Call Trace:
[rwsem_down_read_failed+325/368] rwsem_down_read_failed [kernel] 0x145
(0xe3febee8)
Jul 30 12:45:09 rgmldap30 kernel: Call Trace: [<c022d3c5>]
rwsem_down_read_failed [kernel] 0x145 (0xe3febee8)
Jul 30 12:45:09 rgmldap30 kernel: [stext_lock+15326/33696] stext_lock
[kernel] 0x3bde (0xe3febf14)
Jul 30 12:45:09 rgmldap30 kernel: [<c023cbde>] stext_lock [kernel] 0x3bde
(0xe3febf14)
Jul 30 12:45:09 rgmldap30 kernel: [__alloc_pages+15/160] __alloc_pages
[kernel] 0xf (0xe3febf40)
Jul 30 12:45:09 rgmldap30 kernel: [<c013ebbf>] __alloc_pages [kernel] 0xf
(0xe3febf40)
Jul 30 12:45:09 rgmldap30 kernel: [proc_info_read+76/256] proc_info_read
[kernel] 0x4c (0xe3febf58)
Jul 30 12:45:09 rgmldap30 kernel: [<c01692fc>] proc_info_read [kernel]
0x4c
(0xe3febf58)
Jul 30 12:45:09 rgmldap30 kernel: [sys_read+150/288] sys_read [kernel]
0x96
(0xe3febf7c)
Jul 30 12:45:09 rgmldap30 kernel: [<c0146c66>] sys_read [kernel] 0x96
(0xe3febf7c)
Jul 30 12:45:09 rgmldap30 kernel: [sys_open+149/224] sys_open [kernel]
0x95
(0xe3febfa4)
Jul 30 12:45:10 rgmldap30 kernel: [<c0146675>] sys_open [kernel] 0x95
(0xe3febfa4)
Jul 30 12:45:10 rgmldap30 kernel: [system_call+51/56] system_call [kernel]
0x33 (0xe3febfc0)
Jul 30 12:45:10 rgmldap30 kernel: [<c01073e3>] system_call [kernel] 0x33
(0xe3febfc0)




How reproducible:
Always

Steps to Reproduce:
0. ulimit -c unlimited
1.run a program which mmap()s a 2 gb file
2. kill -SIGSEGV $!
3. ps auxw
    

Additional info:
Workaround: set coresize to 0

Comment 1 Greg Marsden 2004-08-27 21:03:51 UTC
Created attachment 103188 [details]
coredeadlock patch

Changes read sem to write sem in coredump path. by bert.barbe@oracle.com

Comment 2 Greg Marsden 2004-08-27 21:06:24 UTC
From: Arjan van de Ven

Comment 3 Greg Marsden 2004-08-27 21:07:52 UTC
From: Arjan van de Ven
read semaphores are *NOT* recursive. In the 2.4.9 era we used to have a
boatload of issues with this semaphore being taken recursively; I
thought we
had all of them fixed but either one came back or one missed the as2.1
branch...
From: Stephen Tweedie
And worse, they break sporadically and unpredictably.  Unless another
thread comes in with a down_write() between the two recursive
down_read()s, everything _appears_ to be working fine.


Comment 4 Greg Marsden 2004-08-27 21:08:32 UTC
Migrating discussion into bugzilla :)

Comment 5 Jason Baron 2004-08-27 21:32:20 UTC
i've tried the test program and, ps does hang, but eventually returns
when the core file is written. The 'rwsem_down_read_failed' message in
and off itself is not a problem, it just means that we didn't
immediately acquire the semaphore. perhaps, it is poorly named.

Comment 6 Bert Barbe 2004-08-27 23:40:08 UTC
Greg, when you tested, didn't you see an actual deadlock - in other 
words, the coredump never finishing ?

Comment 7 Jason Baron 2004-11-01 18:25:16 UTC
This has been open for more than two months. I'm closing it. If its
really an issue pls re-open.

Comment 8 Greg Marsden 2004-11-02 01:15:44 UTC
Sorry, it took forever for these folks to get back to us and install
the kernel patch (I think it just happened). In any event, I did not
see the same behavior that you did with your testcase, either.. Thus
far, they are very happy with the hacked together coredeadlock patch
which is attached to this bugzilla...

Comment 9 Red Hat Bugzilla 2007-02-05 19:18:50 UTC
REOPENED status has been deprecated. ASSIGNED with keyword of Reopened is preferred.


Note You need to log in before you can comment on or make changes to this bug.