Bug 131137 - UNABLE TO RUN ps WHILE DUMPING LARGE CORE (DEADLOCKS)
UNABLE TO RUN ps WHILE DUMPING LARGE CORE (DEADLOCKS)
Status: CLOSED WONTFIX
Product: Red Hat Enterprise Linux 2.1
Classification: Red Hat
Component: kernel (Show other bugs)
2.1
All Linux
medium Severity medium
: ---
: ---
Assigned To: Jason Baron
: Reopened
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2004-08-27 17:01 EDT by Greg Marsden
Modified: 2013-03-06 00:57 EST (History)
4 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2007-03-29 13:53:06 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
coredeadlock patch (468 bytes, patch)
2004-08-27 17:03 EDT, Greg Marsden
no flags Details | Diff

  None (edit)
Description Greg Marsden 2004-08-27 17:01:28 EDT
Description of problem:
Any calls to ps or top, while another process is dumping core, will
hang in rw_sem_down_read_failed

Jul 30 12:45:09 rgmldap30 kernel: ps            D 00000000    0  9539
 9538
        9540      (NOTLB)
Jul 30 12:45:09 rgmldap30 kernel: Call Trace:
[rwsem_down_read_failed+325/368] rwsem_down_read_failed [kernel] 0x145
(0xe3febee8)
Jul 30 12:45:09 rgmldap30 kernel: Call Trace: [<c022d3c5>]
rwsem_down_read_failed [kernel] 0x145 (0xe3febee8)
Jul 30 12:45:09 rgmldap30 kernel: [stext_lock+15326/33696] stext_lock
[kernel] 0x3bde (0xe3febf14)
Jul 30 12:45:09 rgmldap30 kernel: [<c023cbde>] stext_lock [kernel] 0x3bde
(0xe3febf14)
Jul 30 12:45:09 rgmldap30 kernel: [__alloc_pages+15/160] __alloc_pages
[kernel] 0xf (0xe3febf40)
Jul 30 12:45:09 rgmldap30 kernel: [<c013ebbf>] __alloc_pages [kernel] 0xf
(0xe3febf40)
Jul 30 12:45:09 rgmldap30 kernel: [proc_info_read+76/256] proc_info_read
[kernel] 0x4c (0xe3febf58)
Jul 30 12:45:09 rgmldap30 kernel: [<c01692fc>] proc_info_read [kernel]
0x4c
(0xe3febf58)
Jul 30 12:45:09 rgmldap30 kernel: [sys_read+150/288] sys_read [kernel]
0x96
(0xe3febf7c)
Jul 30 12:45:09 rgmldap30 kernel: [<c0146c66>] sys_read [kernel] 0x96
(0xe3febf7c)
Jul 30 12:45:09 rgmldap30 kernel: [sys_open+149/224] sys_open [kernel]
0x95
(0xe3febfa4)
Jul 30 12:45:10 rgmldap30 kernel: [<c0146675>] sys_open [kernel] 0x95
(0xe3febfa4)
Jul 30 12:45:10 rgmldap30 kernel: [system_call+51/56] system_call [kernel]
0x33 (0xe3febfc0)
Jul 30 12:45:10 rgmldap30 kernel: [<c01073e3>] system_call [kernel] 0x33
(0xe3febfc0)




How reproducible:
Always

Steps to Reproduce:
0. ulimit -c unlimited
1.run a program which mmap()s a 2 gb file
2. kill -SIGSEGV $!
3. ps auxw
    

Additional info:
Workaround: set coresize to 0
Comment 1 Greg Marsden 2004-08-27 17:03:51 EDT
Created attachment 103188 [details]
coredeadlock patch

Changes read sem to write sem in coredump path. by bert.barbe@oracle.com
Comment 2 Greg Marsden 2004-08-27 17:06:24 EDT
From: Arjan van de Ven
Comment 3 Greg Marsden 2004-08-27 17:07:52 EDT
From: Arjan van de Ven
read semaphores are *NOT* recursive. In the 2.4.9 era we used to have a
boatload of issues with this semaphore being taken recursively; I
thought we
had all of them fixed but either one came back or one missed the as2.1
branch...
From: Stephen Tweedie
And worse, they break sporadically and unpredictably.  Unless another
thread comes in with a down_write() between the two recursive
down_read()s, everything _appears_ to be working fine.
Comment 4 Greg Marsden 2004-08-27 17:08:32 EDT
Migrating discussion into bugzilla :)
Comment 5 Jason Baron 2004-08-27 17:32:20 EDT
i've tried the test program and, ps does hang, but eventually returns
when the core file is written. The 'rwsem_down_read_failed' message in
and off itself is not a problem, it just means that we didn't
immediately acquire the semaphore. perhaps, it is poorly named.
Comment 6 Bert Barbe 2004-08-27 19:40:08 EDT
Greg, when you tested, didn't you see an actual deadlock - in other 
words, the coredump never finishing ?
Comment 7 Jason Baron 2004-11-01 13:25:16 EST
This has been open for more than two months. I'm closing it. If its
really an issue pls re-open.
Comment 8 Greg Marsden 2004-11-01 20:15:44 EST
Sorry, it took forever for these folks to get back to us and install
the kernel patch (I think it just happened). In any event, I did not
see the same behavior that you did with your testcase, either.. Thus
far, they are very happy with the hacked together coredeadlock patch
which is attached to this bugzilla...
Comment 9 Red Hat Bugzilla 2007-02-05 14:18:50 EST
REOPENED status has been deprecated. ASSIGNED with keyword of Reopened is preferred.

Note You need to log in before you can comment on or make changes to this bug.