Description of problem:
There is a race condition that can be easily triggered by the mincore system
call that will cause "ps -ef" (and likley other things that need to lock
mmap_sem) to hang. Since this can be used as a DOS attack I am considering this
a security issue.
I can easily reproduce this on my ia64 hardware. I do not expect it to be
hardware specific however (but I will verify when I get access to other hardware).
The situation that causes this is:
1. thread A calls mincore with the *vec argument pointing to a page that is not
in core (i.e. a freshly mmapped page). In the mincore system call it calls
down_read on mmap_sem.
2. a second thread in the same process calls mmap to map a new page, this calls
down_write on mmap_sem but blocks due to the reader in the first thread.
3. the first thread then calls copy_to_user to copy the info regarding which
pages are in core to the address specified by *vec. Since this page is not yet
in core a page fault is triggered. The page fault handler calls down_read on
mmap_sem even though this task already has a read lock on it. Since there is a
waiting writer this call to down_read blocks. Since this is the thread with the
original read_lock on mmap_sem we have deadlock.
4. the DOS issue then comes from any user doing a "ps -ef" which needs to do
down_read on mmap_sem for each process. This blocks.
Version-Release number of selected component (if applicable):
Seen on the latest RHEL4 kernel (2.6.9-30) but is also broken up stream. I will
check RHEL3 soon.
All the time (on ia64 SMP however I expect this is not arch dependent)
Steps to Reproduce:
1. compile the attached reproducer with "cc mincore_hang.c -lpthread"
3. ps -ef
The a.out and ps -ef hang and are uninterruptable
Created attachment 124452 [details]
reproducer for mincore race condition
You got a kernel patch to fix it? Or a suggestion how
to fix it?
I have done some thinking on how to fix this. Since we need to be holding the
read lock on mmap_sem while walking the vma list perhaps instead of copying the
info into user space while holding the lock we could copy into some kmalloc'ed
memory temporarily and the copy_to_user after we drop the read lock. I don't
really like this solution since it means allocating potentially large chunks via
kmalloc (i.e. some really big app on a 1TB system wants to check all of its
memory to see what is in core).
Perhaps a better solution would be to ensure the memory pointed to by *vec is in
core before we take the lock and mlock it temporarily.
As you surmise, it reproduces on a RHEL4 x86_64, and on
a UP ia64 RHEL3 box.
Amazing nobody's ever run into this.
> Perhaps a better solution would be to ensure the memory pointed to by *vec
> is in core before we take the lock and mlock it temporarily.
It would seem to make sense to pre-read the memory pointed
to by vec, as opposed to only doing access_ok() on it.
Pinning the pages from inside the kernel of scares me though,
because nothing prevents the user from passing a ridiculously
large block of address space, even if it's a byte array.
Adding Larry to the cc: list in case he has any suggestions.
Carrying your suggestion a bit further, perhaps a fault-and-pin-page
operation could be done a page a time, by enclosing the "while (vma)"
loop in sys_mincore() inside another while loop that does the operation
on a page-by-page basis (i.e., per page full of byte markers). Even
though *vec at a minimum points to a single page, on a large user virtual
address space, a user could still seemingly cause some damage if all of
its *vec memory were locked down at once. Doing it a page at time will make
sys_mincore() obfuscated, but wouldn't introduce a new problem to fix
the old one.
I had to think this through and stare at the code a bit but it sounds like this
is probably the best solution. The first pass through the loop would be special
since *vec likley isn't page aligned so we need to see how many bytes we have
before we hit the page boundary and only look at that many pages on the first
pass. Most of the time it would likely only be 1 pass so I don't think
performance is impacted (not that mincore is probably a major performance
concern anyway but I bet oracle uses it).
Yep, I forgot about *vec not being on page boundary...
But you're right -- it still means taking the user's passed-in
start, length, and its base pointer to *vec, and keeping local,
progressive, versions of those values for each time through the
Re: performance, my only concern was memory page locking, i.e.,
that it shouldn't do more than one at a time.
QE ack for 4.5.
Ok -- I took a quick look at what data is available in IT #98736.
Aside from the fact that there are a number of processes blocked in ps
in IT #98736, can you explain what that has to do with the mincore() issue
that this bugzilla was filed against? In this bugzilla, Doug forced the
the hang with a multi-threaded process whose threads are doing mincore()
and mmap() operations on their address space simultaneously, leading to
a block on their common mmap_sem semaphore if somebody else then did
a ps on that task.
In IT #98736, the straces show a bunch of processes that mostly are blocked
doing read()'s of various /proc/<pid>/cmdline files. Are there simultaneous
threads doing mincore() calls on the same pids, presuming that they are
Anyway, at a minimum, we need a forced crash-dump to figure out what's
going on in the IT's case. A bunch of user-space "strace" outputs are pretty
much useless, especially those of the "ps" commands (as opposed to the pids
who cmdline's are being read.
And if it has nothing to do with simultaneous mincore()/mmap() operations
going on in the same threads whose /proc/pid/cmdline files are being read,
then this should be opened as a seperate bugzilla, since it makes no sense
to continue in the context of this one.
Looking at the files in IT #98736, there is zero evidence of
the multi-threaded/mincore/mmap interaction issue that is the
subject of this bugzilla.
The only relationship between the two situations is that something
is causing a task's mmap_sem to be deadlocked, and subsequent ps commands
hang when trying to take the semaphore.
In this bugzilla, it's Doug's program that forces simultaneous mincore()/mmap()
interaction between two threads of a process, and both threads end up blocking
on their own mmap_sem. In #IT 98736, it's something completely different,
and that problem is what needs to be determined.
It makes no sense to tie that IT to this bugzilla; it needs its own bugzilla.
> The first comment in this bug don't talk about an IT item. What about fixing
> that problem report and leaving any IT by side?
Doug Chapman (the reporter) was well on his way to fixing this with his
first patch. Perhaps he can resurrect his work and post an updated
Doug's final patch:
Has this patch been proposed for upstream inclusion?
(In reply to comment #21)
> Has this patch been proposed for upstream inclusion?
I have not proposed it upstream as I wasn't sure what the procedure was for
sensitive patches. I sent it to rhkernel-list to get ack's but received none
(no nacks either however).
This bug is 10 month old and I would propose we make it public now and propose
the patch upstream. This one is a good candidate for the next RHEL4 async errata
and we should go with a patch that is acceptable by upstream. Any objections?
So no objections to lifting the embargo. I will communicate this through
vendor-sec and then open this Bugzilla.
And it seems this is not only ia64. Is 64-bit specific or can it also happen on
Doug Chapman wrote:
> How reproducible:
> All the time (on ia64 SMP however I expect this is not arch dependent)
> I can easily reproduce this on my ia64 hardware. I do not expect it to be
> hardware specific however (but I will verify when I get access to other
Created attachment 143802 [details]
Proposed patch from Andrew Morton
Created attachment 143803 [details]
Additional patch on-top of the previous one
Just in case you can't see private comments, I've un-privatized the
last two from Andrew Morton. Can you apply his cleanups/fixes to
your patch, verify it, and re-post to RHKL?
looks like Linux re-worked this himself, please see:
(In reply to comment #30)
> looks like Linux re-worked this himself, please see:
Jason meant "Linus", who re-worked your patch himself.
Can you apply Linus' patch, test it on your ia64, and
re-post to RHKL?
patch posted to rhkernel-list for review.
committed in stream U5 build 42.37. A test kernel with this patch is available
I have verified this in 2.6.9-42.37.EL.
fix looks good in 42.0.6
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.