Hide Forgot
Description of problem: There is a race condition that can be easily triggered by the mincore system call that will cause "ps -ef" (and likley other things that need to lock mmap_sem) to hang. Since this can be used as a DOS attack I am considering this a security issue. I can easily reproduce this on my ia64 hardware. I do not expect it to be hardware specific however (but I will verify when I get access to other hardware). The situation that causes this is: 1. thread A calls mincore with the *vec argument pointing to a page that is not in core (i.e. a freshly mmapped page). In the mincore system call it calls down_read on mmap_sem. 2. a second thread in the same process calls mmap to map a new page, this calls down_write on mmap_sem but blocks due to the reader in the first thread. 3. the first thread then calls copy_to_user to copy the info regarding which pages are in core to the address specified by *vec. Since this page is not yet in core a page fault is triggered. The page fault handler calls down_read on mmap_sem even though this task already has a read lock on it. Since there is a waiting writer this call to down_read blocks. Since this is the thread with the original read_lock on mmap_sem we have deadlock. 4. the DOS issue then comes from any user doing a "ps -ef" which needs to do down_read on mmap_sem for each process. This blocks. Version-Release number of selected component (if applicable): Seen on the latest RHEL4 kernel (2.6.9-30) but is also broken up stream. I will check RHEL3 soon. How reproducible: All the time (on ia64 SMP however I expect this is not arch dependent) Steps to Reproduce: 1. compile the attached reproducer with "cc mincore_hang.c -lpthread" 2. ./a.out 3. ps -ef Actual results: The a.out and ps -ef hang and are uninterruptable Expected results: Additional info:
Created attachment 124452 [details] reproducer for mincore race condition
You got a kernel patch to fix it? Or a suggestion how to fix it?
I have done some thinking on how to fix this. Since we need to be holding the read lock on mmap_sem while walking the vma list perhaps instead of copying the info into user space while holding the lock we could copy into some kmalloc'ed memory temporarily and the copy_to_user after we drop the read lock. I don't really like this solution since it means allocating potentially large chunks via kmalloc (i.e. some really big app on a 1TB system wants to check all of its memory to see what is in core). Perhaps a better solution would be to ensure the memory pointed to by *vec is in core before we take the lock and mlock it temporarily.
As you surmise, it reproduces on a RHEL4 x86_64, and on a UP ia64 RHEL3 box. Amazing nobody's ever run into this. > Perhaps a better solution would be to ensure the memory pointed to by *vec > is in core before we take the lock and mlock it temporarily. It would seem to make sense to pre-read the memory pointed to by vec, as opposed to only doing access_ok() on it. Pinning the pages from inside the kernel of scares me though, because nothing prevents the user from passing a ridiculously large block of address space, even if it's a byte array. Adding Larry to the cc: list in case he has any suggestions.
Doug, Carrying your suggestion a bit further, perhaps a fault-and-pin-page operation could be done a page a time, by enclosing the "while (vma)" loop in sys_mincore() inside another while loop that does the operation on a page-by-page basis (i.e., per page full of byte markers). Even though *vec at a minimum points to a single page, on a large user virtual address space, a user could still seemingly cause some damage if all of its *vec memory were locked down at once. Doing it a page at time will make sys_mincore() obfuscated, but wouldn't introduce a new problem to fix the old one.
Dave, I had to think this through and stare at the code a bit but it sounds like this is probably the best solution. The first pass through the loop would be special since *vec likley isn't page aligned so we need to see how many bytes we have before we hit the page boundary and only look at that many pages on the first pass. Most of the time it would likely only be 1 pass so I don't think performance is impacted (not that mincore is probably a major performance concern anyway but I bet oracle uses it).
Yep, I forgot about *vec not being on page boundary... But you're right -- it still means taking the user's passed-in start, length, and its base pointer to *vec, and keeping local, progressive, versions of those values for each time through the loop. Re: performance, my only concern was memory page locking, i.e., that it shouldn't do more than one at a time.
QE ack for 4.5.
Ok -- I took a quick look at what data is available in IT #98736. Aside from the fact that there are a number of processes blocked in ps in IT #98736, can you explain what that has to do with the mincore() issue that this bugzilla was filed against? In this bugzilla, Doug forced the the hang with a multi-threaded process whose threads are doing mincore() and mmap() operations on their address space simultaneously, leading to a block on their common mmap_sem semaphore if somebody else then did a ps on that task. In IT #98736, the straces show a bunch of processes that mostly are blocked doing read()'s of various /proc/<pid>/cmdline files. Are there simultaneous threads doing mincore() calls on the same pids, presuming that they are multi-threaded processes? Anyway, at a minimum, we need a forced crash-dump to figure out what's going on in the IT's case. A bunch of user-space "strace" outputs are pretty much useless, especially those of the "ps" commands (as opposed to the pids who cmdline's are being read. And if it has nothing to do with simultaneous mincore()/mmap() operations going on in the same threads whose /proc/pid/cmdline files are being read, then this should be opened as a seperate bugzilla, since it makes no sense to continue in the context of this one.
Looking at the files in IT #98736, there is zero evidence of the multi-threaded/mincore/mmap interaction issue that is the subject of this bugzilla. The only relationship between the two situations is that something is causing a task's mmap_sem to be deadlocked, and subsequent ps commands hang when trying to take the semaphore. In this bugzilla, it's Doug's program that forces simultaneous mincore()/mmap() interaction between two threads of a process, and both threads end up blocking on their own mmap_sem. In #IT 98736, it's something completely different, and that problem is what needs to be determined. It makes no sense to tie that IT to this bugzilla; it needs its own bugzilla.
> The first comment in this bug don't talk about an IT item. What about fixing > that problem report and leaving any IT by side? Doug Chapman (the reporter) was well on his way to fixing this with his first patch. Perhaps he can resurrect his work and post an updated version?
Doug's final patch: http://post-office.corp.redhat.com/archives/rhkernel-list/2006-September/msg00540.html
Has this patch been proposed for upstream inclusion?
(In reply to comment #21) > Has this patch been proposed for upstream inclusion? > I have not proposed it upstream as I wasn't sure what the procedure was for sensitive patches. I sent it to rhkernel-list to get ack's but received none (no nacks either however).
This bug is 10 month old and I would propose we make it public now and propose the patch upstream. This one is a good candidate for the next RHEL4 async errata and we should go with a patch that is acceptable by upstream. Any objections?
So no objections to lifting the embargo. I will communicate this through vendor-sec and then open this Bugzilla.
And it seems this is not only ia64. Is 64-bit specific or can it also happen on 32-bit architectures?
Doug Chapman wrote: > How reproducible: > All the time (on ia64 SMP however I expect this is not arch dependent) > I can easily reproduce this on my ia64 hardware. I do not expect it to be > hardware specific however (but I will verify when I get access to other > hardware).
Created attachment 143802 [details] Proposed patch from Andrew Morton
Created attachment 143803 [details] Additional patch on-top of the previous one
Hi Doug, Just in case you can't see private comments, I've un-privatized the last two from Andrew Morton. Can you apply his cleanups/fixes to your patch, verify it, and re-post to RHKL? Thanks, Dave
looks like Linux re-worked this himself, please see: http://marc.theaimsgroup.com/?l=git-commits-head&m=116629555506940&w=2 http://marc.theaimsgroup.com/?l=git-commits-head&m=116638195203710&w=2 http://marc.theaimsgroup.com/?l=git-commits-head&m=116638195213406&w=2
(In reply to comment #30) > looks like Linux re-worked this himself, please see: > > http://marc.theaimsgroup.com/?l=git-commits-head&m=116629555506940&w=2 > http://marc.theaimsgroup.com/?l=git-commits-head&m=116638195203710&w=2 > http://marc.theaimsgroup.com/?l=git-commits-head&m=116638195213406&w=2 Hi Doug, Jason meant "Linus", who re-worked your patch himself. Can you apply Linus' patch, test it on your ia64, and re-post to RHKL? Thanks, Dave
patch posted to rhkernel-list for review.
committed in stream U5 build 42.37. A test kernel with this patch is available from http://people.redhat.com/~jbaron/rhel4/
I have verified this in 2.6.9-42.37.EL.
fix looks good in 42.0.6
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2007-0014.html