Description of problem: During executing do_mmap_pgoff and using ext3 Filesystem, a deadlock can occur. This deadlock requires 3 processes. Each of 3 processes wants resources which the others have gotten. (Each of 3 processes waits till it can get the resource which the other have gotten.) The following describes which each of 3 processes waits for. 1) A certain process acquired the write-lock of rw_semaphore (mm_struct.mmap_sem), and waits till "kjournald" begins to commit a transaction of journal. (and waits till commit_transaction->t_state turns into other value different from "T_LOCKED".) [backtrace example] sys_mmap2() -> do_mmap2() [down_write(¤t->mm->mmap_sem)] *1 -> do_mmap_pgoff() -> generic_file_mmap() -> update_atime() -> ... -> journal_start() -> start_this_handle() *3 -> schedule() 2) The other process incremented an update-counter (but doesn't decrement it), (The other process incremented commit_transaction->t_updates) and waits till it can acquire the read-lock of rw_semaphore acquired by 1)'s process. [backtrace example] sys_pwrite64() -> vfs_write() -> ... -> generic_file_buffered_write() [ext3_prepare_write() -> ... -> journal_start()] *2 -> filemap_copy_from_user() -> __copy_from_user() -> __direct_copy_from_user_inatomic() -> __copy_from_user_ll() = PAGE FAULT = -> do_page_fault() -> down_read(&mm->mmap_sem) *1 -> ... -> schedule() 3) "kjournald" set commit_transaction->t_state to "T_LOCKED" and waits till 2)'s process changes commit_transaction->t_updates into 0. [backtrace example] kjournald() -> journal_commit_transaction() [commit_transaction->t_state = T_LOCKED] *3 -> while (commit_transaction->tupdates) { *2 ... schedule() ... } Thus, each of 3 processes cannot run anymore. (deadlock)
Hi, I think the problem is that page fault in backtrace number (2) and checking upstream code we can see the difference in generic_file_buffered_write() which does: ... if (a_ops->write_begin) status = generic_perform_write(file, &i, pos); else status = generic_perform_write_2copy(file, &i, pos); and both functions above takes care to avoid page faults and so this deadlock too, doing as below: /* * Must not enter the pagefault handler here, because * we hold the page lock, so we might recursively * deadlock on the same lock, or get an ABBA deadlock * against a different lock, or against the mmap_sem * (which nests outside the page lock). So increment * preempt count, and use _atomic usercopies. * * The page is uptodate so we are OK to encounter a * short copy: if unmodified parts of the page are * marked dirty and written out to disk, it doesn't * really matter. */ pagefault_disable(); copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes); pagefault_enable(); flush_dcache_page(page); The patch introducing this change above is discussed here: http://lkml.org/lkml/2007/2/4/26 It seems to me that mounting with 'noatime' -best option- or using ext2 will have the same end results and both are enough to avoid this deadlock specifically,though others deadlocks can still happen. Flavio
Created attachment 306017 [details] pagefault disable/enable stuff. Patch 1/2. Ok heres the first patch in a series in 2.
Created attachment 306018 [details] the mmap_sem deadlock prevention patch. Path 2/2 Here is the actual patch that fixes the problem. Please test with both of these patches applied and make sure the problem goes away.
Need testing on this to make sure it fixes the problem.
Hello, Is there a test plan for this somewhere?
Created attachment 308564 [details] add missing header Josef, It seems to be missing a chunk like below on every highmem.c: --- linux-2.6.9.orig/arch/i386/mm/highmem.c +++ linux-2.6.9/arch/i386/mm/highmem.c @@ -1,4 +1,5 @@ #include <linux/highmem.h> +#include <asm/uaccess.h> void *kmap(struct page *page) { See the attached patch. What you think? Flavio
it compiled, so I don't think so, but you can throw it in there if you like.
Created attachment 310833 [details] ENOMEM logs Josef, The test kernel (under cvs tag kernel-2_6_9-55_EL_it169325_2) has triggered the OOM killer in the first test round and -ENOMEM on the second round. from logs: Normal: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB HighMem: 22*4kB 7*8kB 3*16kB 0*32kB 1*64kB 1*128kB 1*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 640kB Swap cache: add 66421941, delete 66005408, find 28502300/34508086, race 0+100 0 bounce buffer pages Free swap: 6575260kB 524288 pages of RAM 294912 pages of HIGHMEM 5546 reserved pages 102724 pages shared 416705 pages swap cached ENOMEM in journal_alloc_journal_head, retrying. bash: page allocation failure. order:0, mode:0x50 [<c0144570>] __alloc_pages+0x294/0x2a6 [<c014459a>] __get_free_pages+0x18/0x24 [<c01470c0>] kmem_getpages+0x1c/0xbb [<c0147c0e>]524288 pages of RAM Free pages: 13180kB (640kB HighMem) Active:386596 inactive:115205 dirty:8351 writeback:11410 unstable:0 free:3295 slab:5754 mapped:478668 pagetables:3347 DMA free:12540kB min:16kB low:32kB high:48kB active:0kB inactive:0kB present:16384kB pages_scanned:54999 all_unreclaimable? yes protections[]: 0 0 0 Normal free:0kB min:928kB low:1856kB high:2784kB active:643060kB inactive:200476kB present:901120kB pages_scanned:495 all_unreclaimable? no protections[]: 0 0 0 HighMem free:640kB min:512kB low:1024kB high:1536kB active:903324kB inactive:260344kB present:1179648kB pages_scanned:462 all_unreclaimable? no protections[]: 0 0 0 DMA: 3*4kB 4*8kB 3*16kB 3*32kB 3*64kB 3*128kB 2*256kB 0*512kB 1*1024kB 1*2048kB 2*4096kB = 12540kB unix_stream_sendmsg+0x277/0x33a Well, the fact that you said it compiled for you made me think if we are really on the same kernel version.
Updating PM score.
Created attachment 328208 [details] A proposal patch from Fujitsu. Comments from IT: I have created new patch which fixes this problem only. (I create it from scratch and it isn't backported from upstream kernel.) It is small patch, and so it seems not to cause a severe performance degradation. I have confirmed that any problems didn't happen on the system on which this patch was applied while I was running this problem's reproducer for 5 days. Therefore I think this patch fixes this problem. Please confirm it and apply it on RHEL4.8.
Peter Zijlstra reviewed this patch as well, made the comment that you don't need the get_file/put_file around file_accessed, so you can remove that as well. Other than that and my comment about not dropping mmap_sem if we are already mounted noatime that should be good. I look forward to your updated patch. I assume this is being tested as well? I will do my own testing for sure, I'd just like to make sure you are satisfied with it as well.
I received a revised patch as per Josef's comments from Japan this morning, but I'm going to ask them to re-revise it as per Peter's comments. I'll put a revised patch here maybe tomorrow. Thanks for the comments/feedback!
Created attachment 328578 [details] A revised proposal patch from Fujitsu Fujitsu engineer agreed to the comments/feedback and revised the patch accordingly. They tested it and confirmed it worked fine. Please review it.
that looks good, I will post it today.
Created attachment 328771 [details] updated patch. got some feedback and reworked the patch. Please test this patch and let me know if it still works for you guys.
Created attachment 328891 [details] Log of testing on RH updated patch. Fujitsu ran their testset on the kernel with the patch applied. It immediately winded up with making a core file and the kernel panic (the log is attached). Fujitsu doesn't think the patch is good since do_mmap_pgoff() requires the caller to hold mmap_sem. The core and panic happened because mmap_sem was dropped at here. 987 error = file->f_op->mmap(file, vma); Therefore, Fujitsu's patch was made to drop mmap_sem at the end of do_mmap_pgoff() (and do update_atime).
Committed in 79.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/
It's a bit confusing from the sequence of events in the bz, but the patch in c37 is the one we're going with, the patch in c39 was an attempt to make it pretty but obviously it didn't work, and it wasn't actually ever posted to rh-kernel list, just the one in c37 was. Fujitsu, would you mind testing the latest snapshot kernel to verify that you are still good with this fix? Thanks.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-1024.html