Bug 439548 - A deadlock can occur between mmap/munmap and journaling(ext3).
A deadlock can occur between mmap/munmap and journaling(ext3).
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel (Show other bugs)
All Linux
high Severity high
: rc
: ---
Assigned To: Josef Bacik
Martin Jenner
Depends On:
Blocks: 391511 409961 461297
  Show dependency treegraph
Reported: 2008-03-28 23:02 EDT by Flavio Leitner
Modified: 2014-01-07 02:34 EST (History)
9 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2009-05-18 15:18:19 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)
pagefault disable/enable stuff. Patch 1/2. (14.58 KB, patch)
2008-05-19 17:34 EDT, Josef Bacik
no flags Details | Diff
the mmap_sem deadlock prevention patch. Path 2/2 (6.61 KB, patch)
2008-05-19 17:35 EDT, Josef Bacik
no flags Details | Diff
add missing header (1.02 KB, patch)
2008-06-06 16:14 EDT, Flavio Leitner
no flags Details | Diff
ENOMEM logs (560.00 KB, application/octet-stream)
2008-07-02 13:43 EDT, Flavio Leitner
no flags Details
A proposal patch from Fujitsu. (3.45 KB, patch)
2009-01-05 11:05 EST, Keiichiro Tokunaga
no flags Details | Diff
A revised proposal patch from Fujitsu (3.54 KB, patch)
2009-01-09 13:02 EST, Keiichiro Tokunaga
no flags Details | Diff
updated patch. (2.54 KB, patch)
2009-01-12 14:16 EST, Josef Bacik
no flags Details | Diff
Log of testing on RH updated patch. (8.23 KB, text/plain)
2009-01-13 11:52 EST, Keiichiro Tokunaga
no flags Details

  None (edit)
Description Flavio Leitner 2008-03-28 23:02:38 EDT
Description of problem:

During executing do_mmap_pgoff and using ext3 Filesystem, a deadlock can occur.

This deadlock requires 3 processes.
Each of 3 processes wants resources which the others have gotten.
(Each of 3 processes waits till it can get the resource which the other have

The following describes which each of 3 processes waits for.

1) A certain process acquired the write-lock of rw_semaphore 
(mm_struct.mmap_sem), and waits till "kjournald" begins to commit a 
transaction of journal. (and waits till commit_transaction->t_state 
turns into other value different from "T_LOCKED".)

[backtrace example]
-> do_mmap2()
   [down_write(&current->mm->mmap_sem)] *1
   -> do_mmap_pgoff()
      -> generic_file_mmap()
         -> update_atime()
            -> ...
               -> journal_start()
                  -> start_this_handle() *3
                     -> schedule()

2) The other process incremented an update-counter (but doesn't decrement it),
(The other process incremented commit_transaction->t_updates) and waits till 
it can acquire the read-lock of rw_semaphore acquired by 1)'s process.

[backtrace example]
 -> vfs_write()
    -> ...
       -> generic_file_buffered_write()
          [ext3_prepare_write() -> ... -> journal_start()] *2
          -> filemap_copy_from_user()
             -> __copy_from_user()
                -> __direct_copy_from_user_inatomic()
                   -> __copy_from_user_ll()
                      = PAGE FAULT =
                      -> do_page_fault()
                         -> down_read(&mm->mmap_sem) *1
                            -> ... -> schedule()

3) "kjournald" set commit_transaction->t_state to "T_LOCKED"
 and waits till 2)'s process changes commit_transaction->t_updates into 0.

[backtrace example]
 -> journal_commit_transaction()
    [commit_transaction->t_state = T_LOCKED]  *3
    -> while (commit_transaction->tupdates) { *2
         ... }

Thus, each of 3 processes cannot run anymore. (deadlock)
Comment 6 Flavio Leitner 2008-03-30 22:27:43 EDT

I think the problem is that page fault in backtrace number (2) and checking
upstream code we can see the difference in generic_file_buffered_write()
which does:
        if (a_ops->write_begin)
                status = generic_perform_write(file, &i, pos);
                status = generic_perform_write_2copy(file, &i, pos);

and both functions above takes care to avoid page faults and so this 
deadlock too, doing as below:
	 * Must not enter the pagefault handler here, because
	 * we hold the page lock, so we might recursively
	 * deadlock on the same lock, or get an ABBA deadlock
	 * against a different lock, or against the mmap_sem
	 * (which nests outside the page lock).  So increment
	 * preempt count, and use _atomic usercopies.
	 * The page is uptodate so we are OK to encounter a
	 * short copy: if unmodified parts of the page are
	 * marked dirty and written out to disk, it doesn't
	 * really matter.

         copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes);

The patch introducing this change above is discussed here:

It seems to me that mounting with 'noatime' -best option- or using ext2 
will have the same end results and both are enough to avoid this deadlock
specifically,though others deadlocks can still happen.

Comment 9 Josef Bacik 2008-05-19 17:34:05 EDT
Created attachment 306017 [details]
pagefault disable/enable stuff.  Patch 1/2.

Ok heres the first patch in a series in 2.
Comment 10 Josef Bacik 2008-05-19 17:35:11 EDT
Created attachment 306018 [details]
the mmap_sem deadlock prevention patch.  Path 2/2

Here is the actual patch that fixes the problem.  Please test with both of
these patches applied and make sure the problem goes away.
Comment 11 Josef Bacik 2008-05-27 10:09:23 EDT
Need testing on this to make sure it fixes the problem.
Comment 12 R.H. 2008-06-04 10:45:42 EDT
Is there a test plan for this somewhere?
Comment 14 Flavio Leitner 2008-06-06 16:14:52 EDT
Created attachment 308564 [details]
add missing header


It seems to be missing a chunk like below on every highmem.c:

--- linux-2.6.9.orig/arch/i386/mm/highmem.c
+++ linux-2.6.9/arch/i386/mm/highmem.c
@@ -1,4 +1,5 @@
 #include <linux/highmem.h>
+#include <asm/uaccess.h>

 void *kmap(struct page *page)

See the attached patch. 
What you think?

Comment 15 Josef Bacik 2008-06-11 14:02:30 EDT
it compiled, so I don't think so, but you can throw it in there if you like.
Comment 19 Flavio Leitner 2008-07-02 13:43:32 EDT
Created attachment 310833 [details]


The test kernel (under cvs tag kernel-2_6_9-55_EL_it169325_2) has triggered the

OOM killer in the first test round and -ENOMEM on the second round.

from logs:
Normal: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB
0*2048kB 0*4096kB = 0kB
HighMem: 22*4kB 7*8kB 3*16kB 0*32kB 1*64kB 1*128kB 1*256kB 0*512kB 0*1024kB
0*2048kB 0*4096kB = 640kB
Swap cache: add 66421941, delete 66005408, find 28502300/34508086, race 0+100
0 bounce buffer pages
Free swap:	 6575260kB
524288 pages of RAM
294912 pages of HIGHMEM
5546 reserved pages
102724 pages shared
416705 pages swap cached
ENOMEM in journal_alloc_journal_head, retrying.
bash: page allocation failure. order:0, mode:0x50
 [<c0144570>] __alloc_pages+0x294/0x2a6
 [<c014459a>] __get_free_pages+0x18/0x24
 [<c01470c0>] kmem_getpages+0x1c/0xbb
 [<c0147c0e>]524288 pages of RAM

Free pages:	  13180kB (640kB HighMem)
Active:386596 inactive:115205 dirty:8351 writeback:11410 unstable:0 free:3295
slab:5754 mapped:478668 pagetables:3347
DMA free:12540kB min:16kB low:32kB high:48kB active:0kB inactive:0kB
present:16384kB pages_scanned:54999 all_unreclaimable? yes
protections[]: 0 0 0
Normal free:0kB min:928kB low:1856kB high:2784kB active:643060kB
inactive:200476kB present:901120kB pages_scanned:495 all_unreclaimable? no
protections[]: 0 0 0
HighMem free:640kB min:512kB low:1024kB high:1536kB active:903324kB
inactive:260344kB present:1179648kB pages_scanned:462 all_unreclaimable? no
protections[]: 0 0 0
DMA: 3*4kB 4*8kB 3*16kB 3*32kB 3*64kB 3*128kB 2*256kB 0*512kB 1*1024kB 1*2048kB
2*4096kB = 12540kB

Well, the fact that you said it compiled for you made me think if we are 
really on the same kernel version.
Comment 28 RHEL Product and Program Management 2008-09-03 08:54:06 EDT
Updating PM score.
Comment 33 Keiichiro Tokunaga 2009-01-05 11:05:42 EST
Created attachment 328208 [details]
A proposal patch from Fujitsu.

Comments from IT:

I have created new patch which fixes this problem only.
(I create it from scratch and it isn't backported from upstream kernel.)
It is small patch, and so it seems not to cause a severe performance degradation.

I have confirmed that any problems didn't happen on the system on which this
patch was applied while I was running this problem's reproducer for 5 days.

Therefore I think this patch fixes this problem.
Please confirm it and apply it on RHEL4.8.
Comment 35 Josef Bacik 2009-01-08 13:36:55 EST
Peter Zijlstra reviewed this patch as well, made the comment that you don't need the get_file/put_file around file_accessed, so you can remove that as well.  Other than that and my comment about not dropping mmap_sem if we are already mounted noatime that should be good.  I look forward to your updated patch.  I assume this is being tested as well?  I will do my own testing for sure, I'd just like to make sure you are satisfied with it as well.
Comment 36 Keiichiro Tokunaga 2009-01-08 16:44:00 EST
I received a revised patch as per Josef's comments from Japan this morning, but I'm going to ask them to re-revise it as per Peter's comments.  I'll put a revised patch here maybe tomorrow.  Thanks for the comments/feedback!
Comment 37 Keiichiro Tokunaga 2009-01-09 13:02:37 EST
Created attachment 328578 [details]
A revised proposal patch from Fujitsu

Fujitsu engineer agreed to the comments/feedback and revised the patch accordingly.   They tested it and confirmed it worked fine.  Please review it.
Comment 38 Josef Bacik 2009-01-12 09:43:50 EST
that looks good, I will post it today.
Comment 39 Josef Bacik 2009-01-12 14:16:04 EST
Created attachment 328771 [details]
updated patch.

got some feedback and reworked the patch.  Please test this patch and let me know if it still works for you guys.
Comment 40 Keiichiro Tokunaga 2009-01-13 11:52:52 EST
Created attachment 328891 [details]
Log of testing on RH updated patch.

Fujitsu ran their testset on the kernel with the patch applied.  It immediately winded up with making a core file and the kernel panic (the log is attached).

Fujitsu doesn't think the patch is good since do_mmap_pgoff() requires the caller to hold mmap_sem.  The core and panic happened because mmap_sem was dropped at here.

  987      error = file->f_op->mmap(file, vma);

Therefore, Fujitsu's patch was made to drop mmap_sem at the end of do_mmap_pgoff() (and do update_atime).
Comment 42 Vivek Goyal 2009-01-19 14:57:59 EST
Committed in 79.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/
Comment 44 Josef Bacik 2009-04-17 12:01:46 EDT
It's a bit confusing from the sequence of events in the bz, but the patch in c37 is the one we're going with, the patch in c39 was an attempt to make it pretty but obviously it didn't work, and it wasn't actually ever posted to rh-kernel list, just the one in c37 was.  Fujitsu, would you mind testing the latest snapshot kernel to verify that you are still good with this fix?  Thanks.
Comment 48 errata-xmlrpc 2009-05-18 15:18:19 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.


Note You need to log in before you can comment on or make changes to this bug.