Bug 471613 - Corruption on ext3/xfs with O_DIRECT and unaligned user buffers
Corruption on ext3/xfs with O_DIRECT and unaligned user buffers
Status: CLOSED CANTFIX
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel (Show other bugs)
5.2
All Linux
urgent Severity urgent
: rc
: ---
Assigned To: Andrea Arcangeli
Red Hat Kernel QE team
: ZStream
: 512100 (view as bug list)
Depends On:
Blocks: 477770 483701 485920 486921 506684
  Show dependency treegraph
 
Reported: 2008-11-14 12:10 EST by Tim LaBerge
Modified: 2014-01-17 07:27 EST (History)
25 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 506684 (view as bug list)
Environment:
Last Closed: 2010-01-06 10:20:30 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Test program (6.43 KB, text/plain)
2008-11-14 12:10 EST, Tim LaBerge
no flags Details
patch to test (4.09 KB, patch)
2008-12-18 07:43 EST, Andrea Arcangeli
no flags Details | Diff
same as before but take page lock around pageswapcache test (4.43 KB, patch)
2008-12-18 09:03 EST, Andrea Arcangeli
no flags Details | Diff
update (4.42 KB, patch)
2008-12-18 10:16 EST, Andrea Arcangeli
no flags Details | Diff
Potential patch to close a suspected race in the fork path. (4.22 KB, patch)
2008-12-22 15:40 EST, Jeffrey Moyer
no flags Details | Diff
Alternative patch that fixes this problem. (1.55 KB, patch)
2009-01-06 17:23 EST, Larry Woodman
no flags Details | Diff
same as before but block any thread that can modify the page while fork copies the page (5.12 KB, patch)
2009-01-13 10:54 EST, Andrea Arcangeli
no flags Details | Diff
updated to address memleak (5.15 KB, patch)
2009-01-13 13:36 EST, Andrea Arcangeli
no flags Details | Diff
The patch used to creat the latest kernel. (2.24 KB, patch)
2009-01-22 09:54 EST, Larry Woodman
no flags Details | Diff
New patch to try. (1.59 KB, patch)
2009-01-22 17:38 EST, Larry Woodman
no flags Details | Diff
Another reproducer program. (5.52 KB, text/plain)
2009-01-28 10:45 EST, Jeffrey Moyer
no flags Details
diolock2.patch (13.56 KB, patch)
2009-01-29 05:27 EST, Moritoshi Oshiro
no flags Details | Diff
Another variant of the pagesem patch (10.04 KB, patch)
2009-01-30 16:34 EST, Jeffrey Moyer
no flags Details | Diff
final patch that addresses the pagevec false positives (9.00 KB, patch)
2009-02-02 00:41 EST, Andrea Arcangeli
no flags Details | Diff
remove FOLL_WRITE unreliable check (9.51 KB, patch)
2009-02-03 10:11 EST, Andrea Arcangeli
no flags Details | Diff
fix hugetlb (14.06 KB, patch)
2009-02-06 12:40 EST, Andrea Arcangeli
no flags Details | Diff
flush the tlb after marking pte wrprotect and before copying the page in fork (19.44 KB, patch)
2009-02-09 10:44 EST, Andrea Arcangeli
no flags Details | Diff
patch to close the race to be as safe as 2.4 with PG_pinned (or 2.6.7 with count vs mapcount check in vmscan loop to keep ptes pinned if pointing to pages under O_DIRECT) (2.26 KB, patch)
2009-02-11 20:23 EST, Andrea Arcangeli
no flags Details | Diff
same as before but removes the lockdep false positive and corrects minor issue in -ENOMEM path which would never trigger (20.03 KB, patch)
2009-02-17 14:16 EST, Andrea Arcangeli
no flags Details | Diff
this should work well on 32bit with highmem (20.87 KB, patch)
2009-02-18 14:18 EST, Andrea Arcangeli
no flags Details | Diff
patch is reworked to handle fast-gup changes (48.02 KB, patch)
2009-05-14 10:17 EDT, Don Zickus
no flags Details | Diff
fix tiny swapping race in fork-vs-gup patch (1.85 KB, patch)
2009-06-04 14:58 EDT, Andrea Arcangeli
no flags Details | Diff

  None (edit)
Description Tim LaBerge 2008-11-14 12:10:15 EST
Created attachment 323604 [details]
Test program

Description of problem:
The man page for open(2) states the following:

 O_DIRECT (Since Linux 2.6.10)
              Try to minimize cache effects of the I/O to and from this  file.
              In  general  this  will degrade performance, but it is useful in
              special situations, such  as  when  applications  do  their  own
              caching.   File I/O is done directly to/from user space buffers.
              The I/O is synchronous, that is, at the completion of a  read(2)
              or write(2), data is guaranteed to have been transferred.  Under
              Linux 2.4 transfer sizes, and the alignment of user  buffer  and
              file  offset  must all be multiples of the logical block size of
              the file system.  Under Linux 2.6 alignment to  512-byte  bound-
              aries suffices.

However, it appears that data corruption may occur when a multithreaded
process reads into a non-page size aligned user buffer. A test program which
reliably reproduces the problem on ext3 and xfs is attached.

The program creates, patterns, reads, and verify a series of files.

In the read phase, a file is opened with O_DIRECT n times, where n is the
number of cpu's. A single buffer large enough to contain the file is allocated
and patterned with data not found in any of the files. The alignment of the
buffer is controlled by a command line option.

Each file is read in parallel by n threads, where n is the number of cpu's.
Thread 0 reads the first page of data from the file into the first page
of the buffer, thread 1 reads the second page of data in to the second page of
the buffer, and so on.  Thread n - 1 reads the remainder of the file into the
remainder of the buffer.

After a thread reads data into the buffer, it immediately verifies that the
contents of the buffer are correct. If the buffer contains corrupt data, the
thread dumps the data surrounding the corruption and calls abort(). Otherwise,
the thread exits.

Crucially, before the reader threads are dispatched, another thread is started
which calls fork()/msleep() in a loop until all reads are completed. The child
created by fork() does nothing but call exit(0).

A command line option controls whether the buffer is aligned.  In the case where
the buffer is aligned on a page boundary, all is well. In the case where the
buffer is aligned on a page + 512 byte offset, corruption is seen frequently.

I believe that what is happening is that in the direct IO path, because the
user's buffer is not aligned, some user pages are being mapped twice. When a
fork() happens in between the calls to map the page, the page will be marked as
COW. When the second map happens (via get_user_pages()), a new physical page
will be allocated and copied.

Thus, there is a race between the completion of the first read from disk (and
write to the user page) and get_user_pages() mapping the page for the second
time. If the write does not complete before the page is copied, the user will
see stale data in the first 512 bytes of this page of their buffer. Indeed,
this is corruption most frequently seen. (It's also possible for the race to be
lost the other way, so that the last 3584 bytes of the page are stale.)

The attached program dma_thread.c (which is a heavily modified version of a
program provided by a customer seeing this problem) reliably reproduces the
problem on any multicore linux machine on both ext3 and xfs, although any
filesystem using the generic blockdev_direct_IO() routine is probably
vulnerable.

I've sent mail about this problem to linux-mm and linux-fsdevl.


Version-Release number of selected component (if applicable):


How reproducible:

Always

Steps to Reproduce:
1. Compile the attached program
2. Run it with no args on a multi-core intel box
3. Run it with -a 512
  
Actual results:

Abort with memory corruption detection message.


Expected results:

No abort.

Additional info:
Comment 1 Phil Knirsch 2008-11-19 08:35:08 EST
Reassigning to kernel component.
Comment 2 Bryn M. Reeves 2008-12-17 14:18:09 EST
Discussion on linux-fsdevel:

http://kerneltrap.org/mailarchive/linux-fsdevel/2008/11/14/4099714
Comment 7 Andrea Arcangeli 2008-12-18 07:43:55 EST
Created attachment 327319 [details]
patch to test
Comment 9 Andrea Arcangeli 2008-12-18 09:03:47 EST
Created attachment 327325 [details]
same as before but take page lock around pageswapcache test

I think in previous patch I forgot to take the page lock around pageswapcache test. Implications aren't clear, I guess in the long term there should be a smp_wmb() in between page_cache_get and SetPageSwapCache in __add_to_swap_cache and a smp_rmb in between the PageSwapCache and the page_count() to remove this trylock op.
Comment 10 Andrea Arcangeli 2008-12-18 10:16:10 EST
Created attachment 327331 [details]
update

testing revealed a minor glitch in the patch so I've to attach another update. This version works and fixes the bug here according to the reproducer proggy.
Comment 11 Andrea Arcangeli 2008-12-18 10:40:19 EST
posted the last version to the linux-mm/fsdevel thread as this should be applied to mainline too. Mainline problem is that it can't be fixed until some core change is done to get_user_pages_fast too. RHEL doesn't have gup_fast so this is enough for rhel.
Comment 13 Issue Tracker 2008-12-18 20:29:05 EST
Forwarding FJ's comment to bz:
---

> So a question to Fujitsu, does customer have a performance requirement
they
> need to meet?

Yes. The customer is extremely severe about a performance.

Please let me know about follows.
 - What functions are changed?
 - How much performance decreases by the patch?

I want to review the patch, so please send me the current
patch ASAP.

Regards,
Yasuaki Ishimatsu
---


This event sent from IssueTracker by moshiro@redhat.com 
 issue 249537
Comment 14 Jeffrey Moyer 2008-12-22 15:40:25 EST
Created attachment 327689 [details]
Potential patch to close a suspected race in the fork path.

I tested this patch to libc with a patched kernel, and I still am experiencing the hangs in the test program.  Now, it is entirely possible that the patch is not correct, so I hope others will review it and verify.

A build of glibc with this patch is located in brew:

/mnt/redhat/brewroot/scratch/jmoyer/task_1624575
Comment 15 Jeffrey Moyer 2008-12-22 15:43:52 EST
Upstream discussion can be found here:

  http://marc.info/?l=linux-fsdevel&m=122668502608816&w=2
Comment 16 Jeffrey Moyer 2008-12-22 16:27:07 EST
This new hang is a bug in my libc patch, as it occurs both on kernels with and without Andrea's patch.  Here is a backtrace of the hung process.  It is the initial thread doing a join on one of the workers:

#0  0x00002b556d416e6e in __lll_lock_wait_private () from /lib64/libc.so.6
#1  0x00002b556d3ae79a in _L_lock_13544 () from /lib64/libc.so.6
#2  0x00002b556d3acdf6 in free_atfork (mem=0x2b556d6899c0, 
    caller=<value optimized out>) at arena.c:220
#3  0x00002b556d3ad8a7 in *__GI___libc_free (mem=0xdad33b0) at malloc.c:3555
#4  0x00002b556cf0f854 in *__GI__dl_deallocate_tls (tcb=0x734e6940, 
    dealloc_tcb=false) at dl-tls.c:490
#5  0x00002b556d123279 in __deallocate_stack (pd=0xdad33c0)
    at allocatestack.c:239
#6  0x00002b556d1246ca in pthread_join (threadid=1965988160, thread_return=0x0)
    at pthread_join.c:110
#7  0x0000000000401509 in main (argc=5, argv=0x7fff3dba9588)
    at dma_thread.c:279

I'll keep digging.
Comment 17 Jakub Jelinek 2008-12-22 16:33:24 EST
The #c14 patch is wrong, 1) we don't want to work around kernel bugs in glibc
2) fork can be called from async signal handlers, so really locking a lock is not flying well, what if you the async signal interrupted something that held the lock, or worse was in the middle of acquiring it?
Comment 18 Andrea Arcangeli 2008-12-22 17:11:30 EST
w.r.t. #c17, a per-thread recursive lock takes care of any async signal handler lock recursion.

Now the real question is: how it can be safe to call __reclaim_stacks in the fork-child without having __reclaim_stacks start with an lll_unlock. If __reclaim_stacks doesn't start with a lll_unlock it means fork snapshotted the contents of the lists, in a not coherent state, leading to unpredictable beahvior of __reclaim_stacks while looping over the lists.

Jakub can you answer that? Then we'll figure out how to fix it with a recursive lock or something similar.
Comment 19 Ulrich Drepper 2008-12-22 18:06:39 EST
There is absolutely no satisfactory explanation for why anyone wants to change glibc.  As Jakub said, there definitely will be no work-arounds of kernel bugs.  If there is a problem provide a test program and, most importantly, open a new bug.  This sloppy attitude of people recently to arbitrarily add unrelated comments to bugs has to stop.  I'm removing myself from this bug, I'll not look at kernel bugs.  You know what to do to get my attention the right way.
Comment 20 Andrea Arcangeli 2008-12-22 18:12:34 EST
If I'm right __reclaim_stacks can't possibly be safe without starting with lll_unlock, the same deadlock that is reproduced with the bugfix attached is also reproducible with the 2.6.28 official kernel and any other 2.4/2.6 kernel out there. I thought this was obvious but perhaps not.

So while I can't guarantee this isn't a bug introduced by the patch, at the moment we cannot exclude this is an incredibly longstanding bug in nptl that cannot be fixed in any way by changing the kernel (besides binary patching glibc at runtime at least).

It would help if you could explain why __reclaim_stacks is safe to start without a lll_unlock, so we understand why, and we return debugging whatever kernel problem we may have introduced, at the moment this looks a glibc bug that was exposed by a slight slowdown in fork execution speed during O_DIRECT over anon memory.
Comment 21 Andrea Arcangeli 2008-12-22 18:21:13 EST
To be more specific, I tend to think that to reproduce the bug without my patch you've just to have the thread running fork executing the ptep_set_wrprotect in the middle of the list_del operation running in another thread.

The patch replaces the ptep_set_wprotect with a memcpy of the page, so it increases the window for the race to trigger, still glibc bug that is with or without my patch, it needs fixing either ways as far as I can tell.

Now I can try to write the program that may manage to reproduce this on 2.6 mainline but I thought I could ask for help in understanding how __reclaim_stacks is guaranteed to get a coherent copy of the two lists before doing that, especially if it's true there is no glibc bug whatsoever to fix here.
Comment 22 Andrea Arcangeli 2008-12-22 18:33:13 EST
For your convenience I created bug #477705 if you don't like to discuss any technical detail of __reclaim_stacks here. Thanks.
Comment 23 Andrea Arcangeli 2008-12-22 18:50:48 EST
Jeffrey, I would also suggest to remove the stack_cache_lock = LLL_LOCK_INITIALIZER as it's not necessary with lll_unlock (or you can remove the lll_unlock in the first place). The fix is to lll_lock/unlock in the fork-parent around the fork syscall, the child doesn't actually need to literally lll_unlock as long as it later does stack_cache_lock = LLL_LOCK_INITIALIZER. So my question of why __reclaim_stacks doesn't start with lll_unlock was really a simple way to explain what the problem is (i.e. the lock wasn't hold during fork execution) but in practice it's the only bit that's not actually going to make a difference ;). I think it's cleaner to lll_unlock instead of stack_cache_lock = LLL_LOCK_INITIALIZER, but that's just my preference (so if there's a debug trap checking the double-unlock condition, it will work there too).

Why it deadlocks I've no idea at the moment but I wonder what's the LLL_PRIVATE about, I don't see it in my tree so I wonder if the deadlock is related to such LLL_PRIVATE addition.
Comment 25 Jeffrey Moyer 2009-01-06 10:38:23 EST
Jakub, could you please give your opinion on comment #21?  It appears to me that Andrea is right, that there is a small window for list corruption.  And, if we are to fix the kernel problem in the way suggested, then we open that window.  Really, direction from those familiar with the libc portion of this code would go a long way to helping us better understand the problem.

Thanks in advance,
Jeff
Comment 26 Larry Woodman 2009-01-06 17:23:42 EST
Created attachment 328324 [details]
Alternative patch that fixes this problem.


This patch fixes this problem by locking the page over the DIO and locking the page in do_wp_page() over the COW fault.  This might be unacceptable though because you will not be able to issue multiple DIO operations within the same page at the same time.  This is a performance degradation in some cases...
Comment 27 Andrea Arcangeli 2009-01-07 13:26:56 EST
To fix the race with the page lock, it must be taken before returning by gup. gup is required to be run under mmap_sem (read or write), fork runs with write mmap_sem. Otherwise fork may share the page (with page lock) despite O_DIRECT path has already taken the page in gup but not started the bio I/O yet. Then when bio I/O is started the shared page contents change.
Comment 28 Andrea Arcangeli 2009-01-08 12:30:48 EST
One problem (may not be the only one as this is tricky) is atomic_write_barrier isn't enough. That only covers the compiler not the cpu out of order stores. However the problem is the read is actually the kernel and it's memcpy so this clearly shows the current kernel patch isn't ok.

For this to be safe the kernel patch must be changed too. So notably what we have to change is to mark the pte readonly (which is needed anyway for the trick to fix gup-fast lockelss), flush the smp-tlbs and _then_ memcpy, and finally the key: mark the pte of the parent writeable again before dropping the PT lock (that's fundamental part of the fix: otherwise the race triggers again as the other threads may have tried to write to a different piece of the same page but because PT lock was held those page fault will be aborted because pte is found writeable or because pte_same fails).

The only way the page will be allowed to change while fork is doing memcpy in do_wp_page, is if there's in-flight O_DIRECT but that's irrelevant in terms of memory ordering and it won't be allowed to overwrite glibc owned memory areas or it'd be broken anyway.

So I'll soon update a new kernel patch for rhel and mainline (mainline will fix gup-fast too). Perhaps the glibc bug won't be visible anymore with if we halt any further update on the parent while memcpy is in progress, but it could also be that we need both fixes and that the hang wasn't visible before because memory corruption triggered first because of the kernel bug.

In short my next patch will make it enough for glibc to do atomic_write_barrier after setting in_flight_stack without worrying about smp out of order execution (because parent thread won't execute in the other cpu anymore while memcpy runs in the fork cpu).
Comment 33 Andrea Arcangeli 2009-01-13 10:54:57 EST
Created attachment 328875 [details]
same as before but block any thread that can modify the page while fork copies the page

This seem to work ok for me. I didn't yet address in this patch any performance issue, infact the pre_cow_page allocation in the fast path will make it even slower than the previous patch. But even with this patch I still get the same hang in glibc __reclaim_stacks but this is with an unpatched glibc. I'll worry about performance after the glibc hang is gone. So the next step is to test this new kernel patch on top of the glibc changes. And if it runs stable it means the workload triggered both the kernel bug and the glibc bug at the same time (except the kernel bug was easy to trigger so the glibc bug become visible only after kernel was fixed).

Glibc now is guaranteed that each page is freezed and copied atomically from parent to child, so the list_add/del will be halted and a compiler memory barrier is enough.

However note: if the prev pointer is in a page, and the next pointer is in another page, no atomicity or ordering guarantee is provided at all in terms of what will be visible in the child if a thread is modifying list.next and another thread is modifying list.prev at the same time that fork runs (which should never happen because of the lock that glibc takes before trying to modify any page so I doubt this is a problem).
Comment 34 Jeffrey Moyer 2009-01-13 12:14:37 EST
Andrea,

Your patch from comment #33 still hangs with the patched glibc (2.9.90-2).
Comment 35 Andrea Arcangeli 2009-01-13 13:26:15 EST
Now I wonder if this is a glibc or kernel issue, I would tend to think the kernel should do things right now but maybe not yet. I wonder if leaving it running in a loop long enough with a 4096 alignment would also trigger it (4096 alignment doesn't invoke any of the slow path code as there's no page sharing between O_DIRECT data and glibc/program data structures).

btw, after running it for a while on my system I noticed there's a minor memleak in the code just add a page_cache_release(old_page) after cow_user_page(pre_cow_page, old_page, address); in fork_pre_cow and it'll fix it. This is totally unrelated to our issue and it definitely can't explain our problem but you'll hit it if you leave it running long enough.
Comment 36 Andrea Arcangeli 2009-01-13 13:36:17 EST
Created attachment 328903 [details]
updated to address memleak

this update can't make the __relaim_stacks hang go away but it will help to run it in a loop ;)
Comment 39 Jeffrey Moyer 2009-01-14 13:02:12 EST
(In reply to comment #26)
> Created an attachment (id=328324) [details]
> Alternative patch that fixes this problem.
> 
> 
> This patch fixes this problem by locking the page over the DIO and locking the
> page in do_wp_page() over the COW fault.  This might be unacceptable though
> because you will not be able to issue multiple DIO operations within the same
> page at the same time.  This is a performance degradation in some cases...

Can we get some performance testing from Fujitsu on this patch?

Thanks!
Comment 46 Larry Woodman 2009-01-21 07:42:45 EST
Can you test the kernel in:
>>>http://people.redhat.com/~lwoodman/RHEL5/

This kernel breaks the kABI by adding a semaphore to the page structure but shouls allow multiple DIO operations to the same page at the same time.

Larry Woodman
Comment 48 Larry Woodman 2009-01-22 09:54:31 EST
Created attachment 329712 [details]
The patch used to creat the latest kernel.


I'll build an IA64 kernel but if you need one faster than out build system can create one this patch was used to build that kernel.

Larry
Comment 49 Andrea Arcangeli 2009-01-22 12:06:27 EST
Hmm, I don't think this will fix it, did you test it? Any lock should be taken at the latest in between the get_user_pages return and the release of the mmap_sem release in dio_refill_pages. Besides growing the size of the page struct of 4 bytes is likely not ok, PG_lock and preventing simultaneous I/O to the page is probably better.

Still even if this seem to still not fix the race but only reduce the window for the race to trigger, if it practically makes the race not reproducible anymore, it'll be interesting to run this to see if we still get an hang in glibc __reclaim_stacks or not. If we still get an hang in __reclaim_stacks then it's more likely the glibc fix wasn't ok yet.
Comment 50 Larry Woodman 2009-01-22 14:33:12 EST
I did test this and it doenst hang.  However, I know this is not a real solution, especially for RHEL5 but a tool to determine whether the ~25% performance degradation that was reported in the patch that used lock_page() was caused by 1.) not allowing multiple callers doing direct IO through dio_bio_submit() at the same time or 2.) not allowing 2 or more callers through dio_bio_submit() and do_wp_page() at the same time.  If blocking a caller in do_wp_page() is causing the performance degradation then we have to revisit our(my) understanding of what they are most concerned about: 1.) multiple DIO calls to the same page or 2.) forking while doing DIO.

Can someone test the IA64 kernel in:
>>>http://people.redhat.com/~lwoodman/RHEL5/

Larry
Comment 51 Keiichiro Tokunaga 2009-01-22 15:05:37 EST
Larry,

I built a ia64 kernel with your patch (from #48) applied and tested it on the
PRIMEQUEST in the lab.  I got the following message from the reproducer program,
so I think the problem has not been fixed with the patch yet.

  # ./pthread-read-and-fork /dev/sdb2
  Bad data
  rd_res = 512
  pattern = 7
  buf = 7777777777
Comment 52 Keiichiro Tokunaga 2009-01-22 17:37:09 EST
BTW, the messages were outputed after about two hours run of the reproducer.  I'm
trying to reproduce it again, but no luck yet after one hour.
Comment 53 Larry Woodman 2009-01-22 17:38:13 EST
Created attachment 329747 [details]
New patch to try.


There might have been a problem with the previous patch unlocking the page before all the IO was complete, can someone try this patch ASAP???

Larry
Comment 54 Andrea Arcangeli 2009-01-22 17:48:59 EST
Until you move the lock between gup and up_read like mentioned in comment #49 it won't be fixed.

Write side of the semaphore should be moved in fork too.
Comment 55 Keiichiro Tokunaga 2009-01-22 18:23:48 EST
With the kernel with the patch from #53 applied, I got the error messages many times immediately after running the reproducer.
Comment 56 Andrea Arcangeli 2009-01-26 11:33:46 EST
I tracked down the performance regression. I think it is caused by the pagevec code that queues up anon pages in the per cpu queues with an increased page_count that is later released only when the page is added into the lru and the PageLRU bitflag set.

Tried to add the mapcount too not just the page count when the page is added to the pagevec and waiting to be moved to the lru from the pagevec. However generates locking nightmare. page_count vs mapcount was fragile enough without further pagevec issues that are increased outside the PT lock, and it'd make things even more messy for gup-fast than they already are.

So I think to fix this we can re-add the PG_pinned of 2.4 for rhel, and in mainline we need an atomic_t gup_count in the struct page (smaller than a semaphore and works well for gup-fast too) and a release_user_pages to use to free pages that are returned with get_user_pages. So it'll break both ABI and API to fix this well in mainline. That we can't do with rhel, so PG_pinned is the only way. PG_pinned has the disavantage that it'll only be used by O_DIRECT but it'll cover the most important user. KVM would have the same issues in theory but kvm uses MADV_DONTNEED and because this bug isn't exploitable, it doesn't matter if parent screwup if somebody changes qemu binary in a broken way, even if /dev/kvm is world available. This bug can only make userland lose data, it won't crash or corrupt the kernel. So a band-aid that makes O_DIRECT safe against fork should be enough and I doubt I can achieve real accuracy with the mess of page_count page_mapcount plus the pagevec get_page happening anytime and remaining for long time in idle cpus.
Comment 57 Jeffrey Moyer 2009-01-26 16:20:41 EST
(In reply to comment #54)
> Until you move the lock between gup and up_read like mentioned in comment #49
> it won't be fixed.
> 
> Write side of the semaphore should be moved in fork too.

I believe that is immaterial to the current reproducer program since the child never examines the contents of the shared page.  Andrea, would you agree?
Comment 58 Andrea Arcangeli 2009-01-27 07:38:52 EST
The problem for the data loss in the parent, isn't related to the child right (the child however in __reclaim_stacks might actually even write data to anon pages, glibc can own anonymous memory too and glibc runs in the fork child).

The problem is the parent: in do_wp_page we hold only mmap_sem in read mode. So if you want to keep the lock in do_wp_page, you'd need to take the write semaphore in do_wp_page before taking PT lock and release it after the pte is pointing to the new page with write permission. Secondly the read side of the rwsem has to be taken _before_ gup, it's not enough anymore to take it after gup returns and before up_read(&mmap_sem). fork takes mmap_sem in write mode so the per-page rwsem would extend the critical section opened by the mmap_sem. But moving the per-page rwsem down_write into do_wp_page makes a mmap_sem irrelevant for the race, so the per-page rwsem critical section has to grow and cover what was previously covered by mmap_sem.

I'd prefer to have fork being penalized with forced cows (non blocking) when the page is under I/O, than waiting for I/O in do_wp_page. But if you move the semaphore around (not just around the page copy, that's surely not enough) do_wp_page locking should fix it too!
Comment 59 Jeffrey Moyer 2009-01-27 14:44:34 EST
(In reply to comment #55)
> With the kernel with the patch from #53 applied, I got the error messages many
> times immediately after running the reproducer.

Keiichiro,

Did you download the kernel from Larry's people page or did you compile it yourself?  I ask because I downloaded the kernel rpm from Larry's people page and I cannot reproduce the problem.  The pthread-read-and-fork program has been running for 4 hours now without incident.
Comment 60 Andrea Arcangeli 2009-01-27 15:16:30 EST
Just a side note, on my hardware to reproduce the glibc hang I need -w 40. I recommend testing with a larger number of threads otherwise certain races may not trigger. Param:

-w 40 -b 512
Comment 61 Keiichiro Tokunaga 2009-01-27 15:52:07 EST
(In reply to comment #59)
> (In reply to comment #55)
> > With the kernel with the patch from #53 applied, I got the error messages many
> > times immediately after running the reproducer.
> 
> Keiichiro,
> 
> Did you download the kernel from Larry's people page or did you compile it
> yourself?  I ask because I downloaded the kernel rpm from Larry's people page
> and I cannot reproduce the problem.  The pthread-read-and-fork program has been
> running for 4 hours now without incident.

I tried two patches, one is from #48 and the other is from #53, and for both cases I compiled kernels myself.  The reproducer program I used is the one attached to the associating IT (IT#249537).

As for #48 patch, it took two hours to recreate the program.  As for #53 patch, it took a second.  By 'recreate', I mean I got the following messages from the program.

  Bad data
  rd_res = 512
  pattern = 7
  buf = 7777777777

Based on my experience so far I feel it's possible for you not to see the program even after four hours running.  Acutally, I've been running the program on the kernel with #48 patch applied again since yesterday and although it's been 24 hours since then, I've only hit the problem once (the error messages were spewed just once).

I'll start running the program on the kernel Larry built just in case and update  here with the results.  I'm assuming the kernel (kernel-2.6.18-128.pagesem.el5.ia64.rpm) on Larry's page has #48 patch applied.  Is this correct?

Kei
Comment 62 Jeffrey Moyer 2009-01-27 16:10:38 EST
> I tried two patches, one is from #48 and the other is from #53, and for both
> cases I compiled kernels myself.  The reproducer program I used is the one
> attached to the associating IT (IT#249537).
> 
> As for #48 patch, it took two hours to recreate the program.  As for #53 patch,
> it took a second.  By 'recreate', I mean I got the following messages from the
> program.
> 
>   Bad data
>   rd_res = 512
>   pattern = 7
>   buf = 7777777777
> 
> Based on my experience so far I feel it's possible for you not to see the
> program even after four hours running.  Acutally, I've been running the program
> on the kernel with #48 patch applied again since yesterday and although it's
> been 24 hours since then, I've only hit the problem once (the error messages
> were spewed just once).
> 
> I'll start running the program on the kernel Larry built just in case and
> update  here with the results.  I'm assuming the kernel
> (kernel-2.6.18-128.pagesem.el5.ia64.rpm) on Larry's page has #48 patch applied.
>  Is this correct?

Yes, that's correct.

Could you, just for the sake of sanity, issue the following command:

# grep bio_lock_unlock /proc/kallsyms

on the kernels you built yourself?  That should output a line like:

a0000001001cec60 t bio_lock_unlock

That would at least ensure you have a patched kernel.  ;)  I'm now running with the patch from comment #53, and I still am not hitting the problem.
Comment 63 Keiichiro Tokunaga 2009-01-27 16:52:05 EST
(In reply to comment #62)
> > I'm assuming the kernel
> > (kernel-2.6.18-128.pagesem.el5.ia64.rpm) on Larry's page has #48 patch applied.
> >  Is this correct?
> 
> Yes, that's correct.
> 
> Could you, just for the sake of sanity, issue the following command:
> 
> # grep bio_lock_unlock /proc/kallsyms
> 
> on the kernels you built yourself?  That should output a line like:
> 
> a0000001001cec60 t bio_lock_unlock
> 
> That would at least ensure you have a patched kernel.  ;)  I'm now running with
> the patch from comment #53, and I still am not hitting the problem.

Here are the results.  The first one is from Larry's kernel, the second is from my kernel with #48 patch applied, and the third is from my kernel with #53 applied.

[root@dhcp47-139 ~]# uname -a
Linux dhcp47-139.lab.bos.redhat.com 2.6.18-128.pagesem.el5 #1 SMP Thu Jan 22 11:34:25 EST 2009 ia64 ia64 ia64 GNU/Linux
[root@dhcp47-139 ~]# grep bio_lock_unlock /proc/kallsyms 
a0000001001cdd60 t bio_lock_unlock


[root@dhcp47-139 ~]# uname -a
Linux dhcp47-139.lab.bos.redhat.com 2.6.18-128-bz471613 #1 SMP Thu Jan 22 12:45:48 EST 2009 ia64 ia64 ia64 GNU/Linux
[root@dhcp47-139 ~]# grep bio_lock_unlock /proc/kallsyms 
a0000001001cdd60 t bio_lock_unlock


[root@dhcp47-139 ~]# uname -a
Linux dhcp47-139.lab.bos.redhat.com 2.6.18-128-bz471613-v2 #2 SMP Thu Jan 22 18:01:33 EST 2009 ia64 ia64 ia64 GNU/Linux
[root@dhcp47-139 ~]# grep bio_lock_unlock /proc/kallsyms 
a0000001001cec40 t bio_lock_unlock

Kei
Comment 64 Issue Tracker 2009-01-27 21:57:18 EST
From FJ:
---
Just a question.

IIUC, this issue can happen on HugeTLB memory vmas, too.

Right ? If yes, do you have patches ?

kamezawa.hiroyu@jp.fujitsu.com 
---


This event sent from IssueTracker by moshiro@redhat.com 
 issue 249537
Comment 65 Jeffrey Moyer 2009-01-28 10:45:13 EST
Created attachment 330237 [details]
Another reproducer program.

The attached reproducer orchestrates the problem exactly, though it
requires a slow enough read in order to trigger the problem.  The program also assumes that a file named "data" exists in the current working directory, and that it is filled with something other than 0x2a.  It should be >3xpage_size in size.

I've reproduced this by simply starting a dd from the root device to
/dev/null before running the test program.  The order of events is as
follows:

Thread 1	Thread 2	Thread 3
--------	--------	--------
io_submit()
		fork()
				read()

The threads share a common buffer which is 3 x page size.  The
io_submit submits a page-sized I/O that is not aligned on a page
boundary.  So, for x86, it submits 4k starting at a 512 byte offset
into the page.  The read() is done on the second page, again starting
at offset 512 bytes, and spanning into the 3rd page by 512 bytes.

    I/O 1     I/O 2
 |<------->|<------->|
[ page 1 ][ page 2 ][ page 3 ]


The io_submit sets up a direct I/O into pages 1 and 2.  Then, fork()
is called, which does the following:

dup_mmap
  down_write(&oldmm->mmap_sem);
  down_write(&mm->mmap_sem); // child mm
  copy_page_range
    copy_pud_range
      copy_pmd_range
        copy_pte_range
          spin_lock(dst_ptl);  // child page table lock
          spin_lock(src_ptl);  // parent page table lock
          copy_one_pte
          ptep_set_wrprotect(src_mm, addr, src_pte);  <===========
          spin_unlock(src_ptl);

The arrow indicates where we write protect the page.

Then, Thread 3 issues another I/O to page 2:

do_direct_IO
  dio_get_page
    dio_refill_pages
      down_read(&mm->mmap_sem);
      get_user_pages
        __handle_mm_fault
          handle_pte_fault
            spin_lock(ptl); // parent page table lock
            if (!pte_write(entry))                   <============
              do_wp_page
                cow_user_page
      up_read(&mm->mmap_sem);

Here, the arrow indicates where we detect the write-protected page and
trigger a cow fault.  In this case, we allocate a new page and map it
into the parent.  Unfortunately, the first I/O we scheduled will now
go to the page in the child process.

The reproducer program shows exactly this:

$ ~/forkscrew 
worker 0 failed check
0 - 3583: 0
3584 - 4095: 42
child dumping buffer:
0 - 4095: 0
4096 - 8191: 42
parent dumping full buffer
0 - 3583: 0
3584 - 4095: 42
4096 - 8191: 0

The offsets printed are from the beginning of the respective read
buffers, not from the beginning of the page.  So, we see that worker 0
contained the right data (zeroes) for the first page, but corrupt data
in the beginning of the second page.  The full contents from the
parent are dumped last, and they show that only those 512 bytes that
were scheduled before the fork are corrupt.  Examining the child's
memory, we can see that those 512 bytes are correct, there, verifying
our suspicions.  Any I/O that was scheduled before the fork() landed
in the child process just fine.
Comment 66 Larry Woodman 2009-01-28 11:21:29 EST
Jeff did you verify that the lock_page() and/or the page_semaphore patch fixes the corruption your reproducer uncovered ???

Larry
Comment 67 Jeffrey Moyer 2009-01-28 11:34:33 EST
Sorry, I left that part out:

Using this program, I can reliably see the corruption on unpatched
kernels.  When running kernels with either the patch from comment #48
or comment #53, I cannot reproduce the problem.

However, we can't ignore Kei's results.
Comment 68 Jeffrey Moyer 2009-01-28 13:01:40 EST
Data corruption issues aside, can we please get some performance numbers from the kernel posted here?
  http://people.redhat.com/~lwoodman/RHEL5

That kernel has the page semaphore patch applied.  We'd like to know if that addresses the performance problems seen with the lock_page approach.

Thanks!
Comment 69 Jeffrey Moyer 2009-01-28 13:44:10 EST
(In reply to comment #64)
> From FJ:
> ---
> Just a question.
> 
> IIUC, this issue can happen on HugeTLB memory vmas, too.
> 
> Right ? If yes, do you have patches ?

That looks right to me.  Larry, do we need to protect the hugetlb_cow call in hugetlb_fault?
Comment 70 Larry Woodman 2009-01-28 15:05:26 EST
I'd say yes, although hugetlbfs is typically used for "shared memory" whichjh is not subject to COW faults, there is a hugetlb_cow() routine that copies from an old page to a new page then breaks the COW...

This is what needs to be done there:

--- linux-2.6.18.noarch/mm/hugetlb.c.orig	2009-01-28 15:02:22.000000000 -0500
+++ linux-2.6.18.noarch/mm/hugetlb.c	2009-01-28 15:02:25.000000000 -0500
@@ -471,9 +471,11 @@ static int hugetlb_cow(struct mm_struct 
 		return VM_FAULT_OOM;
 	}
 
+	lock_page(new_page);
 	spin_unlock(&mm->page_table_lock);
 	copy_huge_page(new_page, old_page, address);
 	spin_lock(&mm->page_table_lock);
+	unlock_page(new_page);
 
 	ptep = huge_pte_offset(mm, address & HPAGE_MASK);
 	if (likely(pte_same(huge_ptep_get(ptep), pte))) {
Comment 71 Jeffrey Moyer 2009-01-28 15:25:06 EST
(In reply to comment #70)
> I'd say yes, although hugetlbfs is typically used for "shared memory" whichjh
> is not subject to COW faults, there is a hugetlb_cow() routine that copies from
> an old page to a new page then breaks the COW...
> 
> This is what needs to be done there:
> 
> --- linux-2.6.18.noarch/mm/hugetlb.c.orig 2009-01-28 15:02:22.000000000 -0500
> +++ linux-2.6.18.noarch/mm/hugetlb.c 2009-01-28 15:02:25.000000000 -0500
> @@ -471,9 +471,11 @@ static int hugetlb_cow(struct mm_struct 
>    return VM_FAULT_OOM;
>   }
> 
> + lock_page(new_page);
>   spin_unlock(&mm->page_table_lock);
>   copy_huge_page(new_page, old_page, address);
>   spin_lock(&mm->page_table_lock);
> + unlock_page(new_page);
> 
>   ptep = huge_pte_offset(mm, address & HPAGE_MASK);
>   if (likely(pte_same(huge_ptep_get(ptep), pte))) {

Don't you want to lock the old_page?
Comment 72 Larry Woodman 2009-01-28 15:56:06 EST

Yes my mistake!!!  You should be able to verify this with your test program while using hugepages.


--- linux-2.6.18.noarch/mm/hugetlb.c.orig	2009-01-28 15:02:22.000000000 -0500
+++ linux-2.6.18.noarch/mm/hugetlb.c	2009-01-28 15:57:05.000000000 -0500
@@ -471,9 +471,11 @@ static int hugetlb_cow(struct mm_struct 
 		return VM_FAULT_OOM;
 	}
 
+	lock_page(old_page);
 	spin_unlock(&mm->page_table_lock);
 	copy_huge_page(new_page, old_page, address);
 	spin_lock(&mm->page_table_lock);
+	unlock_page(old_page);
 
 	ptep = huge_pte_offset(mm, address & HPAGE_MASK);
 	if (likely(pte_same(huge_ptep_get(ptep), pte))) {
Comment 73 Keiichiro Tokunaga 2009-01-28 15:58:58 EST
(I forgot to post this...)
I had run the FJ's reproducer on the Larry's kernel since yesterday.  I saw the error messages three times this time.  And now I started running Jeff's reproducer on the Larry's kernel.  It's been running about four hours now, I haven't seen any error messages ("worker 0 failed check") yet.  I'll try to keep running it until tomorrow and see what happens.

BTW, I confirmed the error messages were displayed on the stock -128 kernel.  Following the Jeff's instruction, it was easy to reproduce the problem on the stock kernel.
Comment 74 Jeffrey Moyer 2009-01-28 16:06:56 EST
Hi, Kei,

Thanks for the update!  I'm wondering if there is another interaction that is at play, here.  The test program I wrote is *very* specific in what it tests.  So, if there is another bug in the code, it's not likely to catch it (meaning I doubt that it will turn up any errors in your overnight run).

I am still unable to reproduce the corruption issue using the lockpage patch from Larry and the pthread-read-and-fork program.  I've had it running for a day, now.  I wonder what could be different about our environments.
Comment 75 Keiichiro Tokunaga 2009-01-28 16:22:53 EST
Yes, I've also started thinking there would be a great chance for the FJ's reproducer to hit another issue/bug.  Anyways, I'll keep running your reproducer for a day just in case while learning about the FJ's reproducer more and talking to FJ team.  Thanks for the reproducer.
Comment 77 Issue Tracker 2009-01-29 05:25:45 EST
From FJ:
---
A comment about redhat's lock_page() patch at comment #48:

IMHO, calling lock_page() after DIO submitter find the page is not
correct.
What happens when COW finds the page and lock_page() before DIO's
lock_page()
is issued. pte will be overwritten by "new page", anyway.

Thanks,
-Kame 
---

I will upload diolock2.patch shortly.


This event sent from IssueTracker by moshiro@redhat.com 
 issue 249537
Comment 78 Moritoshi Oshiro 2009-01-29 05:27:33 EST
Created attachment 330340 [details]
diolock2.patch

diolock2.patch from Fujitsu. Thanks.
Comment 79 Jeffrey Moyer 2009-01-29 11:15:12 EST
(In reply to comment #77)
> From FJ:
> ---
> A comment about redhat's lock_page() patch at comment #48:
> 
> IMHO, calling lock_page() after DIO submitter find the page is not
> correct.
> What happens when COW finds the page and lock_page() before DIO's
> lock_page()
> is issued. pte will be overwritten by "new page", anyway.

I think this was Andrea's concern in comment #27.  If we moved the lock_page call into get_user_pages, do you think that would address the problem?
Comment 80 Andrea Arcangeli 2009-01-29 11:48:04 EST
Correct, didn't repost the problem for the Nth time ;).

I'm trying to make a patch too but too bad it crashes... Moving the lock up isn't totally trivial. But I'm more convinced the mapcount vs count isn't accurate enough because of count being boosted while page sits in pagevec queues waiting a drain of the pagevec to be moved into lru and unmpinned after PageLRU is set and secondly the check looks inherently racey because even if we run it under PT lock (and even mmap-sem down_write) count and mapcount are still free to move up and down freely randomly depending on the activity on the other mm that share the page :(. Hence my current conclusion that for rhel using PG_locked (which will also interface 100% safely with the ksm.ko logic that bails out on locked pages) is good idea. While for mainline I'm afraid a new separate counter is needed, then we're sure that if under PT lock (or with some lockless trick I think I've figured out for gup-fast) the counter is zero we're safe and it's just a page count/mapcunt aberration because of them randomly going up and down from other mm.
Comment 84 Jeffrey Moyer 2009-01-30 16:34:30 EST
Created attachment 330508 [details]
Another variant of the pagesem patch

Here is another patch that uses a per-page reader-writer semaphore to allow multiple direct I/Os to the same page while preventing races with the fault path.  We take the mmap semaphore for write in the DIO path before calling get_user_pages.  The newly added page_sem is taken for write in the fault path, and read in the DIO path.  We moved the acquisition of the page_sem inside the mmap_sem to close the race we had in the previous patch.  This patch still breaks kabi (though we can fix that with a bit of work).

This is an RFC, but we wouldn't mind seeing performance numbers for it, either.
Comment 85 Issue Tracker 2009-02-01 20:55:15 EST
From FJ: 

BTW, this is a comment for Comment #84/Bugzilla
==
1. increaseing size of struct page in nonsense.

2. Can't it be dead-lock as follwing ?

  Assume 3 threads. Does dio on page "A".

      Thread 1(DIO)             Thread(2)           thread(3)

     down_write(mmap_sem)
     read lock on page A.
     up_write(mmap_sem)
     (submit I/O is delayd)
                                fork()
                                                   COW, page fault
                                                   down_read(mmap_sem)
                                                   write lock on page A
     Check next iov[] segment
     down_write(mmap_sem)

I'm sorry if I missunderstand.
==
 


This event sent from IssueTracker by moshiro@redhat.com 
 issue 249537
Comment 86 Andrea Arcangeli 2009-02-02 00:41:37 EST
Created attachment 330584 [details]
final patch that addresses the pagevec false positives

Ok this should be the optimal solution for rhel. The major problem introduced in fork by my previous patch were the temporary pins of the pagevec code that generated tons of false positives and unnecessary cows. A page pin taken by pagevec queues was indistinguishable from a pin taken by get_user_pages. I solved this with a PG_gup, which still avoids me having to take the PG_lock. 

Taking the PG_lock at the right place, tried that and it open a can of worms and worst of all it only can solve O_DIRECT and not the other gup(write=1) users.

In addition the worry that mapcount vs count was inherently racy was unwarranted. The reason is that the pte is guaranteed to be wrprotected if the page is already shared and other MM could increase or decrease mapcount and count any time outside of our PT lock. So the PT lock taken by fork (and ksm) when they decide if they can wrprotect an anon-pte, should be enough to be sure that at least get_page cannot increase from under us.

The previous patch was already correct, but too slow in fork for no good reason because of pagevec, and the additional copies happening in fork on the glibc pages because of the pagevec false positives altered the timings in a way that could reproduce the deadlock in glibc fork implementation. It seems by fixing the pagevec triggered slowdown, the deadlock in glibc isn't reproducible anymore.

glibc fork bug still needs fixing even if it won't be reproducible (and I'd prefer a solution that can be understood by mere mortals by using a smarter kind of lock instead of second guessing which writes went through during any of the interrupted critical section). Details in the other bug that covers the nptl troubles in this workload.
Comment 87 Jeffrey Moyer 2009-02-02 13:11:54 EST
(In reply to comment #85)
> From FJ: 
> 
> BTW, this is a comment for Comment #84/Bugzilla
> ==
> 1. increaseing size of struct page in nonsense.

The patch is posted as an RFC to see if it performs well.  If it does, and we agree it is the right way to fix the problem, then we'll deal with moving the page_sem outside of the page struct.

Specifically, this is why we want to test this patch (see comment #50):

>> However, I know this is not a real
>> solution, especially for RHEL5 but a tool to determine whether the ~25%
>> performance degradation that was reported in the patch that used lock_page()
>> was caused by 1.) not allowing multiple callers doing direct IO through
>> dio_bio_submit() at the same time or 2.) not allowing 2 or more callers through
>> dio_bio_submit() and do_wp_page() at the same time.


> 2. Can't it be dead-lock as follwing ?
> 
>   Assume 3 threads. Does dio on page "A".
> 
>       Thread 1(DIO)             Thread(2)           thread(3)
> 
>      down_write(mmap_sem)
>      read lock on page A.
>      up_write(mmap_sem)
>      (submit I/O is delayd)
>                                 fork()
>                                                    COW, page fault
>                                                    down_read(mmap_sem)
>                                                    write lock on page A
>      Check next iov[] segment
>      down_write(mmap_sem)

The direct I/O path calls get_user_pages, asking for up to 64 pages at a time.  It is called for the very first dio_get_page, and then called after running out of pages.  If the I/O is smaller than 64 pages (and get_user_pages returns the number of pages requested), then we only call this once per direct I/O submission.

So, let's assume that you have an I/O that is >64 pages in size.  What you're suggesting is that a direct I/O and a buffered I/O are being scheduled for exactly the same page, and that the page in question is page number 64 (the last page returned by the call to get_user_pages).

1) get_user_pages returns the read-locked pages
2) fork marks the pages read-only
3) another thread faults on a page that will be used for direct I/O, takes
   the mmap semaphore for read, and tries to get a writelock on the last page 
   returned by get_user_pages in thread 1.
4) thread 1 tries to allocate more pages via get_user_pages, and hangs trying
   to take the mmap semaphore for write.

Given the above circumstances, you are right, this can deadlock.  This can be fixed easily, should we choose this approach.  The fix involves adding code in submit_page_section to simply issue the I/O for the last block in the last page in the dio->pages array.

So, I'll ask again for performance numbers.  I'd also like to see the performance numbers for the other 2 patches we are currently discussing if at all possible.
Comment 88 Jeffrey Moyer 2009-02-02 13:34:37 EST
(In reply to comment #87)
> Given the above circumstances, you are right, this can deadlock.  This can be
> fixed easily, should we choose this approach.  The fix involves adding code in
> submit_page_section to simply issue the I/O for the last block in the last page
> in the dio->pages array.

Actually, if we were to fix the problem in this way, then we could incur a significant performance drop for I/O loads that use larger I/O sizes.  I'll keep thinking about this.
Comment 91 Jeffrey Moyer 2009-02-02 17:02:50 EST
(In reply to comment #86)
> Created an attachment (id=330584) [details]
> final patch that addresses the pagevec false positives

Andrea,

I get a failure when running the dma_thread reproducer on ia64:

Bad data at 0x000e00: 0x2000000000334000, 
Data dump starting at 0x000df8:
Expect 0x5a followed by 0xfa:
5a 5a 5a 5a 5a 5a 5a 5a 
fa fa fa fa fa fa fa fa 
Aborted

This is reproducible for me.

Oh, and your patch does not apply to the RHEL 5 kernel tree, so you may want to update your tree before posting another version.

Thanks!
Comment 92 Andrea Arcangeli 2009-02-02 17:14:08 EST
I waited several hours and then posted it to lkml a couple of minutes before comment #91 ;)

Anyway it must be something ia64 related, can you test it on x86? On x86 the bugs are 100% reproducible and it goes away fully for me (and glibc bug isn't exposed either after removing the spurious cows genreated by the pagevec pins).

I'll look into the ia64 parts to see if some arch handler is missing, x86 tends to be the simpler arch with that regard. I didn't test on ia64.
Comment 93 Jeffrey Moyer 2009-02-02 17:33:55 EST
It is *not* ia64 related, as I also reproduced it on x86_64.  The key is to start an I/O load on the system.  Actually, the best thing I've found is to run the pthread-read-and-fork test as well as dma_thread.  Then I can get dma_thread to show incorrect data, and once (on x86_64) it even segfaulted.
Comment 94 Andrea Arcangeli 2009-02-02 17:43:42 EST
I run both testcases, the second testcase in parallel with cp /dev/sda /dev/null, to be sure. The first testcase with -a 512 -w 40 that allowed to reproduce the glibc hang too, and I left it running 12 hours without any problem.

I'll try adding more load to the system.

Also note the patch is against rhel-5.2, I thought that was the preferred target for this bug as that is the kernel running in production environment right now. Is 5.3 the target now?
Comment 95 Andrea Arcangeli 2009-02-02 17:49:42 EST
I'm running 'cp /dev/sda /dev/null' in parallel with dma_thread and I can't reproduce troubles here. forkscrew works fine as well in parallel with dma_thread, neither one reports a problem here.
Comment 96 Andrea Arcangeli 2009-02-02 18:02:22 EST
Ok next thing is to test the fix on 5.3 latest, if in the meantime you have time to test it on a 5.2 kernel we could cross check our results later. Thanks!
Comment 97 Issue Tracker 2009-02-03 01:56:18 EST
From FJ:


Reply to
Comment #87 From  Jeffrey Moyer
==
Given the above circumstances, you are right, this can deadlock.  This can
be
fixed easily, should we choose this approach.  The fix involves adding
code in
submit_page_section to simply issue the I/O for the last block in the last
page
in the dio->pages array.
==
seems not good ;)


I've already replied in lkml here is a summarized copy.
==
1. In following, get_user_pages(), FOLL_WRITE can be disappear.
  This needs to be handled in your patch, I think.
  ==
1118                         while (!(page = follow_page(vma, start,
foll_flags))) {
1119                                 int ret;
1120                                 ret = __handle_mm_fault(mm, vma,
start,
1121                                                 foll_flags &
FOLL_WRITE);
1122                                 /*
1123                                  * The VM_FAULT_WRITE bit tells us
that do_wp_page has
1124                                  * broken COW when necessary, even if
maybe_mkwrite
1125                                  * decided not to set pte_write. We
can thus safely do
1126                                  * subsequent page lookups as if they
were reads.
1127                                  */
1128                                 if (ret & VM_FAULT_WRITE)
1129                                         foll_flags &= ~FOLL_WRITE;
   ==
   I think we can remove above path...

2. I'm not sure why HugeTLB is safe. please explain again.

3. In following path,a  page seems to be put() before get(). is this safe
?
I'm sorry if I misunderstand.
==
666 static int
667 submit_page_section(struct dio *dio, struct page *page,
668                 unsigned offset, unsigned len, sector_t blocknr)
669 {
670         int ret = 0;
<snip>
696         if (dio->cur_page) {
697                 ret = dio_send_cur_page(dio);
698                 page_cache_release(dio->cur_page);
699                 dio->cur_page = NULL;
700                 if (ret)
701                         goto out;
702         }
703
704         page_cache_get(page);           /* It is in dio */



This event sent from IssueTracker by moshiro@redhat.com 
 issue 249537
Comment 98 Andrea Arcangeli 2009-02-03 10:11:06 EST
Created attachment 330739 [details]
remove FOLL_WRITE unreliable check

FOLL_WRITE could be cleared if handle_mm_fault would trigger a cow. This was an implementation detail. The logic remains the same, except we'll set PG_gup even for gup(write=0) and not only for gup(write=1), as FOLL_WRITE going away sometime prevents us to be more finegrined. But it's ok, it was only a minor optimization.

I need to look at the hugetlb part now. I think Kame's right we need to patch into hugetlbpages fork path too if they also can be taken by gup and marked wrprotected by fork.

This patch is now against 2.6.18-130.el5, if it rejects in page-flags.h you've just to add PG_gup by hand to that file.
Comment 99 Jeffrey Moyer 2009-02-03 12:44:38 EST
I've been running the patch from comment #98 for an hour so far, and it hasn't failed.
Comment 100 Jeffrey Moyer 2009-02-03 14:03:49 EST
(In reply to comment #97)

> I've already replied in lkml here is a summarized copy.

OK, so the rest of the comments are for Andrea's patch, right?

> 3. In following path,a  page seems to be put() before get(). is this safe
> ?
> I'm sorry if I misunderstand.
> ==
> 666 static int
> 667 submit_page_section(struct dio *dio, struct page *page,
> 668                 unsigned offset, unsigned len, sector_t blocknr)
> 669 {
> 670         int ret = 0;
> <snip>
> 696         if (dio->cur_page) {
> 697                 ret = dio_send_cur_page(dio);
> 698                 page_cache_release(dio->cur_page);
> 699                 dio->cur_page = NULL;
> 700                 if (ret)
> 701                         goto out;
> 702         }
> 703
> 704         page_cache_get(page);           /* It is in dio */

You snipped out the critical part:

        /*
         * Can we just grow the current page's presence in the dio?
         */
        if (    (dio->cur_page == page) &&
                (dio->cur_page_offset + dio->cur_page_len == offset) &&
                (dio->cur_page_block +
                        (dio->cur_page_len >> dio->blkbits) == blocknr)) {
                dio->cur_page_len += len;

                /*
                 * If dio->boundary then we want to schedule the IO now to
                 * avoid metadata seeks.
                 */
                if (dio->boundary) {
                        ret = dio_send_cur_page(dio);
                        page_cache_release(dio->cur_page);
                        dio->cur_page = NULL;
                }
                goto out;
        }

You will not run into a case where the page passed in is the same as dio->curr_page AND the offset/len/blocknr is not contiguous with the last page section.  The reason for this is that a single io vector describes a single I/O segment, and each one of these is processed separately by direct_io_worker.
Comment 101 Jeffrey Moyer 2009-02-03 14:58:38 EST
Let me correct that previous comment.

For each iovec, you will call get_user_pages with the number of pages needed to fill out that segment.  If two iovecs fall on the same page, that just means g_u_p will return us the same page (and bump the reference count on it) the second time.  Thus, it looks safe to me.
Comment 102 Issue Tracker 2009-02-06 00:30:40 EST
From FJ:
---
I confirmed that the patch, which is from comment #98 of Bug #471613,
improved fork() cost compared to the previous patch (comment #36).
The result of fork() cost is as follows:

                   |            fork() cost(ms)        |    ratio(%)
-------------------------------------------------------------------------
size of anon memory | RHEL5.3RC | RHEL5.3RC + #98 patch | ((2)/(1)) * 100
                   |     (1)   |          (2)          |
-------------------------------------------------------------------------
              1M   |    0.075  |         0.078         |     104.00
             10M   |    0.168  |         0.173         |     102.97
            100M   |    1.117  |         1.167         |     104.47
           1000M   |   10.543  |        11.130         |     105.56

And I think that this patch fixed the problem because the reproducer has
run with no problem for a day. Now our performance team is measuring
another performance test to confirm whether our customer can accept this
patch. The test will complete on next Monday.

Can you release the async errata by Feb. 13 JST if we accept this patch
on next Monday?
---



This event sent from IssueTracker by moshiro@redhat.com 
 issue 249537
Comment 103 Andrea Arcangeli 2009-02-06 12:40:54 EST
Created attachment 331150 [details]
fix hugetlb

This takes care of fixing hugetlb too.

To reproduce the same race condition with hugetlb pages was enough to run forkscrew and dma_thread with:

LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./dma_thread -a 512 -w 40
LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./forkscrew 

I verfied hugetlb pages were allocated in grep Huge /proc/meminfo and as well through printk in the hugetlb 'forcecow' path to be sure it run and fixed the race.

I also exercised the -EAGAIN path and verfied it rollbacks and restart on the prev pte just fine, it only happens during very heavy paging activity. We can unlock/relock around the page-copy to be more lowlatency schedule friendly.

I think bugzilla is solved now and I'll be working on addressing gup-fast for mainline now.
Comment 104 Andrea Arcangeli 2009-02-06 13:01:04 EST
Even if there was no measurable regression in fork fast path, with last patch  I moved the page allocation out of the fast path and dropped the spinlocks while copying the page like the rest of the VM does (like in do_wp_page). That's more (semi-RT) schedule friendly too. The downside is the additional complexity of the -EAGAIN path I had to introduce, but I hit in 3 times (verified with printk) during heavy swapping and it run just fine with no failure from dma_thread (it was dma_thread to run the -EAGAIN path) so I don't expect troubles.
Comment 105 Andrea Arcangeli 2009-02-09 10:44:30 EST
Created attachment 331326 [details]
flush the tlb after marking pte wrprotect and before copying the page in fork

This further update adds a tlb flush just before doing memcpy. That explains why before PG_gup there was still regression reproducible in glibc. After adding PG_gup I think having tlb flush or not won't make a difference, but I tend to think it's more correct to have still.

This would be a two liner difference if it wasn't because I wanted to go safe and have both src_vma and dst_vma. Tlb flush requires src_vma, pre-cow requires dst_vma.
Comment 106 Jeffrey Moyer 2009-02-11 13:26:34 EST
(In reply to comment #105)
> Created an attachment (id=331326) [details]
> flush the tlb after marking pte wrprotect and before copying the page in fork
> 
> This further update adds a tlb flush just before doing memcpy. That explains
> why before PG_gup there was still regression reproducible in glibc. After
> adding PG_gup I think having tlb flush or not won't make a difference, but I
> tend to think it's more correct to have still.
> 
> This would be a two liner difference if it wasn't because I wanted to go safe
> and have both src_vma and dst_vma. Tlb flush requires src_vma, pre-cow requires
> dst_vma.

I've had this patch running overnight, and so far there have been no problems.  I will keep it running until we have confidence that it is stable.
Comment 107 Andrea Arcangeli 2009-02-11 20:20:51 EST
Great. Last patch should be good for solving the problem in production and I'll shortly submit a port to mainline that will fix gup-fast without necessairly sending IPIs for every page being wrprotected.

With regard to bug #121733, that was a different problem, which I fixed around 2.6.7 with a page count vs mapcount in the vmscan loop by keeping the pte pinned for pages under O_DIRECT (same as PG_pinned in 2.4 but without requiring altering all gup users to pg_unpin pages before releasing the page count). However later Hugh removed the page vs mapcount check and replaced it with a more accurate version of the 'takeover' logic in do_wp_page (i.e. do_wp_page got smarter at taking over readonly pages instead of triggering cow false positives).
However there was a bug in the fix-replacement when the page triggering copy on write happens to be locked down (for swapping or anything). That can lead to losing O_DIRECT reads during swapin I think but with a much smaller window than in the past probably. So anyway I'll attach the fix that went in mainline that should close the race (or at least fix the PG_locked case of it) still without introducing page count vs mapcount checks in the vmscan loop (I think my original fix was safer and more obvious, but they mentioned some trouble with regard to page migration that could take temporary pins, I don't see big issue with that but as long as there is a solution supposedly safe without count vs mapcount safe check I'm fine).
Comment 108 Andrea Arcangeli 2009-02-11 20:23:07 EST
Created attachment 331638 [details]
patch to close the race to be as safe as 2.4 with PG_pinned (or 2.6.7 with count vs mapcount check in vmscan loop to keep ptes pinned if pointing to pages under O_DIRECT)

I'd suggest to apply this too in addition to attachment #331326 [details]
Comment 109 Andrea Arcangeli 2009-02-11 20:28:54 EST
Credit for investigation and discussion if the PG_pinned issue is fully solved in 2.6 despite my count vs mapcount check being removed after 2.6.7 goes to Hugh Dickins who helped recollect the past bits from git.
Comment 111 Jeffrey Moyer 2009-02-14 10:14:41 EST
A test kernel, based on the RHEL 5.3 z-stream kernel, that includes the proposed patches from this bugzilla is available here:
  http://people.redhat.com/jmoyer/bz471613/

Please download and test this kernel thoroughly.

Thanks!
Comment 113 Larry Troan 2009-02-16 09:52:54 EST
NEEDINFO was satisfied in comment #112.
> We tested kernel-2.6.18-128.1.1.el5.bz471613.1.ia64.rpm.
> And, We confirmed that this kernel fixed the problem. The
> fork() cost of this kernel increased to 103% of EL5.3 GA
> kernel, when fork() is called during direct I/O.

> Partner requests if this is OK with other RHEL consumers.
I believe this is the best we can do is a 3% performance penalty under the above conditions.
Comment 114 RHEL Product and Program Management 2009-02-16 10:10:46 EST
Updating PM score.
Comment 116 RHEL Product and Program Management 2009-02-16 14:16:30 EST
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.
Comment 118 Jeffrey Moyer 2009-02-17 09:53:24 EST
Running this kernel through our regression tests turned up the following:

BUG: sleeping function called from invalid context at mm/page_alloc.c:945
in_atomic():1, irqs_disabled():0
 [<c0460a4e>] __alloc_pages+0x31/0x2a6
 [<c0468fd0>] copy_page_range+0x442/0x653
 [<c046db7e>] anon_vma_link+0x19/0x96
 [<c042308f>] copy_process+0xcf9/0x13bf
 [<c04239a3>] do_fork+0x41/0x168
 [<c044f8ef>] audit_syscall_entry+0x14b/0x17d
 [<c040318b>] sys_clone+0x28/0x2d
 [<c0404f93>] syscall_call+0x7/0xb
 =======================

=============================================
[ INFO: possible recursive locking detected ]
2.6.18-128.1.1.el5.bz471613.1debug #1
---------------------------------------------
kdump-propagate/4927 is trying to acquire lock:
 (&mm->page_table_lock){--..}, at: [<c046907c>] copy_page_range+0x4ee/0x653

but task is already holding lock:
 (&mm->page_table_lock){--..}, at: [<c0469073>] copy_page_range+0x4e5/0x653

other info that might help us debug this:
3 locks held by kdump-propagate/4927:
 #0:  (&mm->mmap_sem){----}, at: [<c0422f0f>] copy_process+0xb79/0x13bf
 #1:  (&mm->mmap_sem/1){--..}, at: [<c0422f1c>] copy_process+0xb86/0x13bf
 #2:  (&mm->page_table_lock){--..}, at: [<c0469073>] copy_page_range+0x4e5/0x653

stack backtrace:
 [<c043a8f5>] __lock_acquire+0x75b/0x972
 [<c043b21b>] lock_acquire+0x5d/0x78
 [<c046907c>] copy_page_range+0x4ee/0x653
 [<c061d8f1>] _spin_lock+0x1a/0x43
 [<c046907c>] copy_page_range+0x4ee/0x653
 [<c046907c>] copy_page_range+0x4ee/0x653
 [<c046db7e>] anon_vma_link+0x19/0x96
 [<c042308f>] copy_process+0xcf9/0x13bf
 [<c04239a3>] do_fork+0x41/0x168
 [<c044f8ef>] audit_syscall_entry+0x14b/0x17d
 [<c040318b>] sys_clone+0x28/0x2d
 [<c0404f93>] syscall_call+0x7/0xb
 =======================
Comment 119 Jeffrey Moyer 2009-02-17 11:49:09 EST
(In reply to comment #118)
> Running this kernel through our regression tests turned up the following:
> 
> BUG: sleeping function called from invalid context at mm/page_alloc.c:945

I think we can fix this by passing GFP_ATOMIC into the alloc_page_vma function, since the code already handles allocation failures.

> =============================================
> [ INFO: possible recursive locking detected ]
> 2.6.18-128.1.1.el5.bz471613.1debug #1
> ---------------------------------------------
> kdump-propagate/4927 is trying to acquire lock:
>  (&mm->page_table_lock){--..}, at: [<c046907c>] copy_page_range+0x4ee/0x653

/usr/src/debug/kernel-2.6.18/linux-2.6.18.i686/mm/memory.c:1739
c046907c:       8b 44 24 54             mov    0x54(%esp),%eax
c0469080:       8b 54 24 48             mov    0x48(%esp),%edx
c0469084:       39 10                   cmp    %edx,(%eax)
c0469086:       75 6c                   jne    c04690f4 <copy_page_range+0x566>

1739:        if (!pte_same(*src_pte, _src_pte))
1740:                goto eagain;

This makes no sense to me.

> but task is already holding lock:
>  (&mm->page_table_lock){--..}, at: [<c0469073>] copy_page_range+0x4e5/0x653

/usr/src/debug/kernel-2.6.18/linux-2.6.18.i686/mm/memory.c:1732
c0469073:       8b 44 24 44             mov    0x44(%esp),%eax
c0469077:       e8 5b 48 1b 00          call   c061d8d7 <_spin_lock>

1731:        spin_lock(dst_ptl);
1732:        spin_lock(src_ptl);

OK, that at least lines up.

If we look at copy_pte_range, we see that the src_ptl is locked using:
        spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);

In fork_pre_cow, we drop that lock:

        spin_unlock(src_ptl);

and then reacquire it:

        spin_lock(src_ptl);

It seems to me that the lock nesting rules still apply here, so we should be using spin_lock_nested as is done in copy_pte_range.  I'll put together a patch which implements the above suggestions and re-test.
Comment 120 Andrea Arcangeli 2009-02-17 12:03:01 EST
GFP_ATOMIC would break fork so we can't use it.

At the moment I'm unsure if this is a false positive of the lockdep nested-deadlock code, or a real bug that needs fixing. Surely if you add the _nested that will try to teach lockdep not to trigger false positives for perfectly safe code.

One minor bug there is in the -ENOMEM path though, but it's not the one that generated the above (hopefully the above is not a bug at all).
Comment 121 Andrea Arcangeli 2009-02-17 14:16:10 EST
Created attachment 332278 [details]
same as before but removes the lockdep false positive and corrects minor issue in -ENOMEM path which would never trigger
Comment 122 Jeffrey Moyer 2009-02-18 10:43:31 EST
OK, I built with that updated patch, and the tests are still hitting the following:

  BUG: sleeping function called from invalid context at mm/page_alloc.c:945

The lockdep warning is gone, as expected.
Comment 123 Andrea Arcangeli 2009-02-18 12:53:44 EST
Ok this is an highmem issue affecting 32bit x86 only. I'll take care of this. 64bit kernels were just fine.
Comment 124 Andrea Arcangeli 2009-02-18 14:18:51 EST
Created attachment 332435 [details]
this should work well on 32bit with highmem
Comment 125 Andrea Arcangeli 2009-02-18 14:20:44 EST
Great regression testsuite I've to say, really great QA!! Thanks.
Comment 126 Jeffrey Moyer 2009-02-20 10:11:55 EST
The last patch passes our internal tests.
Comment 128 Don Zickus 2009-03-04 15:00:04 EST
in kernel-2.6.18-133.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Please do NOT transition this bugzilla state to VERIFIED until our QE team
has sent specific instructions indicating when to do so.  However feel free
to provide a comment indicating that this fix has been verified.
Comment 131 Don Zickus 2009-05-13 17:29:30 EDT
Moving back to POST to pickup the new changes.

The old change is being reverted and is replaced with a new change to handle the fast gup patch, which moves the original problem into a new path that the old patch did not cover.  

In other words, applying bz 474913 re-opens the original problem.  The new patch will re-fix the problem incorporating the new changes.
Comment 134 Don Zickus 2009-05-14 10:17:19 EDT
Created attachment 343973 [details]
patch is reworked to handle fast-gup changes

The fast gup patch described by the bugzilla in an earlier comment changed the code paths for O_DIRECT.  As such the original patch to this problem does not work anymore.  The attached patch was posted to move the fix to the new code paths.
Comment 136 Don Zickus 2009-05-14 15:34:30 EDT
in kernel-2.6.18-148.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Please do NOT transition this bugzilla state to VERIFIED until our QE team
has sent specific instructions indicating when to do so.  However feel free
to provide a comment indicating that this fix has been verified.
Comment 138 Don Zickus 2009-05-14 16:54:09 EDT
Moving back to POST because this patch was originally committed, found to have issues on ia64 xen boxes and reverted.
Comment 141 Don Zickus 2009-05-19 09:48:56 EDT
Apologies for any confusion the status of this bugzilla has caused.  148.el5 does _not_ have the fix FJ needs, it was reverted due to issues caused with xen and ia64.  This is why the bugzilla state is in POST.

To summarize:
146.el5 has the original fix, which should go into EUS
147.el5 had a new fix to cover the fast-gup feature, but broke ia64 xen
148.el5 had the new fix reverted (which exposes the original problem again)
149.el5 has a 'new' new fix which will hopefully have successful test results so I can release it today.

We understand the critical nature of this patch and are working hard to make sure this fix is in 5.4 beta.  Unfortunately due to the fast gup feature, the fix has had a long journey through 5.4.

Feel free to respond with more questions and concerns about this patch if something doesn't make sense.

Cheers,
Don
Comment 143 Don Zickus 2009-05-19 15:45:58 EDT
in kernel-2.6.18-149.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Please do NOT transition this bugzilla state to VERIFIED until our QE team
has sent specific instructions indicating when to do so.  However feel free
to provide a comment indicating that this fix has been verified.
Comment 146 Don Zickus 2009-06-04 10:11:48 EDT
Thanks for the detailed report.  So it seems like the original fix and the new fix with fast gup both exhibit a problem.

I'll let Andrea comment on this as he is the one who put the patch together.
Comment 147 Andrea Arcangeli 2009-06-04 14:43:51 EDT
ASo the thing is, my theory is that there was a set_pte_at executed after
temporarily dropping the src_ptl lock, but without checking if the pte was not
presented anymore. So we were setting _PAGE_RW bit 0x2 on the swap entry, that
leads to swap entry 00001d82 that was unused instead of correct number 00001d80
(bit 0x2 set in swap entry is no proof of my theory but it doesn't invalidate
it either, there's 50% probability my theory is right, more than that
considering it's unlikely there are two bugs in the same code that leads to the
same side effect). I already fixed it, it was an incredibly small race
condition and it triggers as hard as the -EAGAIN path (which I managed to exericse only once during testing with printk on that path).

It can't trigger unless there is an heavy reproducer of the fork vs gup race so
99% of systems out there that aren't either heavy swapping or aren't heavily
reproducing the race can't be affected. Besides if they are heavy reproducer of
the race chances are they're already better off with the fix than without ;).

I'll upload a one liner fix in a minute after it passes my more basic tests. In
case the pte is not present simply we don't need to mark the pte writeable
again, marking the pte writeable after fork_pre_cow was only an optimization to
prevent an noop do_wp_page in the src_mm after fork returns.
Comment 148 Andrea Arcangeli 2009-06-04 14:58:09 EDT
Created attachment 346577 [details]
fix tiny swapping race in fork-vs-gup patch

incremental with 343973 (didn't check the pre-gup-fast status)
Comment 169 Jeffrey Moyer 2009-07-17 09:36:06 EDT
*** Bug 512100 has been marked as a duplicate of this bug. ***
Comment 170 Don Zickus 2009-07-22 11:55:20 EDT
This patch has been reverted from 5.4 (kernel-2.6.18-159.el5).  Moving back to ASSIGNED to further work on it during 5.5
Comment 173 Jeffrey Moyer 2009-08-05 16:03:36 EDT
Andrea, since you've been doing all of the heavy lifting on this bug, I'm reassigning it to you.

Note You need to log in before you can comment on or make changes to this bug.