435291 – LTC41974-Pages of a memory mapped NFS file get corrupted.

Bug 435291 - LTC41974-Pages of a memory mapped NFS file get corrupted.

Summary: LTC41974-Pages of a memory mapped NFS file get corrupted.

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	5.1
Hardware:	All
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	rc
Target Release:	---
Assignee:	Peter Staubach
QA Contact:	Martin Jenner
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	391501 KernelPrio5.3 450335
TreeView+	depends on / blocked

Reported:	2008-02-28 15:11 UTC by Jeff Layton
Modified:	2018-10-20 02:41 UTC (History)
CC List:	16 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2009-01-20 19:40:50 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
proposed patch 1 (5.10 KB, patch) 2008-02-28 16:25 UTC, Jeff Layton	no flags	Details \| Diff
proposed patch 2 (6.02 KB, patch) 2008-02-28 16:27 UTC, Jeff Layton	no flags	Details \| Diff
experimental patch 1 -- don't clean pages as aggressively in invalidate_inode_pages2_range (2.19 KB, patch) 2008-03-11 17:42 UTC, Jeff Layton	no flags	Details \| Diff
experimental patch 2 -- have nfs_revalidate_mapping deal with errors from invalidate_inode_pages2 (1.32 KB, patch) 2008-03-11 17:42 UTC, Jeff Layton	no flags	Details \| Diff
latest linked-list reproducer (5.37 KB, text/plain) 2008-03-18 15:41 UTC, Jeff Layton	no flags	Details
Proposed patch (4.75 KB, patch) 2008-05-20 17:38 UTC, Peter Staubach	no flags	Details \| Diff
Proposed patch (4.56 KB, patch) 2008-06-12 19:08 UTC, Peter Staubach	no flags	Details \| Diff
Show Obsolete (5) View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
IBM Linux Technology Center	42949	0	None	None	None	2019-01-08 15:45:37 UTC
Red Hat Product Errata	RHSA-2009:0225	0	normal	SHIPPED_LIVE	Important: Red Hat Enterprise Linux 5.3 kernel security and bug fix update	2009-01-20 16:06:24 UTC

Description Jeff Layton 2008-02-28 15:11:56 UTC

+++ This bug was initially created as a clone of Bug #432974 +++

[ some comments removed for brevity ]

-- Additional comment from jlayton on 2008-02-20 14:14 EST --
Created an attachment (id=295442)
reproducer 6

Current reproducer that does work, based off the original one from IBM. I'm
running this in a shell script that runs it in a loop. RHEL4 routinely fails
after just a couple of runs. RHEL5 usually fails within 4 or 5. I haven't seen
any rawhide kernels fail with it.

-----------------[snip]--------------
#!/bin/sh

RESULT=0
RUN=1
while [ $RESULT -eq 0 ]; do
	echo "Run $RUN"
	./mutex_lt3 $1
	RESULT=$?
	RUN=`expr $RUN + 1`
done

exit $RESULT


-- Additional comment from tao on 2008-02-20 16:35 EST --
----- Additional Comments From ffilz.com  2008-02-20 16:33 EDT
-------
Jeff's full comment:

Current reproducer that does work, based off the original one from IBM.
I'm
running this in a shell script that runs it in a loop. RHEL4 routinely
fails
after just a couple of runs. RHEL5 usually fails within 4 or 5. I haven't
seen
any rawhide kernels fail with it.

-----------------[snip]--------------
#!/bin/sh

RESULT=0
RUN=1
while [ $RESULT -eq 0 ]; do
	echo "Run $RUN"
	./mutex_lt3 $1
	RESULT=$?
	RUN=`expr $RUN + 1`
done

exit $RESULT 


This event sent from IssueTracker by Glen Johnson 
 issue 163042

-- Additional comment from tao on 2008-02-20 16:40 EST --
file attachment from LTC:reproducer6.c


This event sent from IssueTracker by Glen Johnson 
 issue 163042
it_file 120446

-- Additional comment from tao on 2008-02-20 16:41 EST --
----- Additional Comments From ffilz.com  2008-02-20 16:34 EDT
-------
 
reproducer 6 


This event sent from IssueTracker by Glen Johnson 
 issue 163042

-- Additional comment from jlayton on 2008-02-21 12:20 EST --
Still trying to get my arms around this...

Reducing the number of threads seems to make this race harder to hit (which
isn't too surprising). I've reduced it to 10 and still been able to hit it
pretty consistently however.

Removing the multiplication busy loop seems to make the problem go away
altogether. I ran 20 concurrent threads in a loop and it went over 100
iterations without failing. I'm going to try changing that busy loop slightly
and see if it makes this easier or harder to hit...




-- Additional comment from jlayton on 2008-02-21 12:43 EST --
Reducing the number of forked processes to 5 still allows the problem to occur,
but I have yet to see it occur when I reduce the number to 3. Still working on
this on RHEL4 at the moment, hopefully once we figure out the problem there we
can see whether RHEL5's problem is the same.


-- Additional comment from jlayton on 2008-02-21 15:47 EST --
Added a new debug flag and changed the dfprintk() in nfs_revalidate_mapping to
key off of it and added the pid of the current task. Here was the run:

# ./mutex_lt3_loop.sh /mnt/rhel5/mtxfile
Run 1
Run 2
Run 3
<17935> can't find me to logout - expected at slot 2
<17936> can't find me to logout - expected at slot 3

...here's the output from the run:

NFS: 17921 (0:13/1755723) data cache invalidated
NFS: 17925 (0:13/1755723) data cache invalidated
NFS: 17927 (0:13/1755723) data cache invalidated
NFS: 17933 (0:13/1755723) data cache invalidated
NFS: 17937 (0:13/1755723) data cache invalidated

...so we know that the data cache was never invalidated in the context of the
threads that saw the mapping in this case.


-- Additional comment from jlayton on 2008-02-21 16:06 EST --
Added a new printk to show on every entry to nfs_revalidate_mapping too. Here's
output from this run:

# ./mutex_lt3_loop.sh /mnt/rhel5/mtxfile
Run 1
Run 2
Run 3
<18255> can't find me to logout - expected at slot 2
<18256> can't find me to logout - expected at slot 3

So I think this means that 18255 and 18256 had their writes get lost...

Here, the mapping was invalidated by 18252 and then revalidated by 5 and 6 (as
well as by others) just before the data cache was invalidated by 3.

NFS: 18255 (0:13/1755723) nfs_revalidate_mapping on 0000010011fda410
NFS: 18256 (0:13/1755723) nfs_revalidate_mapping on 0000010011fda410
NFS: 18253 (0:13/1755723) nfs_revalidate_mapping on 0000010011fda410
NFS: 18253 (0:13/1755723) nfs_revalidate_mapping on 0000010011fda410
NFS: 18253 (0:13/1755723) data cache invalidated for mapping 0000010011fda410
NFS: 18254 (0:13/1755723) nfs_revalidate_mapping on 0000010011fda410
NFS: 18255 (0:13/1755723) nfs_revalidate_mapping on 0000010011fda410

...the address at the end is the address_space object address. That's the same
for everyone. So I don't think that the problem is that we have inconsistencies
between different processes for the mmap, rather I think that we're most likely
dealing with a race that's causing writes to occasionally get lost.

I could be wrong on this, but I think it makes the most sense...


-- Additional comment from jlayton on 2008-02-22 11:20 EST --
Created an attachment (id=295644)
reproducer 7

Some more cleanup of the reproducer:

This should make each "slot" in the array be page-aligned (by extending the
garbage array to 1008 entries).

Change MP to be based on the number of spawned processes -- there should be no
need to go past that and that cuts down the run time since the main thread
doesn't need to write out the huge file.

Change a couple of places that check for excessive loop iterations to error out
when we go past MP iterations (signifying that we traversed the list more than
once).

Reduce the number of busy loop multiplication ops -- this "feels" like it makes
things reproduce faster but it might just be my imagination.

Add a routine to dump the state of the array when an error is detected.

-- Additional comment from jlayton on 2008-02-22 11:25 EST --
That should read...

(by extending the garbage array to 1022 entries)

Here's some interesting output from the latest reproducer:

<26595> can't find me to logout - expected at slot 4
(26595) 0: val=-1 next=2
(26595) 1: val=26593 next=5
(26595) 2: val=26594 next=1
(26595) 3: val=-1 next=-1
(26595) 4: val=26595 next=-1
(26595) 5: val=-1 next=-1
<26596> can't find me to logout - expected at slot 5
(26596) 0: val=-1 next=2
(26596) 1: val=26593 next=5
(26596) 2: val=26594 next=1
(26596) 3: val=-1 next=-1
(26596) 4: val=26595 next=-1
(26596) 5: val=-1 next=-1

...so 26595 couldn't find itself in the array. But, since the loop ran, the
mapping has evidently been updated and now it *is* in the right spot.


-- Additional comment from jlayton on 2008-02-25 09:12 EST --
Another run from this morning. Here's the output from the program:

# ./mutex_lt3_loop.sh /mnt/rhel5/mtxfile
Run 1
<2289> can't find me to logout - expected at slot 3
(2289) 0: val=-1 next=4
(2289) 1: val=2287 next=3
(2289) 2: val=-1 next=-1
(2289) 3: val=2289 next=-1
(2289) 4: val=-1 next=-1
<2287> can't find me to logout - expected at slot 1
(2287) 0: val=-1 next=4
(2287) 1: val=2287 next=3
(2287) 2: val=-1 next=-1
(2287) 3: val=2289 next=-1
(2287) 4: val=-1 next=-1
<2288> can't find me to logout - expected at slot 4
(2288) 0: val=-1 next=4
(2288) 1: val=2287 next=3
(2288) 2: val=-1 next=-1
(2288) 3: val=2289 next=-1
(2288) 4: val=-1 next=-1

...hexdump of the final contents. It pretty much matches what we see above.
Presuming that the reproducer locking semantics are correct (and I think that
they are), it looks like 2288's list insertion got lost.

# hexdump /export/mtxfile
0000000 ffff ffff 0004 0000 0000 0000 0000 0000
0000010 0000 0000 0000 0000 0000 0000 0000 0000
*
0001000 08ef 0000 0003 0000 0000 0000 0000 0000
0001010 0000 0000 0000 0000 0000 0000 0000 0000
*
0002000 ffff ffff ffff ffff 0000 0000 0000 0000
0002010 0000 0000 0000 0000 0000 0000 0000 0000
*
0003000 08f1 0000 ffff ffff 0000 0000 0000 0000
0003010 0000 0000 0000 0000 0000 0000 0000 0000
*
0004000 ffff ffff ffff ffff 0000 0000 0000 0000
0004010 0000 0000 0000 0000 0000 0000 0000 0000
*
0005000

...the good news is that turning up nfs_debug seems to make this race even
easier to hit. I have the debug output from the above run and will attach it soon.


-- Additional comment from jlayton on 2008-02-25 09:13 EST --
Created an attachment (id=295797)
debug messages from run in comment #47

nfs_debug output from run in comment #47.


-- Additional comment from jlayton on 2008-02-25 09:47 EST --
Since the slots are now page aligned, we should be able to determine where they
are in the debug output. Slot 4 corresponds to 0x4000 above, so it should start
at byte 16384. There are other races involved since the other processes
obviously didn't find their entries, but the missing write from 2288 is probably
as good a place to start as any.

Just after the first mmap is done, we do a big read and load the cache for the
page starting at 16384:

NFS: read done (0:13/1755723 4096@0)
NFS: read done (0:13/1755723 4096@4096)
NFS: read done (0:13/1755723 4096@8192)
NFS: read done (0:13/1755723 4096@12288)
NFS: read done (0:13/1755723 4096@16384)

...then later, the data cache is invalidated (though it's not clear to me why):

NFS: nfs_update_inode(0:13/1755723 ct=1 info=0x6)
NFS: (0:13/1755723) data cache invalidated
NFS: nfs3_forget_cached_acls(0:13/1755723)
NFS: (0:13/1755723) revalidation complete


...at that point, we do an nfs_readpage call, but for only 4k at page offset 4:

NFS: nfs_readpage (0000010000b64450 4096@4)
NFS:   47 initiated read call (req 0:13/1755723, 4096 bytes @ offset 16384)

...are the other pages already flagged uptodate? Should they be if the cache is
invalidated? A little while later:

NFS:   52 initiated write call (req 0:13/1755723, 4096 bytes @ offset 16384)
NFS:   52 nfs_writeback_done (status 4096)
NFS: write (0:13/1755723 4096@16384) marked for commit
NFS:   53 initiated commit call
nfs: fsync(0:13/1755723)
NFS: nfs_update_inode(0:13/1755723 ct=2 info=0x6)
NFS:   53 nfs_commit_done (status 0)


...and that's the first time we see a write go out on the wire for that page. I
think the fsync represents a msync call, so at that point, one of the processes
had detected and error and started exiting.

The hard part here is that we don't have any real visibility into when the
changes to the mmap happened here. That's all done via direct access to the
memory so we probably won't be able to tell that.

What might be interesting is to dump the first few bytes of each page just
before a read call is done so that we can see if potential changes are being
clobbered before they ever go out on the wire. Possibly the same for write calls.



-- Additional comment from tao on 2008-02-25 10:15 EST --
changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEEDINFO                    |OPEN




------- Additional Comments From ffilz.com  2008-02-25 10:10 EDT
-------
> ...at that point, we do an nfs_readpage call, but for only 4k at page
> offset 4:
>
> NFS: nfs_readpage (0000010000b64450 4096@4)
> NFS:   47 initiated read call (req 0:13/1755723, 4096 bytes @ offset
> 16384)

Is that supposed to be byte offset 4 or page offset 4?

Printing the first few bytes of each page during read and write would be
good. 
Also pringing more details when the cache is invalidated would be good.

Moving to open since development/Red Hat is investigationg, we are not
waiting 
on submitter for anything. 


This event sent from IssueTracker by Glen Johnson 
 issue 163042

-- Additional comment from jlayton on 2008-02-25 10:44 EST --
> NFS: nfs_readpage (0000010000b64450 4096@4)

... 4096 bytes at page index 4

> NFS:   47 initiated read call (req 0:13/1755723, 4096 bytes @ offset 16384)

...which becomes a read call for 4k at byte offset 16384. I may do that (add
some debugging code that adds info about what data is currently in the page),
but I want to see first if I can rewrite this reproducer such that it doesn't
use this linked list. I think if I have different processes updating different
offsets in different pages then I might be able to reproduce the same thing, and
it might be a little clearer about who's updates are getting clobbered.


-- Additional comment from tao on 2008-02-25 11:05 EST --
----- Additional Comments From ffilz.com  2008-02-25 11:01 EDT
-------
Ok, figured that was page index 4.

Yes, having each process have it's own part of the page to use would
help. One 
thought would be for each process to take a snapshot of the other
processes 
page areas and put that in it's own section, so the layout of each page
might 
be something like a 2 dimensional array. It might be interesting to see
what 
the other processes views are when a page update appears to be lost. 


This event sent from IssueTracker by Glen Johnson 
 issue 163042

-- Additional comment from jlayton on 2008-02-25 11:39 EST --
Created an attachment (id=295812)
simpler reproducer 1

Here's a simpler reproducer program. It just declares the file to be a large
array of pages, and gives each process a slot in every page. The process:

gets the lock
checks that its slot is zero in every page
writes its pid to its slot in every page
releases the lock
does a busy loop
gets the lock again
checks that its pid is in its slot in every page
writes 0 to its slot in every page

...it does this for 400 iterations. Usually, it fails after just a few. I need
to clean it up a little more and have it dump the state of the problem page
when  it detects a problem.


w

-- Additional comment from jlayton on 2008-02-25 11:41 EST --
From what I've seen so far, I don't think the processes have an inconsistent
view of the pages. The address_space struct seems to be the same one, so the
pages attached to it ought to also be the same.

Rather, I think that the problem is that updates to the pages are occasionally
getting clobbered. The locking should be sufficient to make that not occur, but
it still seems to be happening.

I could be wrong here, however...


-- Additional comment from jlayton on 2008-02-25 12:07 EST --
I've been able to reduce NRTHREADS to 10 and still reproduce it but it takes
longer. Sample error from the above program:

7981: error p=1 s=7 v=0

...the interesting thing is that every time I've run this this message always
has p=1 (problem in the first page index). Still need to add a bit of code to
dump the state of the problem page when this occurs though...


-- Additional comment from jlayton on 2008-02-25 12:59 EST --
Running with nfs_debug cranked up, I got a message (that I lost), but:

p=0 s=0 v=0

...here's first mmap call:

nfs: mmap(//mtxfile)

...then we see this:

NFS: revalidating (0:13/1755725)
NFS call  getattr
NFS: 15394 initiated write call (req 0:13/1755725, 32768 bytes @ offset 4096)
NFS: 15395 initiated write call (req 0:13/1755725, 4096 bytes @ offset 36864)
NFS: nfs_update_inode(0:13/1755725 ct=2 info=0x6)
NFS: mtime change on server for file 0:13/1755725
NFS: 15394 nfs_writeback_done (status 32768)

...what happened to the stuff at offset 0? Later in the file, I never see that
offset written out, but I do see this a little later on:

NFS: nfs_readpages (0:13/1755725 10)
NFS: 15398 initiated read call (req 0:13/1755725, 32768 bytes @ offset 0)
NFS: 15399 initiated read call (req 0:13/1755725, 8192 bytes @ offset 32768)

...so here the kernel decided to go ahead and reload page 0, even though it
hadn't been written out. We also know from the design of the program that it
should definitely have been written out.

Looking now to see if I can determine how the decision to write out pages was
made here...


-- Additional comment from tao on 2008-02-25 13:50 EST --
----- Additional Comments From nevdull.com (prefers email at
ricklind.com)  2008-02-25 13:47 EDT -------
jlayton:  in our early testing we noticed a similar strange coincidence --
when
a failure was detected it was always the first time through the loop for
the
victim.  Made us wonder if our initial view of the world was mistaken when
we
chose a free slot, but couldn't spot anything obvious to back up that
thinking. 


This event sent from IssueTracker by Glen Johnson 
 issue 163042

-- Additional comment from jlayton on 2008-02-25 14:04 EST --
Ok, bug in my program :-)

I had this:

	for (i=1; i<NRTHREADS; i++) {

...in both of my insert and delete loops. i needs to start at 0. So that
explains why the first page wasn't being written out. It was never getting changed.

After making this change, the problem seems to be more difficult to hit for some
reason. Probably timing-related somehow, but I'm not sure yet...


-- Additional comment from tao on 2008-02-26 09:46 EST --
Nokia is running their IBM Rational ClearCase SCM environment mainly on top
of RHEL platform. Recently there have been many issues especially with the
newer ClearCase releases.  IBM has given Nokia a list of issues on RHEL
which affect their Clear Case implementation. 

This issue is part of that list and has been reported by Nokia to be a
problem issue which they would like to see fixed.



This event sent from IssueTracker by sprabhu 
 issue 163042

-- Additional comment from jlayton on 2008-02-26 10:35 EST --
Created an attachment (id=295935)
simpler reproducer 2

Another reproducer. This one dumps the state of the problem page once we hit a
failure. I've also tried to make it so that as soon as a problem is detected,
the other threads stop what they're doing. I don't think that's working right 
though since the erroring thread is probably exiting before that...

The fact that this reproducer works I think lends creedence to the idea that
the problem is a read/write race and not an issue of inconsistency in the mmap
between processes. Each process in this reproducer is writing to its own area
of each page, so inconsistencies shouldn't really matter...

I've also started running this against a server that uses xfs as the underlying
filesystem, since xfs has sub-second timestamps. I think that the problem is
most likely related to cache invalidation events, and I'm hoping that will make
them occur more frequently.

In any case, current run:

# ./mmap-repro-loop.sh /mnt/xfs/mtxfile
Run 1
Run 2
16852: error p=0 s=8 v=0
(16852) 0: 0 0 0 0 0 0 0 0 0 16853
<16853> lock errsem (11:Resource temporarily unavailable)

...this means that process 16852 was expecting to see its pid in page 0, slot
8, but there's a 0 in there instead. I've also added a bit of debug code to
dump the first 10 bytes of the page prior to and after a readpage call:

NFS: nfs_readpage (0000010000bc88d8 4096@0)
NFS: readpage_async page=0000010000bc88d8 len=4096
NFS: page contents: 0 0 0 0 0 0 0 0 16852 0

^^^ page contents before the readpage was done. The pid seems to be present.

NFS: 25468 initiated read call (req 0:13/131, 4096 bytes @ offset 0)
NFS: nfs_update_inode(0:13/131 ct=1 info=0x6)
NFS: 25468 nfs_readpage_result, (status 4096)
NFS: page contents: 0 0 0 0 0 0 0 0 0 0

^^^ page contents after -- it's no longer there.

NFS: read done (0:13/131 4096@0)

...this suggests that the page contents are being overwritten by a read from
the server prior to them being flushed out in a write. I need to instrument the
corresponding writepages calls too however to make sure I'm understanding this
correctly.


-- Additional comment from jlayton on 2008-02-26 10:43 EST --
If my conclusion in comment #60 is correct, this is probably indicative of a
couple of different problems:

1) we may be too aggressive about invalidating the cache. This could be causing
performance issues if so. That said, prior to 4.5 we were definitely not
aggressive enough about invalidating the cache and that lead to cache
consistency problems as well. That change in behavior is probably what makes the
next problem more evident. There may not be much that we can do about this problem.

2) we're occasionally allowing a read call to overwrite a dirty page. The
mechanism for this isn't clear to me yet, but I think this this the real bug here.

I'm going to continue trying to track this down...


-- Additional comment from tao on 2008-02-26 11:25 EST --
----- Additional Comments From ffilz.com  2008-02-26 11:23 EDT
-------
Reposting Jeff's truncated comment:

Another reproducer. This one dumps the state of the problem page once we
hit a
failure. I've also tried to make it so that as soon as a problem is
detected,
the other threads stop what they're doing. I don't think that's working
right 
though since the erroring thread is probably exiting before that...

The fact that this reproducer works I think lends creedence to the idea
that
the problem is a read/write race and not an issue of inconsistency in the
mmap
between processes. Each process in this reproducer is writing to its own
area
of each page, so inconsistencies shouldn't really matter...

I've also started running this against a server that uses xfs as the
underlying
filesystem, since xfs has sub-second timestamps. I think that the problem
is
most likely related to cache invalidation events, and I'm hoping that
will make
them occur more frequently.

In any case, current run:

# ./mmap-repro-loop.sh /mnt/xfs/mtxfile
Run 1
Run 2
16852: error p=0 s=8 v=0
(16852) 0: 0 0 0 0 0 0 0 0 0 16853
<16853> lock errsem (11:Resource temporarily unavailable)

...this means that process 16852 was expecting to see its pid in page 0,
slot
8, but there's a 0 in there instead. I've also added a bit of debug code
to
dump the first 10 bytes of the page prior to and after a readpage call:

NFS: nfs_readpage (0000010000bc88d8 4096@0)
NFS: readpage_async page=0000010000bc88d8 len=4096
NFS: page contents: 0 0 0 0 0 0 0 0 16852 0

^^^ page contents before the readpage was done. The pid seems to be
present.

NFS: 25468 initiated read call (req 0:13/131, 4096 bytes @ offset 0)
NFS: nfs_update_inode(0:13/131 ct=1 info=0x6)
NFS: 25468 nfs_readpage_result, (status 4096)
NFS: page contents: 0 0 0 0 0 0 0 0 0 0

^^^ page contents after -- it's no longer there.

NFS: read done (0:13/131 4096@0)

...this suggests that the page contents are being overwritten by a read
from
the server prior to them being flushed out in a write. I need to
instrument the
corresponding writepages calls too however to make sure I'm understanding
this
correctly. 


This event sent from IssueTracker by Glen Johnson 
 issue 163042

-- Additional comment from tao on 2008-02-26 11:30 EST --
file attachment from LTC:simple-reproducer2.c


This event sent from IssueTracker by Glen Johnson 
 issue 163042
it_file 121097

-- Additional comment from tao on 2008-02-26 11:30 EST --
----- Additional Comments From ffilz.com  2008-02-26 11:24 EDT
-------
 
Simple reproducer 2 


This event sent from IssueTracker by Glen Johnson 
 issue 163042

-- Additional comment from jlayton on 2008-02-26 13:35 EST --
Another run:

[root@dhcp231-236 432974]# ./mmap-repro-loop.sh /mnt/xfs/mtxfile 
Run 1
Run 2
26374: error p=0 s=0 v=0
(26374) 0: 0 0 0 26377 26378 0 0 0 0 0

Slightly updated debug patch (I'll post that in a bit) that shows the page flags:

NFS: nfs_readpage (00000100008612d0 4096@0)
NFS: readpage_async page=00000100008612d0 len=4096
NFS: dump_page flags=20000065 contents= 26374 0 0 0 0 0 0 0 0 0
NFS: 46999 initiated read call (req 0:13/131, 4096 bytes @ offset 0)
NFS reply access, status = 0
NFS: permission(0:13/131), mask=0x6, res=0
NFS: nfs_update_inode(0:13/131 ct=1 info=0x6)
NFS: 46999 nfs_readpage_result, (status 4096)
NFS: dump_page flags=2000006d contents= 0 0 0 0 0 0 0 0 0 0

...the main difference is that the PG_uptodate flag is set after the read
completes (which is not unexpected). What I need to understand now is why we're
allowing the read to happen when the page has been modified by userspace...


-- Additional comment from tao on 2008-02-26 13:50 EST --
----- Additional Comments From ffilz.com  2008-02-26 13:48 EDT
-------
Hmm, any idea why PG_dirty is not set? 


This event sent from IssueTracker by Glen Johnson 
 issue 163042

-- Additional comment from jlayton on 2008-02-26 15:38 EST --
Created an attachment (id=295990)
simpler reproducer 3

Modified reproducer to divorce the number of pages from the number of threads.
I can now reproduce this when dealing with a single page and that seems to make
this occur a bit more easily, I suppose because access to the page is more
hotly contended.


-- Additional comment from jlayton on 2008-02-26 15:43 EST --
Not sure on the PG_dirty bit yet, but I think the real problem is that we're
doing this:

NFS: revalidating (0:13/131)
NFS call  getattr
NFS reply getattr
NFS: nfs_update_inode(0:13/131 ct=1 info=0x6)
NFS: (0:13/131) data cache invalidated

...when we see "data cache invalidated" then we know that
invalidate_inode_pages2() has been called and the PG_uptodate bit has been cleared. 

In the example in comment #65, had pid 26374 already written to the page when
invalidate_inode_pages2() was called here? If so, why wasn't the page written
back to the server. If not, then why wasn't the PG_uptodate bit set when the
page was modified?

Now that we have the reproducer down to using a single page, it makes this a
little easier to focus in on...


-- Additional comment from jlayton on 2008-02-27 09:04 EST --
Created an attachment (id=296065)
kernel debug patch 1

Patch I'm currently using for debugging. It just adds some extra functions to
dump page contents and flags at different points along the way.


-- Additional comment from jlayton on 2008-02-27 09:38 EST --
Run with above debug patch applied and nfs_debug turned up to 32767:

Run 11
5140: error p=0 s=8 v=0
(5140) 0: 0 0 0 0 0 0 0 5139 0 0
<5139> lock errsem (11:Resource temporarily unavailable)
<5137> lock errsem (11:Resource temporarily unavailable)

...looking in the logs:

nfs: mmap(//mtxfile) pid=5140
NFS: nfs_update_inode(0:13/131 ct=2 info=0x6)
NFS: mtime change on server for file 0:13/131
NFS:  832 nfs_writeback_done (status 4096)
NFS: dump_page page=0000010000a67228 flags=2000202c contents= 0 0 0 0 0 0 0 0 0 0

^^^ this is from the nfs_dump_mapping_pages call. The following messages
indicate that the page was written back to the server. Notice that the pid has
not been written to the page.

2000202C = 100000000000000010000000101100 =
PG_mappedtodisk | PG_lru | PG_uptodate | PG_referenced | PG_writeback

NFS: write (0:13/131 4096@0) marked for commit
NFS: nfs_wait_on_requests returns 1
NFS:  833 initiated commit call
nfs: fsync(0:13/131)
NFS: nfs_update_inode(0:13/131 ct=2 info=0x6)
NFS:  833 nfs_commit_done (status 0)
NFS: commit (0:13/131 4096@0) OK
NFS: nfs_wait_on_requests returns 1
NFS: nfs_wait_on_requests returns 0
nfs: flush(0:13/131)
NFS: nfs_wait_on_requests returns 0

...write, commit, etc...

NFS: revalidating (0:13/131)
NFS call  getattr
NFS reply getattr
NFS: nfs_update_inode(0:13/131 ct=1 info=0x6)

...we revalidate the mapping again...

NFS: dump_page page=0000010000a67228 flags=2000002c contents= 0 0 0 0 0 0 0 0 5140 0
NFS: nfs_wait_on_requests returns 0
NFS: nfs_wait_on_requests returns 0
NFS: (0:13/131) data cache invalidated

...the dump_page output is from the nfs_dump_mapping_page call. It shows that
the pid has been written to memory, but nfs_revalidate_mapping did not make it
write back the page. Page flags at that time were:

2000002C = 100000000000000000000000101100 =
PG_mappedtodisk | PG_lru | PG_uptodate | PG_referenced

I think the main difference is that in the first case, the page was already
under writeback when nfs_revalidate_mapping() was called.



-- Additional comment from jlayton on 2008-02-27 09:41 EST --
This comment above nfs_writeback_done_full() is interesting:

/*
 * Handle a write reply that flushes a whole page.
 *
 * FIXME: There is an inherent race with invalidate_inode_pages and
 *        writebacks since the page->count is kept > 1 for as long
 *        as the page has a write request pending.
 */

...though the nature of the race is a little vague and I'm not sure that it's
the problem here.


-- Additional comment from jlayton on 2008-02-27 09:47 EST --
To summarize:

The problem appears to be that the client is occasionally noticing that the file
has changed on the server and is invalidating pages that have been modified via
the shared mmap without initiating writeback on them. My main question at this
point is:

We're definitely calling nfs_writepages() on the mapping. Why is it not picking
up that the page needs to be written back?

I've never seen the PG_dirty bit set, so I'm not sure that nfs_writepages uses
that to decide whether to write back the page. If not, what criteria *does* it use?


-- Additional comment from jlayton on 2008-02-27 14:26 EST --
Actually, I take that back. I generally do see PG_dirty set on the page, but
only when the file is initially written out in the reproducer.

In my (possiby wrong) understanding, PG_dirty should be getting set by the MMU
when the page is updated via the mmap from userspace, and should only be getting
cleared after the page is written out. The writeback should all be done with the
page locked, so that should prevent userspace from modifying the page while
writeback is occurring, correct?

The question is -- why do we never see the PG_dirty bit set in the debug output?
Pages are clearly being dirtied and no writeback is occurring so that seems to
suggest that PG_dirty is never getting set for some reason.


-- Additional comment from staubach on 2008-02-27 16:52 EST --
It could be that PG_DIRTY is not getting set, but it could also be
being cleared in invalidate_inode_pages2() before it can be detected.

-- Additional comment from jlayton on 2008-02-27 18:44 EST --
IRC notes from Peter:

16:26 < ps> Sorry to bother -- but check out invalidate_complete_page2 which is 

            called from invalidate_inode_pages2 in the upstream kernel...

16:26 < ps> It does not invalidate the page if the page is dirty.

16:26  * ps thinks that steved is a funny guy and clearly does not have enough 

          to do, so needs this bz?

16:29 < ps> I think that that might address this, if the problem is that the 

            page is redirtied after being cleaned but before 

            invalidate_inode_pages2 is done.

16:30 < steved> ps: it would not be the first time we found problem in that 

                invalidate_inode_pagesX path... 

16:30 < steved> and no... I do not need any more bz! :_

16:30 < ps> Yes, I am not surprised.  Invalidating pages is a trciky business.

16:31 < steved> especially for all the different filesystems... 

16:32 < ps> At least, we could instrument invalidate_inode_pages2 to do 

            something interesting when it detects a dirty page...

16:32 < steved> yes... 



...I'll plan to try and confirm whether Peter's suspicion is correct tomorrow
and have a look at what he's mentioning here in the upstream kernel if so.


-- Additional comment from jlayton on 2008-02-28 08:22 EST --
Good call, Peter! Sprinkled some nfs_dump_page() calls in that area of
nfs_revalidate_inode. Snippet from that patch:

----------[snip]-----------
        if (nfsi->cache_validity & NFS_INO_INVALID_DATA) {
                if (S_ISREG(inode->i_mode)) {
+                       nfs_dump_mapping_page(mapping, 10);
                        if (filemap_fdatawrite(mapping) == 0)
                                filemap_fdatawait(mapping);
                        nfs_wb_all(inode);
+                       nfs_dump_mapping_page(mapping, 10);
                }
                invalidate_inode_pages2(mapping);
+               nfs_dump_mapping_page(mapping, 10);
----------[snip]-----------
Output from the run:

Run 19
13247: error p=0 s=4 v=0
(13247) 0: 0 0 0 0 0 0 0 0 0 0

...so we were expecting to see the pid 13247 in there, but it wasn't. Debug output:

NFS: nfs_update_inode(0:13/131 ct=4 info=0x6)
NFS: mtime change on server for file 0:13/131
NFS: dump_page page=0000010000616aa8 flags=20000021 contents= 0 0 0 0 0 0 0 0 0 0

I think this one is from just before we call filemap_fdatawrite() ^^^

NFS: nfs_wait_on_requests returns 0
NFS: nfs_wait_on_requests returns 0
NFS: dump_page page=0000010000616aa8 flags=20000021 contents= 0 0 0 0 0 0 0 0 0 0

^^^ just after nfs_wb_all()

NFS: 2255 nfs_readpage_result, (status 4096)
NFS: dump_page page=0000010000616aa8 flags=20000029 contents= 0 0 0 0 0 0 0 0 0 0
NFS: read done (0:13/131 4096@0)

^^^ read that raced in there somehow

NFS: dump_page page=0000010000616aa8 flags=20000024 contents= 0 0 0 0 13247 0 0
0 0 0

^^^ just after invalidate_inode_pages2 was called

NFS: (0:13/131) data cache invalidated
NFS: nfs3_forget_cached_acls(0:13/131)
NFS: (0:13/131) revalidation complete

So it looks the change to the page contents is racing in between nfs_wb_all()
and invalidate_inode_pages2(). I'll have a look over the
invalidate_inode_pages2() codepath today and the upstream patch that Peter
mentioned on IRC.


-- Additional comment from jlayton on 2008-02-28 09:36 EST --
Created an attachment (id=296212)
possible upstream patch

Actually, it looks like RHEL4 already does something similar in
invalidate_complete_page(). If the page is dirty it too returns 0 and shouldn't
invalidate anything. That function is not called, however if the page is mapped
(and I think it would be in this case)...

This patch from the old-2.6-bkcvs tree might be where we need to start, but it
looks like there were some bugs with it that were fixed later. See also:

commit 1bdeba19a4885374a0031b25e6f2326d486bb65e
Author: zach.brown <zach.brown>
Date:	Sat Mar 5 17:26:38 2005 +0000

    [PATCH] invalidate range of pages after direct IO write

...and...


commit 2eff44802548850761e79c1257db48b9a19d91dd
Author: sct <sct>
Date:	Thu Mar 10 16:39:55 2005 +0000

    [PATCH] invalidate_inode_pages2_range livelock fix

...also from old-2.6-bkcvs tree. Still, poking around in this area at this late
stage of RHEL4's lifecycle makes me a little nervous. This could break more
than it solves...

Comment 1 Jeff Layton 2008-02-28 15:46:11 UTC

I've decided to continue working on this in RHEL5 for now. It seems to be easier
to reproduce this problem on RHEL5 and it's implementation of
invalidate_inode_pages2 is more up to date. I backported a couple of NFS patches
that seemed like they might help this there (presuming that the race is the same
as the one we see in RHEL4). I'll attach them in a bit -- they seem to help
some. The test would usually fail on RHEL5 very quickly but this patch seems to
make it take a bit longer.

I think I'm going to have to roll a rhel5 version of this debug patch to verify
that the problem is the same there...

Comment 2 Jeff Layton 2008-02-28 16:25:44 UTC

Created attachment 296231 [details]
proposed patch 1

Subject: [PATCH] NFS: Deal with failure of invalidate_inode_pages2()

If invalidate_inode_pages2() fails, then it should in principle just be
because the current process was signalled.  In that case, we just want to
ensure that the inode's page cache remains marked as invalid.

Also add a helper to allow the O_DIRECT code to simply mark the page cache as
invalid once it is finished writing, instead of calling
invalidate_inode_pages2() itself.

Signed-off-by: Trond Myklebust <Trond.Myklebust>
Signed-off-by: Andrew Morton <akpm>
Signed-off-by: Linus Torvalds <torvalds>

...

Proposed patch #1. I don't agree with Trond's statement above though. I think
I've see invalidate_inode_pages2 return error even when the process wasn't
signaled. Though maybe there were differences between RHEL5's iip2() and the
upstream one at the time that this patch went in.

Comment 3 Jeff Layton 2008-02-28 16:27:28 UTC

Created attachment 296232 [details]
proposed patch 2

Second patch backported from upstream to deal with races in this codepath.

...might need steved and dhowells to look over it and make sure I've gotten the
fscache stuff right...

Comment 4 Jeff Layton 2008-02-28 16:31:43 UTC

These two patches seem to help things on RHEL5 -- the reproducer lasts quite a
bit longer with them in place. They need more careful review though, I rather
hastily backported them...

There still seems to be a lingering race somewhere though. It looks like the
page is getting dirtied while we're inside nfs_sync_mapping(), but it's returning 0.

Comment 5 Jeff Layton 2008-02-28 18:51:25 UTC

Ok, so the page does seem to be getting dirtied while in nfs_sync_mapping(). So
we enter invalidate_inode_pages2() with a dirty page in the mapping. It then
unmaps the page, and does this near the bottom of the loop in
invalidate_inode_pages2_range:

                        was_dirty = test_clear_page_dirty(page);
                        if (!invalidate_complete_page2(mapping, page)) {
                                if (was_dirty)
                                        set_page_dirty(page);
                                ret = -EIO;
                        }
                        unlock_page(page);

...which seems goofy to me. test_clear_page_dirty(page) is going to clear the
dirty flag. We'll then invalidate the page (since it's no longer marked dirty)
in invalidate_complete_page2(). Only if that fails for some other reason will we
reset the dirty flag and return error.

That seems wrong to me (at least in this situation). If the page is dirty once
we get to that point we shouldn't just be discarding it like this...

Comment 6 Jeff Layton 2008-02-28 19:04:05 UTC

Did a similar debug patch to the one I had for rhel4 and instrumented before and
after nfs_sync_mapping() and before invalidate_inode_pages2():

NFS: dump_page page=ffff81000184fa60 flags=18080000000021 contents=0 0 0 0 0 0 0
0 0 0 
NFS: after nfs_sync_mapping (ret=0):
NFS: dump_page page=ffff81000184fa60 flags=1808000000087c contents=0 0 0 0 0 0
14101 0 0 0 
NFS: before invalidate_inode_pages2:
NFS: dump_page page=ffff81000184fa60 flags=1808000000087c contents=0 0 0 0 0 0
14101 0 0 0 
NFS: after invalidate_inode_pages2 (ret=0)

The problem was detected by pid 14101 -- it was expecting to see its pid but
instead saw a 0 in its slot.

Comment 7 Jeff Layton 2008-02-28 19:51:29 UTC

I think I see some of the patchset that went in to fix this final race. This is
the main one I think:

From e3db7691e9f3dff3289f64e3d98583e28afe03db Mon Sep 17 00:00:00 2001
From: Trond Myklebust <Trond.Myklebust>
Date: Wed, 10 Jan 2007 23:15:39 -0800
Subject: [PATCH] [PATCH] NFS: Fix race in nfs_release_page()

    NFS: Fix race in nfs_release_page()

    invalidate_inode_pages2() may find the dirty bit has been set on a page
    owing to the fact that the page may still be mapped after it was locked.
    Only after the call to unmap_mapping_range() are we sure that the page
    can no longer be dirtied.
    In order to fix this, NFS has hooked the releasepage() method and tries
    to write the page out between the call to unmap_mapping_range() and the
    call to remove_mapping(). This, however leads to deadlocks in the page
    reclaim code, where the page may be locked without holding a reference
    to the inode or dentry.

    Fix is to add a new address_space_operation, launder_page(), which will
    attempt to write out a dirty page without releasing the page lock.

Signed-off-by: Trond Myklebust <Trond.Myklebust>

    Also, the bare SetPageDirty() can skew all sort of accounting leading to
    other nasties.

[akpm: cleanup]
Signed-off-by: Peter Zijlstra <a.p.zijlstra>
Cc: Trond Myklebust <Trond.Myklebust>
Signed-off-by: Andrew Morton <akpm>
Signed-off-by: Linus Torvalds <torvalds>

...the problem here, of course is that adding a new address_space_op is probably
out of the question since that will probably blow kabi out of the water (though
maybe we could do it by adding a new address_space flag to signify those that
actually have the field.

Still, I may go ahead and just backport this patch and a couple of others as a
proof-of-concept and see if I can eliminate the race with it. We can then look
at how best to fix up kabi afterward...

Comment 8 Jeff Layton 2008-02-29 21:29:49 UTC

Backported 2 more patches to this set to attempt to fix this:

Author: Trond Myklebust <Trond.Myklebust>
Date:   Tue Dec 5 00:35:40 2006 -0500

    BZ#435291: NFS: Add nfs_set_page_dirty()
    
    Upstream commit 1a54533ec8d92a5edae97ec6ae10023ee71c4b46

...and...

Author: Trond Myklebust <Trond.Myklebust>
Date:   Tue Dec 5 00:35:42 2006 -0500

    BZ#435291: NFS: Fix nfs_release_page
    
    Upstream commit fa8d8c5b77a2dc467b5365a5651710161b84f16d

...though I can still reproduce this even with this set. As a final stab at
this, I may try to go ahead and backport the patch that adds the launder_page
call, but if the other set doesn't fix it then I have a feeling that that won't
either.

My guess is that we're hitting a problem in the core invalidate_inode_pages2
implementation and we'll need to either patch it or add a parallel call that
handles this correctly. I'll plan to take a harder look next week and see if I
can pinpoint where exactly the race is coming into play with these other patches
in place.

Comment 9 Jeff Layton 2008-03-03 20:18:22 UTC

I've found that running against a netapp is a bit more reliable way to reproduce
this...

In addition to the patches that I've backported to try and fix this, I've also
added some more debugging info. Unfortunately, adding the debug code seems to
change the timing in such a way that it makes the problem go away. When I turn
off the debug info, it returns quickly. I'll need to ponder some less invasive
method to track this down...

Comment 10 Jeff Layton 2008-03-04 15:02:43 UTC

Started with a kernel without all of these patches to make sure I understand the
problem in RHEL5 correctly. I added a bit of debug code:

So here pid 10979 is checking the page validity finds that it's wrong, dumps the
(clean) page from the mapping and does a read, and then updates the page in memory:

NFS:(10979) dump_page page=ffff81000102abe0 flags=808000000086c contents=0 0 0 0
0 0 0 0 0 0 
nfs_revalidate_mapping:(10979) invalidate_inode_pages2 returns 0
NFS:(10979) dump_page page=ffff81000102abe0 flags=8080000000004 contents=0 0 0 0
0 0 0 0 0 0 
NFS(10979) (0:17/2845690) data cache invalidated
NFS: nfs_readpages (0:17/2845690 1)
NFS:    0 initiated read call (req 0:17/2845690, 4096 bytes @ offset 0)
nfs_readpage_result: 7918, (status 4096)
NFS: read done (0:17/2845690 4096@0)


...next 10978 calls invalidate_inode_pages2

nfs_revalidate_mapping:(10978) calling invalidate_inode_pages2

Page is clearly dirty at this point:

NFS:(10978) dump_page page=ffff8100012ad420 flags=808000000003c contents=0 0 0 0
0 0 0 0 10979 0 

...invalidate inode pages2 returns 0, and the page is now clean, _but_ there
doesn't seem to have been any writeback activity in here. Why not?

nfs_revalidate_mapping:(10978) invalidate_inode_pages2 returns 0
NFS:(10978) dump_page page=ffff8100012ad420 flags=8080000000004 contents=0 0 0 0
0 0 0 0 10979 0 
NFS(10978) (0:17/2845690) data cache invalidated

data cache is invalidate and the page is now clean and unattached to the
mapping. We do another read into a new page that gets added to the inode mapping:

NFS: nfs_readpages (0:17/2845690 1)
NFS:    0 initiated read call (req 0:17/2845690, 4096 bytes @ offset 0)
nfs_revalidate_mapping:(10980) calling invalidate_inode_pages2
NFS:(10980) dump_page page=ffff810000dd5240 flags=80000000021 contents=0 0 0 0 0
0 0 0 0 0 


...so the problem in this situation was that invalidate_inode_pages2 cleaned a
dirty page without writing it back. The question is why...

Comment 11 Jeff Layton 2008-03-04 15:42:59 UTC

Ok, it looks like upstream had a patch that went in a while back that changed
some of the invalidate_inode_pages2 semantics:

From fba2591bf4e418b6c3f9f8794c9dd8fe40ae7bd9 Mon Sep 17 00:00:00 2001
From: Linus Torvalds <torvalds.org>
Date: Wed, 20 Dec 2006 13:46:42 -0800
Subject: [PATCH] VM: Remove "clear_page_dirty()" and "test_clear_page_dirty()"
functions

...in particular, it has this delta for iip2 which doesn't seem to depend on the
new functions being added. This makes sense to me, I think. The main way that
invalidate_complete_page2 returns error is when there is PagePrivate is set and
we fail to release the page. In this case, it was *not* set, so the page was
happily cleaned. This seems wrong to me -- we shouldn't be trying to clean any
dirty pages without attempting to write them back...

-                       was_dirty = test_clear_page_dirty(page);
-                       if (!invalidate_complete_page2(mapping, page)) {
-                               if (was_dirty)
-                                       set_page_dirty(page);
+                       if (!invalidate_complete_page2(mapping, page))
                                ret = -EIO;
-                       }

...I plan to play around with modifying this and seeing if it makes any difference.

Comment 12 Jeff Layton 2008-03-04 18:42:00 UTC

After fixing the above delta and applying the patch to add the launder_page
operation, I've been able to run the reproducer for over 5000 iterations without
it hitting the inconsistency.

The downside is:

1) this patchset is large and is almost certainly a kabi breaker. It also mucks
with the invalidate_inode_pages2() function, which is always a bit scary...

2) with this patchset I get a lot of failed mmap() operations.

I think what's happening is that nfs_file_mmap calls nfs_revalidate_mapping.
This causes invalidate_inode_pages2 to get called which then returns -EIO
because the page is being dirtied by the time it calls
invalidate_complete_page2(). This makes the nfs_file_mmap operation fail.

I see some similar failures on rawhide kernels, but the frequency on this
patched RHEL5 kernel is much higher. Now that I *think* see the problem with
invalidate_inode_pages2 I may try to shoot for a smaller more targeted patch and
see if we can't fix this a little less invasively.

Comment 13 Jeff Layton 2008-03-06 16:24:29 UTC

Looking back at the info in comment #10...


NFS:(10978) dump_page page=ffff8100012ad420 flags=808000000003c contents=0 0 0 0
0 0 0 0 10979 0 

...this page was dirty before we enter invalidate_inode_pages2, but PG_private
seems to be unset. So even if we stop clearing the dirty bit before calling
invalidate_complete_page2, we still won't call try_to_release_page and it won't
be written back. That mainly gets set during writeback apparently, so since the
other process raced in and updated the page after the writeback completed, it's
not set here.

I think the right approach might be to turn invalidate_inode_pages2_range into a
wrapper around a new function that does close to the same thing but also takes a
"launder_page" function pointer. In most cases that will be called with a NULL
pointer and it will do what it does now, but when NFS calls this directly we'll
take extra care to try and write back the page after unmapping it. This might
take some careful adjustment of the nfs_releasepage function as well since we
may be interfering with it.

Comment 14 Peter Staubach 2008-03-06 16:34:32 UTC

I am not sure that I see the value in this above just remapping the page
when it is discovered that it is dirty?  Writing out the contents to the
server makes it a valid page, does it not?

Comment 15 Jeff Layton 2008-03-06 18:22:40 UTC

Yes, but is there a potential race there? Suppose we have 2 tasks with the same
page mapped:

Task1 calls invalidate_inode_pages2_range which calls unmap_mapping_range...

Task2 incurs a page fault and does a readpage to attach a new page where the old
one was. Makes a change to that page and dirties it.

Task1 checks the original page, finds it dirty and then remaps the original page
back into the mapping.

What happens at that point? (we're getting outside of my meager VM knowledge
here :-) Can we even undo unmap_mapping_range like that? If so, how?

Presuming we can just put the mapping back, there may be a way to do what you
suggest and prevent this race I suppose -- maybe by holding the i_mutex for the
inode?

Comment 16 Peter Staubach 2008-03-06 18:36:39 UTC

Right.  The trick is prevent the new page from being created until we are
done with the old page.  The page fault code needs to be blocked until the
process which wants to invalidate all of the pages is completed.

Off hand, I don't know how to do this yet, but I will take a look.

Comment 17 Jeff Layton 2008-03-06 18:44:17 UTC

At this point, I think we could probably use Larry's expertise here. Let me see
if I can summarize the problem:

When the cache revalidation logic decides to invalidate the pagecache for a file
on NFS, nfs_revalidate_inode will be called. It in turn will call
nfs_sync_mapping() to writeback any dirty pages and then call
invalidate_inode_pages2() to unmap them.

The problem here is that there are several races possible here. The page can be
redirtied after nfs_sync_mapping completes but before invalidate_inode_pages2
(iip2) can call unmap_mapping_range.

It looks like iip2 tries to handle this situation correctly, but it does a
pretty poor job. From invalidate_inode_pages2_range:

-------------[snip]--------------
was_dirty = test_clear_page_dirty(page);
if (!invalidate_complete_page2(mapping, page)) {
        if (was_dirty)
                set_page_dirty(page);
        ret = -EIO;
}
-------------[snip]--------------

...so if the page is dirty, we clean it and then iff invalidate_complete_page2()
returns error do we set the page dirty again and return error. There is some
logic in nfs_releasepage to attempt a page writeback, but try_to_release_page()
only gets called if PagePrivate is set. That only seems to be so when the page
already has an outstanding write request...

So my question is:

When the page is redirtied just before the page is unmapped in
invalidate_inode_pages2, should it just be clearing the dirty bit on the page
like this without attempting to write it back? I'm presuming the answer there is
no...

If that does occur, I presume that we should expect to get -EIO back from
invalidate_inode_pages2. We'll need to change the NFS code that calls that
function to expect this possibility and deal with it accordingly.

Comment 18 Jeff Layton 2008-03-06 18:48:43 UTC

> Right.  The trick is prevent the new page from being created until we are
> done with the old page.  The page fault code needs to be blocked until the
> process which wants to invalidate all of the pages is completed.

I did a bit of reading in Documentation/filesystems/vfs.txt:

"Adding and removing pages to/from an address_space is protected by the
inode's i_mutex."

It's not clear to me where it's taken in that codepath and whether we can call
invalidate_inode_pages2 with that held without deadlocking...

Comment 19 Jeff Layton 2008-03-11 17:42:14 UTC

Created attachment 297648 [details]
experimental patch 1 -- don't clean pages as aggressively in invalidate_inode_pages2_range

First patch in a smaller, more targeted patchset I've been working with.

Comment 20 Jeff Layton 2008-03-11 17:42:54 UTC

Created attachment 297649 [details]
experimental patch 2 -- have nfs_revalidate_mapping deal with errors from invalidate_inode_pages2

Second patch

Comment 21 Jeff Layton 2008-03-11 17:50:37 UTC

I did some testing on a kernel with the two patches in comment #19 and comment
#20. I was able to run the reproducer for over an hour with no corruption. I
ended up killing the reproducer and it never failed. The downside is that this
does cause nfs_file_mmap to occasionally fail when it races with pages being
dirtied:

Run 24939
Run 24940
<10910> mmap failed (5:Input/output error)
<10911> mmap failed (5:Input/output error)
<10912> mmap failed (5:Input/output error)
<10913> mmap failed (5:Input/output error)
Run 24941
<10925> mmap failed (5:Input/output error)
Run 24942

...this seems to happen fairly frequently. I think this is probably an
improvement -- it's better to have the mmap fail rather than corrupt the data,
but it's not ideal.

We may be able to fix these though with a mutex to serialize mmap and
nfs_revalidate_mapping. I want to do a bit more testing with these two patches
and then I'll see if we can do that.

This is not what Peter was suggesting. I'm still open to his idea, but this does
seem to more closely follow upstream behavior. Upstream seems to behave
similarly -- we get the occasional failed mmap there, but the frequency seems to
be a lot lower than on RHEL5 with these patches.

One of these patches touches invalidate_inode_pages2_range(). I'd like for Larry
to look at that one and comment when he gets some time. I think it's a sane
change, but that's definitely an area where I could use some guidance.

Comment 22 Jeff Layton 2008-03-12 19:10:21 UTC

I don't think my serialization idea in comment #21 will work. The pages are
being dirtied via userspace mmap and that's completely disconnected from any
kernel activity. I think all we can really do is try to unmap the pages, and
then to make an attempt to write them back if they're dirty after being unmapped.

I have a patch that adds some launder_page functionality, but it doesn't seem to
work correctly at the moment. The nfs_wb_page() function seems to work
completely differently upstream than it does in RHEL5, so I think we'd need to
have nfs_launder_page do something different. The RHEL5 version doesn't seem to
actually clean the page. PG_dirty seems to be consistently set on exit from the
function.

I've experimented a bit with different ways of attempting to write the page back
in nfs_launder_page, but they've all been buggy. It's a bit tricky since we're
calling this from invalidate_inode_pages2_range and the page is already locked,
etc...

Comment 23 Peter Staubach 2008-03-12 19:20:53 UTC

If the page is locked and unmapped and the mapping is locked, then
would it be possible to just map it back in instead of attempting
to clean it and destroy it?

Comment 24 Jeff Layton 2008-03-12 19:44:51 UTC

Good question...

Presuming that that's possible and that it's acceptible to do this, what would
be the proper return value in this case? -EAGAIN?

Then again, is it even necessary to remap the pages like that? Presuming that we
don't remove the page from the pagecache, then it should still be lingering
around. The next pagefault from userspace should make the page get mapped back
into the process' pagetables.

That may actually be a source of one of the races I've been seeing now that I
think about it. Even though we call unmap_mapping_range, is there anything that
prevents the page from being mapped back and dirtied before we get a chance to
write it out?

Comment 25 Peter Staubach 2008-03-12 20:09:34 UTC

It seems to me that, at this point, we don't want to have to worry
about writing out the page.  We tried earlier and if it is dirty
again, then we know that someone is accessing it and we should just
leave it alone.

However, we do have to ensure that it is not accessible to userspace
while we are checking to see whether it is dirty or not and also
that another page can not be created with this same identity in the
meantime.

Comment 26 Jeff Layton 2008-03-12 20:25:37 UTC

I don't think we can create another page with the same identity until this one
has been remove from the pagecache (as in invalidate_complete_page2()). With
this patchset, that should not be happening now when the page is dirty since
invalidate_complete_page2 should generally just fail in that case...

That said, I decided to test the original reproducer for this problem and hit an
issue after 253 runs:

Run 253
<5353> can't find me to logout - expected at slot 2
(5353) 0: val=-1 next=1
(5353) 1: val=5351 next=-1
(5353) 2: val=5353 next=-1
(5353) 3: val=-1 next=-1
(5353) 4: val=-1 next=-1
<5352> mmap failed (5:Input/output error)

...I've not done any detailed looking at why it failed yet.

Comment 27 Jeff Layton 2008-03-13 13:51:40 UTC

Modified the linked-list reproducer a bit so that it prints out the pids for
each run that are running concurrently:

25859:  25860 25861 25862 25863 25864 25865 25866 25867 25868 25869
<25869> can't find me to logout - expected at slot 1
(25869) 0: val=-1 next=-1
(25869) 1: val=25869 next=1
(25869) 2: val=-1 next=-1
(25869) 3: val=-1 next=-1
(25869) 4: val=-1 next=-1
(25869) 5: val=-1 next=-1
(25869) 6: val=-1 next=-1
(25869) 7: val=-1 next=-1
(25869) 8: val=-1 next=-1
(25869) 9: val=-1 next=-1
(25869) 10: val=-1 next=-1

...I also added some debug code to tell me when invalidate_inode_pages2 returns
error, and changed the dfprintk for the "data cache invalidated" message to be
printed with my private NFSDBG_MAPPING bit.

NFS: (25862) (0:17/2489340) data cache invalidated
NFS: (25861) (0:17/2489340) data cache invalidated
NFS: (25863) (0:17/2489340) data cache invalidated
NFS: (25869) (0:17/2489340) data cache invalidated
NFS: (25866) (0:17/2489340) data cache invalidated

...so when the above run failed, invalidate_inode_pages2 did not return error to
nfs_revalidate_mapping. It did, however, invalidate the cache.

One thing that might help this is figuring out why we're doing so many cache
invalidations with this reproducer. This host is the only writer of the file, so
it shouldn't be invalidating the cache so often.

While that wouldn't really "fix" this problem, I think that it would make it
much more difficult to hit and would probably help performance some here and
possibly in other situations.

Comment 28 Jeff Layton 2008-03-13 18:17:17 UTC

With the 2 patches I have so far, I'm no longer able to hit the race with my own
reproducer, but the mutex_lt3 reproducer can still hit it. I'm not sure of why
this is, but it may have something to do with the fact that that reproducer
deals with multiple pages...

On a hunch, I decided to try the following patch:

--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -704,13 +704,16 @@ int nfs_revalidate_mapping(struct inode *inode, struct add
                        || nfs_attribute_timeout(inode))
                ret = __nfs_revalidate_inode(NFS_SERVER(inode), inode);
 
+       mutex_lock(&inode->i_mutex);
        if (nfsi->cache_validity & NFS_INO_INVALID_DATA) {
                nfs_inc_stats(inode, NFSIOS_DATAINVALIDATE);
                if (S_ISREG(inode->i_mode))
                        nfs_sync_mapping(mapping);
                err = invalidate_inode_pages2(mapping);
-               if (err != 0)
-                       return err;
+               if (err != 0) {
+                       ret = err;
+                       goto out;
+               }
 
                spin_lock(&inode->i_lock);
                nfsi->cache_validity &= ~NFS_INO_INVALID_DATA;
@@ -724,6 +727,8 @@ int nfs_revalidate_mapping(struct inode *inode, struct addre
                                inode->i_sb->s_id,
                                (long long)NFS_FILEID(inode));
        }
+out:
+       mutex_unlock(&inode->i_mutex);
        return ret;
 }

-------------------------------

There is some similar serialization code upstream, but it's a bit more complex.
It's prob. not safe to take the i_mutex here in every case, but for this test it
seems to work well enough.

This actually seems to make the race harder to hit. I'm still able to hit it
with this patch in place, so it's not a complete fix, but it does seem to help
for reasons I don't completely understand.

I suspect that we might have multiple threads ending up in the
nfs_revalidate_mapping logic at once, and that accounts for at least some of the
races. Since this doesn't fix it completely though, I have to assume that there
are other races we're hitting too...

Comment 29 Jeff Layton 2008-03-14 13:38:23 UTC

I spent some time looking to see if we can try to reduce the number of cache
invalidations to work around this problem. I've identified some things we can do
to try and help that, but haven't been able to eliminate the cache invalidations
in this situation. The problem that I've seen now is that there are some races
between write calls on the wire and the getattr that is done when the file is
opened. When there are writes outstanding and a getattr is done, the mtime can
seemingly reflect any intermediate time on the file.

One thing we could consider is making sure that we wait for any outstanding
writes to complete when doing a getattr, but that could significantly slow
things down.

Comment 30 Peter Staubach 2008-03-14 14:22:38 UTC

This is the price to pay if there is a need to validate the attribute
cache while writing to a file.  The file has to be quiesced, ie. not
changing, while the validation is done.

Do you know where the GETATTR is coming from?  The WRITE responses
contain attributes which should be used to update the attributes.
There shouldn't be a need to issue the GETATTR for attribute cache
validation purposes because the attribute cache is continually
getting updated from the WRITE responses.

All this said, I don't think that this problem can be solved by
addressing the issue of cache invalidations done as a result of
processing WRITE responses.  There has to be a safe way implemented
which allows the pages to be safely unmapped, cleaned if necessary,
and then destroyed.  How does the system do it under low memory
situations when memory must be reclaimed?

Comment 31 Jeff Layton 2008-03-14 14:34:08 UTC

The GETATTR call is coming from another process opening the file. It's doing the
GETATTR as the open part of a CTO check. It just happens to be that the open
sometimes races with writes from other threads.

> All this said, I don't think that this problem can be solved by 
> addressing the issue of cache invalidations done as a result of
> processing WRITE responses.

Agreed, while this might help us from getting into this situation as often,
we'll still be vulnerable to it in some cases (particularly if there are other
hosts occasionally writing to the file).

Comment 32 IBM Bug Proxy 2008-03-14 14:56:59 UTC

------- Comment From ffilz.com 2008-03-14 10:52 EDT-------
> The problem that I've seen now is that there are some races
> between write calls on the wire and the getattr that is done when the file is
> opened. When there are writes outstanding and a getattr is done, themtime can
> seemingly reflect any intermediate time on the file.

Ok, so we have an open getattr racing with a write. Hmm, can we skip the
getattr for the open if a write is in progress? But what do we do if we want to
write and a getattr is in progress? With UDP, we can't guarantee ordering, can
we with TCP? (If we can, I'd be comfortable requiring TCP for this type of
situation).

> Agreed, while this might help us from getting into this situation as often,
> we'll still be vulnerable to it in some cases (particularly if thereare other
> hosts occasionally writing to the file).

Hmm, can we hope to support two clients mmapping the same file (for write)?

Comment 33 Jeff Layton 2008-03-14 15:28:06 UTC

> Hmm, can we skip the getattr for the open if a write is in progress?

Perhaps. That would mean that we'd probably need to reintroduce the data_updates
counter that the patch in bug 321111 removes. I played around a bit with
reimplementing it today and still hit some races that caused the cache to be
invalidated.

I've been using TCP for all of this testing. Packet loss and network delays
aren't really the cause of most of this. The problem seems to stem from the fact
that the server doesn't necessarily respond to calls in the order that they were
received.

I suppose this basically comes down to that we need to serialize getattrs vs.
writes somehow. Allow writes to occur however the client wants but we have to
make sure they're all complete before we do a getattr and not allow any new ones
until the getattr completes. Maybe we could do this with some sort of rq
semaphore? I'd need to think on it a bit.

> Hmm, can we hope to support two clients mmapping the same file (for write)?

Good question. With proper locking it seems like that should work. You'd have to
make sure that when the client gets the lock and the file has changed since you
last had it that the cache is invalidated and reloaded before allowing a process
to make changes to the mmap.

Regardless though, it might be best to consider this in the context of a
different BZ. Preventing false cache invalidations is really a separate issue
from what happens once the invalidation occurs.

Comment 34 Peter Staubach 2008-03-14 15:41:15 UTC

I don't think that we want to muck with the open-to-close semantics at
this point.  If we need to seriliaze writes and opens, then sobeit.

This problem can be solved with a safe way to unmap pages which does not
leave a window where data can be lost.

The other issues of reducing cache invalidations can be solved separately.

Comment 35 Jeff Layton 2008-03-17 13:10:50 UTC

Yep...

I decided to step back a bit and make sure I understand what's supposed to
happen when the page is mapped and dirtied. I've written a simple program to
mmap a file, read from it via the mmap and write back to it via the mmap.
Starting out playing with this on a file in /tmp and will look at NFS after I
see what happens here:

I start the program, which opens the file and mmaps it:

# ./mmap-test /tmp/testfile
mmap at address 0x2aaaaaaac000

crash> vtop 0x2aaaaaaac000
VIRTUAL     PHYSICAL        
2aaaaaaac000  (not mapped)

...so there is no page attached at this point. Now, read from the mapping (incur
a page fault):

      PTE         PHYSICAL  FLAGS
80000000158cc025  158cc000  (PRESENT|USER|ACCESSED|NX)

      VMA           START       END     FLAGS FILE
ffff8100153f6218 2aaaaaaac000 2aaaaaaad000     fb /tmp/testfile

      PAGE       PHYSICAL      MAPPING       INDEX CNT FLAGS
ffff810001515c80 158cc000 ffff81001eabfbc8        0  3 1008000000086c

...now, write to it:

      PTE         PHYSICAL  FLAGS
80000000158cc067  158cc000  (PRESENT|RW|USER|ACCESSED|DIRTY|NX)

      PAGE       PHYSICAL      MAPPING       INDEX CNT FLAGS
ffff810001515c80 158cc000 ffff81001eabfbc8        0  3 1008000000087c

...now msync:

      PTE         PHYSICAL  FLAGS
80000000158cc025  158cc000  (PRESENT|USER|ACCESSED|NX)

      PAGE       PHYSICAL      MAPPING       INDEX CNT FLAGS
ffff810001515c80 158cc000 ffff81001eabfbc8        0  3 1008000000086c


...nothing terribly surprising here.

I suspect that what's happening is that the page is being allowed to be dirtied
somehow after unmap_mapping_range() is complete and after the PageDirty check is
done in invalidate_complete_page2().

Could it be something like:

thread1: unmap_mapping_range completes (no more mmaps)
thread2: page fault occurs, page is remapped back into process address space
thread1: invalidate_complete_page2 is called and gets past the PageDirty check
thread2: dirties the page
thread1: __remove_from_page_cache

...there's probably some way to test for this, but I need to figure out to do it
in such a way that nothing can slip by. It seems like mapping a page into the
process' address space and dirtying it is pretty much lockless...

Comment 36 Jeff Layton 2008-03-17 15:23:39 UTC

Testing with my current test kernels (which contain the patches I have so far
for this). Here's the output from the run:

9216:  9217 9218 9219 9220 9221 9222 9223 9224 9225 9226
<9222> can't find me to logout - expected at slot 1
(9222) 0: val=-1 next=-1
(9222) 1: val=9222 next=1
(9222) 2: val=-1 next=-1
(9222) 3: val=-1 next=-1
(9222) 4: val=-1 next=-1
(9222) 5: val=-1 next=-1
(9222) 6: val=-1 next=-1
(9222) 7: val=-1 next=-1
(9222) 8: val=-1 next=-1
(9222) 9: val=-1 next=-1
(9222) 10: val=-1 next=-1
<9226> looped 2
(9226) 0: val=-1 next=1
(9226) 1: val=9222 next=1
(9226) 2: val=-1 next=-1
(9226) 3: val=-1 next=-1
(9226) 4: val=-1 next=-1
(9226) 5: val=-1 next=-1
(9226) 6: val=-1 next=-1
(9226) 7: val=-1 next=-1
(9226) 8: val=-1 next=-1
(9226) 9: val=-1 next=-1
(9226) 10: val=-1 next=-1

I ran this with the following systemtap script:

probe kernel.function("__remove_from_page_cache")
{
        if ($page->_mapcount->counter >= 0) {
                printf("__remove_from_page_cache(%u): _mapcount=%d
flags=0x%x\n", pid(), $page->_mapcount->counter, $page->flags);
                print_backtrace();
        }
}

...here's the output:

__remove_from_page_cache(9226): _mapcount=0 flags=0x8000000002d
 0xffffffff800ca61f : __remove_from_page_cache+0x1/0x3d
 0xffffffff800ce245 : invalidate_inode_pages2_range+0x158/0x209
 0xffffffff8842075a : nfs_revalidate_mapping+0xac/0x1af [nfs]
 0xffffffff8841e378 : nfs_file_mmap+0x21/0x4c [nfs]
 0xffffffff8000e584 : do_mmap_pgoff+0x4f1/0x7da
 0xffffffff80024cff : sys_mmap+0x89/0xc1
 0xffffffff800602a6 : tracesys+0xd5/0xdf

...I think this may be related to the problem here. The page should have been
unmapped when we called __remove_from_page_cache, but I'm guessing that sometime
after unmap_mapping_range was called, a process incurred a page fault and mapped
the page back into its page tables. It wasn't dirty at this point, but it easily
could have been dirtied after this and before trashing the page.

It seems like we need something that prevents the page from getting mapped back
into anything's pagetables before we toss it out.

I'm going to see if I can get a bit more milage out of systemtap here...

Comment 37 Jeff Layton 2008-03-17 15:52:12 UTC

Some of this may be incidental to the actual problem though. While testing this,
I've seen mapping go >= 0 when no problem occurred, and I've seen the problem
occur with no message pop.

The problem still could be the same, just with the race happening a bit later. I
may see if I can do this same check after invalidate_complete_page2 returns, but
that means rebuilding the kernel since that function is static.

Comment 38 Jeff Layton 2008-03-18 15:41:00 UTC

Created attachment 298407 [details]
latest linked-list reproducer

This is the latest linked list reproducer that I'm using to test this. It still
fails reliably even with the two experimental patches above. To build it:

# gcc -Wall -g -o ./mutex_lt3 ./mutex_lt3-7.c -lpthread

...I then run it under a shell script that runs it in a loop:


-------[snip]---------
#!/bin/sh

RESULT=0
RUN=1
while [ $RESULT -eq 0 ]; do
	./mutex_lt3 $1
	RESULT=$?
	RUN=`expr $RUN + 1`
done

exit $RESULT
-------[snip]---------

You run the shell script with a file on a NFS share as an argument. I've been
using a netapp server. The fact that it has sub-second mtimes I think makes for
more cache invalidations and helps to trigger the problem.

The code in this reproducer is very hard to follow, but basically it has a
bunch of threads adding and removing entries from a linked list that's on a
mmapped file. It uses a semaphore for serialization.

When run, one of the threads will eventually notice that the list is not what
is expected and will exit with an error.

Comment 39 Jeff Layton 2008-03-18 15:42:45 UTC

To summarize:

The issue with this case seems to be that when a file is mmapped on NFS
and processes are updating it, it occasionally ends up having different
contents vs. what's expected.

I *think* the problem is related to the client noticing mtime changes
on the file and invalidating its contents based on that. I suspect that
the problem is either:

1) we're occasionally invalidating a page that has dirty data

...and/or...

2) we're occasionally reading new data onto a dirty page

My initial thought was that the problem was a race where a page would
be dirtied after writeback but before being invalidated. The two experimental
patches attached to this case try to fix that. While they fix a simpler
reproducer that I have, they do not seem to fix the reproducer in comment #38.

Comment 40 Jeff Layton 2008-03-24 13:56:59 UTC

Actually....

I'm not convinced that this problem is actually fixed in rawhide. It may be, but
it may be that this problem is being sidestepped by reducing cache invalidations.

When doing a mutex_lt3_loop test on rawhide (kernel 2.6.25-0.121.rc5.git4.fc9),
I see virtually no read calls occurring. This tells me that, we're almost never
invalidating the cache and so we virtually never end up clobbering the data in a
dirty page...

Comment 42 IBM Bug Proxy 2008-03-31 16:17:10 UTC

------- Comment From sglass.com 2008-03-31 12:15 EDT-------
Red Hat, could I get a quick update on where we are with this bug  please.  IBM
really needs to update the customer on this bug.

Thanks

Stephanie Glass
SWG Program Manager

Comment 43 Peter Staubach 2008-03-31 17:29:39 UTC

I am working on a instrumenting a kernel so that I can attempt to
see what is happening when the corruption is lost.  Some of the
instrumentation needs to go into the page fault handling code, so
this must be done very carefully.  Also, due to the high level of
concurrency, this must be done very carefully in order to not
disturb the ordering by serializing all of the processes on a
single lock which is used for logging.  I have almost completed
the design for this instrumentation and will be implementing it.
Once I get it in place, then I would hope that we can find out
the updates to some pages are being lost.

Comment 44 IBM Bug Proxy 2008-04-07 13:57:05 UTC

------- Comment From ffilz.com 2008-04-07 09:51 EDT-------
Any updates on this?

Comment 45 Peter Staubach 2008-04-07 14:20:31 UTC

Not yet.  I needed to spend some time on another high priority bug for a
few days.  I have come to a stopping point on that other bug and will be
getting back to work on this situation this week.

Comment 46 IBM Bug Proxy 2008-05-01 16:09:01 UTC

------- Comment From chavez.com 2008-05-01 12:05 EDT-------
Given that Red Hat has requested a RHIT issue to assist in getting a fix into a
Z stream, I am going to have to break the mirroring link for a short time while
I create a RHIT issue to link the LTC bug to. After this, Red Hat will need to
establish a link from the RHIT issue to RH BZ 435291 with their Tao system.
Please be patient and refrain from adding comments until this is complete.
Thanks for your patience.

Comment 49 Issue Tracker 2008-05-06 12:54:31 UTC

Hello ENG,
I have attached the patch IBM tested and said was good.

Thank You
Joe Kachuck


This event sent from IssueTracker by jkachuck 
 issue 178083
it_file 131821

Comment 52 Peter Staubach 2008-05-20 17:38:29 UTC

Created attachment 306161 [details]
Proposed patch

Here is a proposed patch to address the situation.  I have not completed
testing using the patch, but the current results look promising.  A kernel
with this patch runs the reproducer in this bugzilla and the more complex
reproducer from bz435291 without failure on my test systems.

Comment 53 RHEL Program Management 2008-05-20 20:27:18 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 54 Issue Tracker 2008-05-21 19:20:13 UTC

----- Additional Comments From sglass.com  2008-05-21 14:17 EDT
-------
From Red Hat bug 435291 which did not get mirrored:

Comment #52 From Peter Staubach (staubach)  	 on 2008-05-20
13:38
EST  	[reply]  	 

  [edit]
Proposed patch

Here is a proposed patch to address the situation.  I have not completed
testing using the patch, but the current results look promising.  A
kernel
with this patch runs the reproducer in this bugzilla and the more complex
reproducer from bz435291 without failure on my test systems.

I will post the patch separately. 


This event sent from IssueTracker by jkachuck 
 issue 178083

Comment 55 Issue Tracker 2008-05-21 19:20:15 UTC

File uploaded: 435291.rhel-5

This event sent from IssueTracker by jkachuck 
 issue 178083
it_file 134389

Comment 56 Issue Tracker 2008-05-21 19:20:16 UTC

file attachment from LTC:435291.rhel-5


This event sent from IssueTracker by jkachuck 
 issue 178083

Comment 57 Issue Tracker 2008-05-21 19:20:18 UTC

----- Additional Comments From sglass.com  2008-05-21 14:26 EDT
-------
 
New Patch

Please test on both RHEL 4.5 and RHEL 5.1 


This event sent from IssueTracker by jkachuck 
 issue 178083

Comment 58 Peter Staubach 2008-05-21 19:45:49 UTC

Please be aware that the patch for RHEL-4 is different than the patch
for RHEL-5.  You will need the patch from bz432974 for the RHEL-4
version.

Comment 59 Issue Tracker 2008-05-28 19:13:10 UTC

changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|FIXEDAWAITINGTEST           |TESTED




------- Additional Comments From sglass.com  2008-05-28 14:50 EDT
-------
IBM has tested this latest patch and it works great!  Thanks for all the
hardwork. 


This event sent from IssueTracker by jkachuck 
 issue 178083

Comment 60 Peter Staubach 2008-05-28 20:54:53 UTC

Great news!  Thank you for the update.

Comment 69 Jeff Layton 2008-06-10 13:03:06 UTC

I added this patch to the test kernels that I have here:

http://people.redhat.com/jlayton/

Today, I was doing some testing for an upstream server-side
patch I'm working on and I noticed that iozone had hung on the rhel5
client I was using. Here's the stack there:

PID: 23154  TASK: ffff810003980c00  CPU: 0   COMMAND: "iozone"
 #0 [ffff810013f23cf8] schedule at ffffffff80065f83
 #1 [ffff810013f23dd0] inode_wait at ffffffff800eebb1
 #2 [ffff810013f23de0] __wait_on_bit at ffffffff80066a75
 #3 [ffff810013f23e20] __writeback_single_inode at ffffffff80031779
 #4 [ffff810013f23eb0] sync_sb_inodes at ffffffff80022109
 #5 [ffff810013f23ef0] sync_inodes_sb at ffffffff800f54ec
 #6 [ffff810013f23f40] __sync_inodes at ffffffff800f555d
 #7 [ffff810013f23f60] do_sync at ffffffff800e44e8
 #8 [ffff810013f23f70] sys_sync at ffffffff800e451a
 #9 [ffff810013f23f80] tracesys at ffffffff800602a6 (via system_call)

...looking around, I noticed that rhn_check seemed to also be hung (and
it had been running since yesterday):

PID: 20997  TASK: ffff81000333ec80  CPU: 0   COMMAND: "rhn_check"
 #0 [ffff810015fcd998] schedule at ffffffff80065f83
 #1 [ffff810015fcda70] io_schedule at ffffffff80066849
 #2 [ffff810015fcda90] sync_page at ffffffff80029f58
 #3 [ffff810015fcdaa0] __wait_on_bit_lock at ffffffff8006698d
 #4 [ffff810015fcdad0] __lock_page at ffffffff80041cfa
 #5 [ffff810015fcdb20] mpage_writepages at ffffffff8001d9f1
 #6 [ffff810015fcdc30] do_writepages at ffffffff8005d5c4
 #7 [ffff810015fcdc40] __writeback_single_inode at ffffffff800317ed
 #8 [ffff810015fcdcd0] sync_sb_inodes at ffffffff80022109
 #9 [ffff810015fcdd10] writeback_inodes at ffffffff800532ef
#10 [ffff810015fcdd30] balance_dirty_pages_ratelimited_nr at
ffffffff800cda2c #11 [ffff810015fcddd0] __handle_mm_fault at
ffffffff80008f1a #12 [ffff810015fcde60] do_page_fault at
ffffffff8006a7db #13 [ffff810015fcdf50] error_exit at ffffffff80060e9d

...it looks like it's hung trying to lock the page here while in the
page fault handler. The page seems to be part of a file on ext3. I'm
not that familiar with this code, but perhaps we're trying to lock a
page that we already have locked?

Comment 70 Peter Staubach 2008-06-12 19:06:59 UTC

Yup, trying to lock to a page that is already locked.  Deadlock.

Comment 71 Peter Staubach 2008-06-12 19:08:21 UTC

Created attachment 309133 [details]
Proposed patch

Comment 72 Peter Staubach 2008-06-12 19:09:45 UTC

Here is a new patch which should address the deadlock described in
Comment #69.  It passes the reproducers and is in use on a test
system.

Comment 73 John Jarvis 2008-06-20 13:50:31 UTC

Stephanie, can you please get the appropriate IBM resources testing this new
patch?

Comment 74 Issue Tracker 2008-06-23 12:55:37 UTC

----- Additional Comments From cguild.com  2008-06-20 12:05 EDT
-------
I'll be more than happy to test the patch:  If you can tell me the kernel
that 
the patch was based off of, I have several images that have been in use
for 
months.  The only one I do not have is RHAS4 U7, and I will see about 
downloading the latest ISOs....   Or, you can point me to the kernel
sources 
which would be better.

Thanks! 


This event sent from IssueTracker by jkachuck 
 issue 178083

Comment 75 Steven Keene 2008-07-14 19:43:42 UTC

If there is a Hotfix for this in RHAS4 U5 can some point us to where we can get
that hotfix or is required to go to update 6 to have this fixed period?

Comment 76 Peter Staubach 2008-07-15 11:18:48 UTC

This is the RHEL-5 bugzilla, by the way.  The RHEL-4 bugzilla is 432974.

But anyway, there isn't either one, that I know of.  The patch for this
has been proposed for RHEL-4 U8.  I believe that there are rpms for
testing which contain the patch, but others would know better which
ones and where they are located.

The changes to address this situation are in very commonly used
areas and are considered to be fairly high risk.  Until we can get
some more experience with them, then backporting them to older RHEL-4
releases is not a good idea.

Comment 77 Don Zickus 2008-07-23 18:54:50 UTC

in kernel-2.6.18-99.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Comment 80 C. Michael Pilato 2008-09-11 18:20:22 UTC

I've seen serious issues with RHEL 5.0 when accessing BerkeleyDB-backed Subversion repositories housed on a NetApp device.  Relatively straightforward operations hang indefinitely with only two concurrent accessing processes in play.  An upgrade to RHEL 5.2 didn't help.  But Don Zickus' kernel at http://people.redhat.com/dzickus/el5/107.el5 (applied as an upgrade to an RHEL 5.0 installation) seems to have remedied the problem!

Can anyone speak about the trajection of this issue, specifically regarding the possibility of the fix showing up in a blessed Red Hat release?

Comment 81 C. Michael Pilato 2008-09-11 18:33:19 UTC

Sorry about previous comment.  It might be helpful to know that I and my co-workers stumbed upon this issue while testing RHEL 5.0, Subversion 1.5, and Berkeley DB 4.7.  We were able to rule out Subversion and Berkeley DB as likely causes, and eventually found that the one key change since our last successful configuration was an upgrade from RHEL-3 to RHEL-5.

There are no shortage of NFS-related bugs in this tracker, but this one caught my eye because it referred to memory mapped files.  In some of my tests I didn't see an application hang but I *did* see results that seemed to indicate ... "wonkiness" (for lack of a better word) in the data returned from Berkeley DB: Subversion transaction records suddenly disappearing, errors about "unpinned pages", etc.  We have many applications reading/writing to the NetApp, but Berkeley DB makes fairly heavy use of file-backed shared memory regions.  *That's* what drew my attention here, and prompted us to try out Don's test kernel.

Comment 82 Peter Staubach 2008-09-11 19:01:10 UTC

Sorry, you are asking when a kernel, containing this patch, might
be released?

This patch is targetted for release in RHEL-5.3.

Comment 84 Chris Ward 2008-10-21 13:01:33 UTC

Attention Partners! 

RHEL 5.3 public Beta will be released soon. This URGENT priority/severity bug should have a fix in place in the recently released Partner Alpha drop, available at ftp://partners.redhat.com. If you haven't had a chance yet to test this bug, please do so at your earliest convenience, to ensure the highest possible quality bits in the upcoming Beta drop.

Thanks, more information about Beta testing to come.

 - Red Hat QE Partner Management

Comment 90 C. Michael Pilato 2008-11-03 20:50:24 UTC

We (not me personally this time, but peers of mine) at CollabNet have tested the RHEL 5.3 alpha drop and found that it did *not* fix our problems.  Strangely, Don Zickus' kernel at http://people.redhat.com/dzickus/el5/107.el5 *did*.  I'm left wondering if perhaps there were additional patches in Don's kernel that are not present in the RHEL 5.3 alpha kernel.  Did the RHEL 5.3 alpha kernel remedy the problems seen by the original reporter of this issue?

Comment 91 Peter Staubach 2008-11-03 21:14:05 UTC

Which version of the RHEL-5 kernel was included in the alpha described
in Comment #90?  The patch was included in 2.6.18-99.el5.

Comment 92 C. Michael Pilato 2008-11-04 02:03:32 UTC

The version is as stated on RH announcement here:
https://www.redhat.com/archives/rhelv5-announce/2008-October/msg00000.html

(And note that I was saying "alpha" when apparently this was actually a "beta"?)

$ uname -a
Linux *****.collab.net 2.6.18-120.el5PAE #1 SMP Fri Oct 17 18:17:11 EDT
2008 i686 i686 i386 GNU/Linux
$

Comment 93 Peter Staubach 2008-11-04 14:56:52 UTC

The patch as attached to this bugzilla does indeed address the problems
as reproduced by the reproducer also attached to this bugzilla.  The
patch was included in RHEL-5 kernels as of 2.6.18-99.

If it does not address some other situation, then I will need some way
to reproduce that other situation so that I can see it and do some
debugging and analysis.

Comment 94 C. Michael Pilato 2008-11-05 16:12:31 UTC

Hrm.  We received the following update from RedHat this morning via other channels:

   ---------------------------------------
   |  Case Information  |
   ---------------------------------------
   Case Title       : LTC41974-Pages of a memory mapped NFS file get corrupted
   Case Number      : 1857478
   Case Open Date   : 11-SEP-2008
   Problem Type     : 
   Last Update Comment as of 05-NOV-2008 09:10:41   :  
   Hello,
   
   I have checked and the current beta kernel which is included in RHEL5.3 beta 
   *does not* include the patch described in the bugzilla. This is because the 
   bugzilla and patch has still to get a qa acknowledgment from the quality 
   management.
   
   Thank you,
   Shijoe George.

I'm trying to reconcile the various bits of information at my disposal, but that's proving difficult.  :-)

Comment 95 Peter Staubach 2008-11-05 16:37:24 UTC

Let me assure you -- the patch associated with this bugzilla is in
every RHEL-5 kernel after 2.6.18-99.  If someone disagrees, please
have them contact me.  Consider this to be authoritative.

Comment 97 Joseph Kachuck 2008-11-07 15:43:36 UTC

Update from IBM:
This has tested well on RHEL 5.3 for IBM and its customer.  I have a request that if there is another problem that a new bug is open on that issue rather then confusing this issue with other problems. I like to not delay getting this fix out to our cusomers.

Thanks

Stephanie LTC Program Manager for SWG

Comment 100 C. Michael Pilato 2008-12-01 16:21:19 UTC

Okay.  We've retested this beta kernel, and things *do* seem to work.  (There arose some question about the validity of our previous round of tests which encouraged us to try again.)  Embarrassing, yes, but in the long run all we'll care about is that stuff works.  :-)

Comment 102 errata-xmlrpc 2009-01-20 19:40:50 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-0225.html

Note You need to log in before you can comment on or make changes to this bug.