Bug 428751 - GFS2 is not cluster coherent
GFS2 is not cluster coherent
Status: CLOSED DUPLICATE of bug 432057
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel (Show other bugs)
5.1
All Linux
high Severity medium
: rc
: ---
Assigned To: Don Zickus
GFS Bugs
:
Depends On: 432057 437893
Blocks: 432826
  Show dependency treegraph
 
Reported: 2008-01-14 17:34 EST by Nate Straz
Modified: 2008-07-21 11:43 EDT (History)
5 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2008-07-21 11:43:33 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Steve's possible fix (14.38 KB, patch)
2008-01-16 13:22 EST, Robert Peterson
no flags Details | Diff
A new version of the patch (14.33 KB, patch)
2008-01-17 06:32 EST, Steve Whitehouse
no flags Details | Diff
A new version of the patch (RHEL5 over perf. patch) (34.98 KB, patch)
2008-01-17 10:39 EST, Robert Peterson
no flags Details | Diff
script to generate a herd file to test all write/read combinations (1.25 KB, text/plain)
2008-02-08 10:54 EST, Nate Straz
no flags Details
parse test log showing operations between trunc and failed read (6.08 KB, text/plain)
2008-02-21 18:45 EST, Nate Straz
no flags Details
Patch to invalidate the page cache before calling the lock manager (6.62 KB, patch)
2008-03-14 18:08 EDT, Ben Marzinski
no flags Details | Diff

  None (edit)
Description Nate Straz 2008-01-14 17:34:26 EST
Description of problem:

While running revolver on recent GFS2 kmods I have sometimes hit a data
comparison error (from our test case) on a read during revolver.

Version-Release number of selected component (if applicable):
kmod-gfs2-1.65-1.6
kernel-2.6.18-65.el5

How reproducible:
Unknown, will attempt to reproduce.

Steps to Reproduce:
1. run revolver
  
Actual results:

Senario iteration 2.1 started at Mon Jan 14 15:38:47 CST 2008
Sleeping 5 minute(s) to let the I/O get its lock count up...
Senario: DLM kill one node

Those picked to face the revolver... tank-01 
Feeling lucky tank-01? Well do ya? Go'head make my day...

Verify that tank-01 has been removed from cluster on remaining nodes
Verifying that the dueler(s) are alive
still not all alive, sleeping another 10 seconds
still not all alive, sleeping another 10 seconds
<fail name="iogen_736" pid="15918" time="Mon Jan 14 15:45:17 2008" type="cmd"
duration="3246" ec="1" />
ALL STOP!
<stop name="iogen_612" pid="15916" time="Mon Jan 14 15:45:17 2008" type="cmd"
duration="3246" ec="0" />
<stop name="iogen_889" pid="15920" time="Mon Jan 14 15:45:17 2008" type="cmd"
duration="3246" ec="0" />
<killed name="tank-03_0" pid="17890" time="Mon Jan 14 15:45:17 2008" type="cmd"
duration="409" signal="2" />
<stop name="tank-02_1" pid="17902" time="Mon Jan 14 15:45:17 2008" type="cmd"
duration="405" ec="1" />

tag iogen_736 had output:

No (-R) resource file specified -- d_iogen will allow any client to connect.
d_iogen starting up with the following:
Start Time:                 Mon Jan 14 14:51:11 2008
Session id:                 736
Resource file:              None
Internal Region Lock Type:  clm
Iterations:                 Infinite
Seed:                       15919
Offset-mode:                sequential
Overlap Flag:               off
Mintrans:                   512
Maxtrans:                   4096
Requests:                   read,write
Syscalls:                   read,readv,pread,mmread,write,writev,pwrite,mmwrite
IO type:                    buffered

Test Files:

Path                                                      Size
                                                        (bytes)
---------------------------------------------------------------
doiofile                                        10000000
send(11, 0x8cbc938, 315, 0) returned -1
Failed send in d_iogen
d_doio ior status != expected status

======== msg ========
type: 2 (verify)
status: 0 (nack)
expected status: 1 (ack)
srchost: tank-02
srcpid: 2990
desthost: try
destpid: 0
ior: 
----- xior ----
magic: 0xfeed10
type: 4 (read)
path: doiofile
syscall: read
oflags: 2 (O_RDWR)
offset: 9689023
count: 2548
pattern: L:2973:tank-03:mmwrite*
chksum: 0xf7c42126

=====================

tag tank-02_1 had output:

*** CHECKSUM ERROR doiofile ***
Expected checksum of 0xf7c42126, got 0xcca41171
*** DATA COMPARISON ERROR doiofile ***
Corrupt regions follow - unprintable chars are represented as '.'
-----------------------------------------------------------------
corrupt bytes starting at file offset 9689023
    1st 32 expected bytes:  L:2973:tank-03:mmwrite*L:2973:ta
    1st 32 actual bytes:    e*R:2990:tank-02:pwrite*R:2990:t


----- xior ----
magic: 0xfeed10
type: 4 (read)
path: doiofile
syscall: read
oflags: 2 (O_RDWR)
offset: 9689023
count: 2548
pattern: L:2973:tank-03:mmwrite*
chksum: 0xf7c42126

The expected pattern shows that this read was verifying a mmap write from
tank-03.  The verify was being run on tank-02.

Expected results:


Additional info:

While collecting data for the bug, all nodes were able to see the correct
pattern in the file.
Comment 1 Robert Peterson 2008-01-14 18:07:59 EST
The latest changes made to this level of code were the "i_alloc" patch
and the "permission denied executing command" patch.  Thankfully, this
level of code does not have my performance changes in it.  I wonder:
(1) if this problem can be reproduced
(2) if this problem can be reproduced on older versions of gfs2, and
(3) if so, what level of code introduced the problem.
Comment 2 Nate Straz 2008-01-15 09:17:42 EST
I was able to hit another data comparison error, this time I was verifying a
read so I only have the checksum.  I modified d_doio so on future checksum
errors I should see what the node read and I can compare that to what ends up on
disk.

This checksum error happened after 5.1 iterations of revolver, after recovery
had completed.

*** CHECKSUM ERROR doiofile ***
Expected checksum of 0xb4dcf7b9, got 0x9651f785

----- xior ----
magic: 0xfeed10
type: 4 (read)
path: doiofile
syscall: read
oflags: 0 (O_RDONLY)
offset: 5695194
count: 2507
pattern: 
chksum: 0xb4dcf7b9
Comment 3 Nate Straz 2008-01-15 10:10:14 EST
I was able to hit it again with the new d_doio code.

*** CHECKSUM ERROR doiofile ***
Expected checksum of 0x2f1cec6b, got 0x27c5e8ed
First 32 bytes of bad region: 03:mmwrite*A:4899:tank-9N:3134:ta

----- xior ----
magic: 0xfeed10
type: 4 (read)
path: doiofile
syscall: pread
oflags: 0 (O_RDONLY)
offset: 5535913
count: 754
pattern: 
chksum: 0x2f1cec6b

It appears that an mmwrite wasn't on disk when the read happened which this
pread is verifying.
Comment 4 Steve Whitehouse 2008-01-16 09:49:24 EST
Just to be clear what we are looking at here, my understanding is that there are
three nodes involved here: One has written something via mmap into a file, a
second node then is shot by revolver and removed from the cluster, and the third
node doesn't see the write from the first node?

Does the node thats shot have the file open and is it doing any I/O to the file
when its being shot? Does the I/O show up correctly on the first node? I'm just
trying to get my head around what the fundamental problem is in this case.
Comment 5 Nate Straz 2008-01-16 10:12:36 EST
(In reply to comment #4)
> Just to be clear what we are looking at here, my understanding is that there are
> three nodes involved here: One has written something via mmap into a file, a
> second node then is shot by revolver and removed from the cluster, and the third
> node doesn't see the write from the first node?

Something like that, yes.

> Does the node thats shot have the file open and is it doing any I/O to the file
> when its being shot? 

Yes, all nodes have the same file open and are doing I/O with the file.

> Does the I/O show up correctly on the first node?

After the failure I can verify that the I/O made it to disk and all nodes can
see it.
Comment 6 Steve Whitehouse 2008-01-16 10:58:35 EST
Ok, so it looks like the second node sees stale data for a short period of time.
Most likely due to not invalidating something correctly so that the old data is
still visible. If the page gets pushed out, or invalidated later on, then it
will be reread from the correct on-disk data. I'll have a look and see if I can
spot anything in that code path.
Comment 7 Robert Peterson 2008-01-16 13:22:20 EST
Created attachment 291873 [details]
Steve's possible fix

This is a RHEL port of an upstream patch that may fix the problem.
This is designed to be applied after my latest performance patch for
bug #253990.
Comment 8 Nate Straz 2008-01-16 15:04:28 EST
I tried the above patch which was built into gfs2-kmod-1.68-1.5 and my d_doio
processes are now dying with SIGBUS.
Comment 9 Nate Straz 2008-01-16 16:38:58 EST
I've found a workload which hits this pretty easily.

On driver node run: d_iogen -s mmwrite -o -F 1m:doiofile -I 751
On cluster nodes run: d_doio -I 751 -P <driver> -w /mnt/gfs2

This will usually hit the data inconsistency before the recovery.
Comment 10 Steve Whitehouse 2008-01-17 06:32:03 EST
Created attachment 291980 [details]
A new version of the patch

I'm guessing a bit since comment #8 is a bit short on detail, but I think this
was probably down to a one-liner. I also need to check upstream for this same
problem which is complicated by upstream using ->fault() and RHEL using
->nopage().

I would be useful to know whether the SIGBUS was received on read faults or
write faults or both. I suspect write faults only.
Comment 11 Nate Straz 2008-01-17 08:25:11 EST
From what I could gather in GDB, it looks like write faults.  In the debugger
the buffer I was going to write gets completely corrupted.  This worked fine on
abhi's previous build and the test case in comment #9 works on GFS.  Please run
the test case on your development box.
Comment 12 Steve Whitehouse 2008-01-17 08:43:35 EST
Perhaps then I could send you on a mission to discover why four of my five boxes
in MN are turned off? They are not on an APC, so there is nothing I can do from
here.
Comment 13 Robert Peterson 2008-01-17 10:39:41 EST
Created attachment 292006 [details]
A new version of the patch (RHEL5 over perf. patch)

This is the revised "new version" made to be applied over top of
my patch for 253990 for RHEL5.	I ran Nate's scenario on the roth
cluster and verified the problem is fixed by it.
Comment 14 Robert Peterson 2008-01-17 20:26:30 EST
Correction: The patch from comment #13 fixes the "bus errors" Nate was
reporting with the previous patch.  The original data comparison problem
apparently still exists, even with the revised patch.
Comment 15 Robert Peterson 2008-01-21 10:43:51 EST
So the page_mkwrite fix from upstream does not solve the problem.
It's still a valid patch, but perhaps best saved for later.  For now
we have to go back to the beginning and debug the real problem.
Comment 16 Robert Peterson 2008-01-21 11:02:55 EST
Requesting flags because it sounds like a blocker to me.
Comment 18 Nate Straz 2008-01-23 17:18:23 EST
Updating the summary since I can hit the inconsistency without recovery.
Comment 19 Steve Whitehouse 2008-01-24 04:28:25 EST
I thought that this was working ok at the time that the patch for min hold time
went in. I've just been back through all the patches for glops.c (i.e. the
invalidation/unmap code) and there have been no changes aside from one which is
only upstream. That would normally point at the page fault side of things,
however we've already tested that and found it not guilty, and there have been
no changes there either. So I'm wondering why we are seeing this now and not before.
Comment 20 Nate Straz 2008-01-24 09:45:22 EST
I'd say the reason we are hitting it easier now is that we found a test case
which hits it easier.  We changed the workload for recovery testing which
exposed the bug.  We refined the test case to make it easier to hit and now we
found that we don't need recovery to hit it with the new test case.  The new
test case is now available under dd_io and is called bz428751, although we are
hitting it with d_mmap3 also.
Comment 21 Dean Jansa 2008-02-04 11:59:13 EST
FWIW -- I hit this again while running 2.6.18-76.el5 on ia64.

Nate'd bz428751 test case in dd_io:
d_doio ior status != expected status

======== msg ========
type: 2 (verify)
status: 0 (nack)
expected status: 1 (ack)
srchost: link-16
srcpid: 10669
desthost: fore
destpid: 0
ior: 
----- xior ----
magic: 0xfeed10
type: 4 (read)
path: bz428751
syscall: read
oflags: 2 (O_RDWR)
offset: 343507
count: 1920
pattern: T:9238:link-15:mmwrite*
chksum: 0x5d125fa4

=====================\


*** CHECKSUM ERROR bz428751 ***
Expected checksum of 0x5d125fa4, got 0x4a004f3b
*** DATA COMPARISON ERROR bz428751 ***
Corrupt regions follow - unprintable chars are represented as '.'
-----------------------------------------------------------------
corrupt bytes starting at file offset 343507
    1st 32 expected bytes:  T:9238:link-15:mmwrite*T:9238:li
    1st 32 actual bytes:    mmwrite*J:10669:link-16:mmwrite*


----- xior ----
magic: 0xfeed10
type: 4 (read)
path: bz428751
syscall: read
oflags: 2 (O_RDWR)
offset: 343507
count: 1920
pattern: T:9238:link-15:mmwrite*
chksum: 0x5d125fa4

Comment 22 Nate Straz 2008-02-05 15:26:57 EST
I was able to hit this again on x86 w/ data=writeback.  Here are the messages
relevant to the corruption in a more readable format

* The pattern found is written and verified on the same node.
7638 I <- tank-02(9293) AA <bz428751>  555738+3103 0x5e9cd4d4
T:9293:tank-02:mmwrite*
7667 V -> tank-02(9296) AR <bz428751>  555738+3103 0x5e9cd4d4
T:9293:tank-02:mmwrite*
7667 V <- tank-02(9296) AA <bz428751>  555738+3103 0x5e9cd4d4
T:9293:tank-02:mmwrite*
* The new pattern is written and verified on two other nodes.
8602 I <- tank-01(11344) AA <bz428751>  555994+544 0xaa8da6e4
O:11344:tank-01:mmwrite*
8914 V -> tank-04(9929) AR <bz428751>  555994+544 0xaa8da6e4
O:11344:tank-01:mmwrite*
8914 V <- tank-04(9929) AN <bz428751>  555994+544 0xaa8da6e4
O:11344:tank-01:mmwrite*
Comment 23 Steve Whitehouse 2008-02-06 06:23:46 EST
Some ideas to consider... firstly I spotted this upstream patch:

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=0ed361dec36945f3116ee1338638ada9a8920905

also I spotted an interesting comment in this patch:
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=61d5048f149572434daee0cce5e1374a8a7cf3e8

which made me wonder whether we ought to be calling unmap_shared_mapping()
twice, once each side of the truncation.
Comment 24 Nate Straz 2008-02-06 12:01:12 EST
We've added a new option to d_iogen so we can specify the syscall used when we
verify operations.  I started running a new set tests which verifies mmap writes
with one of mmap read, read, readv, or pread.  The mmap read case seems to work
just fine.  The other three cases don't work correctly.
Comment 25 Nate Straz 2008-02-06 17:38:14 EST
FWIW, all four cases pass on GFS using the same kernel.
Comment 26 Steve Whitehouse 2008-02-06 17:49:41 EST
I hadn't appreciated that the verify step wasn't also via mmap (comment #24) but
knowing that mmap and read behave differently is a very useful clue in tracking
this down.
Comment 27 Nate Straz 2008-02-08 10:54:49 EST
Created attachment 294369 [details]
script to generate a herd file to test all write/read combinations

I started running all combinations of writes and reads on GFS2 and found that
with enough time, most combinations can fail.  Here is a 20 minute per test
case run on GFS2 w/ data=writeback:

Testcase				 Result    
--------				 ------    
write-read				 PASS	   
write-readv				 FAIL	   
write-pread				 PASS	   
write-mmread				 PASS	   
writev-read				 PASS	   
writev-readv				 PASS	   
writev-pread				 PASS	   
writev-mmread				 PASS	   
pwrite-read				 PASS	   
pwrite-readv				 FAIL	   
pwrite-pread				 FAIL	   
pwrite-mmread				 PASS	   
mmwrite-read				 PASS	   
mmwrite-readv				 PASS	   
mmwrite-pread				 FAIL	   
mmwrite-mmread				 PASS	   
=================================================
Total Tests Run: 16
Total PASS:	 12
Total FAIL:	 4

Attached is the script I used to generate the herd file to run this series of
tests.	

To run:
1. Edit the script to include your list of nodes, mount point, etc.
2. sh coherency.sh > coherency.h2
3. add $STS_LROOT/bin to your path
4. collie -f coherency.h2 -i 1 -O . | tee coherency.log
Comment 28 Steve Whitehouse 2008-02-08 11:11:18 EST
So that seems to indicate that the problem is wider than just mmap at least. It
looks like some pages are not being invalidated correctly.
Comment 29 Nate Straz 2008-02-08 16:40:28 EST
Another data point:

I tried running the tests with larger I/O sizes (10k to 100k) for 20 minutes
each and none of the test cases found incoherency.  The default I/O sizes are
512 to 4096 bytes.  
Comment 30 Nate Straz 2008-02-18 11:09:07 EST
Since we've hit this with most write/read combinations, removing mmap from summary.
Comment 31 Dean Jansa 2008-02-21 16:45:41 EST
Another data point:

Adding truncates into the mix shows a similar issue.  A node truncs a file,
another writes to extend and another attempts to read that new data.  The read()
returns 0, yet the data is there if I look in the file.

I was running the following test case:

#driver node
d_iogen -I 12345 -i 120s -s write,trunc -F 10000b:file1,10000b:file2,10000b:file0

#cluster nodes
d_doio -I 12345 -P <drivernode> -w /mnt/link_ia640
Comment 32 Dean Jansa 2008-02-21 17:08:48 EST
FWIW -- I ran the test case in comment #31 for 10 minutes with out issue on a
GFS1 fs.
Comment 33 Nate Straz 2008-02-21 18:45:47 EST
Created attachment 295573 [details]
parse test log showing operations between trunc and failed read

I ran a similar test case to Dean and with some more work on our log analyzer,
I have the series of operations between the trunc and a read that returned 0
bytes when it should have returned 2713 bytes.

The log shows:
 1. the file was truncated to 2085249 bytes
 2. the file was extended by several writes
 3. a read to verify one of those writes fails.
Comment 34 Dean Jansa 2008-02-22 15:32:56 EST
Another log of the write/read/trunc issue:

 
    2 I -> link-13( 7519) AR <file2>    creat  5120000 mode 666
    2 I <- link-13( 7519) AA <file2>    creat  5120000 mode 666
    3 I -> link-16( 8955) AR <file2>    write        0+  562        0x0 
    4 V -> link-13( 7519) AR <file2>     stat size == 5120000 
    4 V <- link-13( 7519) AA <file2>     stat size == 5120000 
    5 I -> link-13( 7519) AR <file2>    trunc  1649100
    5 I <- link-13( 7519) AA <file2>    trunc  1649100
    6 V -> link-16( 8955) AR <file2>     stat size == 1649100 
    3 I <- link-16( 8955) AA <file2>    write        0+  562 0xb76eabbc
N:8955:link-16:write*
   10 V -> link-14( 7517) AR <file2>     read        0+  562 0xb76eabbc
N:8955:link-16:write*
    6 V <- link-16( 8955) AA <file2>     stat size == 1649100 
   13 I -> link-16( 8955) AR <file2>    trunc   407397
   10 V <- link-14( 7517) AA <file2>     read        0+  562 0xb76eabbc
N:8955:link-16:write*
   13 I <- link-16( 8955) AA <file2>    trunc   407397
   44 V -> link-16( 8955) AR <file2>     stat size == 407397 
   44 V <- link-16( 8955) AA <file2>     stat size == 407397 
   54 I -> link-13( 7519) AR <file2>    trunc   320285
   54 I <- link-13( 7519) AA <file2>    trunc   320285
   79 V -> link-14( 7517) AR <file2>     stat size == 320285 
   79 V <- link-14( 7517) AA <file2>     stat size == 320285 
  155 I -> link-13( 7519) AR <file2>    trunc   130474
  155 I <- link-13( 7519) AA <file2>    trunc   130474
  199 V -> link-16( 8955) AR <file2>     stat size == 130474 
  199 V <- link-16( 8955) AA <file2>     stat size == 130474 
  207 I -> link-16( 8955) AR <file2>    write   130475+ 1826        0x0 
  210 I -> link-16( 8955) AR <file2>    write   132302+ 2624        0x0 
  207 I <- link-16( 8955) AA <file2>    write   130475+ 1826 0x9e5f2f8f
O:8955:link-16:write*
  220 I -> link-16( 8955) AR <file2>    write   134927+  987        0x0 
  210 I <- link-16( 8955) AA <file2>    write   132302+ 2624 0xd0d82580
R:8955:link-16:write*
  220 I <- link-16( 8955) AA <file2>    write   134927+  987 0x16d42f1d
S:8955:link-16:write*
  228 I -> link-16( 8955) AR <file2>    write   135915+ 1110        0x0 
  233 V -> link-14( 7517) AR <file2>     read   130475+ 1826 0x9e5f2f8f
O:8955:link-16:write*
  234 I -> link-14( 7517) AR <file2>    write   137026+ 3355        0x0 
  233 V <- link-14( 7517) AA <file2>     read   130475+ 1826 0x9e5f2f8f
O:8955:link-16:write*
  238 I -> link-14( 7517) AR <file2>    write   143488+ 2901        0x0 
  240 I -> link-16( 8955) AR <file2>    write   140382+ 3105        0x0 
  234 I <- link-14( 7517) AA <file2>    write   137026+ 3355 0xde58f5c2
B:7517:link-14:write*
  248 V -> link-13( 7519) AR <file2>     read   132302+ 2624 0xd0d82580
R:8955:link-16:write*
  249 V -> link-13( 7519) AR <file2>     read   134927+  987 0x16d42f1d
S:8955:link-16:write*
  248 V <- link-13( 7519) AN <file2>     read   132302+ 2624 0xd0d82580
R:8955:link-16:write*
  228 I <- link-16( 8955) AA <file2>    write   135915+ 1110  0x3955147
B:8955:link-16:write*
  255 C -> link-16( 8955) AR <file2>    write   135915+ 1110  0x3955147
B:8955:link-16:write*
  240 I <- link-16( 8955) AA <file2>    write   140382+ 3105 0x27c5afa5
B:8955:link-16:write*
  238 I <- link-14( 7517) AA <file2>    write   143488+ 2901 0x834a6e4e
E:7517:link-14:write*



------------------------------


The interesting sequence of events:


Message 155, link-13 is request to trunc to 130474 bytes
             Responds with an ACK

Message 199, link-16 is requested to verify that trunc
             Responds with an ACK

Message 210, link-16 is requested to write 2624 bytes, at offset 132302.
             Responds with and ACK and updated pattern  R:8955:link-16:write*

Message 248, link-13 is requested to verify the write (from msg 210),
             Responds with and NACK (The error output shows that the 
             read() return 0, so it seems link-13 didn't know the
             file had been extended in this case, rather than reading
             stale data as in the other cases in this bz

Comment 35 Ben Marzinski 2008-02-25 11:23:27 EST
I'm not totally certain, because this just seems far to improbable, but it
really looks like the problem here is that we just don't flush the cache at all
before we give up glocks. Looking at inode_go_xmote_th and inode_go_drop_th, we
call gfs2_pte_inval() which might mean the mmap'ed reads are safe, but we never
clear the page cache.  We clear it in the bottom half of these functions, but
that isn't called until after we have demoted the lock, and possibly another
node has acquired it and written to the file.

The easy solution is to flush the cache in the top half of these functions.
However this will mean that were there is a lot of processes on the same node
contending for a lock, we will be unnecessarily invalidating the cache a lot.  A
better solution might be to have a new lock manager callback, so that we can
actually invalidate the cache only when we really need to.

Like I said, this seems sort of improbable, and this is the first time I've ever
dug into gfs2's glock code this deeply.  If I'm totally off base, let me know.
Comment 36 Ben Marzinski 2008-02-25 11:36:42 EST
I talked to Dave, and there's another alternative to fix this, which may not be
quite so painful.  gfs2 can just not set the flag that allows the dlm to drop
our locks on promotions from SHARED mode to EXCLUSIVE mode, which is where this
is happening. In this case, we would get EDEADLOCK back, at which point we could
drop the lock ourselves and flush the page cache, and then reacquire the lock in
EXCLUSIVE mode.
Comment 37 Ben Marzinski 2008-03-14 18:08:00 EDT
Created attachment 298089 [details]
Patch to invalidate the page cache before calling the lock manager

This patch does two things to make sure that we always invalidate the page
cache before dropping a lock. 

1. It moves the go_inval to before the lock manager request when we are
dropping locks.

2. It modifies lock_dlm, so that we don't have to use conversion mode deadlock
avoidance.  Instead, lock_dlm just returns a failure, and GFS2 can manually
drop the lock and then reacquire it.
Comment 38 Steve Whitehouse 2008-03-18 11:32:24 EDT
We've hit an issue during testing. Will resubmit when we've fixed it.
Comment 43 Robert Peterson 2008-05-21 18:43:28 EDT
I posted the patch to rhkernel-list for this and bug #432057.
Reassigning to Don Z and changing status to POST.
Comment 44 Steve Whitehouse 2008-07-21 11:43:33 EDT
Two bzs, one patch. Closing this as a dup of the other bz.

*** This bug has been marked as a duplicate of 432057 ***

Note You need to log in before you can comment on or make changes to this bug.