Bug 428751
Summary: | GFS2 is not cluster coherent | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Nate Straz <nstraz> |
Component: | kernel | Assignee: | Don Zickus <dzickus> |
Status: | CLOSED DUPLICATE | QA Contact: | GFS Bugs <gfs-bugs> |
Severity: | medium | Docs Contact: | |
Priority: | high | ||
Version: | 5.1 | CC: | adas, bmarzins, lwang, rpeterso, swhiteho |
Target Milestone: | rc | ||
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2008-07-21 15:43:33 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 432057, 437893 | ||
Bug Blocks: | 432826 | ||
Attachments: |
Description
Nate Straz
2008-01-14 22:34:26 UTC
The latest changes made to this level of code were the "i_alloc" patch and the "permission denied executing command" patch. Thankfully, this level of code does not have my performance changes in it. I wonder: (1) if this problem can be reproduced (2) if this problem can be reproduced on older versions of gfs2, and (3) if so, what level of code introduced the problem. I was able to hit another data comparison error, this time I was verifying a read so I only have the checksum. I modified d_doio so on future checksum errors I should see what the node read and I can compare that to what ends up on disk. This checksum error happened after 5.1 iterations of revolver, after recovery had completed. *** CHECKSUM ERROR doiofile *** Expected checksum of 0xb4dcf7b9, got 0x9651f785 ----- xior ---- magic: 0xfeed10 type: 4 (read) path: doiofile syscall: read oflags: 0 (O_RDONLY) offset: 5695194 count: 2507 pattern: chksum: 0xb4dcf7b9 I was able to hit it again with the new d_doio code. *** CHECKSUM ERROR doiofile *** Expected checksum of 0x2f1cec6b, got 0x27c5e8ed First 32 bytes of bad region: 03:mmwrite*A:4899:tank-9N:3134:ta ----- xior ---- magic: 0xfeed10 type: 4 (read) path: doiofile syscall: pread oflags: 0 (O_RDONLY) offset: 5535913 count: 754 pattern: chksum: 0x2f1cec6b It appears that an mmwrite wasn't on disk when the read happened which this pread is verifying. Just to be clear what we are looking at here, my understanding is that there are three nodes involved here: One has written something via mmap into a file, a second node then is shot by revolver and removed from the cluster, and the third node doesn't see the write from the first node? Does the node thats shot have the file open and is it doing any I/O to the file when its being shot? Does the I/O show up correctly on the first node? I'm just trying to get my head around what the fundamental problem is in this case. (In reply to comment #4) > Just to be clear what we are looking at here, my understanding is that there are > three nodes involved here: One has written something via mmap into a file, a > second node then is shot by revolver and removed from the cluster, and the third > node doesn't see the write from the first node? Something like that, yes. > Does the node thats shot have the file open and is it doing any I/O to the file > when its being shot? Yes, all nodes have the same file open and are doing I/O with the file. > Does the I/O show up correctly on the first node? After the failure I can verify that the I/O made it to disk and all nodes can see it. Ok, so it looks like the second node sees stale data for a short period of time. Most likely due to not invalidating something correctly so that the old data is still visible. If the page gets pushed out, or invalidated later on, then it will be reread from the correct on-disk data. I'll have a look and see if I can spot anything in that code path. Created attachment 291873 [details] Steve's possible fix This is a RHEL port of an upstream patch that may fix the problem. This is designed to be applied after my latest performance patch for bug #253990. I tried the above patch which was built into gfs2-kmod-1.68-1.5 and my d_doio processes are now dying with SIGBUS. I've found a workload which hits this pretty easily. On driver node run: d_iogen -s mmwrite -o -F 1m:doiofile -I 751 On cluster nodes run: d_doio -I 751 -P <driver> -w /mnt/gfs2 This will usually hit the data inconsistency before the recovery. Created attachment 291980 [details] A new version of the patch I'm guessing a bit since comment #8 is a bit short on detail, but I think this was probably down to a one-liner. I also need to check upstream for this same problem which is complicated by upstream using ->fault() and RHEL using ->nopage(). I would be useful to know whether the SIGBUS was received on read faults or write faults or both. I suspect write faults only. From what I could gather in GDB, it looks like write faults. In the debugger the buffer I was going to write gets completely corrupted. This worked fine on abhi's previous build and the test case in comment #9 works on GFS. Please run the test case on your development box. Perhaps then I could send you on a mission to discover why four of my five boxes in MN are turned off? They are not on an APC, so there is nothing I can do from here. Created attachment 292006 [details]
A new version of the patch (RHEL5 over perf. patch)
This is the revised "new version" made to be applied over top of
my patch for 253990 for RHEL5. I ran Nate's scenario on the roth
cluster and verified the problem is fixed by it.
Correction: The patch from comment #13 fixes the "bus errors" Nate was reporting with the previous patch. The original data comparison problem apparently still exists, even with the revised patch. So the page_mkwrite fix from upstream does not solve the problem. It's still a valid patch, but perhaps best saved for later. For now we have to go back to the beginning and debug the real problem. Requesting flags because it sounds like a blocker to me. Updating the summary since I can hit the inconsistency without recovery. I thought that this was working ok at the time that the patch for min hold time went in. I've just been back through all the patches for glops.c (i.e. the invalidation/unmap code) and there have been no changes aside from one which is only upstream. That would normally point at the page fault side of things, however we've already tested that and found it not guilty, and there have been no changes there either. So I'm wondering why we are seeing this now and not before. I'd say the reason we are hitting it easier now is that we found a test case which hits it easier. We changed the workload for recovery testing which exposed the bug. We refined the test case to make it easier to hit and now we found that we don't need recovery to hit it with the new test case. The new test case is now available under dd_io and is called bz428751, although we are hitting it with d_mmap3 also. FWIW -- I hit this again while running 2.6.18-76.el5 on ia64. Nate'd bz428751 test case in dd_io: d_doio ior status != expected status ======== msg ======== type: 2 (verify) status: 0 (nack) expected status: 1 (ack) srchost: link-16 srcpid: 10669 desthost: fore destpid: 0 ior: ----- xior ---- magic: 0xfeed10 type: 4 (read) path: bz428751 syscall: read oflags: 2 (O_RDWR) offset: 343507 count: 1920 pattern: T:9238:link-15:mmwrite* chksum: 0x5d125fa4 =====================\ *** CHECKSUM ERROR bz428751 *** Expected checksum of 0x5d125fa4, got 0x4a004f3b *** DATA COMPARISON ERROR bz428751 *** Corrupt regions follow - unprintable chars are represented as '.' ----------------------------------------------------------------- corrupt bytes starting at file offset 343507 1st 32 expected bytes: T:9238:link-15:mmwrite*T:9238:li 1st 32 actual bytes: mmwrite*J:10669:link-16:mmwrite* ----- xior ---- magic: 0xfeed10 type: 4 (read) path: bz428751 syscall: read oflags: 2 (O_RDWR) offset: 343507 count: 1920 pattern: T:9238:link-15:mmwrite* chksum: 0x5d125fa4 I was able to hit this again on x86 w/ data=writeback. Here are the messages relevant to the corruption in a more readable format * The pattern found is written and verified on the same node. 7638 I <- tank-02(9293) AA <bz428751> 555738+3103 0x5e9cd4d4 T:9293:tank-02:mmwrite* 7667 V -> tank-02(9296) AR <bz428751> 555738+3103 0x5e9cd4d4 T:9293:tank-02:mmwrite* 7667 V <- tank-02(9296) AA <bz428751> 555738+3103 0x5e9cd4d4 T:9293:tank-02:mmwrite* * The new pattern is written and verified on two other nodes. 8602 I <- tank-01(11344) AA <bz428751> 555994+544 0xaa8da6e4 O:11344:tank-01:mmwrite* 8914 V -> tank-04(9929) AR <bz428751> 555994+544 0xaa8da6e4 O:11344:tank-01:mmwrite* 8914 V <- tank-04(9929) AN <bz428751> 555994+544 0xaa8da6e4 O:11344:tank-01:mmwrite* Some ideas to consider... firstly I spotted this upstream patch: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=0ed361dec36945f3116ee1338638ada9a8920905 also I spotted an interesting comment in this patch: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=61d5048f149572434daee0cce5e1374a8a7cf3e8 which made me wonder whether we ought to be calling unmap_shared_mapping() twice, once each side of the truncation. We've added a new option to d_iogen so we can specify the syscall used when we verify operations. I started running a new set tests which verifies mmap writes with one of mmap read, read, readv, or pread. The mmap read case seems to work just fine. The other three cases don't work correctly. FWIW, all four cases pass on GFS using the same kernel. I hadn't appreciated that the verify step wasn't also via mmap (comment #24) but knowing that mmap and read behave differently is a very useful clue in tracking this down. Created attachment 294369 [details]
script to generate a herd file to test all write/read combinations
I started running all combinations of writes and reads on GFS2 and found that
with enough time, most combinations can fail. Here is a 20 minute per test
case run on GFS2 w/ data=writeback:
Testcase Result
-------- ------
write-read PASS
write-readv FAIL
write-pread PASS
write-mmread PASS
writev-read PASS
writev-readv PASS
writev-pread PASS
writev-mmread PASS
pwrite-read PASS
pwrite-readv FAIL
pwrite-pread FAIL
pwrite-mmread PASS
mmwrite-read PASS
mmwrite-readv PASS
mmwrite-pread FAIL
mmwrite-mmread PASS
=================================================
Total Tests Run: 16
Total PASS: 12
Total FAIL: 4
Attached is the script I used to generate the herd file to run this series of
tests.
To run:
1. Edit the script to include your list of nodes, mount point, etc.
2. sh coherency.sh > coherency.h2
3. add $STS_LROOT/bin to your path
4. collie -f coherency.h2 -i 1 -O . | tee coherency.log
So that seems to indicate that the problem is wider than just mmap at least. It looks like some pages are not being invalidated correctly. Another data point: I tried running the tests with larger I/O sizes (10k to 100k) for 20 minutes each and none of the test cases found incoherency. The default I/O sizes are 512 to 4096 bytes. Since we've hit this with most write/read combinations, removing mmap from summary. Another data point: Adding truncates into the mix shows a similar issue. A node truncs a file, another writes to extend and another attempts to read that new data. The read() returns 0, yet the data is there if I look in the file. I was running the following test case: #driver node d_iogen -I 12345 -i 120s -s write,trunc -F 10000b:file1,10000b:file2,10000b:file0 #cluster nodes d_doio -I 12345 -P <drivernode> -w /mnt/link_ia640 FWIW -- I ran the test case in comment #31 for 10 minutes with out issue on a GFS1 fs. Created attachment 295573 [details]
parse test log showing operations between trunc and failed read
I ran a similar test case to Dean and with some more work on our log analyzer,
I have the series of operations between the trunc and a read that returned 0
bytes when it should have returned 2713 bytes.
The log shows:
1. the file was truncated to 2085249 bytes
2. the file was extended by several writes
3. a read to verify one of those writes fails.
Another log of the write/read/trunc issue: 2 I -> link-13( 7519) AR <file2> creat 5120000 mode 666 2 I <- link-13( 7519) AA <file2> creat 5120000 mode 666 3 I -> link-16( 8955) AR <file2> write 0+ 562 0x0 4 V -> link-13( 7519) AR <file2> stat size == 5120000 4 V <- link-13( 7519) AA <file2> stat size == 5120000 5 I -> link-13( 7519) AR <file2> trunc 1649100 5 I <- link-13( 7519) AA <file2> trunc 1649100 6 V -> link-16( 8955) AR <file2> stat size == 1649100 3 I <- link-16( 8955) AA <file2> write 0+ 562 0xb76eabbc N:8955:link-16:write* 10 V -> link-14( 7517) AR <file2> read 0+ 562 0xb76eabbc N:8955:link-16:write* 6 V <- link-16( 8955) AA <file2> stat size == 1649100 13 I -> link-16( 8955) AR <file2> trunc 407397 10 V <- link-14( 7517) AA <file2> read 0+ 562 0xb76eabbc N:8955:link-16:write* 13 I <- link-16( 8955) AA <file2> trunc 407397 44 V -> link-16( 8955) AR <file2> stat size == 407397 44 V <- link-16( 8955) AA <file2> stat size == 407397 54 I -> link-13( 7519) AR <file2> trunc 320285 54 I <- link-13( 7519) AA <file2> trunc 320285 79 V -> link-14( 7517) AR <file2> stat size == 320285 79 V <- link-14( 7517) AA <file2> stat size == 320285 155 I -> link-13( 7519) AR <file2> trunc 130474 155 I <- link-13( 7519) AA <file2> trunc 130474 199 V -> link-16( 8955) AR <file2> stat size == 130474 199 V <- link-16( 8955) AA <file2> stat size == 130474 207 I -> link-16( 8955) AR <file2> write 130475+ 1826 0x0 210 I -> link-16( 8955) AR <file2> write 132302+ 2624 0x0 207 I <- link-16( 8955) AA <file2> write 130475+ 1826 0x9e5f2f8f O:8955:link-16:write* 220 I -> link-16( 8955) AR <file2> write 134927+ 987 0x0 210 I <- link-16( 8955) AA <file2> write 132302+ 2624 0xd0d82580 R:8955:link-16:write* 220 I <- link-16( 8955) AA <file2> write 134927+ 987 0x16d42f1d S:8955:link-16:write* 228 I -> link-16( 8955) AR <file2> write 135915+ 1110 0x0 233 V -> link-14( 7517) AR <file2> read 130475+ 1826 0x9e5f2f8f O:8955:link-16:write* 234 I -> link-14( 7517) AR <file2> write 137026+ 3355 0x0 233 V <- link-14( 7517) AA <file2> read 130475+ 1826 0x9e5f2f8f O:8955:link-16:write* 238 I -> link-14( 7517) AR <file2> write 143488+ 2901 0x0 240 I -> link-16( 8955) AR <file2> write 140382+ 3105 0x0 234 I <- link-14( 7517) AA <file2> write 137026+ 3355 0xde58f5c2 B:7517:link-14:write* 248 V -> link-13( 7519) AR <file2> read 132302+ 2624 0xd0d82580 R:8955:link-16:write* 249 V -> link-13( 7519) AR <file2> read 134927+ 987 0x16d42f1d S:8955:link-16:write* 248 V <- link-13( 7519) AN <file2> read 132302+ 2624 0xd0d82580 R:8955:link-16:write* 228 I <- link-16( 8955) AA <file2> write 135915+ 1110 0x3955147 B:8955:link-16:write* 255 C -> link-16( 8955) AR <file2> write 135915+ 1110 0x3955147 B:8955:link-16:write* 240 I <- link-16( 8955) AA <file2> write 140382+ 3105 0x27c5afa5 B:8955:link-16:write* 238 I <- link-14( 7517) AA <file2> write 143488+ 2901 0x834a6e4e E:7517:link-14:write* ------------------------------ The interesting sequence of events: Message 155, link-13 is request to trunc to 130474 bytes Responds with an ACK Message 199, link-16 is requested to verify that trunc Responds with an ACK Message 210, link-16 is requested to write 2624 bytes, at offset 132302. Responds with and ACK and updated pattern R:8955:link-16:write* Message 248, link-13 is requested to verify the write (from msg 210), Responds with and NACK (The error output shows that the read() return 0, so it seems link-13 didn't know the file had been extended in this case, rather than reading stale data as in the other cases in this bz I'm not totally certain, because this just seems far to improbable, but it really looks like the problem here is that we just don't flush the cache at all before we give up glocks. Looking at inode_go_xmote_th and inode_go_drop_th, we call gfs2_pte_inval() which might mean the mmap'ed reads are safe, but we never clear the page cache. We clear it in the bottom half of these functions, but that isn't called until after we have demoted the lock, and possibly another node has acquired it and written to the file. The easy solution is to flush the cache in the top half of these functions. However this will mean that were there is a lot of processes on the same node contending for a lock, we will be unnecessarily invalidating the cache a lot. A better solution might be to have a new lock manager callback, so that we can actually invalidate the cache only when we really need to. Like I said, this seems sort of improbable, and this is the first time I've ever dug into gfs2's glock code this deeply. If I'm totally off base, let me know. I talked to Dave, and there's another alternative to fix this, which may not be quite so painful. gfs2 can just not set the flag that allows the dlm to drop our locks on promotions from SHARED mode to EXCLUSIVE mode, which is where this is happening. In this case, we would get EDEADLOCK back, at which point we could drop the lock ourselves and flush the page cache, and then reacquire the lock in EXCLUSIVE mode. Created attachment 298089 [details]
Patch to invalidate the page cache before calling the lock manager
This patch does two things to make sure that we always invalidate the page
cache before dropping a lock.
1. It moves the go_inval to before the lock manager request when we are
dropping locks.
2. It modifies lock_dlm, so that we don't have to use conversion mode deadlock
avoidance. Instead, lock_dlm just returns a failure, and GFS2 can manually
drop the lock and then reacquire it.
We've hit an issue during testing. Will resubmit when we've fixed it. I posted the patch to rhkernel-list for this and bug #432057. Reassigning to Don Z and changing status to POST. Two bzs, one patch. Closing this as a dup of the other bz. *** This bug has been marked as a duplicate of 432057 *** |