Bug 1028899 - [NFSV3] read randomly return EOF when using O_EXCL against Windows NFS Server
[NFSV3] read randomly return EOF when using O_EXCL against Windows NFS Server
Status: CLOSED NOTABUG
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: kernel (Show other bugs)
6.5
All Linux
unspecified Severity medium
: rc
: ---
Assigned To: Jeff Layton
Filesystem QE
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2013-11-11 02:33 EST by Chao Ye
Modified: 2013-11-20 10:26 EST (History)
4 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2013-11-20 10:26:26 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
tshark cap (260.00 KB, application/vnd.tcpdump.pcap)
2013-11-12 00:14 EST, Chao Ye
no flags Details

  None (edit)
Description Chao Ye 2013-11-11 02:33:54 EST
Description of problem:
Only observe this on Windows Server with RHEL NFSV3 client when doing IO using O_EXCL, NFS4.1 don't have such issue:
=================================================================
# grep FAIL /var/tmp/beakerlib-6lNWAG9/journal.txt | sort | uniq 
:: [   FAIL   ] :: RESULT: ClientNFSV3/Clientcombo
:: [   FAIL   ] :: RESULT: /kernel/filesystems/nfs/multihost/combo
:: [   FAIL   ] :: Running: dt bs=1024 flags=excl oflags=append (Expected 0, got 254)
:: [   FAIL   ] :: Running: dt bs=1536 flags=excl oflags=append (Expected 0, got 254)
:: [   FAIL   ] :: Running: dt bs=1536 flags=excl oflags=dsync (Expected 0, got 254)
:: [   FAIL   ] :: Running: dt bs=1536 flags=excl oflags=trunc (Expected 0, got 254)
:: [   FAIL   ] :: Running: dt bs=3072 flags=excl oflags=append (Expected 0, got 254)
:: [   FAIL   ] :: Running: dt bs=3072 flags=excl oflags=dsync (Expected 0, got 254)
:: [   FAIL   ] :: Running: dt bs=3072 flags=excl oflags=trunc (Expected 0, got 254)
:: [   FAIL   ] :: Running: dt bs=3584 flags=excl oflags=append (Expected 0, got 254)
:: [   FAIL   ] :: Running: dt bs=3584 flags=excl oflags=dsync (Expected 0, got 254)
:: [   FAIL   ] :: Running: dt bs=3584 flags=excl oflags=trunc (Expected 0, got 254)
:: [   FAIL   ] :: Running: dt bs=512 flags=excl oflags=append (Expected 0, got 254)

254 is End of file code.

I tested different bs, 512/1024/1536/2048/2560/3072/3584/4096, but only 512/1024/1536/3072/3584 have such issue.

Version-Release number of selected component (if applicable):
2.6.32-428.el6.x86_64
nfs-utils-1.2.3-39.el6.x86_64

How reproducible:
About 30%

Steps to Reproduce:
1. mount -o vers=3 10.66.86.144:/NFS /mnt/testarea/client
2. wget http://porkchop.devel.redhat.com/qa/rhts/lookaside/dt.d.tar.gz
3. tar xxf dt.d.tar.gz
4. cd dt.d
5. make 
6. strace -f ./dt bs=3072 limit=64k of=/mnt/testarea/client/test.img flags=excl oflags=append dispose=delete enable=fsync,debug

Actual results:
read() return 0

Expected results:
No such issue.

Additional info:
Strace output:
=================================================================
lseek(3, 0, SEEK_CUR)                   = 0
lseek(3, 0, SEEK_CUR)                   = 0
write(3, "9\234\30399\234\30399\234\30399\234\30399\234\30399\234\30399\234\30399\234\3039"..., 1536) = 1536
......
write(3, "9\234\30399\234\30399\234\30399\234\30399\234\30399\234\30399\234\30399\234\3039"..., 1536) = 1536
write(2, "dt: Record #43, Writing a partia"..., 58) = 58
write(3, "9\234\30399\234\30399\234\30399\234\30399\234\30399\234\30399\234\30399\234\3039"..., 1024) = 1024
write(2, "dt: Flushing data to file '/mnt/"..., 76) = 76
fsync(3)                                = 0
<===============================================fsync pass.

Failed strace output:
-----------------------------------------------------------------------------------------------
lseek(3, 0, SEEK_CUR)                   = 0
lseek(3, 0, SEEK_CUR)                   = 0
read(3, "9\234\30399\234\30399\234\30399\234\30399\234\30399\234\30399\234\30399\234\3039"..., 1536) = 1536
......
read(3, "9\234\30399\234\30399\234\30399\234\30399\234\30399\234\30399\234\30399\234\3039"..., 1536) = 1536
write(2, "dt: Record #43, Reading a partia"..., 58) = 58
read(3, "", 1024)                       = 0
<===============================================get 0 here.
write(2, "dt: End of file detected, count "..., 69) = 69

Pass strace output:
-----------------------------------------------------------------------------------------------
read(3, "9\234\30399\234\30399\234\30399\234\30399\234\30399\234\30399\234\30399\234\3039"..., 1536) = 1536
write(2, "dt: Record #43, Reading a partia"..., 58) = 58
read(3, "9\234\30399\234\30399\234\30399\234\30399\234\30399\234\30399\234\30399\234\3039"..., 1024) = 1024
<===============================================get 1024 here.


Both pass/fail log are using bs=1536.
Comment 2 Chao Ye 2013-11-11 21:37:06 EST
(In reply to Jeff Layton from comment #1)
> Do you have any captures from this? Does this test fail vs. other servers or
> just windows?

No captures yet, will do. Only fail on windows, test on RHEL/NetApp server pass, and also only windows nfsv3, v4.1 is cool.

Chao
Comment 3 Chao Ye 2013-11-12 00:14:28 EST
Created attachment 822760 [details]
tshark cap
Comment 4 Jeff Layton 2013-11-12 09:22:18 EST
It looks like this is a server bug.

There's this WRITE in frames 450 and 453:

450	2.650309000	10.66.86.24	10.66.86.144	NFS	V3 WRITE Call (Reply In 453), FH: 0x6f616b6d Offset: 64512 Len: 1024 FILE_SYNC

453	2.656404000	10.66.86.144	10.66.86.24	NFS	V3 WRITE Reply (Call In 450) Len: 1024 FILE_SYNC

...the wcc data in the reply says that the file increased in size from 64512 to 65536 at this point.

However just after that, in frames 456/457 there is a GETATTR call and reply, that report the size of the file to be 64512. All of the later attributes that I can see from the server report the size as 64512.

So either something raced in and truncated that file back to 64512 bytes after the write, or there's a server bug of some sort. I'd assume that you're running this on files that only this client would have access to, so my suspicion would be a server bug.

Any chance we could make this bug public? The Microsoft folks would probably like to see this...
Comment 8 Jeff Layton 2013-11-15 08:32:52 EST
I directed this to Tom Talpey at MS and he asked:

"What's the version of the Windows server in this? I see 4.1 mentioned, so I am guessing Windows 8 or later? You can run "winver.exe" to get the full build string."

Chao, can you run that and let us know?
Comment 9 Chao Ye 2013-11-17 21:21:23 EST
(In reply to Jeff Layton from comment #8)
> I directed this to Tom Talpey at MS and he asked:
> 
> "What's the version of the Windows server in this? I see 4.1 mentioned, so I
> am guessing Windows 8 or later? You can run "winver.exe" to get the full
> build string."
> 
> Chao, can you run that and let us know?

It's WindowsServer 2012 R2.

Chao
Comment 10 Jeff Layton 2013-11-20 10:26:26 EST
We're now in communication with the MS devs who are responsible for this code. Closing this as NOTABUG since it's a server side bug.

Note You need to log in before you can comment on or make changes to this bug.