Hide Forgot
Description of problem: Only observe this on Windows Server with RHEL NFSV3 client when doing IO using O_EXCL, NFS4.1 don't have such issue: ================================================================= # grep FAIL /var/tmp/beakerlib-6lNWAG9/journal.txt | sort | uniq :: [ FAIL ] :: RESULT: ClientNFSV3/Clientcombo :: [ FAIL ] :: RESULT: /kernel/filesystems/nfs/multihost/combo :: [ FAIL ] :: Running: dt bs=1024 flags=excl oflags=append (Expected 0, got 254) :: [ FAIL ] :: Running: dt bs=1536 flags=excl oflags=append (Expected 0, got 254) :: [ FAIL ] :: Running: dt bs=1536 flags=excl oflags=dsync (Expected 0, got 254) :: [ FAIL ] :: Running: dt bs=1536 flags=excl oflags=trunc (Expected 0, got 254) :: [ FAIL ] :: Running: dt bs=3072 flags=excl oflags=append (Expected 0, got 254) :: [ FAIL ] :: Running: dt bs=3072 flags=excl oflags=dsync (Expected 0, got 254) :: [ FAIL ] :: Running: dt bs=3072 flags=excl oflags=trunc (Expected 0, got 254) :: [ FAIL ] :: Running: dt bs=3584 flags=excl oflags=append (Expected 0, got 254) :: [ FAIL ] :: Running: dt bs=3584 flags=excl oflags=dsync (Expected 0, got 254) :: [ FAIL ] :: Running: dt bs=3584 flags=excl oflags=trunc (Expected 0, got 254) :: [ FAIL ] :: Running: dt bs=512 flags=excl oflags=append (Expected 0, got 254) 254 is End of file code. I tested different bs, 512/1024/1536/2048/2560/3072/3584/4096, but only 512/1024/1536/3072/3584 have such issue. Version-Release number of selected component (if applicable): 2.6.32-428.el6.x86_64 nfs-utils-1.2.3-39.el6.x86_64 How reproducible: About 30% Steps to Reproduce: 1. mount -o vers=3 10.66.86.144:/NFS /mnt/testarea/client 2. wget http://porkchop.devel.redhat.com/qa/rhts/lookaside/dt.d.tar.gz 3. tar xxf dt.d.tar.gz 4. cd dt.d 5. make 6. strace -f ./dt bs=3072 limit=64k of=/mnt/testarea/client/test.img flags=excl oflags=append dispose=delete enable=fsync,debug Actual results: read() return 0 Expected results: No such issue. Additional info: Strace output: ================================================================= lseek(3, 0, SEEK_CUR) = 0 lseek(3, 0, SEEK_CUR) = 0 write(3, "9\234\30399\234\30399\234\30399\234\30399\234\30399\234\30399\234\30399\234\3039"..., 1536) = 1536 ...... write(3, "9\234\30399\234\30399\234\30399\234\30399\234\30399\234\30399\234\30399\234\3039"..., 1536) = 1536 write(2, "dt: Record #43, Writing a partia"..., 58) = 58 write(3, "9\234\30399\234\30399\234\30399\234\30399\234\30399\234\30399\234\30399\234\3039"..., 1024) = 1024 write(2, "dt: Flushing data to file '/mnt/"..., 76) = 76 fsync(3) = 0 <===============================================fsync pass. Failed strace output: ----------------------------------------------------------------------------------------------- lseek(3, 0, SEEK_CUR) = 0 lseek(3, 0, SEEK_CUR) = 0 read(3, "9\234\30399\234\30399\234\30399\234\30399\234\30399\234\30399\234\30399\234\3039"..., 1536) = 1536 ...... read(3, "9\234\30399\234\30399\234\30399\234\30399\234\30399\234\30399\234\30399\234\3039"..., 1536) = 1536 write(2, "dt: Record #43, Reading a partia"..., 58) = 58 read(3, "", 1024) = 0 <===============================================get 0 here. write(2, "dt: End of file detected, count "..., 69) = 69 Pass strace output: ----------------------------------------------------------------------------------------------- read(3, "9\234\30399\234\30399\234\30399\234\30399\234\30399\234\30399\234\30399\234\3039"..., 1536) = 1536 write(2, "dt: Record #43, Reading a partia"..., 58) = 58 read(3, "9\234\30399\234\30399\234\30399\234\30399\234\30399\234\30399\234\30399\234\3039"..., 1024) = 1024 <===============================================get 1024 here. Both pass/fail log are using bs=1536.
(In reply to Jeff Layton from comment #1) > Do you have any captures from this? Does this test fail vs. other servers or > just windows? No captures yet, will do. Only fail on windows, test on RHEL/NetApp server pass, and also only windows nfsv3, v4.1 is cool. Chao
Created attachment 822760 [details] tshark cap
It looks like this is a server bug. There's this WRITE in frames 450 and 453: 450 2.650309000 10.66.86.24 10.66.86.144 NFS V3 WRITE Call (Reply In 453), FH: 0x6f616b6d Offset: 64512 Len: 1024 FILE_SYNC 453 2.656404000 10.66.86.144 10.66.86.24 NFS V3 WRITE Reply (Call In 450) Len: 1024 FILE_SYNC ...the wcc data in the reply says that the file increased in size from 64512 to 65536 at this point. However just after that, in frames 456/457 there is a GETATTR call and reply, that report the size of the file to be 64512. All of the later attributes that I can see from the server report the size as 64512. So either something raced in and truncated that file back to 64512 bytes after the write, or there's a server bug of some sort. I'd assume that you're running this on files that only this client would have access to, so my suspicion would be a server bug. Any chance we could make this bug public? The Microsoft folks would probably like to see this...
I directed this to Tom Talpey at MS and he asked: "What's the version of the Windows server in this? I see 4.1 mentioned, so I am guessing Windows 8 or later? You can run "winver.exe" to get the full build string." Chao, can you run that and let us know?
(In reply to Jeff Layton from comment #8) > I directed this to Tom Talpey at MS and he asked: > > "What's the version of the Windows server in this? I see 4.1 mentioned, so I > am guessing Windows 8 or later? You can run "winver.exe" to get the full > build string." > > Chao, can you run that and let us know? It's WindowsServer 2012 R2. Chao
We're now in communication with the MS devs who are responsible for this code. Closing this as NOTABUG since it's a server side bug.