Bug 858978

Summary: nfs version 2 stress test failed under i686. 2.6.18-238.el5 and 2.6.18-308.13.1.el5 failed, but 2.6.18-164.9.1.el5 succeed..
Product: Red Hat Enterprise Linux 5 Reporter: Mitz Amano <mitz.amano>
Component: kernelAssignee: nfs-maint
Status: CLOSED WONTFIX QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 5.6CC: eguan, jlayton, mitz.amano, nfs-maint, rwheeler, steved
Target Milestone: rc   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-06-02 13:20:28 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
first level sub dir is kernel versions which I have tested, each sub dir contents netpan.log and netpan.out
none
log for kernel mod work and user mod test (reading and truncating can repeat this issue)
none
it is the dump log data which suport the analysing for the Root cause 1.
none
it is the dump log data which suport the analysing for the Root cause 2. none

Description Mitz Amano 2012-09-20 09:34:51 UTC
Created attachment 614800 [details]
first level sub dir is kernel versions which I have tested, each sub dir contents netpan.log and netpan.out

Description of problem:

the nfs client and nfs server are seperated.


Version-Release number of selected component (if applicable):

both of them are i686 (4CPUs, 2GB RAM, ext3 10GB free disk space), the ISO is RHEL5.6 (Tikanga)

kernel version are the same for client and server (at one test time).

for kernel-2.6.18-308.13.1el5 or kernel-2.6.18-238.el5 are all failed.

for kernel-2.6.18-164.9.1.el5 is succeed.

for x86_64 version, succeed.

for upstream kernel (kernel-3.6.0-rc5), it is succeed.


How reproducible:

I use kernel test tools (ltp-full-20120903.bz2)

use nfsx-linux to test nfs

client mount server with  '-o proto-udp,vers=2'

please see attachment to get more details.

Steps to Reproduce:
1. tar -zxvf ltp-full-20120903.bz2
2. cd ltp-full-20120903
3. ./configure&&make&&make install
4. config rsh: the client can login server with hostname without passwords
     config /etc/securetty append rsh and rlogin
     ...

5. vi /opt/ltp/testscript/networktests.sh
     asign name to HOST=
     asign password to PASSWD=
     add '-o /tmp/netpan.log' to ltp-pan command (need search it, firstly).

6. vi /opt/ltp/runtest/nfs
     comment all lines excluding the line which contents "nfsx" words.

7. cd /opt/ltp/testscript
8. ./networktests.sh -N
9. we will get the result: /tmp/netpan.out and /tmp/netpan.log


Actual results:

not pass the test , it report :
"Size error: expected 0x30f86 stat 0x2e952 seek 0x2e952"

please see attachment to get more details.


Expected results:

pass the test (the output file will report it will succeed)

Additional info:

I will continue to find the root cause of it.

and now, it is not difficult for us (I think)

Comment 1 Mitz Amano 2012-09-27 02:46:46 UTC
current status:

1) the relative patch is patch25348 of Red Hat:

   A) before this patch, it is ok, add this patch, it will be fail,
      but after checking details, I think it is not this patch issue

   B) and truly have additional relative fix patches which RHEL5 not add in,
      these patches fix many bugs (I think also including our issue).
      but I think: none of them is only specially for fixing our issue.

   C) continue finding and reading relative patches is still valuable for us.
      at least, they are the good reference for reading relative source code.


2) only reading and truncating operations, can repeat this issue every time

   A) environments
      only one client, and one server; both are i686.
      I think, only focus client part is enough;
      in kernel, reading task(thread) and truncating task(thread) are different.
      in user mode, only one thread.

   B) running flow
      after truncate down (make inode->i_size smaller) finish, (task1)
      the nfs_read_done comes, (task2) 
      it still has original fattr which size is larger than inode->i_size
      nfs_read_done calls nfs_refresh_inode with original fattr
      nfs_refresh_inode calls nfs_refresh_inode_locked with original fattr
      nfs_refresh_inode_locked calls nfs_inode_attrs_need_update
      nfs_inode_attrs_need_update calls nfs_size_need_update (patch25348)
      the inode->i_size is smaller than fattr->size (original fattr), return 1
      then call nfs_update_inode and reset inode->i_size again !!

   C) attachment
      it is logdmp.tar.bz2
      dmp.log is printk in kernel flow (please search "hit" in contents)
      netpan.out is print from the user mode test tools
      *.c are my modifications for analysing


3) it will be better if Red Hat can give some suggestions or completions

   A) hope the contents above are useful for Red Hat, if they still work on it.

   B) I will continue analysing, until truly find root cause and fix it.

   C) welcome to giving any suggestions or completions.


thanks.

Comment 2 Mitz Amano 2012-09-27 02:49:15 UTC
Created attachment 617861 [details]
log for kernel mod work  and user mod test  (reading and truncating can repeat this issue)

Comment 3 Mitz Amano 2012-10-22 08:30:06 UTC
Root cause 1:

the nfs_update_inode in fs/nfs/inode.c has no enough information to judge whether should change nfsi->attr_gencount (NFS_INO_INVALID_ATTR should be set for invalid).

so when truncate operation occurs, it will set the file size outside of nfs_update_inode, and then when it call nfs_update_inode, the NFS_INO_INVALID_ATTR not set, nfsi->attr_gencount is still old.

and then async read_done comes, it's fattr->gencount is later than the nfsi->attr_gencount, so nfs_update_inode will be called (which should not be called in this situation).

For upstream kernel, no this issue (it has enough information in fs/nfs/inode.c to judge whether should change nfsi->attr_gencount).


Please check this root cause by Jeff Layton.

the logs are in the relative attachment below

thanks.

Comment 4 Mitz Amano 2012-10-22 08:32:10 UTC
Created attachment 631320 [details]
it is the dump log data which suport the analysing for the Root cause 1.

please see time_err.tar.bz2, which the dump log data for analysing Root cause 1.

Comment 5 Mitz Amano 2012-10-22 08:36:16 UTC
Root cause 2:

the nfs_size_need_update in fs/nfs/inode.c  does not consider the situation for async read_done which flowing with the truncate to smaller size operation.

the upstream kernel also has this issue.

so I will continue to communicate with the relative members in upstream kernel mailing list.

the logs are in the relative attachment below.

thanks.

Comment 6 Mitz Amano 2012-10-22 08:37:45 UTC
Created attachment 631324 [details]
it is the dump log data which suport the analysing for the Root cause 2.

the file: size_err.tar.bz2 is the support log for analysing the Root cause 2.

Comment 7 Mitz Amano 2012-10-22 08:45:56 UTC
after "fix" the 2 bugs, it can pass the test of fsx-linux, finally.


fix root cause 1:

  when trucate operation call nfs_refresh_inode (which will call nfs_update_inode), add additonal information (such as additional parameter) to let nfs_update_inode know the NFS_INO_INVALID_ATTR has effect, the nfsi->attr_gencount should be updated.


fix root cause 2:

  not call nfs_size_need_update, it will cause the delay of synchronizing attributes between client and server, but for our test case(no write operation, and only a client and a server), it will be ok.


at last, we get 2 root causes.

for root cause 1, it is not complix to fix.

for root cause 2, I am just communicating with upstream in kernel mailing list.

Comment 8 Mitz Amano 2012-10-23 01:12:06 UTC
Please confirm what I said above is correct or not, thanks.


if you can not confirm, I will send the relative information to kernel-mgr, I believe kernel-mgr can confirm the issues of Red Hat own.


thanks.

Comment 9 RHEL Program Management 2014-03-07 13:45:58 UTC
This bug/component is not included in scope for RHEL-5.11.0 which is the last RHEL5 minor release. This Bugzilla will soon be CLOSED as WONTFIX (at the end of RHEL5.11 development phase (Apr 22, 2014)). Please contact your account manager or support representative in case you need to escalate this bug.

Comment 10 RHEL Program Management 2014-06-02 13:20:28 UTC
Thank you for submitting this request for inclusion in Red Hat Enterprise Linux 5. We've carefully evaluated the request, but are unable to include it in RHEL5 stream. If the issue is critical for your business, please provide additional business justification through the appropriate support channels (https://access.redhat.com/site/support).