Red Hat Bugzilla – Bug 858978
nfs version 2 stress test failed under i686. 2.6.18-238.el5 and 2.6.18-308.13.1.el5 failed, but 2.6.18-164.9.1.el5 succeed..
Last modified: 2014-09-15 21:50:32 EDT
Created attachment 614800 [details]
first level sub dir is kernel versions which I have tested, each sub dir contents netpan.log and netpan.out
Description of problem:
the nfs client and nfs server are seperated.
Version-Release number of selected component (if applicable):
both of them are i686 (4CPUs, 2GB RAM, ext3 10GB free disk space), the ISO is RHEL5.6 (Tikanga)
kernel version are the same for client and server (at one test time).
for kernel-2.6.18-308.13.1el5 or kernel-2.6.18-238.el5 are all failed.
for kernel-2.6.18-164.9.1.el5 is succeed.
for x86_64 version, succeed.
for upstream kernel (kernel-3.6.0-rc5), it is succeed.
I use kernel test tools (ltp-full-20120903.bz2)
use nfsx-linux to test nfs
client mount server with '-o proto-udp,vers=2'
please see attachment to get more details.
Steps to Reproduce:
1. tar -zxvf ltp-full-20120903.bz2
2. cd ltp-full-20120903
3. ./configure&&make&&make install
4. config rsh: the client can login server with hostname without passwords
config /etc/securetty append rsh and rlogin
5. vi /opt/ltp/testscript/networktests.sh
asign name to HOST=
asign password to PASSWD=
add '-o /tmp/netpan.log' to ltp-pan command (need search it, firstly).
6. vi /opt/ltp/runtest/nfs
comment all lines excluding the line which contents "nfsx" words.
7. cd /opt/ltp/testscript
8. ./networktests.sh -N
9. we will get the result: /tmp/netpan.out and /tmp/netpan.log
not pass the test , it report :
"Size error: expected 0x30f86 stat 0x2e952 seek 0x2e952"
please see attachment to get more details.
pass the test (the output file will report it will succeed)
I will continue to find the root cause of it.
and now, it is not difficult for us (I think)
1) the relative patch is patch25348 of Red Hat:
A) before this patch, it is ok, add this patch, it will be fail,
but after checking details, I think it is not this patch issue
B) and truly have additional relative fix patches which RHEL5 not add in,
these patches fix many bugs (I think also including our issue).
but I think: none of them is only specially for fixing our issue.
C) continue finding and reading relative patches is still valuable for us.
at least, they are the good reference for reading relative source code.
2) only reading and truncating operations, can repeat this issue every time
only one client, and one server; both are i686.
I think, only focus client part is enough;
in kernel, reading task(thread) and truncating task(thread) are different.
in user mode, only one thread.
B) running flow
after truncate down (make inode->i_size smaller) finish, (task1)
the nfs_read_done comes, (task2)
it still has original fattr which size is larger than inode->i_size
nfs_read_done calls nfs_refresh_inode with original fattr
nfs_refresh_inode calls nfs_refresh_inode_locked with original fattr
nfs_refresh_inode_locked calls nfs_inode_attrs_need_update
nfs_inode_attrs_need_update calls nfs_size_need_update (patch25348)
the inode->i_size is smaller than fattr->size (original fattr), return 1
then call nfs_update_inode and reset inode->i_size again !!
it is logdmp.tar.bz2
dmp.log is printk in kernel flow (please search "hit" in contents)
netpan.out is print from the user mode test tools
*.c are my modifications for analysing
3) it will be better if Red Hat can give some suggestions or completions
A) hope the contents above are useful for Red Hat, if they still work on it.
B) I will continue analysing, until truly find root cause and fix it.
C) welcome to giving any suggestions or completions.
Created attachment 617861 [details]
log for kernel mod work and user mod test (reading and truncating can repeat this issue)
Root cause 1:
the nfs_update_inode in fs/nfs/inode.c has no enough information to judge whether should change nfsi->attr_gencount (NFS_INO_INVALID_ATTR should be set for invalid).
so when truncate operation occurs, it will set the file size outside of nfs_update_inode, and then when it call nfs_update_inode, the NFS_INO_INVALID_ATTR not set, nfsi->attr_gencount is still old.
and then async read_done comes, it's fattr->gencount is later than the nfsi->attr_gencount, so nfs_update_inode will be called (which should not be called in this situation).
For upstream kernel, no this issue (it has enough information in fs/nfs/inode.c to judge whether should change nfsi->attr_gencount).
Please check this root cause by Jeff Layton.
the logs are in the relative attachment below
Created attachment 631320 [details]
it is the dump log data which suport the analysing for the Root cause 1.
please see time_err.tar.bz2, which the dump log data for analysing Root cause 1.
Root cause 2:
the nfs_size_need_update in fs/nfs/inode.c does not consider the situation for async read_done which flowing with the truncate to smaller size operation.
the upstream kernel also has this issue.
so I will continue to communicate with the relative members in upstream kernel mailing list.
the logs are in the relative attachment below.
Created attachment 631324 [details]
it is the dump log data which suport the analysing for the Root cause 2.
the file: size_err.tar.bz2 is the support log for analysing the Root cause 2.
after "fix" the 2 bugs, it can pass the test of fsx-linux, finally.
fix root cause 1:
when trucate operation call nfs_refresh_inode (which will call nfs_update_inode), add additonal information (such as additional parameter) to let nfs_update_inode know the NFS_INO_INVALID_ATTR has effect, the nfsi->attr_gencount should be updated.
fix root cause 2:
not call nfs_size_need_update, it will cause the delay of synchronizing attributes between client and server, but for our test case(no write operation, and only a client and a server), it will be ok.
at last, we get 2 root causes.
for root cause 1, it is not complix to fix.
for root cause 2, I am just communicating with upstream in kernel mailing list.
Please confirm what I said above is correct or not, thanks.
if you can not confirm, I will send the relative information to firstname.lastname@example.org, I believe email@example.com can confirm the issues of Red Hat own.
This bug/component is not included in scope for RHEL-5.11.0 which is the last RHEL5 minor release. This Bugzilla will soon be CLOSED as WONTFIX (at the end of RHEL5.11 development phase (Apr 22, 2014)). Please contact your account manager or support representative in case you need to escalate this bug.
Thank you for submitting this request for inclusion in Red Hat Enterprise Linux 5. We've carefully evaluated the request, but are unable to include it in RHEL5 stream. If the issue is critical for your business, please provide additional business justification through the appropriate support channels (https://access.redhat.com/site/support).