Description of problem: NetApp (running OnTAP 7.2.1.1) exports as volume, and has NFSv4 on. 1 box mounts the volume R/W and places data on it. 2 fully-patched RHEL4.4 boxes running Apache mount the volume over NFSv4 (ro,bg,hard,nointr,timeo=600,rsize=32768,wsize=32768,actimeo=60). Load kicks in, and we run a soak test on Apache. During the run, we begin to see kernel: nfs4_map_errors could not handle NFSv4 error 10025 in /var/log/messages. We notice the problem when Apache doesn't serve certain static content. We do file /path/to/some/file and get back /path/to/some/file: : ERROR: cannot read `/path/to/some/file' (Input/output error) This happens on different files on different servers during the run, so it's session related, not the server, and we've been unable to replicate the condition with Solaris 10 boxes. Version-Release number of selected component (if applicable): 2.6.9-42.0.10.ELsmp How reproducible: Flakes into existence after a long soak test, but the trigger is not known. Steps to Reproduce: 1. NetApp exports a volume as NFSv4. 2. Fully-updated RHEL4.4 box mounts the volume R/O. 3. Add load generators whamming apache, which references the volume. 4. Wait for it. Actual results: Attempts to access files return IO errors, and 'nfs4_map_errors could not handle NFSv4 error 10025' in syslog. Expected results: Continued perfect filesystem access. Additional info: A umount/mount can resolve it, but, that's going to be bad in production.
As a followup: with a deadline looming, we had to give up and work around it, so, I've lost my testing platform. FC4 had the fewest RPM changes to make to get a more recent kernel into RHEL4, so we pulled kernel-smp-2.6.17-1.2142_FC4.i686 mkinitrd-4.2.15-1 module-init-tools-3.2-0.pre9.0.FC4.4 udev-071-0.FC4.3 and slapped those into the boxes. We have been unable to duplicate the 10025 error since then.
I've proposed a couple of patches for 4.6 that will alleviate problems due to error 10024 (NFS4ERR_OLD_STATEID), and elimianate the printk's you're getting: kernel: nfs4_map_errors could not handle NFSv4 error 10025 10025 is NFS4ERR_BAD_STATEID, which basically means that the client is somehow sending along stateid's that the server is not aware of. This could be a client or server bug -- it's hard to tell which. If you're willing to do so, a good first step would be to test on the kernels that I have on my people page: http://people.redhat.com/jlayton They have a number of nfs and nfsv4 related patches that may make a difference here.
No response from reporter in over a month. Closing this case. Please reopen if you have more info.