Bug 238093 - Kernel NFSv4 client vs NetApp server hits error 10025
Kernel NFSv4 client vs NetApp server hits error 10025
Status: CLOSED INSUFFICIENT_DATA
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel (Show other bugs)
4.4
i386 Linux
medium Severity medium
: ---
: ---
Assigned To: Jeff Layton
Martin Jenner
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2007-04-26 23:17 EDT by ratness
Modified: 2007-11-16 20:14 EST (History)
2 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2007-06-18 14:28:23 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description ratness 2007-04-26 23:17:10 EDT
Description of problem:
NetApp (running OnTAP 7.2.1.1) exports as volume, and has NFSv4 on.  1 box
mounts the volume R/W and places data on it.  2 fully-patched RHEL4.4 boxes
running Apache mount the volume over NFSv4
(ro,bg,hard,nointr,timeo=600,rsize=32768,wsize=32768,actimeo=60).  Load kicks
in, and we run a soak test on Apache.  During the run, we begin to see
  kernel: nfs4_map_errors could not handle NFSv4 error 10025
in /var/log/messages.

We notice the problem when Apache doesn't serve certain static content.  We do 
  file /path/to/some/file
and get back
/path/to/some/file: : ERROR: cannot read `/path/to/some/file' (Input/output error)

This happens on different files on different servers during the run, so it's
session related, not the server, and we've been unable to replicate the
condition with Solaris 10 boxes.  

Version-Release number of selected component (if applicable):
2.6.9-42.0.10.ELsmp

How reproducible:
Flakes into existence after a long soak test, but the trigger is not known.

Steps to Reproduce:
1. NetApp exports a volume as NFSv4.
2. Fully-updated RHEL4.4 box mounts the volume R/O.
3. Add load generators whamming apache, which references the volume.
4. Wait for it.
  
Actual results:
Attempts to access files return IO errors, and 'nfs4_map_errors could not handle
NFSv4 error 10025' in syslog.

Expected results:
Continued perfect filesystem access.

Additional info:
A umount/mount can resolve it, but, that's going to be bad in production.
Comment 1 ratness 2007-05-04 19:13:26 EDT
As a followup: with a deadline looming, we had to give up and work around it,
so, I've lost my testing platform.

FC4 had the fewest RPM changes to make to get a more recent kernel into RHEL4,
so we pulled
 kernel-smp-2.6.17-1.2142_FC4.i686
 mkinitrd-4.2.15-1
 module-init-tools-3.2-0.pre9.0.FC4.4
 udev-071-0.FC4.3

and slapped those into the boxes.  We have been unable to duplicate the 10025
error since then.
Comment 2 Jeff Layton 2007-05-10 12:54:40 EDT
I've proposed a couple of patches for 4.6 that will alleviate problems due to
error 10024 (NFS4ERR_OLD_STATEID), and elimianate the printk's you're getting:

  kernel: nfs4_map_errors could not handle NFSv4 error 10025

10025 is NFS4ERR_BAD_STATEID, which basically means that the client is somehow
sending along stateid's that the server is not aware of. This could be a client
or server bug -- it's hard to tell which.

If you're willing to do so, a good first step would be to test on the kernels
that I have on my people page:

http://people.redhat.com/jlayton

They have a number of nfs and nfsv4 related patches that may make a difference here.
Comment 3 Jeff Layton 2007-06-18 14:28:23 EDT
No response from reporter in over a month. Closing this case. Please reopen if
you have more info.

Note You need to log in before you can comment on or make changes to this bug.