238093 – Kernel NFSv4 client vs NetApp server hits error 10025

Bug 238093 - Kernel NFSv4 client vs NetApp server hits error 10025

Summary: Kernel NFSv4 client vs NetApp server hits error 10025

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	Red Hat Enterprise Linux 4
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	4.4
Hardware:	i386
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Jeff Layton
QA Contact:	Martin Jenner
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2007-04-27 03:17 UTC by ratness
Modified:	2007-11-17 01:14 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2007-06-18 18:28:23 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description ratness 2007-04-27 03:17:10 UTC

Description of problem:
NetApp (running OnTAP 7.2.1.1) exports as volume, and has NFSv4 on.  1 box
mounts the volume R/W and places data on it.  2 fully-patched RHEL4.4 boxes
running Apache mount the volume over NFSv4
(ro,bg,hard,nointr,timeo=600,rsize=32768,wsize=32768,actimeo=60).  Load kicks
in, and we run a soak test on Apache.  During the run, we begin to see
  kernel: nfs4_map_errors could not handle NFSv4 error 10025
in /var/log/messages.

We notice the problem when Apache doesn't serve certain static content.  We do 
  file /path/to/some/file
and get back
/path/to/some/file: : ERROR: cannot read `/path/to/some/file' (Input/output error)

This happens on different files on different servers during the run, so it's
session related, not the server, and we've been unable to replicate the
condition with Solaris 10 boxes.  

Version-Release number of selected component (if applicable):
2.6.9-42.0.10.ELsmp

How reproducible:
Flakes into existence after a long soak test, but the trigger is not known.

Steps to Reproduce:
1. NetApp exports a volume as NFSv4.
2. Fully-updated RHEL4.4 box mounts the volume R/O.
3. Add load generators whamming apache, which references the volume.
4. Wait for it.
  
Actual results:
Attempts to access files return IO errors, and 'nfs4_map_errors could not handle
NFSv4 error 10025' in syslog.

Expected results:
Continued perfect filesystem access.

Additional info:
A umount/mount can resolve it, but, that's going to be bad in production.

Comment 1 ratness 2007-05-04 23:13:26 UTC

As a followup: with a deadline looming, we had to give up and work around it,
so, I've lost my testing platform.

FC4 had the fewest RPM changes to make to get a more recent kernel into RHEL4,
so we pulled
 kernel-smp-2.6.17-1.2142_FC4.i686
 mkinitrd-4.2.15-1
 module-init-tools-3.2-0.pre9.0.FC4.4
 udev-071-0.FC4.3

and slapped those into the boxes.  We have been unable to duplicate the 10025
error since then.

Comment 2 Jeff Layton 2007-05-10 16:54:40 UTC

I've proposed a couple of patches for 4.6 that will alleviate problems due to
error 10024 (NFS4ERR_OLD_STATEID), and elimianate the printk's you're getting:

  kernel: nfs4_map_errors could not handle NFSv4 error 10025

10025 is NFS4ERR_BAD_STATEID, which basically means that the client is somehow
sending along stateid's that the server is not aware of. This could be a client
or server bug -- it's hard to tell which.

If you're willing to do so, a good first step would be to test on the kernels
that I have on my people page:

http://people.redhat.com/jlayton

They have a number of nfs and nfsv4 related patches that may make a difference here.

Comment 3 Jeff Layton 2007-06-18 18:28:23 UTC

No response from reporter in over a month. Closing this case. Please reopen if
you have more info.

Note You need to log in before you can comment on or make changes to this bug.