432881 – kernel: NFS: v4 server returned a bad sequence-id error!

Bug 432881 - kernel: NFS: v4 server returned a bad sequence-id error!

Summary: kernel: NFS: v4 server returned a bad sequence-id error!

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 4
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	4.6
Hardware:	All
OS:	Linux
Priority:	low
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Jeff Layton
QA Contact:	Martin Jenner
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2008-02-14 22:05 UTC by John Caruso
Modified:	2014-06-18 07:37 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2009-05-18 19:36:36 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2009:1024	0	normal	SHIPPED_LIVE	Important: Red Hat Enterprise Linux 4.8 kernel security and bug fix update	2009-05-18 14:57:26 UTC

Description John Caruso 2008-02-14 22:05:24 UTC

Description of problem:
"kernel: NFS: v4 server returned a bad sequence-id error!" is
regularly/repeatedly logged when using NFSv4 on RHEL 4.6.

Version-Release number of selected component (if applicable):
nfs-utils-1.0.6-84.EL4
kernel-smp-2.6.9-67.EL

(I'm not sure if nfs-utils was the appropriate component to choose for this
problem, but it seems like it will at least get it routed to the right person.)

How reproducible:
Just touch a new file on an NFSv4-mounted directory; the sequence-id error
occurs nearly every time.

Steps to Reproduce:
1. mount -t nfs4 netappfiler:/vol/test /var/tmp/test
2. touch /var/tmp/test/testfile
  
Actual results:
"kernel: NFS: v4 server returned a bad sequence-id error!"

Expected results:
No error.

Additional info:
I previously opened Redhat service request 180277 for this NFSv4 bug (and one
other), and that case has syslog and tcpdump output for the mount issue along
with a sysreport for the system in question.  I've verified this on two separate
RHEL4 systems (one i686 and one x86_64).  The volume in question is being
mounted from a Netapp filer.

Comment 1 John Caruso 2008-02-14 22:11:28 UTC

Oops--for "the mount issue" above, read "the sequence-id issue".

Comment 3 Michael Kearey 2008-08-08 08:01:59 UTC

  John Caruso, Hi I found this BZ while looking into a _similar_ issue. I see that you have worked through the problem in https://bugzilla.redhat.com/show_bug.cgi?id=432861 .


So is it ok to close this one?

Regards,
Michael

Comment 4 John Caruso 2008-08-08 15:28:26 UTC

I don't know if there's any relationship between bug 432861 and this bug, so no, this bug shouldn't be closed.

Comment 5 Jeff Layton 2008-08-29 17:45:00 UTC

Hi John,
   I just gave this a try using kernel 2.6.9-78.6.EL.jtltest.47smp on my RH people page and haven't seen this error with simple file creations:

http://people.redhat.com/jlayton/

Would you be able to test these somewhere non-critical and see if the problem might already be fixed? What sort of server are you testing against here?

Comment 6 Jeff Layton 2008-09-03 19:44:09 UTC

Actually, it would be even better to test this with a jtltest.50 kernel or greater. I think there was a problem in some earlier kernels that could cause this.

Please let me know if those kernels seem to cure this problem for you.

Comment 7 John Caruso 2008-09-04 23:53:22 UTC

I verified that the bug does still occur on the currently-released RHEL4 kernel (78.0.1).  I also just installed kernel-smp-2.6.9-78.7.EL.jtltest.50.x86_64 off of your page and tested with that, and it gets the error as well.  This is my test sequence in its entirety:

   mount -t nfs4 netappfiler:/vol/test /var/tmp/test
   touch /var/tmp/test/testfile
   umount /var/tmp/test

Executing this series of commands reliably (90+% of the time) results in the "NFS: v4 server returned a bad sequence-id error!" message.  One thing, though: it's usually (though not always) the case that this error is only logged to the netdump server defined in SYSLOGADDR, rather than being directly logged by the kernel via syslog--that may be why you haven't been seeing the message.  So you may need to enable netdump logging on your test machine and check for the message on the netdump server.

Also, the Netapp filer in our case is running Data ONTAP 7.2.3.

Comment 8 Jeff Layton 2008-09-05 19:04:46 UTC

I'm able to see this too against netapp servers, and somewhat less frequently against other servers. What I'm seeing is that we're issuing an open call and the server is returning NFS4_BAD_SEQID. After we get this error, we issue almost the exact same open call and that does not return the error. The main difference between the working and non-working calls appears to be the data in the open_owner4.

Still checking the RFC to see if I can tell what's causing this.

Comment 9 Jeff Layton 2008-09-05 20:24:05 UTC

Ok, the problem it not that we're reusing open_owner ID's, but rather that we're sending identical strings to the server on a SETCLIENTID call. This means that the server sends us back the same clientid on each mount. We then start with open_owner ID 0 and have to roll through the entire list of used open_owners until we find one that hasn't been used yet. This slows things down, and each attempt gives us one of these printk's.

I have a patch that I think will fix it. I'll plan to add this into my test kernels so that it can be easily tested.

Comment 10 Jeff Layton 2008-09-05 23:10:47 UTC

John, I've built some kernels with a patch that I think will fix this and put them on my people page:

http://people.redhat.com/jlayton/

could you test them and let me know if they help the problem you're seeing?

Comment 11 John Caruso 2008-09-05 23:17:29 UTC

Which kernel (for RHEL4 x86_64)?

Comment 12 Jeff Layton 2008-09-05 23:43:36 UTC

I'd use whatever kernel variant you're using now ("normal", smp or whatever). For instance, if you're using the uniprocessor kernel, then you'll probably want this:

kernel-2.6.9-78.8.EL.jtltest.51.x86_64.rpm

The RHEL4 kernels on that page are all built from the same sources, just with different configs.

Comment 13 John Caruso 2008-09-05 23:53:29 UTC

Sorry, I'd confused the RHEL5 kernels for alternate RHEL4 kernels.  
Looks like that kernel fixes it--I couldn't reproduce the message anymore in 20 tests (or so).

Comment 14 Jeff Layton 2008-09-06 00:08:41 UTC

Thanks for testing it. I'll add this to the proposed list for 4.8.

Comment 15 RHEL Program Management 2008-09-07 01:47:42 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 17 Vivek Goyal 2008-09-25 13:17:34 UTC

Committed in 78.11.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/

Comment 20 errata-xmlrpc 2009-05-18 19:36:36 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-1024.html

Note You need to log in before you can comment on or make changes to this bug.