Bug 432881 - kernel: NFS: v4 server returned a bad sequence-id error!
Summary: kernel: NFS: v4 server returned a bad sequence-id error!
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel
Version: 4.6
Hardware: All
OS: Linux
low
medium
Target Milestone: rc
: ---
Assignee: Jeff Layton
QA Contact: Martin Jenner
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2008-02-14 22:05 UTC by John Caruso
Modified: 2014-06-18 07:37 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2009-05-18 19:36:36 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2009:1024 0 normal SHIPPED_LIVE Important: Red Hat Enterprise Linux 4.8 kernel security and bug fix update 2009-05-18 14:57:26 UTC

Description John Caruso 2008-02-14 22:05:24 UTC
Description of problem:
"kernel: NFS: v4 server returned a bad sequence-id error!" is
regularly/repeatedly logged when using NFSv4 on RHEL 4.6.

Version-Release number of selected component (if applicable):
nfs-utils-1.0.6-84.EL4
kernel-smp-2.6.9-67.EL

(I'm not sure if nfs-utils was the appropriate component to choose for this
problem, but it seems like it will at least get it routed to the right person.)

How reproducible:
Just touch a new file on an NFSv4-mounted directory; the sequence-id error
occurs nearly every time.

Steps to Reproduce:
1. mount -t nfs4 netappfiler:/vol/test /var/tmp/test
2. touch /var/tmp/test/testfile
  
Actual results:
"kernel: NFS: v4 server returned a bad sequence-id error!"

Expected results:
No error.

Additional info:
I previously opened Redhat service request 180277 for this NFSv4 bug (and one
other), and that case has syslog and tcpdump output for the mount issue along
with a sysreport for the system in question.  I've verified this on two separate
RHEL4 systems (one i686 and one x86_64).  The volume in question is being
mounted from a Netapp filer.

Comment 1 John Caruso 2008-02-14 22:11:28 UTC
Oops--for "the mount issue" above, read "the sequence-id issue".

Comment 3 Michael Kearey 2008-08-08 08:01:59 UTC
  John Caruso, Hi I found this BZ while looking into a _similar_ issue. I see that you have worked through the problem in https://bugzilla.redhat.com/show_bug.cgi?id=432861 .


So is it ok to close this one?

Regards,
Michael

Comment 4 John Caruso 2008-08-08 15:28:26 UTC
I don't know if there's any relationship between bug 432861 and this bug, so no, this bug shouldn't be closed.

Comment 5 Jeff Layton 2008-08-29 17:45:00 UTC
Hi John,
   I just gave this a try using kernel 2.6.9-78.6.EL.jtltest.47smp on my RH people page and haven't seen this error with simple file creations:

http://people.redhat.com/jlayton/

Would you be able to test these somewhere non-critical and see if the problem might already be fixed? What sort of server are you testing against here?

Comment 6 Jeff Layton 2008-09-03 19:44:09 UTC
Actually, it would be even better to test this with a jtltest.50 kernel or greater. I think there was a problem in some earlier kernels that could cause this.

Please let me know if those kernels seem to cure this problem for you.

Comment 7 John Caruso 2008-09-04 23:53:22 UTC
I verified that the bug does still occur on the currently-released RHEL4 kernel (78.0.1).  I also just installed kernel-smp-2.6.9-78.7.EL.jtltest.50.x86_64 off of your page and tested with that, and it gets the error as well.  This is my test sequence in its entirety:

   mount -t nfs4 netappfiler:/vol/test /var/tmp/test
   touch /var/tmp/test/testfile
   umount /var/tmp/test

Executing this series of commands reliably (90+% of the time) results in the "NFS: v4 server returned a bad sequence-id error!" message.  One thing, though: it's usually (though not always) the case that this error is only logged to the netdump server defined in SYSLOGADDR, rather than being directly logged by the kernel via syslog--that may be why you haven't been seeing the message.  So you may need to enable netdump logging on your test machine and check for the message on the netdump server.

Also, the Netapp filer in our case is running Data ONTAP 7.2.3.

Comment 8 Jeff Layton 2008-09-05 19:04:46 UTC
I'm able to see this too against netapp servers, and somewhat less frequently against other servers. What I'm seeing is that we're issuing an open call and the server is returning NFS4_BAD_SEQID. After we get this error, we issue almost the exact same open call and that does not return the error. The main difference between the working and non-working calls appears to be the data in the open_owner4.

Still checking the RFC to see if I can tell what's causing this.

Comment 9 Jeff Layton 2008-09-05 20:24:05 UTC
Ok, the problem it not that we're reusing open_owner ID's, but rather that we're sending identical strings to the server on a SETCLIENTID call. This means that the server sends us back the same clientid on each mount. We then start with open_owner ID 0 and have to roll through the entire list of used open_owners until we find one that hasn't been used yet. This slows things down, and each attempt gives us one of these printk's.

I have a patch that I think will fix it. I'll plan to add this into my test kernels so that it can be easily tested.

Comment 10 Jeff Layton 2008-09-05 23:10:47 UTC
John, I've built some kernels with a patch that I think will fix this and put them on my people page:

http://people.redhat.com/jlayton/

could you test them and let me know if they help the problem you're seeing?

Comment 11 John Caruso 2008-09-05 23:17:29 UTC
Which kernel (for RHEL4 x86_64)?

Comment 12 Jeff Layton 2008-09-05 23:43:36 UTC
I'd use whatever kernel variant you're using now ("normal", smp or whatever). For instance, if you're using the uniprocessor kernel, then you'll probably want this:

kernel-2.6.9-78.8.EL.jtltest.51.x86_64.rpm

The RHEL4 kernels on that page are all built from the same sources, just with different configs.

Comment 13 John Caruso 2008-09-05 23:53:29 UTC
Sorry, I'd confused the RHEL5 kernels for alternate RHEL4 kernels.  
Looks like that kernel fixes it--I couldn't reproduce the message anymore in 20 tests (or so).

Comment 14 Jeff Layton 2008-09-06 00:08:41 UTC
Thanks for testing it. I'll add this to the proposed list for 4.8.

Comment 15 RHEL Program Management 2008-09-07 01:47:42 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 17 Vivek Goyal 2008-09-25 13:17:34 UTC
Committed in 78.11.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/

Comment 20 errata-xmlrpc 2009-05-18 19:36:36 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-1024.html


Note You need to log in before you can comment on or make changes to this bug.