Description of problem:
"kernel: NFS: v4 server returned a bad sequence-id error!" is
regularly/repeatedly logged when using NFSv4 on RHEL 4.6.
Version-Release number of selected component (if applicable):
(I'm not sure if nfs-utils was the appropriate component to choose for this
problem, but it seems like it will at least get it routed to the right person.)
Just touch a new file on an NFSv4-mounted directory; the sequence-id error
occurs nearly every time.
Steps to Reproduce:
1. mount -t nfs4 netappfiler:/vol/test /var/tmp/test
2. touch /var/tmp/test/testfile
"kernel: NFS: v4 server returned a bad sequence-id error!"
I previously opened Redhat service request 180277 for this NFSv4 bug (and one
other), and that case has syslog and tcpdump output for the mount issue along
with a sysreport for the system in question. I've verified this on two separate
RHEL4 systems (one i686 and one x86_64). The volume in question is being
mounted from a Netapp filer.
Oops--for "the mount issue" above, read "the sequence-id issue".
John Caruso, Hi I found this BZ while looking into a _similar_ issue. I see that you have worked through the problem in https://bugzilla.redhat.com/show_bug.cgi?id=432861 .
So is it ok to close this one?
I don't know if there's any relationship between bug 432861 and this bug, so no, this bug shouldn't be closed.
I just gave this a try using kernel 2.6.9-78.6.EL.jtltest.47smp on my RH people page and haven't seen this error with simple file creations:
Would you be able to test these somewhere non-critical and see if the problem might already be fixed? What sort of server are you testing against here?
Actually, it would be even better to test this with a jtltest.50 kernel or greater. I think there was a problem in some earlier kernels that could cause this.
Please let me know if those kernels seem to cure this problem for you.
I verified that the bug does still occur on the currently-released RHEL4 kernel (78.0.1). I also just installed kernel-smp-2.6.9-78.7.EL.jtltest.50.x86_64 off of your page and tested with that, and it gets the error as well. This is my test sequence in its entirety:
mount -t nfs4 netappfiler:/vol/test /var/tmp/test
Executing this series of commands reliably (90+% of the time) results in the "NFS: v4 server returned a bad sequence-id error!" message. One thing, though: it's usually (though not always) the case that this error is only logged to the netdump server defined in SYSLOGADDR, rather than being directly logged by the kernel via syslog--that may be why you haven't been seeing the message. So you may need to enable netdump logging on your test machine and check for the message on the netdump server.
Also, the Netapp filer in our case is running Data ONTAP 7.2.3.
I'm able to see this too against netapp servers, and somewhat less frequently against other servers. What I'm seeing is that we're issuing an open call and the server is returning NFS4_BAD_SEQID. After we get this error, we issue almost the exact same open call and that does not return the error. The main difference between the working and non-working calls appears to be the data in the open_owner4.
Still checking the RFC to see if I can tell what's causing this.
Ok, the problem it not that we're reusing open_owner ID's, but rather that we're sending identical strings to the server on a SETCLIENTID call. This means that the server sends us back the same clientid on each mount. We then start with open_owner ID 0 and have to roll through the entire list of used open_owners until we find one that hasn't been used yet. This slows things down, and each attempt gives us one of these printk's.
I have a patch that I think will fix it. I'll plan to add this into my test kernels so that it can be easily tested.
John, I've built some kernels with a patch that I think will fix this and put them on my people page:
could you test them and let me know if they help the problem you're seeing?
Which kernel (for RHEL4 x86_64)?
I'd use whatever kernel variant you're using now ("normal", smp or whatever). For instance, if you're using the uniprocessor kernel, then you'll probably want this:
The RHEL4 kernels on that page are all built from the same sources, just with different configs.
Sorry, I'd confused the RHEL5 kernels for alternate RHEL4 kernels.
Looks like that kernel fixes it--I couldn't reproduce the message anymore in 20 tests (or so).
Thanks for testing it. I'll add this to the proposed list for 4.8.
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release. Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products. This request is not yet committed for inclusion in an Update
Committed in 78.11.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.