Description of problem: I am using nfsv4 from Fedora 11 towards a Centos-5 server. The server is mounted using NFSv4 with kerberos. The relevant part of the fstab looks as follows: nest.pheasant:/home1 /home/home1 nfs4 sec=krb5 0 0 The problem appears since the Fedora 10 kernels (see BZ 476084) Version-Release number of selected component (if applicable): kernel 2.6.18-162.el5xen (dzikus) but the same happens with stock kernels How reproducible: I have not found a way to reproduce the problem. It appears most often with Evolution Actual results: Fedora dmesg shows many lines like: NFS: v4 server returned a bad sequence-id error on an unconfirmed sequence ffff8801afdd4e28! NFS: v4 server returned a bad sequence-id error on an unconfirmed sequence ffff8801aedac628! NFS: v4 server returned a bad sequence-id error on an unconfirmed sequence ffff8801aedac628! NFS: v4 server returned a bad sequence-id error on an unconfirmed sequence ffff8801aedac228! NFS: v4 server returned a bad sequence-id error on an unconfirmed sequence ffff8801afe72828! NFS: v4 server returned a bad sequence-id error on an unconfirmed sequence ffff8801afe72e28! and the Centos dmesg says: NFSD: preprocess_seqid_op: bad seqid (expected 5764, got 5765) NFSD: preprocess_seqid_op: bad seqid (expected 5764, got 5765) NFSD: preprocess_seqid_op: bad seqid (expected 5764, got 5765) NFSD: preprocess_seqid_op: bad seqid (expected 5764, got 5765) Evolution hangs and needs a --force-shutdown to recover Expected results: No errors Additional info: On Fedora 9 the error also did show up, but the system recoved. I have for a long time been using a self compiled 2.6.28 kernel on the Centos box for unrelated reasons. This kernel did no exhibit the problem. The issue seems to be rare: when I first switched to dzikus kernel a week ago, the problem seemed to have vanished. I now saw it happen twice within a day (on light personal use).
Looks like this error is a dprintk on recent upstream kernels. Backporting 2fdada03b3876ab9f84ede160f187e888cafefb4 should silence it. Still, would be nice to know what triggers this so we can determine if whether we have a client or server bug someplace that needs to be fixed.
Hello Jeff, - the problem does not appear when my server runs the 2.6.28 kernel - the error message itself does not worry me, the hang on the client does. It happens primarily when evolution does a lot of disk IO, something I fail to understand as it uses dovecot, so I assume (no proof) that Evo updates it indexes etc., probably doing a lot of locking - As said above, with Fedora 9 I saw the error messages, but no hangs (maybe Evo handled the situation more gracefully?). The systems DO recover, but more often than not, I need to force an Evo shutdown, everything else continues to work. - I use a somewhat complex setup: NFSv4 + Kerberos The first point above makes me believe this is a RHEL kernel bug, as 2.6.28 does not have the issue. Could the issue be related to Kerberos?
Recent stock kernels (2.6.31.[45] and 2.6.31.5 show the problem as well. The hangings are there, but due to the demotion of the errors to dprintk() I do not see the error messages. I have no clue how to debug this....
I've faced the same errors using a very similar setup. The server is CentOS 5.4 using NFS4+kerberos and clients are Fedora 9 up to 11 and Debian Lenny (2.6.26). We have our home directories on NFS using automount. It's very noticeable serious slowdowns when using Firefox.
Just a note to let you know that I'm experiencing a similar problem. Client: Linux blaise.nelson.org.au 2.6.29.6-1.rt23.4.fc11.ccrma.x86_64.rt #1 SMP PREEMPT RT Thu Sep 3 18:59:28 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux Server: kernel 2.6.18-28 or so. I'm using LDAP instead of kerberos, though. It prevents me from saving documents in OpenOffice. Admittedly one of my documents is over 60,000 words. Thanks to whoever pointed out that the systems recover eventually. I just need to practise my patience :). You'll note that my Fedora/client kernel is a ccrma realtime one, but unless others are doing that, that's probably not a relevant factor either.
I am experiencing this as well, after upgrading to 2.6.18-200.el5 which I deployed trying to fix https://bugzilla.redhat.com/show_bug.cgi?id=551028 it is followed by (admittedly one hour later) Jun 4 09:18:07 fe01 kernel: NFS: v4 server returned a bad sequence-id error! Jun 4 10:18:54 fe01 kernel: INFO: task java:28498 blocked for more than 120 seconds. Jun 4 10:18:54 fe01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jun 4 10:18:54 fe01 kernel: java D 000036AF 1892 28498 1 28499 28497 (NOTLB) Jun 4 10:18:54 fe01 kernel: d66e2e84 00000086 4f174382 000036af f0ac4ac0 f8c4e734 00001000 00000008 Jun 4 10:18:54 fe01 kernel: d58ddaa0 4f4207d1 000036af 002ac44f 00000000 d58ddbac c1c09900 f66d9200 Jun 4 10:18:54 fe01 kernel: d66e2e90 00000000 0000000f d66e2e90 00aafb06 c042c9d5 f65b784c ffffffff Jun 4 10:18:54 fe01 kernel: Call Trace: Jun 4 10:18:54 fe01 kernel: [<f8c4e734>] nfs_pagein_one+0x224/0x243 [nfs] Jun 4 10:18:54 fe01 kernel: [<c042c9d5>] getnstimeofday+0x30/0xb6 Jun 4 10:18:54 fe01 kernel: [<c061d9d6>] io_schedule+0x36/0x59 Jun 4 10:18:54 fe01 kernel: [<c0459568>] sync_page+0x0/0x3b Jun 4 10:18:54 fe01 kernel: [<c04595a0>] sync_page+0x38/0x3b Jun 4 10:18:54 fe01 kernel: [<c061dae8>] __wait_on_bit_lock+0x2a/0x52 Jun 4 10:18:54 fe01 kernel: [<c04594e3>] __lock_page+0x52/0x59 Jun 4 10:18:54 fe01 kernel: [<c04362b8>] wake_bit_function+0x0/0x3c Jun 4 10:18:54 fe01 kernel: [<c045bf27>] filemap_nopage+0x265/0x34a Jun 4 10:18:54 fe01 kernel: [<c046527f>] __handle_mm_fault+0x178/0xa25 Jun 4 10:18:54 fe01 kernel: [<c047efbf>] sys_stat64+0x1e/0x23 Jun 4 10:18:54 fe01 kernel: [<c061fba6>] do_page_fault+0x23a/0x52d Jun 4 10:18:54 fe01 kernel: [<c061f96c>] do_page_fault+0x0/0x52d Jun 4 10:18:54 fe01 kernel: [<c0405a89>] error_code+0x39/0x40 Jun 4 10:18:54 fe01 kernel: ======================= the java process is a tomcat, and this is a dev server, so not much heavy traffic.
I get this from Fedora 13: 2.6.33.5-112.fc13.x86_64 (but it's been a problem ever since I moved to NFSv4). It is reliably triggered by Evolution. Server is likewise CentOS 5: 2.6.18-194.3.1.el5 Is there anything I can usefully provide to assist in finding this?
Also occurring on Fedora 13 2.6.33.6-147.fc13.x86_64. Should I file a separate bug for that?
We're seeing this too, on a 2.6.18-194.11.4.el5.x86_64 file server with about 200 up-to-date (64-bit) F13 clients. It only happens under heavy load, like at the beginning of the academic year. We're not using kerberos. I don't think it's locking-related since the errors seem to persist after trying "echo 0 > /proc/sys/fs/leases-enable" on the server. (This apparently disables locking and delegation.)
What would probably be a good start is a wire capture of this. Preferably, start the capture, mount up the filesystem and do whatever is needed to make the problem happen and then kill it off. If the capture is large or you're not comfortable attaching it to the bug, send me an email and I'll let you know a location where you can FTP it.
Oh and if you use tcpdump to do the capture, please make sure you're using a snarflen that's large enough to capture most of the compound RPCs. '-s512' should be sufficient.
I have had very similar problems (F13 to Centos 5.5 server, NFS4, Evolution, iptables enabled on the client. All fully updated) Has everyone opened up the call back port on the client ? Just a thought - it fixed it for me John Force the client to tell the server which port to use for call back ja@milos ~ 1$ cat /etc/modprobe.d/nfs.conf options nfs callback_tcpport=801 Open up the port on the client [root@milos ~]# cat /etc/sysconfig/iptables ... -A INPUT -m state --state NEW -m tcp -p tcp --dport 801 -s "server_ip_addr" -j ACCEPT ... Reboot client
Thanks, that's an interesting point. I wouldn't think that blocking the callback port would cause this. It should just make it so that the server won't hand you delegations. It may be that opening the callback port simply papered over the bug by allowing delegations to work and thereby reducing the number of state-morphing operations. I'd still like to see a capture of this in action if anyone can provide one. Note that I'll need a binary one (i.e.: tcpdump -w foo.pcap), as I'll likely need to sift through it with wireshark.
I should have added some more detail but I wasn't convinced my problem was relevant My Symptoms were Following a clean boot of the client and only after login of a user with home directory on the NFS4 server the server thows maui kernel: nfs4_cb: server "client_ip_addr" not responding, timed out No apparent problem until the user uses evolution Then the directories "can" become unreadable and evolution is unusable A totally repeatable method of causing evo to fail was to start creating a new email, no entries in the header required, start typing into the text window and when RETURN is hit the "bad seqid" error is given by the server. The server shows errors of the form Oct 13 11:14:55 maui kernel: NFSD: preprocess_seqid_op: bad seqid (expected 2787, got 2788) Oct 13 11:14:55 maui kernel: NFSD: preprocess_seqid_op: bad seqid (expected 2, got 3) Oct 13 11:15:45 maui kernel: NFSD: preprocess_seqid_op: bad seqid (expected 2787, got 2788 Disabling iptables solved the problem ... Having fixed the callback port (which made it easy to monitor traffic) wireshark shows the server talking to the client just after an NFS4 login Is it this transaction that is of interest?
(In reply to comment #14) > Disabling iptables solved the problem ... > Having fixed the callback port (which made it easy to monitor traffic) > wireshark shows the server talking to the client just after an NFS4 login > Is it this transaction that is of interest? It sounds sort of like the server or client may be incrementing or not incrementing a seqid when it should. It would be ideal to have a capture of the entire problem taking place, preferably from mount time until it manifests itself. That would include all traffic between client and server on the nfs port (2049) and the client's callback port. What we'd probably need you to do is re-enable iptables so that the callback port is blocked, start up a capture prior to mounting and then reproduce the problem. Once the problem manifests, kill the capture. Compress the resulting capture file and either attach it here if it's small enough or email me and I'll give you a location to FTP it if it's large.
I have captured the transactions during a failure http://www.vigla.eclipse.co.uk/wireshark_capture_files.tar.gz Hopefully they are useful (and this is the same problem as the bugzilla!) I will remove the file as soon as you have down loaded it
(In reply to comment #16) > I will remove the file as soon as you have down loaded it Thanks, got them. Let me look over them and I'll see what I can find.
I had a look and I'm afraid I don't see any BAD_SEQID errors in that capture. It looks like it may have been stopped too quickly? The last packet in the port 2049 capture is at 10:12:44, but the first bad-seqid printk pop was at 10:13:04. If you can provide a more complete capture that may be helpful.
Second attempt ! http://www.vigla.eclipse.co.uk/wireshark_mk2.tar.gz Hopefully its OK now I started the capture before the client booted The file contains all the data but for the ypserv traffic I do not see anything in wireshark that says "BAD_SEQID" but there are quite a few re-transmission, lost segments, RCP dup ...
Thanks, that helps... Ok, so I see 2 BAD_SEQID errors in the capture in frames 16391 and 16405. Here's the whole sequence from open-to-close (which is where the error comes in): ...file is opened read-only: 16386 84.500290 148.197.29.246 -> 148.197.29.5 NFS V4 COMP Call <EMPTY> PUTFH;SAVEFH SAVEFH;OPEN OPEN;GETFH GETFH;GETATTR GETATTR;RESTOREFH RESTOREFH;GETATTR GETATTR 16387 84.500382 148.197.29.5 -> 148.197.29.246 NFS V4 COMP Reply (Call In 16386) <EMPTY> PUTFH;SAVEFH SAVEFH;OPEN OPEN;GETFH GETFH;GETATTR GETATTR;RESTOREFH RESTOREFH;GETATTR GETATTR ...attempt to get a write lock on the file, which fails with NFS4ERR_OPENMODE (seems like correct behavior here): 16388 84.500513 148.197.29.246 -> 148.197.29.5 NFS V4 COMP Call <EMPTY> PUTFH;LOCK LOCK 16389 84.500580 148.197.29.5 -> 148.197.29.246 NFS V4 COMP Reply (Call In 16388) <EMPTY> PUTFH;LOCK LOCK(10038) ...then we try to close the file and get a NFS4ERR_BAD_SEQID: 16390 84.500700 148.197.29.246 -> 148.197.29.5 NFS V4 COMP Call <EMPTY> PUTFH;CLOSE CLOSE;GETATTR GETATTR 16391 84.500778 148.197.29.5 -> 148.197.29.246 NFS V4 COMP Reply (Call In 16390) <EMPTY> PUTFH;CLOSE CLOSE(10026) ...it seems somewhat likely that the problem is related to bad lock state handling here, but I'll need to look over it more closely to understand whether it's a client or server problem.
From RFC3530: The client MUST monotonically increment the sequence number for the CLOSE, LOCK, LOCKU, OPEN, OPEN_CONFIRM, and OPEN_DOWNGRADE operations. This is true even in the event that the previous operation that used the sequence number received an error. The only exception to this rule is if the previous operation received one of the following errors: NFS4ERR_STALE_CLIENTID, NFS4ERR_STALE_STATEID, NFS4ERR_BAD_STATEID, NFS4ERR_BAD_SEQID, NFS4ERR_BADXDR, NFS4ERR_RESOURCE, NFS4ERR_NOFILEHANDLE. The seqid's from the client look correct: OPEN: 0x000003be LOCK: 0x000003bf CLOSE: 0x000003c0 ...so this does look like a server bug. I'll have to look over the cause, but I suspect that the LOCK failure didn't cause the seqid on the server to get incremented like it should have been.
Looks like we might need this patch: commit 39325bd03fc16d903f1e0f51104436d939899c8c Author: J. Bruce Fields <bfields.edu> Date: Mon Nov 26 17:06:39 2007 -0500 nfsd4: fix bad seqid on lock request incompatible with open mode The failure to return a stateowner from nfs4_preprocess_seqid_op() means in the case where a lock request is of a type incompatible with an open (due to, e.g., an application attempting a write lock on a file open for read), means that fs/nfsd/nfs4xdr.c:ENCODE_SEQID_OP_TAIL() never bumps the seqid as it should. The client, attempting to close the file afterwards, then gets an (incorrect) bad sequence id error. Worse, this prevents the open file from ever being closed, so we leak state. Thanks to Benny Halevy and Trond Myklebust for analysis, and to Steven Wilton for the report and extensive data-gathering. Cc: Benny Halevy <bhalevy> Cc: Steven Wilton <steven.wilton.com.au> Cc: Trond Myklebust <trond.myklebust.no> Signed-off-by: J. Bruce Fields <bfields.edu> I'll plan to add this to my test kernels in the near future.
Ok, the RHEL5 kernels on this page (jtltest.123) have this patch: http://people.redhat.com/jlayton/ ...if anyone here is able to test these on their NFS server, please report the results.
Server maui ja@maui ~ 8$ uname -a Linux maui.jaa.org.uk 2.6.18-230.el5.jtltest.123 #1 SMP Fri Oct 29 ... I have run with above over the weekend and have not seen any NFSD: preprocess_seqid_op: bad seqid errors(with client call back port closed) nor any other ill effects ! Looking very good - but I am a very light, home user. How long will it be before the fix is included in a "released" kernel? (I'm lazy and I am using things such as nvidia-graphics260.19.12-kmdl-2.6.18-194.17.4.el5-260.19.12-129.el5.x86_64 ATrpms.net on the server and currently have no "X"). Many thanks John
(In reply to comment #24) > Server maui > ja@maui ~ 8$ uname -a > Linux maui.jaa.org.uk 2.6.18-230.el5.jtltest.123 #1 SMP Fri Oct 29 ... > > I have run with above over the weekend and have not seen any > NFSD: preprocess_seqid_op: bad seqid errors(with client call back port closed) > nor any other ill effects ! > > Looking very good - but I am a very light, home user. > Thanks for testing it. You'll probably want to open that port back up. Otherwise you don't get one of the main benefits of NFSv4 (delegations). > How long will it be before the fix is included in a "released" kernel? 5.7 unless someone with a support contract wants to open a support case and agitate for earlier inclusion.
This bug hits us too very hard, we have a large filerserver with NFSv4 and currently kernel-xen-2.6.18-194.26.1.el5.x86_64 and more than 100 clients. There daily at least some clients which need a reboot after NFSD: preprocess_seqid_op: bad seqid errors So a backport to 5.5 would be really nice. Sincerly, Klaus Steinberger
(In reply to comment #26) > This bug hits us too very hard, we have a large filerserver with NFSv4 and > currently kernel-xen-2.6.18-194.26.1.el5.x86_64 and more than 100 clients. > > There daily at least some clients which need a reboot after NFSD: > preprocess_seqid_op: bad seqid errors > > So a backport to 5.5 would be really nice. > The patch should apply to 5.5-ish kernels without too much trouble. Currently this is on track for inclusion in 5.7. If you want to see it earlier than that, then you'll need to open a support case, reference this bug and make a case for its inclusion into z-stream. It's a fairly small, targeted patch so it seems reasonable for inclusion there.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
hi Jeff, could customer help to test it since he hold the environment or give me some advices for how to verify the bug? thanks.
(In reply to comment #33) > hi Jeff, > could customer help to test it since he hold the environment or give me some > advices for how to verify the bug? thanks. You certainly can ask them to help verify it (see comment #24 where he already did). I suspect that we can also write a reproducer for this that opens a file read-only on the mount, tries to get a write lock (which should then fail) and then closes the file.
in kernel-2.6.18-245.el5 You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5 Detailed testing feedback is always welcomed.
Source is checked: [root@beaker rhel5]# git log --oneline 2.6.18-244.el5...2.6.18-245.el5 | grep "nfsd4: fix" 33b9e69 [fs] nfsd4: fix seqid on lock req incompat w/open mode This test use tips of comment 35, flock() instead of fcntl() would reproduce the bug. Details is listed below, reproduced in 2.6.18-126. This bug is caused by server. [[client]] ** mount: 10.66.86.226:/ on /mnt/test type nfs4 (rw,addr=10.66.86.226) ** steps cd /mnt/test touch test ./flock test [[server]] ** /etc/exports /test *(rw,no_root_squash,fsid=0) ** dmesg NFSD: preprocess_seqid_op: bad seqid (expected 3, got 4) ** other message appears in console, which is not in dmesg, maybe in syslog. flock.c used in test: ** ** ** #include <stdio.h> #include <fcntl.h> int main(int argc, char **argv) { if (argc < 2 ) { printf("./flock file\n"); return 0; } int fd = open(argv[1], O_RDONLY); if (fd <= 0) { perror("open"); return 0; } if (flock(fd, LOCK_EX) == -1) { perror("flock"); close(fd); return 0; } printf("test finish\n"); close(fd); return 0; } ** ** ** In 2.6.18-262, this message doesn't appear in dmesg. With comment 24, this bug is verified.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-1065.html