Bug 732748

Summary: nfs4_reclaim_open_state: Lock reclaim failed!
Product: [Fedora] Fedora Reporter: Michael J. Chudobiak <mjc>
Component: kernelAssignee: Steve Dickson <steved>
Status: CLOSED ERRATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: urgent Docs Contact:
Priority: high    
Version: 16CC: aquini, bfields, gansalmon, itamar, jgarzik, jiali, jonathan, kernel-maint, madhu.chinakonda, thomas.jarosch
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: kernel-2.6.43.5-2.fc15 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-05-13 01:50:27 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Michael J. Chudobiak 2011-08-23 13:29:47 UTC
kernel-2.6.40-4.fc15.x86_64

When I have more that one F15 system running Firefox on gnome-shell, I experience desktop freezes across various systems. For example, when I run one F15/gnome-shell desktop, everything is OK. If a second worker starts their computer, my desktop may freeze when they start using gnomeshell+firefox.

All home folders are nfsv4 mounted. Authentication is via nis.

I don't see a lot of useful log outputs, but I did catch the snippet below, which seems to confirm my theory that Something Bad is happening to the nfs system.

It is reminiscent of bug 517629, but this kernel seems to have the patch that closed that bug.

- Mike

Aug 22 15:16:45 xena kernel: [ 9294.039202] nfs4_reclaim_open_state: Lock reclaim failed!
Aug 22 15:16:55 xena kernel: [ 9303.364075] nfs4_reclaim_open_state: Lock reclaim failed!
Aug 22 15:16:55 xena kernel: [ 9303.394724] nfs4_reclaim_open_state: Lock reclaim failed!
Aug 22 15:16:55 xena kernel: [ 9303.579710] nfs4_reclaim_open_state: Lock reclaim failed!
Aug 22 15:16:55 xena kernel: [ 9303.608663] nfs4_reclaim_open_state: Lock reclaim failed!
Aug 22 15:17:05 xena kernel: [ 9313.465224] nfs4_reclaim_open_state: Lock reclaim failed!
Aug 22 15:17:05 xena kernel: [ 9313.466475] NFS: v4 server returned a bad sequence-id error on an unconfirmed sequence ffff8801539f1420!
Aug 22 15:17:05 xena kernel: [ 9313.467178] nfs4_reclaim_open_state: unhandled error -10026. Zeroing state
Aug 22 15:17:05 xena kernel: [ 9313.467800] NFS: v4 server returned a bad sequence-id error on an unconfirmed sequence ffff8801129c3e20!
Aug 22 15:17:05 xena kernel: [ 9313.468429] nfs4_reclaim_open_state: unhandled error -10026. Zeroing state
Aug 22 15:17:05 xena kernel: [ 9313.469037] NFS: v4 server returned a bad sequence-id error on an unconfirmed sequence ffff8801129c2420!
Aug 22 15:17:05 xena kernel: [ 9313.469663] nfs4_reclaim_open_state: unhandled error -10026. Zeroing state
Aug 22 15:17:05 xena kernel: [ 9313.470814] nfs4_reclaim_open_state: unhandled error -10026. Zeroing state
Aug 22 15:17:05 xena kernel: [ 9313.471429] nfs4_reclaim_open_state: unhandled error -10026. Zeroing state
Aug 22 15:17:05 xena kernel: [ 9313.472077] nfs4_reclaim_open_state: unhandled error -10026. Zeroing state
Aug 22 15:17:05 xena kernel: [ 9313.472706] nfs4_reclaim_open_state: unhandled error -10026. Zeroing state
Aug 22 15:17:05 xena kernel: [ 9313.473337] nfs4_reclaim_open_state: unhandled error -10026. Zeroing state
Aug 22 15:17:05 xena kernel: [ 9313.473961] nfs4_reclaim_open_state: unhandled error -10026. Zeroing state
.... last line repeated many times ...

Comment 1 Michael J. Chudobiak 2011-08-24 19:37:48 UTC
This was so problematic for me that I switched to glusterfs.

Everything works fine now. Better than fine, really. glusterfs rocks.

Something is badly broken in F15 + gnome-shell + firefox + nfsv4, but I'm no longer able to help debug it.

- Mike

Comment 2 Jeff Garzik 2012-05-05 01:47:00 UTC
Bumping to Fedora 16...  running into same problem.  My setup:

NFSv4 server:
     Fedora 16, kernel 3.3.2-6.fc16.x86_64
     Filesystem ext4 on top of md (RAID 1)
     /etc/exports:
          /g  10.10.20.0/24(rw,no_root_squash,fsid=0)

NFSv4 client:
     Fedora 16.  Home directories are mounted via NFSv4.

     Known working kernel: kernel-3.2.9-2.fc16.x86_64

     Known broken kernels:
          kernel-3.3.2-6.fc16.x86_64
          kernel-3.3.4-1.fc16.x86_64

     Mount entry:
10.10.20.1:/ on /g type nfs4 (rw,relatime,vers=4,rsize=131072,wsize=131072,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=0.0.0.0,minorversion=0,local_lock=none,addr=10.10.20.1)

Symptoms:

     Receive the following message, upon first access of a home dir,
     more than 1200 per second!

          May  4 14:50:26 bd kernel: [  761.391786] nfs4_reclaim_open_state: Lock reclaim failed!

Comment 3 Jeff Garzik 2012-05-05 01:52:11 UTC
Here is a related and probably useful upstream commit:

commit 96dcadc2fdd111dca90d559f189a30c65394451a
Author: William Dauchy <wdauchy>
Date:   Wed Mar 14 12:32:04 2012 +0100

    NFSv4: Rate limit the state manager for lock reclaim warning messages
    
    Adding rate limit on `Lock reclaim failed` messages since it could fill
    up system logs
    Signed-off-by: William Dauchy <wdauchy>
    Signed-off-by: Trond Myklebust <Trond.Myklebust>

Comment 4 Thomas Jarosch 2012-05-07 13:53:57 UTC
Same issue on Fedora 15 after kernel upgrade via "updates":

"Broken" kernel: 2.6.43.2-6.fc15.x86_64
Working kernel: 2.6.42.12-1.fc15.x86_64

Server runs: 2.6.32-220.4.2.el6.centos.plus.x86_64

My home directory is on a NFSv4 share and kmail stalls on creation of new messages. Booting the old kernel works instantly.

I'll try to reboot the NFS server in the evening, may be it's a "protocol" upgrade incompatibility.

Comment 5 J. Bruce Fields 2012-05-07 14:33:32 UTC
From discusion around

  http://mid.gmane.org/1334770614-10653-1-git-send-email-Trond.Myklebust@netapp.com

it looks like we probably need 55725513b5e "NFSv4: Ensure that we check lock exclusive/shared type against open modes" and 05ffe24f529 "NFSv4: Ensure that the LOCK code sets exception->inode".

Comment 6 Josh Boyer 2012-05-07 14:57:52 UTC
(In reply to comment #5)
> From discusion around
> 
>  
> http://mid.gmane.org/1334770614-10653-1-git-send-email-Trond.Myklebust@netapp.com
> 
> it looks like we probably need 55725513b5e "NFSv4: Ensure that we check lock
> exclusive/shared type against open modes" and 05ffe24f529 "NFSv4: Ensure that
> the LOCK code sets exception->inode".

Those are already applied on F16 and F17.  They're part of the 3.3.5 stable queue that should be released today.  F15 will pick them up at that point.

We'll grab the rate limit patch today too, though it looks mostly to be a paper over "fix" if anything.

Comment 7 Josh Boyer 2012-05-07 18:00:32 UTC
Patch has been applied.

Comment 8 Fedora Update System 2012-05-08 20:57:19 UTC
kernel-2.6.43.5-2.fc15 has been submitted as an update for Fedora 15.
https://admin.fedoraproject.org/updates/kernel-2.6.43.5-2.fc15

Comment 9 Fedora Update System 2012-05-08 21:09:19 UTC
kernel-3.3.5-2.fc16 has been submitted as an update for Fedora 16.
https://admin.fedoraproject.org/updates/kernel-3.3.5-2.fc16

Comment 10 Fedora Update System 2012-05-10 14:30:20 UTC
Package kernel-3.3.5-2.fc16:
* should fix your issue,
* was pushed to the Fedora 16 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing kernel-3.3.5-2.fc16'
as soon as you are able to, then reboot.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-2012-7538/kernel-3.3.5-2.fc16
then log in and leave karma (feedback).

Comment 11 Fedora Update System 2012-05-13 01:50:27 UTC
kernel-3.3.5-2.fc16 has been pushed to the Fedora 16 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 12 Fedora Update System 2012-05-15 23:23:35 UTC
kernel-2.6.43.5-2.fc15 has been pushed to the Fedora 15 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 13 Thomas Jarosch 2012-05-29 13:12:22 UTC
Thanks for the fast resolution of this issue! It's highly appreciated.
Everything back to normal with the latest updates.