Bug 634632 - nfs4_reclaim_open_state: unhandled error -5. Zeroing state
Summary: nfs4_reclaim_open_state: unhandled error -5. Zeroing state
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel
Version: 4.9
Hardware: All
OS: Linux
low
medium
Target Milestone: rc
: ---
Assignee: Jeff Layton
QA Contact: yanfu,wang
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2010-09-16 15:16 UTC by PaulB
Modified: 2018-11-14 14:33 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-02-16 16:04:57 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
proposed patch (1.03 KB, patch)
2010-09-24 18:24 UTC, Jeff Layton
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2011:0263 0 normal SHIPPED_LIVE Important: Red Hat Enterprise Linux 4.9 kernel security and bug fix update 2011-02-16 15:14:55 UTC

Description PaulB 2010-09-16 15:16:09 UTC
Description of problem:
During the review of 2.6.9-89.35.EL kernel workflow,
the following issue was seen...

x86_64 - kernel 2.6.9-89.35.EL smp 
	Recipe-448611
	 Test - /kernel/filesystems/nfs/connectathon/ -  PANIC

  [System Name: see next Comment for system hostname]
   As the system is still in this state and yet to WATCHDOG,
   I am able to see the following connecting via console:
    ====================================================
    lockd: couldn't shutdown host module!
    lockd: couldn't shutdown host module!
    [-- MARK -- Tue Sep 14 16:00:00 2010]
    [-- MARK -- Tue Sep 14 16:05:00 2010]
    nfs: server netapp-nfs not responding, timed out
    nfs: server netapp-nfs not responding, timed out
    [-- MARK -- Tue Sep 14 16:10:00 2010]
    nfs: server netapp-nfs not responding, timed out
    nfs: server netapp-nfs not responding, timed out
    nfs: server netapp-nfs not responding, timed out
    nfs: server netapp-nfs not responding, timed out
    nfs: server netapp-nfs not responding, timed out
    nfs: server netapp-nfs not responding, timed out
    nfs: server netapp-nfs not responding, timed out
    nfs: server netapp-nfs not responding, timed out
    nfs: server netapp-nfs not responding, timed out
    nfs: server netapp-nfs not responding, timed out
    nfs: server netapp-nfs not responding, timed out
    nfs: server netapp-nfs not responding, timed out
    nfs: server netapp-nfs not responding, still trying
    nfs: server netapp-nfs not responding, timed out
    nfs4_reclaim_open_state: unhandled error -5. Zeroing state
    nfs: server netapp-nfs OK
    NFS: v4 raced in function nfs4_proc_file_open
    general protection fault: 0000 [1] SMP
    CPU 3
    Modules linked in: nfs lockd nfs_acl lp md5 ipv6 parport_pc parport autofs4 i2c_dev i2c_core sunrpc ds yenta_socket pcmcia_core   
    cpufreq_powersave loop button battery ac uhci_hcd ehci_hcd snd_hda_intel snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd_page_alloc   
    snd_hwdep snd soundcore tg3 floppy sr_mod dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod ahci libata sd_mod scsi_mod
    Pid: 29994, comm: test1 Not tainted 2.6.9-89.35.ELsmp
    RIP: 0010:[<ffffffffa02ce1c0>] <ffffffffa02ce1c0>{:nfs:put_nfs_open_context+66}
    RSP: 0018:0000010123e09df8  EFLAGS: 00010246
    RAX: dead4ead00000001 RBX: 00000100b6e269b4 RCX: ffffffff803f7668
    RDX: 000001012766df40 RSI: 0000000000000246 RDI: 00000100b6e269b4
    RBP: 000001012766df00 R08: ffffffff803f7668 R09: 00000101341b0d80
    R10: 0000000100000000 R11: 0000ffff80412b20 R12: 000001012766df38
    R13: 00000100b6e26908 R14: 0000000000000000 R15: 000001005533c048
    FS:  0000002a95562de0(0000) GS:ffffffff80505900(0000) knlGS:0000000000000000
    CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    CR2: 00000000005b731c CR3: 00000000bef44000 CR4: 00000000000006e0
    Process test1 (pid: 29994, threadinfo 0000010123e08000, task 00000100057a37f0)
    Stack: 00000101341b0d80 000001012766df00 00000100a903fec0 ffffffffa02de52b
       0000000000000000 0000000000000000 00000100b6e26908 00000100a903fec0
       0000010139cd5b40 ffffffffa02cc83c
    Call Trace:<ffffffffa02de52b>{:nfs:nfs4_proc_file_open+238} <ffffffffa02cc83c>{:nfs:nfs_file_open+138}
       <ffffffff8017b86d>{__dentry_open+208} <ffffffff8017ba46>{filp_open+95}
       <ffffffff801f2af5>{strncpy_from_user+74} <ffffffff80158eb1>{audit_getname+133}
       <ffffffff8017bc35>{sys_open+57} <ffffffff80110442>{tracesys+209}


    Code: 48 89 50 08 48 89 02 49 c7 44 24 08 00 02 20 00 48 c7 45 38
    RIP <ffffffffa02ce1c0>{:nfs:put_nfs_open_context+66} RSP <0000010123e09df8>
    <0>Kernel panic - not syncing: Oops
    [-- MARK -- Tue Sep 14 16:15:00 2010]
    [-- MARK -- Tue Sep 14 16:20:00 2010] 
    ====================================================
 

Version-Release number of selected component (if applicable):


  
Actual results:
 See above trace.

Expected results:
 This test should pass.

Additional info:
 See next comment.

-pbunyan

Comment 2 Jeff Layton 2010-09-16 15:23:56 UTC
From the code, trace and the log messages, it looks like the problem may be that alloc_nfs_open_context isn't returning nfs contexts that can be passed to put_nfs_open_context without oopsing.

Comment 3 Jeff Layton 2010-09-24 16:49:05 UTC
I think I was able to reproduce this with a fault injection patch that poisons nfs_open_context struct after kmalloc'ing it and then pretending that the state wasn't found on the list...

ctx->list is definitely not being initialized, AFAICT (not even upstream -- yipes!) Testing a fix for that now...

Comment 4 Jeff Layton 2010-09-24 18:24:04 UTC
Created attachment 449480 [details]
proposed patch

This patch seems to fix my artificial reproducer for this. It looks like this is also an upstream bug too, but the effects may be mitigated there, as I don't see where the code there passes a newly allocated ctx to put_nfs_open_context.

Still, it's worth fixing there too so once I test this out on rawhide I'll send the patch there too.

Comment 5 Jeff Layton 2010-09-24 19:18:45 UTC
Patch sent upstream:

    http://marc.info/?l=linux-nfs&m=128535568718186&w=2

..awaiting comment, but I don't expect it to be especially controversial. I'll queue up something similar for RHEL4, and will check out RHEL5 and 6 too to make sure they're not vulnerable to this issue.

Comment 8 RHEL Program Management 2010-09-27 20:28:51 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 9 Vivek Goyal 2010-10-13 16:14:47 UTC
Committed in 89.42.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/

Comment 13 PaulB 2010-11-01 14:11:53 UTC
All,

 Retesting /kernel/networking/ndnc and /kernel/filesystems/nfs/connectathon on hp-z400-02.lab.bos.redhat.com with kernel 2.6.9-90.EL:
 http://rhts.redhat.com/cgi-bin/rhts/jobs.cgi?id=177612

 Kernel 2.6.9-90.EL was installed, /kernel/networking/ndnc, and /kernel/filesystems/nfs/connectathon was run on hp-z400-02.lab.bos.redhat.com without issue 5x. The results look good.

Best,
-pbunyan

Comment 15 errata-xmlrpc 2011-02-16 16:04:57 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0263.html


Note You need to log in before you can comment on or make changes to this bug.