Description of problem: During the review of 2.6.9-89.35.EL kernel workflow, the following issue was seen... x86_64 - kernel 2.6.9-89.35.EL smp Recipe-448611 Test - /kernel/filesystems/nfs/connectathon/ - PANIC [System Name: see next Comment for system hostname] As the system is still in this state and yet to WATCHDOG, I am able to see the following connecting via console: ==================================================== lockd: couldn't shutdown host module! lockd: couldn't shutdown host module! [-- MARK -- Tue Sep 14 16:00:00 2010] [-- MARK -- Tue Sep 14 16:05:00 2010] nfs: server netapp-nfs not responding, timed out nfs: server netapp-nfs not responding, timed out [-- MARK -- Tue Sep 14 16:10:00 2010] nfs: server netapp-nfs not responding, timed out nfs: server netapp-nfs not responding, timed out nfs: server netapp-nfs not responding, timed out nfs: server netapp-nfs not responding, timed out nfs: server netapp-nfs not responding, timed out nfs: server netapp-nfs not responding, timed out nfs: server netapp-nfs not responding, timed out nfs: server netapp-nfs not responding, timed out nfs: server netapp-nfs not responding, timed out nfs: server netapp-nfs not responding, timed out nfs: server netapp-nfs not responding, timed out nfs: server netapp-nfs not responding, timed out nfs: server netapp-nfs not responding, still trying nfs: server netapp-nfs not responding, timed out nfs4_reclaim_open_state: unhandled error -5. Zeroing state nfs: server netapp-nfs OK NFS: v4 raced in function nfs4_proc_file_open general protection fault: 0000 [1] SMP CPU 3 Modules linked in: nfs lockd nfs_acl lp md5 ipv6 parport_pc parport autofs4 i2c_dev i2c_core sunrpc ds yenta_socket pcmcia_core cpufreq_powersave loop button battery ac uhci_hcd ehci_hcd snd_hda_intel snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd_page_alloc snd_hwdep snd soundcore tg3 floppy sr_mod dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod ahci libata sd_mod scsi_mod Pid: 29994, comm: test1 Not tainted 2.6.9-89.35.ELsmp RIP: 0010:[<ffffffffa02ce1c0>] <ffffffffa02ce1c0>{:nfs:put_nfs_open_context+66} RSP: 0018:0000010123e09df8 EFLAGS: 00010246 RAX: dead4ead00000001 RBX: 00000100b6e269b4 RCX: ffffffff803f7668 RDX: 000001012766df40 RSI: 0000000000000246 RDI: 00000100b6e269b4 RBP: 000001012766df00 R08: ffffffff803f7668 R09: 00000101341b0d80 R10: 0000000100000000 R11: 0000ffff80412b20 R12: 000001012766df38 R13: 00000100b6e26908 R14: 0000000000000000 R15: 000001005533c048 FS: 0000002a95562de0(0000) GS:ffffffff80505900(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00000000005b731c CR3: 00000000bef44000 CR4: 00000000000006e0 Process test1 (pid: 29994, threadinfo 0000010123e08000, task 00000100057a37f0) Stack: 00000101341b0d80 000001012766df00 00000100a903fec0 ffffffffa02de52b 0000000000000000 0000000000000000 00000100b6e26908 00000100a903fec0 0000010139cd5b40 ffffffffa02cc83c Call Trace:<ffffffffa02de52b>{:nfs:nfs4_proc_file_open+238} <ffffffffa02cc83c>{:nfs:nfs_file_open+138} <ffffffff8017b86d>{__dentry_open+208} <ffffffff8017ba46>{filp_open+95} <ffffffff801f2af5>{strncpy_from_user+74} <ffffffff80158eb1>{audit_getname+133} <ffffffff8017bc35>{sys_open+57} <ffffffff80110442>{tracesys+209} Code: 48 89 50 08 48 89 02 49 c7 44 24 08 00 02 20 00 48 c7 45 38 RIP <ffffffffa02ce1c0>{:nfs:put_nfs_open_context+66} RSP <0000010123e09df8> <0>Kernel panic - not syncing: Oops [-- MARK -- Tue Sep 14 16:15:00 2010] [-- MARK -- Tue Sep 14 16:20:00 2010] ==================================================== Version-Release number of selected component (if applicable): Actual results: See above trace. Expected results: This test should pass. Additional info: See next comment. -pbunyan
From the code, trace and the log messages, it looks like the problem may be that alloc_nfs_open_context isn't returning nfs contexts that can be passed to put_nfs_open_context without oopsing.
I think I was able to reproduce this with a fault injection patch that poisons nfs_open_context struct after kmalloc'ing it and then pretending that the state wasn't found on the list... ctx->list is definitely not being initialized, AFAICT (not even upstream -- yipes!) Testing a fix for that now...
Created attachment 449480 [details] proposed patch This patch seems to fix my artificial reproducer for this. It looks like this is also an upstream bug too, but the effects may be mitigated there, as I don't see where the code there passes a newly allocated ctx to put_nfs_open_context. Still, it's worth fixing there too so once I test this out on rawhide I'll send the patch there too.
Patch sent upstream: http://marc.info/?l=linux-nfs&m=128535568718186&w=2 ..awaiting comment, but I don't expect it to be especially controversial. I'll queue up something similar for RHEL4, and will check out RHEL5 and 6 too to make sure they're not vulnerable to this issue.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Committed in 89.42.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/
All, Retesting /kernel/networking/ndnc and /kernel/filesystems/nfs/connectathon on hp-z400-02.lab.bos.redhat.com with kernel 2.6.9-90.EL: http://rhts.redhat.com/cgi-bin/rhts/jobs.cgi?id=177612 Kernel 2.6.9-90.EL was installed, /kernel/networking/ndnc, and /kernel/filesystems/nfs/connectathon was run on hp-z400-02.lab.bos.redhat.com without issue 5x. The results look good. Best, -pbunyan
verified on kernel kernel-2.6.9-96.EL: https://beaker.engineering.redhat.com/jobs/45983 https://beaker.engineering.redhat.com/jobs/45985 https://beaker.engineering.redhat.com/jobs/45986 https://beaker.engineering.redhat.com/jobs/45988 https://beaker.engineering.redhat.com/jobs/45989 https://beaker.engineering.redhat.com/jobs/45990
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-0263.html