Bug 746861

Summary: umount of RHEL 6.2 2.6.32-209.el6.x86_64 beta pNFS share can hang or cause Oops
Product: Red Hat Enterprise Linux 6 Reporter: Andy Adamson <andros>
Component: kernelAssignee: Steve Dickson <steved>
Status: CLOSED ERRATA QA Contact: Filesystem QE <fs-qe>
Severity: high Docs Contact:
Priority: unspecified    
Version: 6.2CC: cward, eguan, kzhang, pbenas, rwheeler, steved
Target Milestone: rcKeywords: OtherQA
Target Release: ---   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: kernel-2.6.32-214.el6 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-12-06 14:18:03 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: 750914    

Description Andy Adamson 2011-10-18 02:19:45 UTC
Description of problem:

Bug in nfs4_deviceid_purge_client that is fixed in 3.0-rc5 commit 9e3bd4e24

Pid: 2731, comm: umount.nfs Not tainted 2.6.32-209.el6.x86_64 #1 VMware, Inc. VM
ware Virtual Platform/440BX Desktop Reference Platform
RIP: 0010:[<ffffffffa053bab8>]  [<ffffffffa053bab8>] nfs4_deviceid_purge_client+
0xe8/0x170 [nfs]
RSP: 0018:ffff88006a243dc8  EFLAGS: 00000246
RAX: 0000000000000000 RBX: ffff88006a243e08 RCX: 0000000000000050
RDX: ffff880066584a50 RSI: ffffffffa00f0c70 RDI: 0000000000000282
RBP: ffffffff8100bc0e R08: ffff88006a243d10 R09: 0000000000000001
R10: 0000000000000000 R11: 0000000000000000 R12: ffff88006a243d78
R13: ffff88006a243d58 R14: 0000000000000282 R15: dead000000200200
FS:  00007fbc5093d700(0000) GS:ffff880002200000(0000) knlGS:0000000000000000CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007f1d300949d0 CR3: 0000000053bca000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process umount.nfs (pid: 2731, threadinfo ffff88006a242000, task ffff88006a37f50
0)
Stack:
ffff880066584a50 ffffffff81c00140 ffff88006a152400 ffff880069e7e000
<0> ffff880069e7e000 ffffffff81c00140 ffff88006a152400 ffff8800378ab9c0
<0> ffff88006a243e28 ffffffffa04fda3a ffffffff81c00140 ffff880069e7e000
Call Trace:
[<ffffffffa04fda3a>] ? nfs_free_client+0x9a/0x120 [nfs]
[<ffffffffa04fe04b>] ? nfs_put_client+0x7b/0xb0 [nfs]
[<ffffffffa04fe143>] ? nfs_free_server+0xc3/0x130 [nfs]
[<ffffffffa050b3a9>] ? nfs4_kill_super+0x49/0x90 [nfs] [<ffffffff81179650>] ? deactivate_super+0x70/0x90 [<ffffffff811955cf>] ? mntput_no_expire+0xbf/0x110
[<ffffffff8119606b>] ? sys_umount+0x7b/0x3a0
[<ffffffff8100b0f2>] ? system_call_fastpath+0x16/0x1b
Code: 00 00 00 4c 89 f8 c7 00 00 00 00 00 48 83 7d c0 00 74 70 e8 9b 18 b5 e0 49
8d 5c 24 48 eb 0e 0f 1f 40 00 49 8b 44 24 18 48 85 c0 <75> 26 48 83 7d c0 00 74
4f f0 ff 0b 0f 94 c0 84 c0 74 e5 49 8b
Call Trace:
[<ffffffffa053bad6>] ? nfs4_deviceid_purge_client+0x106/0x170 [nfs]
[<ffffffffa04fda3a>] ? nfs_free_client+0x9a/0x120 [nfs]
[<ffffffffa04fe04b>] ? nfs_put_client+0x7b/0xb0 [nfs]
[<ffffffffa04fe143>] ? nfs_free_server+0xc3/0x130 [nfs]
[<ffffffffa050b3a9>] ? nfs4_kill_super+0x49/0x90 [nfs]
[<ffffffff81179650>] ? deactivate_super+0x70/0x90
[<ffffffff811955cf>] ? mntput_no_expire+0xbf/0x110
[<ffffffff8119606b>] ? sys_umount+0x7b/0x3a0
[<ffffffff8100b0f2>] ? system_call_fastpath+0x16/0x1b
BUG: unable to handle kernel NULL pointer dereference at 0000000000000068IP: [<ffffffffa053bad3>] nfs4_deviceid_purge_client+0x103/0x170 [nfs]PGD 0Oops: 0000 [#1] SMP
last sysfs file: /sys/kernel/mm/ksm/runCPU 1Modules linked in: nfs_layout_nfsv41_files nfs lockd fscache nfs_acl auth_rpcgss nls_utf8 fuse ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat xt_CHECKSUM iptable_mangle bridge stp llc autofs4 sunrpc ipt_REJECT nf_conntrack_ipv4 nf_d
efrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter
Got error -10052 from the server on DESTROY_SESSION. Session has been destroyed regardless... ip6_tables ipv6 vhost_net macvtap macvlan tun uinput ppdev parport_pc parport snd_ens1371 snd_rawmidi snd_ac97_codec ac97_bus snd_seq snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc e1000 microcode vmware_balloon sg i2c_piix4 i2c_core shpchp ext4 mbcache jbd2 sd_mod crc_t10dif sr_mod cdrom mptspi mptscsi
h mptbase scsi_transport_spi pata_acpi ata_generic ata_piix dm_mirror dm_region_
hash dm_log dm_mod [last unloaded: speedstep_lib]
Pid: 2731, comm: umount.nfs Not tainted 2.6.32-209.el6.x86_64 #1 VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform
RIP: 0010:[<ffffffffa053bad3>]  [<ffffffffa053bad3>] nfs4_deviceid_purge_client+
0x103/0x170 [nfs]
RSP: 0018:ffff88006a243dc8  EFLAGS: 00010202
RAX: 0000000000000000 RBX: ffff880066584e08 RCX: 0000000000000050
RDX: ffff880066584a50 RSI: ffffffffa00f0c70 RDI: ffff880066584dc0
RBP: ffff88006a243e08 R08: ffff88006a243d10 R09: 0000000000000001
R10: 0000000000000000 R11: 0000000000000000 R12: ffff880066584dc0
R13: ffff880069e7e000 R14: ffffffffa054e5e0 R15: ffffffffa054e5c0
FS:  00007fbc5093d700(0000) GS:ffff880002300000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007fc6d59e7000 CR3: 0000000053bca000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process umount.nfs (pid: 2731, threadinfo ffff88006a242000, task ffff88006a37f50
0)
Stack:
ffff880066584a50 ffffffff81c00140 ffff88006a152400 ffff880069e7e000
<0> ffff880069e7e000 ffffffff81c00140 ffff88006a152400 ffff8800378ab9c0
<0> ffff88006a243e28 ffffffffa04fda3a ffffffff81c00140 ffff880069e7e000
Call Trace:
[<ffffffffa04fda3a>] nfs_free_client+0x9a/0x120 [nfs]
[<ffffffffa04fe04b>] nfs_put_client+0x7b/0xb0 [nfs]
[<ffffffffa04fe143>] nfs_free_server+0xc3/0x130 [nfs]
[<ffffffffa050b3a9>] nfs4_kill_super+0x49/0x90 [nfs]
[<ffffffff81179650>] deactivate_super+0x70/0x90
[<ffffffff811955cf>] mntput_no_expire+0xbf/0x110
[<ffffffff8119606b>] sys_umount+0x7b/0x3a0 [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1bCode: 24 48 eb 0e 0f 1f 40 00 49 8b 44 24 18 48 85 c0 75 26 48 83 7d c0 00 74 4f f0 ff 0b 0f 94 c0 84 c0 74 e5 49 8b 44 24 20 4c 89 e7 <ff> 50 68 49 8b 44 24 18 48 85 c0 74 da 49 8b 54 24 10 48 85 d2
RIP  [<ffffffffa053bad3>] nfs4_deviceid_purge_client+0x103/0x170 [nfs]
RSP <ffff88006a243dc8>
CR2: 0000000000000068
---[ end trace 7afe685c8e44198a ]---
Kernel panic - not syncing: Fatal exception
Pid: 2731, comm: umount.nfs Tainted: G      D    ----------------   2.6.32-209.el6.x86_64 #1
Call Trace:
[<ffffffff814ebd7b>] ? panic+0x78/0x143
[<ffffffff814eff14>] ? oops_end+0xe4/0x100
[<ffffffff810422eb>] ? no_context+0xfb/0x260
[<ffffffff81042575>] ? __bad_area_nosemaphore+0x125/0x1e0
[<ffffffff8104269e>] ? bad_area+0x4e/0x60
[<ffffffff81042da3>] ? __do_page_fault+0x3c3/0x480
[<ffffffff814ed305>] ? schedule_timeout+0x215/0x2e0
[<ffffffff814eef5b>] ? _spin_unlock_bh+0x1b/0x20
[<ffffffff814f1ece>] ? do_page_fault+0x3e/0xa0
[<ffffffff814ef285>] ? page_fault+0x25/0x30
[<ffffffffa053bad3>] ? nfs4_deviceid_purge_client+0x103/0x170 [nfs]
[<ffffffffa053bad6>] ? nfs4_deviceid_purge_client+0x106/0x170 [nfs]
[<ffffffffa04fda3a>] ? nfs_free_client+0x9a/0x120 [nfs]
[<ffffffffa04fe04b>] ? nfs_put_client+0x7b/0xb0 [nfs]
[<ffffffffa04fe143>] ? nfs_free_server+0xc3/0x130 [nfs]
[<ffffffffa050b3a9>] ? nfs4_kill_super+0x49/0x90 [nfs]
[<ffffffff81179650>] ? deactivate_super+0x70/0x90
[<ffffffff811955cf>] ? mntput_no_expire+0xbf/0x110
[<ffffffff8119606b>] ? sys_umount+0x7b/0x3a0
[<ffffffff8100b0f2>] ? system_call_fastpath+0x16/0x1b



How reproducible:

Very.


Steps to Reproduce:
1. Run connectathon Special tests on a pNFS mount
2. umount
3.
  
Actual results:

umount hangs or Oops


Expected results:

umount succeeds


Additional info:

Here is the broken code:

static void
_deviceid_purge_client(const struct nfs_client *clp, long hash)
{
  .......

       while (!hlist_empty(&tmp)) {
               if (atomic_dec_and_test(&d->ref))
                       d->ld->free_deviceid_node(d);
               hlist_del_init(&d->tmpnode);
       }
}


Here is the fixed code.

static void
_deviceid_purge_client(const struct nfs_client *clp, long hash)
{
       
........

       while (!hlist_empty(&tmp)) {
               d = hlist_entry(tmp.first, struct nfs4_deviceid_node, tmpnode);
               hlist_del(&d->tmpnode);
               if (atomic_dec_and_test(&d->ref))
                       d->ld->free_deviceid_node(d);
       }
}

Comment 2 RHEL Program Management 2011-10-18 18:10:55 UTC
This request was evaluated by Red Hat Product Management for inclusion
in a Red Hat Enterprise Linux maintenance release. Product Management has 
requested further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed 
products. This request is not yet committed for inclusion in an Update release.

Comment 4 Steve Dickson 2011-10-19 14:26:29 UTC
Posted patch:

From: Andy Adamson <andros>
Date: Wed, 19 Oct 2011 10:47:43 -0400
Subject: [RHEL6.2 PATCH 1/1] pNFS can hang or oops on umounts.

This fix is part of the upstream commit 9e3bd4e24 that
went into 3.0-rc5. The patch fixes an oops that can occur
after the connectathon special tests are run on an
pNFS mount and then an umount is done.

Signed-off-by: Steve Dickson <steved>
BZ: https://bugzilla.redhat.com/show_bug.cgi?id=746861
---
 fs/nfs/pnfs_dev.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/fs/nfs/pnfs_dev.c b/fs/nfs/pnfs_dev.c
index bee94a3..005e82d 100644
--- a/fs/nfs/pnfs_dev.c
+++ b/fs/nfs/pnfs_dev.c
@@ -239,9 +239,10 @@ _deviceid_purge_client(const struct nfs_client *clp, long hash)
 
 	synchronize_rcu();
 	while (!hlist_empty(&tmp)) {
+		d = hlist_entry(tmp.first, struct nfs4_deviceid_node, tmpnode);
+		hlist_del(&d->tmpnode);
 		if (atomic_dec_and_test(&d->ref))
 			d->ld->free_deviceid_node(d);
-		hlist_del_init(&d->tmpnode);
 	}
 }

Comment 6 Eryu Guan 2011-10-24 08:30:54 UTC
Hi Andy,

Will NetApp verify the fix once a test kernel is available?

Thanks!
Eryu Guan

Comment 8 Steve Dickson 2011-10-24 15:53:54 UTC
(In reply to comment #6)
> Hi Andy,
> 
> Will NetApp verify the fix once a test kernel is available?

I just talked to Andy and he said this patch was verified 
at that this year's Bakathon (which happen last week).

Comment 9 Aristeu Rozanski 2011-10-26 19:46:25 UTC
Patch(es) available on kernel-2.6.32-214.el6

Comment 13 errata-xmlrpc 2011-12-06 14:18:03 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2011-1530.html