Bug 595825

Summary: tmpfs: panic on 2.6.33.4-rt20
Product: Red Hat Enterprise MRG Reporter: Luis Claudio R. Goncalves <lgoncalv>
Component: realtime-kernelAssignee: Red Hat Real Time Maintenance <rt-maint>
Status: CLOSED CURRENTRELEASE QA Contact: David Sommerseth <davids>
Severity: medium Docs Contact:
Priority: high    
Version: betaCC: bhu, davids, jkacur, johnstul, lgoncalv, ovasik, tglx, williams
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-09-15 10:13:54 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Really rough shot at fixing the select_parent locking
none
Cleaned up version of the selec_parent locking rework none

Description Luis Claudio R. Goncalves 2010-05-25 17:42:04 UTC
Running the following test script I have been able to trigger a kernel panic in VFS/tmpfs:

# cat ./stress_tmpfs.sh 
while `true`
do
	mount -t tmpfs tmpfs /mnt/tmp -osize=400m
	cp -a /home/lclaudio/WORK/sandbox/dbench /mnt/tmp/
	cd /mnt/tmp/dbench
	rm -rf NBSIMULD CLIENTS
	./dbench 4
	rm -rf NBSIMULD CLIENTS
	./dbench 10
	rm -rf NBSIMULD CLIENTS
	./dbench 10
	cd /root
	umount /mnt/tmp
done

Note that I have used dbench but I believe any other fs stress test or even a kernel build would do the trick. My first attempt triggered the panic in less than five minutes. The second test is taking longer...

Here is the panic message from the first test run:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
IP: [<ffffffff811050cc>] __dentry_lru_del_init+0x1b/0x42
PGD 79e41067 PUD 79496067 PMD 0 
Oops: 0002 [#1] PREEMPT SMP 
last sysfs file: /sys/class/misc/autofs/dev
CPU 0 
Pid: 7384, comm: dbench Not tainted 2.6.33.4-rt20.18.el5rt #1 LENOVO/8810A56
RIP: 0010:[<ffffffff811050cc>]  [<ffffffff811050cc>] __dentry_lru_del_init+0x1b/0x42
RSP: 0018:ffff88007b3c5dc8  EFLAGS: 00010246
RAX: ffff88007ceff2b0 RBX: ffff88007ceff230 RCX: 0000000000000000
RDX: 0000000000000001 RSI: ffffffff81357091 RDI: ffff88007ceff230
RBP: ffff88007b3c5dc8 R08: ffff88007b3c4000 R09: ffff88007b3c5de8
R10: 0000000000000046 R11: ffff880077cc3ab8 R12: ffff88007ae22070
R13: ffff88006e23f058 R14: ffff88007ceff238 R15: 000000000002cb88
FS:  00007fb70dd2c6e0(0000) GS:ffff880001a00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000008 CR3: 000000007b8c8000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process dbench (pid: 7384, threadinfo ffff88007b3c4000, task ffff880079db8480)
Stack:
 ffff88007b3c5de8 ffffffff81105390 ffff88007b3c5de8 ffff88007ceff230
<0> ffff88007b3c5e48 ffffffff81105c08 ffff88007adcc450 ffff88007beae800
<0> 000000017b3c5e18 ffffffff8116af38 ffff88007b3c5e38 ffff88007adcc450
Call Trace:
 [<ffffffff81105390>] dentry_lru_del_init+0x35/0x46
 [<ffffffff81105c08>] shrink_dcache_parent+0xc5/0x210
 [<ffffffff8116af38>] ? security_inode_permission+0x21/0x23
 [<ffffffff810fe718>] dentry_unhash+0x20/0x44
 [<ffffffff810feb60>] vfs_rmdir+0x64/0xdd
 [<ffffffff811012e5>] do_rmdir+0xcf/0x123
 [<ffffffff810fd8ac>] ? path_put+0x22/0x26
 [<ffffffff8108e447>] ? audit_syscall_entry+0x103/0x12f
 [<ffffffff8110137a>] sys_rmdir+0x16/0x18
 [<ffffffff81002d1b>] system_call_fastpath+0x16/0x1b
Code: 3e 64 00 48 c1 e0 06 48 03 05 a1 3e 64 00 c9 c3 55 48 89 e5 0f 1f 44 00 00 48 8d 87 80 00 00 00 48 8b 8f  
RIP  [<ffffffff811050cc>] __dentry_lru_del_init+0x1b/0x42
 RSP <ffff88007b3c5dc8>
CR2: 0000000000000008
---[ end trace 7769b161bc05abd5 ]---
Kernel panic - not syncing: Fatal exception
Pid: 7384, comm: dbench Tainted: G      D     2.6.33.4-rt20.18.el5rt #1
Call Trace:
 [<ffffffff810424ec>] panic+0xb4/0x170
 [<ffffffff81357030>] ? rt_spin_lock_slowunlock+0x0/0x61
 [<ffffffff81357d06>] ? retint_kernel+0x26/0x30
 [<ffffffff81357613>] ? rt_spin_unlock+0x15/0x17
 [<ffffffff81043513>] ? kmsg_dump+0x10f/0x11b
 [<ffffffff813588d8>] oops_end+0xaf/0xbf
 [<ffffffff81025b7f>] no_context+0x1fa/0x209
 [<ffffffff810327d4>] ? rcu_read_unlock+0xe/0x10
 [<ffffffff81025dd1>] __bad_area_nosemaphore+0x184/0x1a7
 [<ffffffff81025e38>] __bad_area+0x44/0x4d
 [<ffffffff81025e69>] bad_area+0x13/0x15
 [<ffffffff8135a36e>] do_page_fault+0x1a0/0x25d
 [<ffffffff81357e0f>] page_fault+0x1f/0x30
 [<ffffffff81357091>] ? rt_spin_lock_slowlock+0x0/0x184
 [<ffffffff811050cc>] ? __dentry_lru_del_init+0x1b/0x42
 [<ffffffff81357641>] ? rt_spin_lock+0x15/0x17
 [<ffffffff81105390>] dentry_lru_del_init+0x35/0x46
 [<ffffffff81105c08>] shrink_dcache_parent+0xc5/0x210
 [<ffffffff8116af38>] ? security_inode_permission+0x21/0x23
 [<ffffffff810fe718>] dentry_unhash+0x20/0x44
 [<ffffffff810feb60>] vfs_rmdir+0x64/0xdd
 [<ffffffff811012e5>] do_rmdir+0xcf/0x123
 [<ffffffff810fd8ac>] ? path_put+0x22/0x26
 [<ffffffff8108e447>] ? audit_syscall_entry+0x103/0x12f
 [<ffffffff8110137a>] sys_rmdir+0x16/0x18
 [<ffffffff81002d1b>] system_call_fastpath+0x16/0x1b

Comment 1 Luis Claudio R. Goncalves 2010-05-26 23:55:52 UTC
Just got a second panic, pretty similar:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
IP: [<ffffffff811050cc>] __dentry_lru_del_init+0x1b/0x42
PGD 7838c067 PUD 7806f067 PMD 0 
Oops: 0002 [#1] PREEMPT SMP 
last sysfs file: /sys/class/misc/autofs/dev
CPU 0 
Pid: 15108, comm: dbench Not tainted 2.6.33.4-rt20.18.el5rt #1 LENOVO/8810A56
RIP: 0010:[<ffffffff811050cc>]  [<ffffffff811050cc>] __dentry_lru_del_init+0x1b/0x42
RSP: 0018:ffff88007b243dc8  EFLAGS: 00010246
RAX: ffff88006bd67270 RBX: ffff88006bd671f0 RCX: 0000000000000000
RDX: 0000000000000001 RSI: ffffffff81357091 RDI: ffff88006bd671f0
RBP: ffff88007b243dc8 R08: 0000000000000000 R09: ffff88007b243de8
R10: ffff88007b243d38 R11: ffff88007b243e98 R12: ffff88006e57e738
R13: ffff88005cfd5c50 R14: ffff88006bd671f8 R15: 0000000000cce83e
FS:  00007f8d826e26e0(0000) GS:ffff880001a00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000008 CR3: 000000007c00f000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process dbench (pid: 15108, threadinfo ffff88007b242000, task ffff880079ef2740)
Stack:
 ffff88007b243de8 ffffffff81105390 ffff88007b243de8 ffff88006bd671f0
<0> ffff88007b243e48 ffffffff81105c08 ffff88005b45bb18 ffff88007b860c00
<0> 000000017b243e18 ffffffff8116af38 ffff88007b243e38 ffff88005b45bb18
Call Trace:
 [<ffffffff81105390>] dentry_lru_del_init+0x35/0x46
 [<ffffffff81105c08>] shrink_dcache_parent+0xc5/0x210
 [<ffffffff8116af38>] ? security_inode_permission+0x21/0x23
 [<ffffffff810fe718>] dentry_unhash+0x20/0x44
 [<ffffffff810feb60>] vfs_rmdir+0x64/0xdd
 [<ffffffff811012e5>] do_rmdir+0xcf/0x123
 [<ffffffff810fd8ac>] ? path_put+0x22/0x26
 [<ffffffff8108e447>] ? audit_syscall_entry+0x103/0x12f
 [<ffffffff8110137a>] sys_rmdir+0x16/0x18
 [<ffffffff81002d1b>] system_call_fastpath+0x16/0x1b
Code: 3e 64 00 48 c1 e0 06 48 03 05 a1 3e 64 00 c9 c3 55 48 89 e5 0f 1f 44 00 00 48 8d 87 80 00 00 00 48 8b 8f  
RIP  [<ffffffff811050cc>] __dentry_lru_del_init+0x1b/0x42
 RSP <ffff88007b243dc8>
CR2: 0000000000000008
---[ end trace c4c7dd4f0fc3fe87 ]---
Kernel panic - not syncing: Fatal exception
Pid: 15108, comm: dbench Tainted: G      D     2.6.33.4-rt20.18.el5rt #1
Call Trace:
 [<ffffffff810424ec>] panic+0xb4/0x170
 [<ffffffff81357030>] ? rt_spin_lock_slowunlock+0x0/0x61
 [<ffffffff81357d06>] ? retint_kernel+0x26/0x30
 [<ffffffff81357613>] ? rt_spin_unlock+0x15/0x17
 [<ffffffff81043513>] ? kmsg_dump+0x10f/0x11b
 [<ffffffff813588d8>] oops_end+0xaf/0xbf
 [<ffffffff81025b7f>] no_context+0x1fa/0x209
 [<ffffffff81025dd1>] __bad_area_nosemaphore+0x184/0x1a7
 [<ffffffff81032831>] ? cpuacct_charge+0x5b/0x64
 [<ffffffff81025e38>] __bad_area+0x44/0x4d
 [<ffffffff81025e69>] bad_area+0x13/0x15
 [<ffffffff8135a36e>] do_page_fault+0x1a0/0x25d
 [<ffffffff81357e0f>] page_fault+0x1f/0x30
 [<ffffffff81357091>] ? rt_spin_lock_slowlock+0x0/0x184
 [<ffffffff811050cc>] ? __dentry_lru_del_init+0x1b/0x42
 [<ffffffff81357641>] ? rt_spin_lock+0x15/0x17
 [<ffffffff81105390>] dentry_lru_del_init+0x35/0x46
 [<ffffffff81105c08>] shrink_dcache_parent+0xc5/0x210
 [<ffffffff8116af38>] ? security_inode_permission+0x21/0x23
 [<ffffffff810fe718>] dentry_unhash+0x20/0x44
 [<ffffffff810feb60>] vfs_rmdir+0x64/0xdd
 [<ffffffff811012e5>] do_rmdir+0xcf/0x123
 [<ffffffff810fd8ac>] ? path_put+0x22/0x26
 [<ffffffff8108e447>] ? audit_syscall_entry+0x103/0x12f
 [<ffffffff8110137a>] sys_rmdir+0x16/0x18
 [<ffffffff81002d1b>] system_call_fastpath+0x16/0x1b

Comment 2 john stultz 2010-05-28 02:28:08 UTC
After only a few days, I finally managed to reproduce this!

Had to boot with mem=512M maxcpus=1 and use the old dbench-1.2 tarball.

Comment 3 Steven Rostedt 2010-05-28 13:41:11 UTC
I believe the bug is due to this patch:

commit 9e0b2cfd72583db7ad50e3e73958926fb65e0557
Author: John Stultz <johnstul.com>
Date:   Mon Feb 22 19:58:20 2010 -0800

    Revert d_count back to an atomic_t
    
    This patch reverts the portion of Nick's vfs scalability patch that
    converts the dentry d_count from an atomic_t to an int protected by
    the d_lock.
    
    This greatly improves vfs scalability with the -rt kernel, as
    the extra lock contention on the d_lock hurts very badly when
    CONFIG_PREEMPT_RT is enabled and the spinlocks become rtmutexes.
    
    Signed-off-by: John Stultz <johnstul.com>
    Signed-off-by: Thomas Gleixner <tglx>


Particularly this change:

 static inline struct dentry *__dget_dlock(struct dentry *dentry)
 {
-       dentry->d_count++;
+       atomic_inc(&dentry->d_count);
        return dentry;
 }

 static inline struct dentry *__dget(struct dentry *dentry)
 {
-       spin_lock(&dentry->d_lock);
        __dget_dlock(dentry);
-       spin_unlock(&dentry->d_lock);
        return dentry;
 }

This just removed all the protection from the readers of d_count.

Any reader that does:

  spin_lock(d_lock);
  if (dentry->d_count == X)
          do something
  spin_unlock(d_lock);

will lose the atomic nature between the if and the do something that the d_lock provided. If the __dget() is called between the if and do something, the d_count would have changed and the do something will now be doing something that it should not be doing!

Comment 4 john stultz 2010-05-28 18:44:42 UTC
(In reply to comment #3)
> I believe the bug is due to this patch: 
>     Revert d_count back to an atomic_t
[snip]
> This just removed all the protection from the readers of d_count.
> 
> Any reader that does:
> 
>   spin_lock(d_lock);
>   if (dentry->d_count == X)
>           do something
>   spin_unlock(d_lock);
> 
> will lose the atomic nature between the if and the do something that the d_lock
> provided. If the __dget() is called between the if and do something, the
> d_count would have changed and the do something will now be doing something
> that it should not be doing!

Interesting. Hrmm..  However, I'm not totally sure I see yet where that might cause problems here. I'll try to review things further, but unless I missed something subtle, the patch should be moving the logic back to how mainline manages d_count.

Comment 5 john stultz 2010-05-29 01:22:29 UTC
Even though I was a bit skeptical, I added the locking back around the dget increments, but still hit the same issue.

I do suspect the reference counting is the issue, and in this case I suspecting we're having a situation where we're freeing a dentry while its still in use (which causes the variance in the oops/bug/gpf location).

There's a few spots where we do atomic_dec without checking to see if its zero (shrink_dcache_for_umount_subtree, shrink_dcache_for_umount, d_genocide) and I added BUG_ONs there if the dentry count was already zero when we dec it, but I did not see any trigger.

Next I'm going to try to audit to make sure we aren't forgetting to increment the dcount somewhere.

Comment 6 john stultz 2010-05-29 01:33:30 UTC
Actually.... __d_lookup_rcu() is suspicious. I don't see anything incrementing the d_count on the dentry it returns.

Comment 7 john stultz 2010-06-02 00:45:20 UTC
Despite reports to the contrary, I was able to get the issue to reproduce with ftrace enabled. Currently working with some trace_printks to try to further narrow down what's happening.

Comment 8 john stultz 2010-06-03 22:53:37 UTC
So I think I have the issue figured out, but I'm still not quite sure what the right fix is.

From the traces, I can clearly see that the problem is select_parent() racing against dput()/d_kill(). So we're traversing a directory tree, and while doing so, someone else comes through and cleans up the dentry that we have a pointer to.

Now, the point of select_parent is to go through a directory tree and find unused (ie: d_count == 0) dentries and move them to the unused list, so we shouldn't be surprised that the dentries disappear, since they are unused.

However, in order for dput to kill the dentry, it needs to hold the lock on both the dentry being killed and its parent.

Since select_parent holds a lock on the parent dentry, this protects children from disappearing under us. 


But...  The problem is when we descend and ascend the directory tree. In the case where we descend, we drop the parent lock (since we're going down, the child becomes the new parent), then in order to ascend, we have grab the locks in the right order (parent -> child), so we drop the current dentry's lock, and try to grab its parent.

Its at this time when we're usually preempted, and dentries get killed.

Once we are running again and we grab the parent's lock (if the parent is still there, so far we've never hit that case, but I cant' see why it couldn't be killed before we get the lock), we use a stale reference to the now-child's .d_u.d_child.next pointer as the spot where we want to continue scanning in the parent's d_subdirs list. And of course, this next pointer has been killed under us, so we'll soon trip over a null pointer.

So, the complications here:
1) This ascending lock-dropping dance is done under a rcu_read_lock(), which I'm not totally sure of the reasons. Might be trying to avoid this issue, but we are missing an sync_rcu call (although i'm not sure that would work here).

2) dget/dput on the child won't quite work, because we end up holding the parent lock when we'd want to dput, which could deadlock (since dput will want try to get the parent lock).


Right now, my thought is that we just not drop the parent locks when we descend into the tree, but that greatly complicates the exit path (we have to make sure we unlock all the way up) and might lead to worse contention.

I also sent mail to Nick to see if he had any thoughts on this hole.

Comment 9 john stultz 2010-06-04 02:04:39 UTC
Created attachment 421091 [details]
Really rough shot at fixing the select_parent locking

Here's a rough first shot at fixing the select_parent locking to avoid this issue.

No promises if its right, but its run for an hour or so and I've not triggered the issue yet (usually it would trigger inside of 30 minutes). I'll be letting it run overnight and wanted to provide it here if anyone else wanted to try testing it as well.

Comment 10 john stultz 2010-06-04 19:00:32 UTC
So the testing ran over night successfully. I'll be running some different tests and then sending a cleaned up version of the patch to Thomas shortly.

Comment 11 john stultz 2010-06-04 19:43:20 UTC
Created attachment 421337 [details]
Cleaned up version of the selec_parent locking rework

Here's the cleaned up version of the patch that ran overnight without issues.

I'm still doing some further stress testing on it, but wanted to share it here so others could start testing with it.

Comment 12 Clark Williams 2010-06-04 20:12:39 UTC
*** Bug 592989 has been marked as a duplicate of this bug. ***

Comment 13 Luis Claudio R. Goncalves 2010-06-17 13:14:13 UTC
Even though the original bug has been fixed, running 2.6.33.5-rt23 with lockdep enabled we saw that during boot:

=============================================
[ INFO: possible recursive locking detected ]
2.6.33.5-rt23.22.el5rtdebug #1
---------------------------------------------
init/1 is trying to acquire lock:
 (&dentry->d_lock/1){+.+...}, at: [<c054015b>] shrink_dcache_parent+0xe8/0x271

but task is already holding lock:
 (&dentry->d_lock/1){+.+...}, at: [<c054015b>] shrink_dcache_parent+0xe8/0x271

other info that might help us debug this:
2 locks held by init/1:
 #0:  (&dentry->d_lock){+.+...}, at: [<c0540120>] shrink_dcache_parent+0xad/0x271
 #1:  (&dentry->d_lock/1){+.+...}, at: [<c054015b>] shrink_dcache_parent+0xe8/0x271

stack backtrace:
Pid: 1, comm: init Not tainted 2.6.33.5-rt23.22.el5rtdebug #1
Call Trace:
 [<c048b68d>] __lock_acquire+0xb64/0xb70
 [<c048b759>] lock_acquire+0xc0/0xd4
 [<c054015b>] ? shrink_dcache_parent+0xe8/0x271
 [<c08184b1>] rt_spin_lock_nested+0x47/0x4e
 [<c054015b>] ? shrink_dcache_parent+0xe8/0x271
 [<c054015b>] shrink_dcache_parent+0xe8/0x271
 [<c05853a8>] proc_flush_task+0xa0/0x1e9
 [<c0453d45>] release_task+0x26/0x4a0
 [<c04548be>] wait_consider_task+0x6ff/0xc7e
 [<c08183a6>] ? __rt_spin_lock+0x12/0x14
 [<c0454f2c>] do_wait+0xef/0x2be
 [<c04551c1>] sys_wait4+0xc6/0xd9
 [<c0453664>] ? child_wait_callback+0x0/0x7c
 [<c0455200>] sys_waitpid+0x2c/0x2e
 [<c08193ac>] syscall_call+0x7/0xb
 [<c0810000>] ? threshold_create_device+0x41/0x304

Comment 14 John Kacur 2010-06-17 13:42:49 UTC
I have a similar lockdep splat

=============================================
[ INFO: possible recursive locking detected ]
2.6.33.5-rt23-tip-debug #2
---------------------------------------------
init/1 is trying to acquire lock:
 (&dentry->d_lock/1){+.+...}, at: [<ffffffff811981be>] shrink_dcache_parent+0x10f/0x2eb

but task is already holding lock:
 (&dentry->d_lock/1){+.+...}, at: [<ffffffff811981be>] shrink_dcache_parent+0x10f/0x2eb

other info that might help us debug this:
2 locks held by init/1:
 #0:  (&dentry->d_lock){+.+...}, at: [<ffffffff8119818e>] shrink_dcache_parent+0xdf/0x2eb
 #1:  (&dentry->d_lock/1){+.+...}, at: [<ffffffff811981be>] shrink_dcache_parent+0x10f/0x2eb

stack backtrace:
Pid: 1, comm: init Not tainted 2.6.33.5-rt23-tip-debug #2
Call Trace:
 [<ffffffff810a68e2>] __lock_acquire+0xcb9/0xd35
 [<ffffffff811981be>] ? shrink_dcache_parent+0x10f/0x2eb
 [<ffffffff810a701c>] lock_acquire+0xd4/0xf1
 [<ffffffff811981be>] ? shrink_dcache_parent+0x10f/0x2eb
 [<ffffffff81521006>] rt_spin_lock_nested+0x3d/0x44
 [<ffffffff811981be>] ? shrink_dcache_parent+0x10f/0x2eb
 [<ffffffff81193456>] ? dentry_lru_del_init+0x3e/0xa8
 [<ffffffff8119818e>] ? shrink_dcache_parent+0xdf/0x2eb
 [<ffffffff811981be>] shrink_dcache_parent+0x10f/0x2eb
 [<ffffffff811f59a4>] proc_flush_task+0xd4/0x26f
 [<ffffffff81065ea7>] release_task+0x47/0x650
 [<ffffffff81066e6a>] wait_consider_task+0x9ba/0x10b4
 [<ffffffff810676e1>] do_wait+0x17d/0x38b
 [<ffffffff810679e5>] sys_wait4+0xf6/0x119
 [<ffffffff810037cc>] ? sysret_check+0x27/0x62
 [<ffffffff810652a4>] ? child_wait_callback+0x0/0xa5
 [<ffffffff8100379b>] system_call_fastpath+0x16/0x1b

Comment 15 john stultz 2010-06-30 00:46:02 UTC
I'm now testing an updated change suggested by Nick that should hopefully resolve the lockdep issue.

So far testing is going well, I'm going to leave it over night. Tomorrow if its ok I'll submit the revert of the select_parent locking change and the new fix to Thomas for testing.

Comment 16 John Kacur 2010-09-06 14:19:27 UTC
lockdep splats are separate issues.
In any case, as long as cannot reproduce the kernel oops anymore, I suggest we close this since the code base has changed considerable since this was reported. (as a result of a lot of the problem code being reverted.)