Bug 595825
Summary: | tmpfs: panic on 2.6.33.4-rt20 | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise MRG | Reporter: | Luis Claudio R. Goncalves <lgoncalv> | ||||||
Component: | realtime-kernel | Assignee: | Red Hat Real Time Maintenance <rt-maint> | ||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | David Sommerseth <davids> | ||||||
Severity: | medium | Docs Contact: | |||||||
Priority: | high | ||||||||
Version: | beta | CC: | bhu, davids, jkacur, johnstul, lgoncalv, ovasik, tglx, williams | ||||||
Target Milestone: | --- | ||||||||
Target Release: | --- | ||||||||
Hardware: | All | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2010-09-15 10:13:54 UTC | Type: | --- | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
Description
Luis Claudio R. Goncalves
2010-05-25 17:42:04 UTC
Just got a second panic, pretty similar: BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 IP: [<ffffffff811050cc>] __dentry_lru_del_init+0x1b/0x42 PGD 7838c067 PUD 7806f067 PMD 0 Oops: 0002 [#1] PREEMPT SMP last sysfs file: /sys/class/misc/autofs/dev CPU 0 Pid: 15108, comm: dbench Not tainted 2.6.33.4-rt20.18.el5rt #1 LENOVO/8810A56 RIP: 0010:[<ffffffff811050cc>] [<ffffffff811050cc>] __dentry_lru_del_init+0x1b/0x42 RSP: 0018:ffff88007b243dc8 EFLAGS: 00010246 RAX: ffff88006bd67270 RBX: ffff88006bd671f0 RCX: 0000000000000000 RDX: 0000000000000001 RSI: ffffffff81357091 RDI: ffff88006bd671f0 RBP: ffff88007b243dc8 R08: 0000000000000000 R09: ffff88007b243de8 R10: ffff88007b243d38 R11: ffff88007b243e98 R12: ffff88006e57e738 R13: ffff88005cfd5c50 R14: ffff88006bd671f8 R15: 0000000000cce83e FS: 00007f8d826e26e0(0000) GS:ffff880001a00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000008 CR3: 000000007c00f000 CR4: 00000000000006f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process dbench (pid: 15108, threadinfo ffff88007b242000, task ffff880079ef2740) Stack: ffff88007b243de8 ffffffff81105390 ffff88007b243de8 ffff88006bd671f0 <0> ffff88007b243e48 ffffffff81105c08 ffff88005b45bb18 ffff88007b860c00 <0> 000000017b243e18 ffffffff8116af38 ffff88007b243e38 ffff88005b45bb18 Call Trace: [<ffffffff81105390>] dentry_lru_del_init+0x35/0x46 [<ffffffff81105c08>] shrink_dcache_parent+0xc5/0x210 [<ffffffff8116af38>] ? security_inode_permission+0x21/0x23 [<ffffffff810fe718>] dentry_unhash+0x20/0x44 [<ffffffff810feb60>] vfs_rmdir+0x64/0xdd [<ffffffff811012e5>] do_rmdir+0xcf/0x123 [<ffffffff810fd8ac>] ? path_put+0x22/0x26 [<ffffffff8108e447>] ? audit_syscall_entry+0x103/0x12f [<ffffffff8110137a>] sys_rmdir+0x16/0x18 [<ffffffff81002d1b>] system_call_fastpath+0x16/0x1b Code: 3e 64 00 48 c1 e0 06 48 03 05 a1 3e 64 00 c9 c3 55 48 89 e5 0f 1f 44 00 00 48 8d 87 80 00 00 00 48 8b 8f RIP [<ffffffff811050cc>] __dentry_lru_del_init+0x1b/0x42 RSP <ffff88007b243dc8> CR2: 0000000000000008 ---[ end trace c4c7dd4f0fc3fe87 ]--- Kernel panic - not syncing: Fatal exception Pid: 15108, comm: dbench Tainted: G D 2.6.33.4-rt20.18.el5rt #1 Call Trace: [<ffffffff810424ec>] panic+0xb4/0x170 [<ffffffff81357030>] ? rt_spin_lock_slowunlock+0x0/0x61 [<ffffffff81357d06>] ? retint_kernel+0x26/0x30 [<ffffffff81357613>] ? rt_spin_unlock+0x15/0x17 [<ffffffff81043513>] ? kmsg_dump+0x10f/0x11b [<ffffffff813588d8>] oops_end+0xaf/0xbf [<ffffffff81025b7f>] no_context+0x1fa/0x209 [<ffffffff81025dd1>] __bad_area_nosemaphore+0x184/0x1a7 [<ffffffff81032831>] ? cpuacct_charge+0x5b/0x64 [<ffffffff81025e38>] __bad_area+0x44/0x4d [<ffffffff81025e69>] bad_area+0x13/0x15 [<ffffffff8135a36e>] do_page_fault+0x1a0/0x25d [<ffffffff81357e0f>] page_fault+0x1f/0x30 [<ffffffff81357091>] ? rt_spin_lock_slowlock+0x0/0x184 [<ffffffff811050cc>] ? __dentry_lru_del_init+0x1b/0x42 [<ffffffff81357641>] ? rt_spin_lock+0x15/0x17 [<ffffffff81105390>] dentry_lru_del_init+0x35/0x46 [<ffffffff81105c08>] shrink_dcache_parent+0xc5/0x210 [<ffffffff8116af38>] ? security_inode_permission+0x21/0x23 [<ffffffff810fe718>] dentry_unhash+0x20/0x44 [<ffffffff810feb60>] vfs_rmdir+0x64/0xdd [<ffffffff811012e5>] do_rmdir+0xcf/0x123 [<ffffffff810fd8ac>] ? path_put+0x22/0x26 [<ffffffff8108e447>] ? audit_syscall_entry+0x103/0x12f [<ffffffff8110137a>] sys_rmdir+0x16/0x18 [<ffffffff81002d1b>] system_call_fastpath+0x16/0x1b After only a few days, I finally managed to reproduce this! Had to boot with mem=512M maxcpus=1 and use the old dbench-1.2 tarball. I believe the bug is due to this patch: commit 9e0b2cfd72583db7ad50e3e73958926fb65e0557 Author: John Stultz <johnstul.com> Date: Mon Feb 22 19:58:20 2010 -0800 Revert d_count back to an atomic_t This patch reverts the portion of Nick's vfs scalability patch that converts the dentry d_count from an atomic_t to an int protected by the d_lock. This greatly improves vfs scalability with the -rt kernel, as the extra lock contention on the d_lock hurts very badly when CONFIG_PREEMPT_RT is enabled and the spinlocks become rtmutexes. Signed-off-by: John Stultz <johnstul.com> Signed-off-by: Thomas Gleixner <tglx> Particularly this change: static inline struct dentry *__dget_dlock(struct dentry *dentry) { - dentry->d_count++; + atomic_inc(&dentry->d_count); return dentry; } static inline struct dentry *__dget(struct dentry *dentry) { - spin_lock(&dentry->d_lock); __dget_dlock(dentry); - spin_unlock(&dentry->d_lock); return dentry; } This just removed all the protection from the readers of d_count. Any reader that does: spin_lock(d_lock); if (dentry->d_count == X) do something spin_unlock(d_lock); will lose the atomic nature between the if and the do something that the d_lock provided. If the __dget() is called between the if and do something, the d_count would have changed and the do something will now be doing something that it should not be doing! (In reply to comment #3) > I believe the bug is due to this patch: > Revert d_count back to an atomic_t [snip] > This just removed all the protection from the readers of d_count. > > Any reader that does: > > spin_lock(d_lock); > if (dentry->d_count == X) > do something > spin_unlock(d_lock); > > will lose the atomic nature between the if and the do something that the d_lock > provided. If the __dget() is called between the if and do something, the > d_count would have changed and the do something will now be doing something > that it should not be doing! Interesting. Hrmm.. However, I'm not totally sure I see yet where that might cause problems here. I'll try to review things further, but unless I missed something subtle, the patch should be moving the logic back to how mainline manages d_count. Even though I was a bit skeptical, I added the locking back around the dget increments, but still hit the same issue. I do suspect the reference counting is the issue, and in this case I suspecting we're having a situation where we're freeing a dentry while its still in use (which causes the variance in the oops/bug/gpf location). There's a few spots where we do atomic_dec without checking to see if its zero (shrink_dcache_for_umount_subtree, shrink_dcache_for_umount, d_genocide) and I added BUG_ONs there if the dentry count was already zero when we dec it, but I did not see any trigger. Next I'm going to try to audit to make sure we aren't forgetting to increment the dcount somewhere. Actually.... __d_lookup_rcu() is suspicious. I don't see anything incrementing the d_count on the dentry it returns. Despite reports to the contrary, I was able to get the issue to reproduce with ftrace enabled. Currently working with some trace_printks to try to further narrow down what's happening. So I think I have the issue figured out, but I'm still not quite sure what the right fix is. From the traces, I can clearly see that the problem is select_parent() racing against dput()/d_kill(). So we're traversing a directory tree, and while doing so, someone else comes through and cleans up the dentry that we have a pointer to. Now, the point of select_parent is to go through a directory tree and find unused (ie: d_count == 0) dentries and move them to the unused list, so we shouldn't be surprised that the dentries disappear, since they are unused. However, in order for dput to kill the dentry, it needs to hold the lock on both the dentry being killed and its parent. Since select_parent holds a lock on the parent dentry, this protects children from disappearing under us. But... The problem is when we descend and ascend the directory tree. In the case where we descend, we drop the parent lock (since we're going down, the child becomes the new parent), then in order to ascend, we have grab the locks in the right order (parent -> child), so we drop the current dentry's lock, and try to grab its parent. Its at this time when we're usually preempted, and dentries get killed. Once we are running again and we grab the parent's lock (if the parent is still there, so far we've never hit that case, but I cant' see why it couldn't be killed before we get the lock), we use a stale reference to the now-child's .d_u.d_child.next pointer as the spot where we want to continue scanning in the parent's d_subdirs list. And of course, this next pointer has been killed under us, so we'll soon trip over a null pointer. So, the complications here: 1) This ascending lock-dropping dance is done under a rcu_read_lock(), which I'm not totally sure of the reasons. Might be trying to avoid this issue, but we are missing an sync_rcu call (although i'm not sure that would work here). 2) dget/dput on the child won't quite work, because we end up holding the parent lock when we'd want to dput, which could deadlock (since dput will want try to get the parent lock). Right now, my thought is that we just not drop the parent locks when we descend into the tree, but that greatly complicates the exit path (we have to make sure we unlock all the way up) and might lead to worse contention. I also sent mail to Nick to see if he had any thoughts on this hole. Created attachment 421091 [details]
Really rough shot at fixing the select_parent locking
Here's a rough first shot at fixing the select_parent locking to avoid this issue.
No promises if its right, but its run for an hour or so and I've not triggered the issue yet (usually it would trigger inside of 30 minutes). I'll be letting it run overnight and wanted to provide it here if anyone else wanted to try testing it as well.
So the testing ran over night successfully. I'll be running some different tests and then sending a cleaned up version of the patch to Thomas shortly. Created attachment 421337 [details]
Cleaned up version of the selec_parent locking rework
Here's the cleaned up version of the patch that ran overnight without issues.
I'm still doing some further stress testing on it, but wanted to share it here so others could start testing with it.
*** Bug 592989 has been marked as a duplicate of this bug. *** Even though the original bug has been fixed, running 2.6.33.5-rt23 with lockdep enabled we saw that during boot: ============================================= [ INFO: possible recursive locking detected ] 2.6.33.5-rt23.22.el5rtdebug #1 --------------------------------------------- init/1 is trying to acquire lock: (&dentry->d_lock/1){+.+...}, at: [<c054015b>] shrink_dcache_parent+0xe8/0x271 but task is already holding lock: (&dentry->d_lock/1){+.+...}, at: [<c054015b>] shrink_dcache_parent+0xe8/0x271 other info that might help us debug this: 2 locks held by init/1: #0: (&dentry->d_lock){+.+...}, at: [<c0540120>] shrink_dcache_parent+0xad/0x271 #1: (&dentry->d_lock/1){+.+...}, at: [<c054015b>] shrink_dcache_parent+0xe8/0x271 stack backtrace: Pid: 1, comm: init Not tainted 2.6.33.5-rt23.22.el5rtdebug #1 Call Trace: [<c048b68d>] __lock_acquire+0xb64/0xb70 [<c048b759>] lock_acquire+0xc0/0xd4 [<c054015b>] ? shrink_dcache_parent+0xe8/0x271 [<c08184b1>] rt_spin_lock_nested+0x47/0x4e [<c054015b>] ? shrink_dcache_parent+0xe8/0x271 [<c054015b>] shrink_dcache_parent+0xe8/0x271 [<c05853a8>] proc_flush_task+0xa0/0x1e9 [<c0453d45>] release_task+0x26/0x4a0 [<c04548be>] wait_consider_task+0x6ff/0xc7e [<c08183a6>] ? __rt_spin_lock+0x12/0x14 [<c0454f2c>] do_wait+0xef/0x2be [<c04551c1>] sys_wait4+0xc6/0xd9 [<c0453664>] ? child_wait_callback+0x0/0x7c [<c0455200>] sys_waitpid+0x2c/0x2e [<c08193ac>] syscall_call+0x7/0xb [<c0810000>] ? threshold_create_device+0x41/0x304 I have a similar lockdep splat ============================================= [ INFO: possible recursive locking detected ] 2.6.33.5-rt23-tip-debug #2 --------------------------------------------- init/1 is trying to acquire lock: (&dentry->d_lock/1){+.+...}, at: [<ffffffff811981be>] shrink_dcache_parent+0x10f/0x2eb but task is already holding lock: (&dentry->d_lock/1){+.+...}, at: [<ffffffff811981be>] shrink_dcache_parent+0x10f/0x2eb other info that might help us debug this: 2 locks held by init/1: #0: (&dentry->d_lock){+.+...}, at: [<ffffffff8119818e>] shrink_dcache_parent+0xdf/0x2eb #1: (&dentry->d_lock/1){+.+...}, at: [<ffffffff811981be>] shrink_dcache_parent+0x10f/0x2eb stack backtrace: Pid: 1, comm: init Not tainted 2.6.33.5-rt23-tip-debug #2 Call Trace: [<ffffffff810a68e2>] __lock_acquire+0xcb9/0xd35 [<ffffffff811981be>] ? shrink_dcache_parent+0x10f/0x2eb [<ffffffff810a701c>] lock_acquire+0xd4/0xf1 [<ffffffff811981be>] ? shrink_dcache_parent+0x10f/0x2eb [<ffffffff81521006>] rt_spin_lock_nested+0x3d/0x44 [<ffffffff811981be>] ? shrink_dcache_parent+0x10f/0x2eb [<ffffffff81193456>] ? dentry_lru_del_init+0x3e/0xa8 [<ffffffff8119818e>] ? shrink_dcache_parent+0xdf/0x2eb [<ffffffff811981be>] shrink_dcache_parent+0x10f/0x2eb [<ffffffff811f59a4>] proc_flush_task+0xd4/0x26f [<ffffffff81065ea7>] release_task+0x47/0x650 [<ffffffff81066e6a>] wait_consider_task+0x9ba/0x10b4 [<ffffffff810676e1>] do_wait+0x17d/0x38b [<ffffffff810679e5>] sys_wait4+0xf6/0x119 [<ffffffff810037cc>] ? sysret_check+0x27/0x62 [<ffffffff810652a4>] ? child_wait_callback+0x0/0xa5 [<ffffffff8100379b>] system_call_fastpath+0x16/0x1b I'm now testing an updated change suggested by Nick that should hopefully resolve the lockdep issue. So far testing is going well, I'm going to leave it over night. Tomorrow if its ok I'll submit the revert of the select_parent locking change and the new fix to Thomas for testing. lockdep splats are separate issues. In any case, as long as cannot reproduce the kernel oops anymore, I suggest we close this since the code base has changed considerable since this was reported. (as a result of a lot of the problem code being reverted.) |