Bug 856243
| Summary: | kernel-rt-debug potential deadlock | ||
|---|---|---|---|
| Product: | Red Hat Enterprise MRG | Reporter: | John Kacur <jkacur> |
| Component: | realtime-kernel | Assignee: | John Kacur <jkacur> |
| Status: | CLOSED ERRATA | QA Contact: | David Sommerseth <davids> |
| Severity: | unspecified | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 2.2 | CC: | bhu, jkastner, lgoncalv, ovasik, williams |
| Target Milestone: | 2.2.2 | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: |
Cause: Nesting local and remote cpu locks
Consequence: Lockdep warning about potentially recursive locking dependencies
Fix: Acquire only the remote cpu lock which is sufficient protection.
Result: Lockdep warning does not occur
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2012-12-04 19:59:07 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
John Kacur
2012-09-11 14:47:54 UTC
This is found on kernel-rt-debug based on 3.2.23-rt37.
The issue is reported to the console log:
[ 0.708568] =============================================
[ 0.708568] [ INFO: possible recursive locking detected ]
[ 0.708568] 3.2.23-rt37.56.el6rt.x86_64.debug #1
[ 0.708568] ---------------------------------------------
[ 0.708568] swapper/0/1 is trying to acquire lock:
[ 0.708568] (&per_cpu(slab_lock, __cpu).lock){+.+...}, at: [<ffffffff81169bc0>] lock_slab_on+0xb0/0x190
[ 0.708568]
[ 0.708568] but task is already holding lock:
[ 0.708568] (&per_cpu(slab_lock, __cpu).lock){+.+...}, at: [<ffffffff81169b70>] lock_slab_on+0x60/0x190
[ 0.708568]
[ 0.708568] other info that might help us debug this:
[ 0.708568] Possible unsafe locking scenario:
[ 0.708568]
[ 0.708568] CPU0
[ 0.708568] ----
[ 0.708568] lock(&per_cpu(slab_lock, __cpu).lock);
[ 0.708568] lock(&per_cpu(slab_lock, __cpu).lock);
[ 0.708568]
[ 0.708568] *** DEADLOCK ***
[ 0.708568]
[ 0.708568] May be due to missing lock nesting notation
[ 0.708568]
[ 0.708568] 2 locks held by swapper/0/1:
[ 0.708568] #0: (cache_chain_mutex){+.+.+.}, at: [<ffffffff8116d5c1>] kmem_cache_create+0x521/0x5e0
[ 0.708568] #1: (&per_cpu(slab_lock, __cpu).lock){+.+...}, at: [<ffffffff81169b70>] lock_slab_on+0x60/0x190
[ 0.708568]
[ 0.708568] stack backtrace:
[ 0.708568] Pid: 1, comm: swapper/0 Not tainted 3.2.23-rt37.56.el6rt.x86_64.debug #1
[ 0.708568] Call Trace:
[ 0.708568] [<ffffffff81056bbd>] ? vprintk+0x32d/0x5f0
[ 0.708568] [<ffffffff810963f3>] print_deadlock_bug+0xf3/0x100
[ 0.708568] [<ffffffff81097fc9>] validate_chain+0x549/0x7e0
[ 0.708568] [<ffffffff81092b09>] ? trace_hardirqs_off_caller+0x29/0xc0
[ 0.708568] [<ffffffff81098556>] __lock_acquire+0x2f6/0x4e0
[ 0.708568] [<ffffffff81169bc0>] ? lock_slab_on+0xb0/0x190
[ 0.708568] [<ffffffff81098e07>] lock_acquire+0x97/0x140
[ 0.708568] [<ffffffff81169bc0>] ? lock_slab_on+0xb0/0x190
[ 0.708568] [<ffffffff81538349>] rt_spin_lock+0x39/0x40
[ 0.708568] [<ffffffff81169bc0>] ? lock_slab_on+0xb0/0x190
[ 0.708568] [<ffffffff8104cec5>] ? migrate_disable+0x65/0xd0
[ 0.708568] [<ffffffff81169bc0>] lock_slab_on+0xb0/0x190
[ 0.708568] [<ffffffff81169ca0>] ? lock_slab_on+0x190/0x190
[ 0.708568] [<ffffffff81169cc3>] do_ccupdate_local+0x23/0x60
[ 0.708568] [<ffffffff811672e8>] slab_on_each_cpu+0x38/0x70
[ 0.708568] [<ffffffff8116c953>] do_tune_cpucache+0xf3/0x2c0
[ 0.708568] [<ffffffff8116cfe8>] enable_cpucache+0x38/0xf0
[ 0.708568] [<ffffffff81521a5e>] setup_cpu_cache+0x1ae/0x200
[ 0.708568] [<ffffffff8116d4c1>] kmem_cache_create+0x421/0x5e0
[ 0.708568] [<ffffffff81136730>] ? shmem_mount+0x20/0x20
[ 0.708568] [<ffffffff81d15b63>] shmem_init+0x3c/0xd7
[ 0.708568] [<ffffffff81cf462e>] kernel_init+0xc7/0x168
[ 0.708568] [<ffffffff81542ab4>] kernel_thread_helper+0x4/0x10
[ 0.708568] [<ffffffff810460dc>] ? finish_task_switch+0x8c/0x110
[ 0.708568] [<ffffffff815391fb>] ? _raw_spin_unlock_irq+0x3b/0x60
[ 0.708568] [<ffffffff81539521>] ? retint_restore_args+0xe/0xe
[ 0.708568] [<ffffffff81cf4567>] ? parse_early_options+0x20/0x20
[ 0.708568] [<ffffffff81542ab0>] ? gs_change+0xb/0xb
This is a rather consistent backtrace, even on multiple boxes. It happens early during boot. It may happen that it doesn't show, but from experience this happens at least on every 3rd or 4th boot. And it is not hardware related at all. I've seen this issue on several boxes.
I've tried to reproduce this on 3.0.36-rt57.66 without any luck. I tried booting that version 10 times without seeing any indication. The 3.2.23-rt37.56 triggered in 2 out of 4 boots after having tried to trigger it on the 3.0 kernel.
Note: during a call - David reported that the box does not lock-up even when the trace-back is triggered. Thomas Gleixner has a fix in 3.6.0-rt1 that Steven Rostedt backported to
3.2.31-rt47 (posted here below)
I believe this message should go away in our latest kernel (3.2.32-rt48.59)
Steven Rostedt via vger.kernel.org
Oct 12
to linux-kernel, linux-rt-users, Thomas, Carsten, John
From: Thomas Gleixner <tglx>
=============================================
[ INFO: possible recursive locking detected ]
3.6.0-rt1+ #49 Not tainted
---------------------------------------------
swapper/0/1 is trying to acquire lock:
lock_slab_on+0x72/0x77
but task is already holding lock:
__local_lock_irq+0x24/0x77
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(&per_cpu(slab_lock, __cpu).lock);
lock(&per_cpu(slab_lock, __cpu).lock);
*** DEADLOCK ***
May be due to missing lock nesting notation
2 locks held by swapper/0/1:
kmem_cache_create+0x33/0x89
__local_lock_irq+0x24/0x77
stack backtrace:
Pid: 1, comm: swapper/0 Not tainted 3.6.0-rt1+ #49
Call Trace:
__lock_acquire+0x9a4/0xdc4
? __local_lock_irq+0x24/0x77
? lock_slab_on+0x72/0x77
lock_acquire+0xc4/0x108
? lock_slab_on+0x72/0x77
? unlock_slab_on+0x5b/0x5b
rt_spin_lock+0x36/0x3d
? lock_slab_on+0x72/0x77
? migrate_disable+0x85/0x93
lock_slab_on+0x72/0x77
do_ccupdate_local+0x19/0x44
slab_on_each_cpu+0x36/0x5a
do_tune_cpucache+0xc1/0x305
enable_cpucache+0x8c/0xb5
setup_cpu_cache+0x28/0x182
__kmem_cache_create+0x34b/0x380
? shmem_mount+0x1a/0x1a
kmem_cache_create+0x4a/0x89
? shmem_mount+0x1a/0x1a
shmem_init+0x3e/0xd4
kernel_init+0x11c/0x214
kernel_thread_helper+0x4/0x10
? retint_restore_args+0x13/0x13
? start_kernel+0x3bc/0x3bc
? gs_change+0x13/0x13
It's not a missing annotation. It's simply wrong code and needs to be
fixed. Instead of nesting the local and the remote cpu lock simply
acquire only the remote cpu lock, which is sufficient protection for
this procedure.
Signed-off-by: Thomas Gleixner <tglx>
Cc: stable-rt.org
Signed-off-by: Steven Rostedt <rostedt>
---
include/linux/locallock.h | 8 ++++++++
mm/slab.c | 10 ++--------
2 files changed, 10 insertions(+), 8 deletions(-)
diff --git a/include/linux/locallock.h b/include/linux/locallock.h
index 8fbc393..0161fbb 100644
--- a/include/linux/locallock.h
+++ b/include/linux/locallock.h
@@ -96,6 +96,9 @@ static inline void __local_lock_irq(struct local_irq_lock *lv)
#define local_lock_irq(lvar) \
do { __local_lock_irq(&get_local_var(lvar)); } while (0)
+#define local_lock_irq_on(lvar, cpu) \
+ do { __local_lock_irq(&per_cpu(lvar, cpu)); } while (0)
+
static inline void __local_unlock_irq(struct local_irq_lock *lv)
{
LL_WARN(!lv->nestcnt);
@@ -111,6 +114,11 @@ static inline void __local_unlock_irq(struct local_irq_lock *lv)
put_local_var(lvar); \
} while (0)
+#define local_unlock_irq_on(lvar, cpu) \
+ do { \
+ __local_unlock_irq(&per_cpu(lvar, cpu)); \
+ } while (0)
+
static inline int __local_lock_irqsave(struct local_irq_lock *lv)
{
if (lv->owner != current) {
diff --git a/mm/slab.c b/mm/slab.c
index 411c545..6a8fd1b 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -747,18 +747,12 @@ slab_on_each_cpu(void (*func)(void *arg, int this_cpu), void *arg)
static void lock_slab_on(unsigned int cpu)
{
- if (cpu == smp_processor_id())
- local_lock_irq(slab_lock);
- else
- local_spin_lock_irq(slab_lock, &per_cpu(slab_lock, cpu).lock);
+ local_lock_irq_on(slab_lock, cpu);
}
static void unlock_slab_on(unsigned int cpu)
{
- if (cpu == smp_processor_id())
- local_unlock_irq(slab_lock);
- else
- local_spin_unlock_irq(slab_lock, &per_cpu(slab_lock, cpu).lock);
+ local_unlock_irq_on(slab_lock, cpu);
}
#endif
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2012-1491.html |