kernel-rt-debug potential deadlock More info needed, we had a console output before which is unfortunately lost, but will try to reproduce
This is found on kernel-rt-debug based on 3.2.23-rt37. The issue is reported to the console log: [ 0.708568] ============================================= [ 0.708568] [ INFO: possible recursive locking detected ] [ 0.708568] 3.2.23-rt37.56.el6rt.x86_64.debug #1 [ 0.708568] --------------------------------------------- [ 0.708568] swapper/0/1 is trying to acquire lock: [ 0.708568] (&per_cpu(slab_lock, __cpu).lock){+.+...}, at: [<ffffffff81169bc0>] lock_slab_on+0xb0/0x190 [ 0.708568] [ 0.708568] but task is already holding lock: [ 0.708568] (&per_cpu(slab_lock, __cpu).lock){+.+...}, at: [<ffffffff81169b70>] lock_slab_on+0x60/0x190 [ 0.708568] [ 0.708568] other info that might help us debug this: [ 0.708568] Possible unsafe locking scenario: [ 0.708568] [ 0.708568] CPU0 [ 0.708568] ---- [ 0.708568] lock(&per_cpu(slab_lock, __cpu).lock); [ 0.708568] lock(&per_cpu(slab_lock, __cpu).lock); [ 0.708568] [ 0.708568] *** DEADLOCK *** [ 0.708568] [ 0.708568] May be due to missing lock nesting notation [ 0.708568] [ 0.708568] 2 locks held by swapper/0/1: [ 0.708568] #0: (cache_chain_mutex){+.+.+.}, at: [<ffffffff8116d5c1>] kmem_cache_create+0x521/0x5e0 [ 0.708568] #1: (&per_cpu(slab_lock, __cpu).lock){+.+...}, at: [<ffffffff81169b70>] lock_slab_on+0x60/0x190 [ 0.708568] [ 0.708568] stack backtrace: [ 0.708568] Pid: 1, comm: swapper/0 Not tainted 3.2.23-rt37.56.el6rt.x86_64.debug #1 [ 0.708568] Call Trace: [ 0.708568] [<ffffffff81056bbd>] ? vprintk+0x32d/0x5f0 [ 0.708568] [<ffffffff810963f3>] print_deadlock_bug+0xf3/0x100 [ 0.708568] [<ffffffff81097fc9>] validate_chain+0x549/0x7e0 [ 0.708568] [<ffffffff81092b09>] ? trace_hardirqs_off_caller+0x29/0xc0 [ 0.708568] [<ffffffff81098556>] __lock_acquire+0x2f6/0x4e0 [ 0.708568] [<ffffffff81169bc0>] ? lock_slab_on+0xb0/0x190 [ 0.708568] [<ffffffff81098e07>] lock_acquire+0x97/0x140 [ 0.708568] [<ffffffff81169bc0>] ? lock_slab_on+0xb0/0x190 [ 0.708568] [<ffffffff81538349>] rt_spin_lock+0x39/0x40 [ 0.708568] [<ffffffff81169bc0>] ? lock_slab_on+0xb0/0x190 [ 0.708568] [<ffffffff8104cec5>] ? migrate_disable+0x65/0xd0 [ 0.708568] [<ffffffff81169bc0>] lock_slab_on+0xb0/0x190 [ 0.708568] [<ffffffff81169ca0>] ? lock_slab_on+0x190/0x190 [ 0.708568] [<ffffffff81169cc3>] do_ccupdate_local+0x23/0x60 [ 0.708568] [<ffffffff811672e8>] slab_on_each_cpu+0x38/0x70 [ 0.708568] [<ffffffff8116c953>] do_tune_cpucache+0xf3/0x2c0 [ 0.708568] [<ffffffff8116cfe8>] enable_cpucache+0x38/0xf0 [ 0.708568] [<ffffffff81521a5e>] setup_cpu_cache+0x1ae/0x200 [ 0.708568] [<ffffffff8116d4c1>] kmem_cache_create+0x421/0x5e0 [ 0.708568] [<ffffffff81136730>] ? shmem_mount+0x20/0x20 [ 0.708568] [<ffffffff81d15b63>] shmem_init+0x3c/0xd7 [ 0.708568] [<ffffffff81cf462e>] kernel_init+0xc7/0x168 [ 0.708568] [<ffffffff81542ab4>] kernel_thread_helper+0x4/0x10 [ 0.708568] [<ffffffff810460dc>] ? finish_task_switch+0x8c/0x110 [ 0.708568] [<ffffffff815391fb>] ? _raw_spin_unlock_irq+0x3b/0x60 [ 0.708568] [<ffffffff81539521>] ? retint_restore_args+0xe/0xe [ 0.708568] [<ffffffff81cf4567>] ? parse_early_options+0x20/0x20 [ 0.708568] [<ffffffff81542ab0>] ? gs_change+0xb/0xb This is a rather consistent backtrace, even on multiple boxes. It happens early during boot. It may happen that it doesn't show, but from experience this happens at least on every 3rd or 4th boot. And it is not hardware related at all. I've seen this issue on several boxes. I've tried to reproduce this on 3.0.36-rt57.66 without any luck. I tried booting that version 10 times without seeing any indication. The 3.2.23-rt37.56 triggered in 2 out of 4 boots after having tried to trigger it on the 3.0 kernel.
Note: during a call - David reported that the box does not lock-up even when the trace-back is triggered.
Thomas Gleixner has a fix in 3.6.0-rt1 that Steven Rostedt backported to 3.2.31-rt47 (posted here below) I believe this message should go away in our latest kernel (3.2.32-rt48.59) Steven Rostedt via vger.kernel.org Oct 12 to linux-kernel, linux-rt-users, Thomas, Carsten, John From: Thomas Gleixner <tglx> ============================================= [ INFO: possible recursive locking detected ] 3.6.0-rt1+ #49 Not tainted --------------------------------------------- swapper/0/1 is trying to acquire lock: lock_slab_on+0x72/0x77 but task is already holding lock: __local_lock_irq+0x24/0x77 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&per_cpu(slab_lock, __cpu).lock); lock(&per_cpu(slab_lock, __cpu).lock); *** DEADLOCK *** May be due to missing lock nesting notation 2 locks held by swapper/0/1: kmem_cache_create+0x33/0x89 __local_lock_irq+0x24/0x77 stack backtrace: Pid: 1, comm: swapper/0 Not tainted 3.6.0-rt1+ #49 Call Trace: __lock_acquire+0x9a4/0xdc4 ? __local_lock_irq+0x24/0x77 ? lock_slab_on+0x72/0x77 lock_acquire+0xc4/0x108 ? lock_slab_on+0x72/0x77 ? unlock_slab_on+0x5b/0x5b rt_spin_lock+0x36/0x3d ? lock_slab_on+0x72/0x77 ? migrate_disable+0x85/0x93 lock_slab_on+0x72/0x77 do_ccupdate_local+0x19/0x44 slab_on_each_cpu+0x36/0x5a do_tune_cpucache+0xc1/0x305 enable_cpucache+0x8c/0xb5 setup_cpu_cache+0x28/0x182 __kmem_cache_create+0x34b/0x380 ? shmem_mount+0x1a/0x1a kmem_cache_create+0x4a/0x89 ? shmem_mount+0x1a/0x1a shmem_init+0x3e/0xd4 kernel_init+0x11c/0x214 kernel_thread_helper+0x4/0x10 ? retint_restore_args+0x13/0x13 ? start_kernel+0x3bc/0x3bc ? gs_change+0x13/0x13 It's not a missing annotation. It's simply wrong code and needs to be fixed. Instead of nesting the local and the remote cpu lock simply acquire only the remote cpu lock, which is sufficient protection for this procedure. Signed-off-by: Thomas Gleixner <tglx> Cc: stable-rt.org Signed-off-by: Steven Rostedt <rostedt> --- include/linux/locallock.h | 8 ++++++++ mm/slab.c | 10 ++-------- 2 files changed, 10 insertions(+), 8 deletions(-) diff --git a/include/linux/locallock.h b/include/linux/locallock.h index 8fbc393..0161fbb 100644 --- a/include/linux/locallock.h +++ b/include/linux/locallock.h @@ -96,6 +96,9 @@ static inline void __local_lock_irq(struct local_irq_lock *lv) #define local_lock_irq(lvar) \ do { __local_lock_irq(&get_local_var(lvar)); } while (0) +#define local_lock_irq_on(lvar, cpu) \ + do { __local_lock_irq(&per_cpu(lvar, cpu)); } while (0) + static inline void __local_unlock_irq(struct local_irq_lock *lv) { LL_WARN(!lv->nestcnt); @@ -111,6 +114,11 @@ static inline void __local_unlock_irq(struct local_irq_lock *lv) put_local_var(lvar); \ } while (0) +#define local_unlock_irq_on(lvar, cpu) \ + do { \ + __local_unlock_irq(&per_cpu(lvar, cpu)); \ + } while (0) + static inline int __local_lock_irqsave(struct local_irq_lock *lv) { if (lv->owner != current) { diff --git a/mm/slab.c b/mm/slab.c index 411c545..6a8fd1b 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -747,18 +747,12 @@ slab_on_each_cpu(void (*func)(void *arg, int this_cpu), void *arg) static void lock_slab_on(unsigned int cpu) { - if (cpu == smp_processor_id()) - local_lock_irq(slab_lock); - else - local_spin_lock_irq(slab_lock, &per_cpu(slab_lock, cpu).lock); + local_lock_irq_on(slab_lock, cpu); } static void unlock_slab_on(unsigned int cpu) { - if (cpu == smp_processor_id()) - local_unlock_irq(slab_lock); - else - local_spin_unlock_irq(slab_lock, &per_cpu(slab_lock, cpu).lock); + local_unlock_irq_on(slab_lock, cpu); } #endif
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2012-1491.html