Bug 1121001 - Seems we have deadlock situation over wait_queue spin_lock
Summary: Seems we have deadlock situation over wait_queue spin_lock
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: kernel
Version: 6.4
Hardware: x86_64
OS: Linux
medium
high
Target Milestone: rc
: ---
Assignee: Dave Anderson
QA Contact: Li Wang
URL:
Whiteboard:
Depends On:
Blocks: 1075802 1159933
TreeView+ depends on / blocked
 
Reported: 2014-07-18 06:18 UTC by gopal krishna tiwari
Modified: 2019-03-22 07:15 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2015-04-16 19:17:24 UTC


Attachments (Terms of Use)

Description gopal krishna tiwari 2014-07-18 06:18:38 UTC
Description of problem:

We have HUNG scenario with the following traces in the logs:

  [353686.119720] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 2
      [353686.134318] Pid: 25261, comm: java Tainted: P           ---------------    2.6.32-358.14.1.el6.x86_64 #1
      [353686.153413] Call Trace:
      [353686.158480]  <NMI>  [<ffffffff8150d668>] ? panic+0xa7/0x16f
      [353686.169833]  [<ffffffff810e114d>] ? watchdog_overflow_callback+0xcd/0xd0
      [353686.183389]  [<ffffffff81116f70>] ? __perf_event_overflow+0xb0/0x2a0
      [353686.196252]  [<ffffffff8101a89d>] ? x86_perf_event_update+0x5d/0xb0
      [353686.208941]  [<ffffffff8101b82d>] ? x86_perf_event_set_period+0xdd/0x170
      [353686.222496]  [<ffffffff81117594>] ? perf_event_overflow+0x14/0x20
      [353686.234839]  [<ffffffff810208c2>] ? intel_pmu_handle_irq+0x192/0x300
      [353686.247768]  [<ffffffff81513246>] ? kprobe_exceptions_notify+0x16/0x430
      [353686.261220]  [<ffffffff81511db9>] ? perf_event_nmi_handler+0x39/0xb0
      [353686.274148]  [<ffffffff81513875>] ? notifier_call_chain+0x55/0x80
      [353686.286558]  [<ffffffff815138da>] ? atomic_notifier_call_chain+0x1a/0x20
      [353686.300181]  [<ffffffff8109cc3e>] ? notify_die+0x2e/0x30
      [353686.311031]  [<ffffffff8151153b>] ? do_nmi+0x1bb/0x340
      [353686.321534]  [<ffffffff81510e00>] ? nmi+0x20/0x30
      [353686.331175]  [<ffffffff81055d5d>] ? task_rq_lock+0x4d/0xa0
      [353686.342371]  <<EOE>>  <IRQ>  [<ffffffff81062f8c>] ? try_to_wake_up+0x3c/0x3e0
      [353686.357015]  [<ffffffff811676ee>] ? fallback_alloc+0x14e/0x270
      [353686.368905]  [<ffffffff81063342>] ? default_wake_function+0x12/0x20
      [353686.381660]  [<ffffffff81197436>] ? pollwake+0x56/0x60
      [353686.392161]  [<ffffffff81063330>] ? default_wake_function+0x0/0x20
      [353686.404745]  [<ffffffff81051439>] ? __wake_up_common+0x59/0x90
      [353686.416634]  [<ffffffff81055ac8>] ? __wake_up+0x48/0x70
      [353686.427314]  [<ffffffffa0492545>] ? ib_uverbs_comp_handler+0xc5/0x100 [ib_uverbs]
      [353686.442564]  [<ffffffffa03794b7>] ? mlx4_ib_cq_comp+0x17/0x20 [mlx4_ib]
      [353686.456019]  [<ffffffffa031c902>] ? mlx4_cq_completion+0x42/0x90 [mlx4_core]
      [353686.470336]  [<ffffffffa031d633>] ? mlx4_eq_int+0x563/0xca0 [mlx4_core]
      [353686.483784]  [<ffffffff8105ad5e>] ? scheduler_tick+0x11e/0x260
      [353686.495679]  [<ffffffffa02958c0>] ? bnx2x_timer+0x0/0x190 [bnx2x]
      [353686.508105]  [<ffffffffa031dd84>] ? mlx4_msi_x_interrupt+0x14/0x20 [mlx4_core]
      [353686.522831]  [<ffffffff810e1690>] ? handle_IRQ_event+0x60/0x170
      [353686.534898]  [<ffffffffa02958c0>] ? bnx2x_timer+0x0/0x190 [bnx2x]
      [353686.547309]  [<ffffffff810e3dde>] ? handle_edge_irq+0xde/0x180
      [353686.559200]  [<ffffffff8100de89>] ? handle_irq+0x49/0xa0
      [353686.570051]  [<ffffffff8151732c>] ? do_IRQ+0x6c/0xf0
      [353686.580209]  [<ffffffff8100b9d3>] ? ret_from_intr+0x0/0x11
      [353686.591407]  [<ffffffffa0290497>] ? bnx2x_drv_pulse+0x67/0x80 [bnx2x]
      [353686.604512]  [<ffffffffa0295986>] ? bnx2x_timer+0xc6/0x190 [bnx2x]
      [353686.617096]  [<ffffffff81081857>] ? run_timer_softirq+0x197/0x340
      [353686.629507]  [<ffffffff81076fd1>] ? __do_softirq+0xc1/0x1e0
      [353686.640876]  [<ffffffff8109b79b>] ? hrtimer_interrupt+0x14b/0x260
      [353686.653285]  [<ffffffff8100c1cc>] ? call_softirq+0x1c/0x30
      [353686.664481]  [<ffffffff8100de05>] ? do_softirq+0x65/0xa0
      [353686.675330]  [<ffffffff81076db5>] ? irq_exit+0x85/0x90
      [353686.685834]  [<ffffffff81517420>] ? smp_apic_timer_interrupt+0x70/0x9b
      [353686.699109]  [<ffffffff8100bb93>] ? apic_timer_interrupt+0x13/0x20
      [353686.711689]  <EOI> 


Version-Release number of selected component (if applicable):
kernel-2.6.32-358.14.1.el6.x86_64

Comment 1 gopal krishna tiwari 2014-07-18 06:23:15 UTC
Vmcore analysis is as follows :

 #5 [ffff880645427da0] perf_event_overflow at ffffffff81117594
 #6 [ffff880645427db0] intel_pmu_handle_irq at ffffffff810208c2
 #7 [ffff880645427e90] perf_event_nmi_handler at ffffffff81511db9
 #8 [ffff880645427ea0] notifier_call_chain at ffffffff81513875
 #9 [ffff880645427ee0] atomic_notifier_call_chain at ffffffff815138da
#10 [ffff880645427ef0] notify_die at ffffffff8109cc3e
#11 [ffff880645427f20] do_nmi at ffffffff8151153b
#12 [ffff880645427f50] nmi at ffffffff81510e00
    [exception RIP: task_rq_lock+0x4d]
    RIP: ffffffff81055d5d  RSP: ffff8806454239d0  RFLAGS: 00000082
    RAX: 0000000000000003  RBX: 0000000000016700  RCX: 0000000000000000
    RDX: 0000000000000082  RSI: ffff880645423a28  RDI: ffff88060eaad500
    RBP: ffff8806454239f0   R8: 0000000000000000   R9: 0000000000000000
    R10: 0000000000000000  R11: 0000000000000000  R12: ffff88060eaad500
    R13: ffff880645423a28  R14: 0000000000016700  R15: 0000000000000001
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0000
--- <NMI exception stack> ---
#13 [ffff8806454239d0] task_rq_lock at ffffffff81055d5d
#14 [ffff8806454239f8] try_to_wake_up at ffffffff81062f8c
#15 [ffff880645423a68] default_wake_function at ffffffff81063342
#16 [ffff880645423a78] pollwake at ffffffff81197436
#17 [ffff880645423ab8] __wake_up_common at ffffffff81051439
#18 [ffff880645423b08] __wake_up at ffffffff81055ac8


struct thread_info {
  task = 0xffff880c16b7c040, 
  exec_domain = 0xffffffff81a9a8a0 <default_exec_domain>, 
  flags = 0x20088, 
  status = 0x0, 
  cpu = 0x2, 
  preempt_count = 0x4020100, <<----------- Not able to preempt

<snip>

 1194 static struct rq *task_rq_lock(struct task_struct *p, unsigned long *flags)
 1195         __acquires(rq->lock)
 1196 {
 1197         struct rq *rq;
 1198 
 1199         for (;;) {
 1200                 local_irq_save(*flags);
 1201                 rq = task_rq(p);
 1202                 spin_lock(&rq->lock);

</snip>


When tasks are woken up in try_to_wake_up the kernel uses check_preempt_curr to see if the new task can preempt the currently running one. Seems we have that value of preempt_count is non zero which does not allow scheduler to switch the current task as the current task has acquired the important resources which may lead to further issues .. 

CPU-2 runq lock status 

crash> 
crash> rq ffff880645436700
struct rq {
  lock = {
    raw_lock = {
      slock = 0x9c719c71 <<-------------
    }
  }, 


 #6 [ffff880645427db0] intel_pmu_handle_irq at ffffffff810208c2
 #7 [ffff880645427e90] perf_event_nmi_handler at ffffffff81511db9
 #8 [ffff880645427ea0] notifier_call_chain at ffffffff81513875
 #9 [ffff880645427ee0] atomic_notifier_call_chain at ffffffff815138da
#10 [ffff880645427ef0] notify_die at ffffffff8109cc3e
#11 [ffff880645427f20] do_nmi at ffffffff8151153b
#12 [ffff880645427f50] nmi at ffffffff81510e00
    [exception RIP: task_rq_lock+0x4d]
    RIP: ffffffff81055d5d  RSP: ffff8806454239d0  RFLAGS: 00000082
    RAX: 0000000000000003  RBX: 0000000000016700  RCX: 0000000000000000
    RDX: 0000000000000082  RSI: ffff880645423a28  RDI: ffff88060eaad500
    RBP: ffff8806454239f0   R8: 0000000000000000   R9: 0000000000000000
    R10: 0000000000000000  R11: 0000000000000000  R12: ffff88060eaad500
    R13: ffff880645423a28  R14: 0000000000016700  R15: 0000000000000001
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0000
--- <NMI exception stack> ---
#13 [ffff8806454239d0] task_rq_lock at ffffffff81055d5d
#14 [ffff8806454239f8] try_to_wake_up at ffffffff81062f8c
#15 [ffff880645423a68] default_wake_function at ffffffff81063342
#16 [ffff880645423a78] pollwake at ffffffff81197436
#17 [ffff880645423ab8] __wake_up_common at ffffffff81051439
#18 [ffff880645423b08] __wake_up at ffffffff81055ac8


Now here if I understood it correctly we are trying to wake up 

static int try_to_wake_up(struct task_struct *p, unsigned int state,
                          int wake_flags)
{
        int cpu, orig_cpu, this_cpu, success = 0;
        unsigned long flags;
        unsigned long en_flags = ENQUEUE_WAKEUP;
        struct rq *rq;

        this_cpu = get_cpu();

        smp_wmb();
        rq = task_rq_lock(p, &flags); <<----------

task -> RDI: ffff88060eaad500


PID: 28035  TASK: ffff88060eaad500  CPU: 3   COMMAND: "java"
struct task_struct {
  state = 0x0,  <<--------- Seems runnable 
  stack = 0xffff88059f940000, 
  usage = {
    counter = 0x2
  }, 
  flags = 0x402040, 
  ptrace = 0x0, 
  lock_depth = 0xffffffff, 
  prio = 0x78, 
  static_prio = 0x78, 
  normal_prio = 0x78, 
  rt_priority = 0x0, 
  sched_class = 0xffffffff8160b6a0 <fair_sched_class>, 
  se = {


struct thread_info {
  task = 0xffff88060eaad500, 
  exec_domain = 0xffffffff81a9a8a0 <default_exec_domain>, 
  flags = 0x80, 
  status = 0x0, 
  cpu = 0x3,  <<-------------------
  preempt_count = 0x4010000, <<-----------

Seems already over CPU-3 

PID: 28035  TASK: ffff88060eaad500  CPU: 3   COMMAND: "java"
 #0 [ffff880028227e90] crash_nmi_callback at ffffffff8102d3a6
 #1 [ffff880028227ea0] notifier_call_chain at ffffffff81513875
 #2 [ffff880028227ee0] atomic_notifier_call_chain at ffffffff815138da
 #3 [ffff880028227ef0] notify_die at ffffffff8109cc3e
 #4 [ffff880028227f20] do_nmi at ffffffff8151153b
 #5 [ffff880028227f50] nmi at ffffffff81510e00
    [exception RIP: _spin_lock_irqsave+0x2f] <<---------------------------------
    RIP: ffffffff8151055f  RSP: ffff88059f941948  RFLAGS: 00000097
    RAX: 000000000000f418  RBX: ffff880c16462b58  RCX: 000000000000f417
    RDX: 0000000000000282  RSI: ffff88059f941b78  RDI: ffff880c16462b58
    RBP: ffff88059f941948   R8: ffff88059f941b50   R9: 00000000ffffffff
    R10: 0000000000000000  R11: 0000000000000000  R12: ffff88059f941b78
    R13: ffff88059f941ab8  R14: ffff8806243622c0  R15: 0000000000000100
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <NMI exception stack> ---
 #6 [ffff88059f941948] _spin_lock_irqsave at ffffffff8151055f
 #7 [ffff88059f941950] remove_wait_queue at ffffffff81097064
 #8 [ffff88059f941980] free_poll_entry at ffffffff8119723e
 #9 [ffff88059f9419a0] poll_freewait at ffffffff8119728d
#10 [ffff88059f9419d0] do_select at ffffffff81198335
#11 [ffff88059f941d70] core_sys_select at ffffffff8119858a


crash> spinlock_t ffff880c16462b58
struct spinlock_t {
  raw_lock = {
    slock = 0xf419f417
  }
}

Seems waiting for spin_lock to be free. Which seems to me here some kind of resource (lock) contention is happening.

Now let's check what is happening over CPU-0

PID: 2021   TASK: ffff880c24522080  CPU: 0   COMMAND: "flush-253:1"
 #0 [ffff880645407e90] crash_nmi_callback at ffffffff8102d3a6
 #1 [ffff880645407ea0] notifier_call_chain at ffffffff81513875
 #2 [ffff880645407ee0] atomic_notifier_call_chain at ffffffff815138da
 #3 [ffff880645407ef0] notify_die at ffffffff8109cc3e
 #4 [ffff880645407f20] do_nmi at ffffffff8151153b
 #5 [ffff880645407f50] nmi at ffffffff81510e00
    [exception RIP: __bitmap_empty+0x73]
    RIP: ffffffff81284c43  RSP: ffff880c17cf58e0  RFLAGS: 00000202
    RAX: 00000000ffffffff  RBX: 0000000000000000  RCX: 0000000000000020
    RDX: 0000000000000000  RSI: 0000000000000020  RDI: ffffffff81e22898
    RBP: ffff880c17cf58e0   R8: 0000000000000000   R9: 0000000000000000
    R10: ffffea0022cc5400  R11: 0000000000000004  R12: ffffffff81e22898
    R13: 0000000000000000  R14: ffffffff81e22890  R15: ffff880c16bb0a88
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <NMI exception stack> ---
 #6 [ffff880c17cf58e0] __bitmap_empty at ffffffff81284c43
 #7 [ffff880c17cf58e8] flush_tlb_others_ipi at ffffffff8104c678
 #8 [ffff880c17cf5938] native_flush_tlb_others at ffffffff8104c6f6
 #9 [ffff880c17cf5968] flush_tlb_page at ffffffff8104c81e
#10 [ffff880c17cf5998] page_mkclean at ffffffff8114ef33 <<------------------------- From here 
#11 [ffff880c17cf5a68] clear_page_dirty_for_io at ffffffff8112cbf0
#12 [ffff880c17cf5a88] write_cache_pages at ffffffff8112ddf8
#13 [ffff880c17cf5bb8] generic_writepages at ffffffff8112e134
#14 [ffff880c17cf5bc8] do_writepages at ffffffff8112e175
#15 [ffff880c17cf5bd8] writeback_single_inode at ffffffff811ac80d
#16 [ffff880c17cf5c18] writeback_sb_inodes at ffffffff811acc1e
#17 [ffff880c17cf5c78] writeback_inodes_wb at ffffffff811acd7b
#18 [ffff880c17cf5cd8] wb_writeback at ffffffff811ad11b
#19 [ffff880c17cf5dd8] wb_do_writeback at ffffffff811ad409


 744 static int page_mkclean_file(struct address_space *mapping, struct page *page)
 745 {
 746         pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
 747         struct vm_area_struct *vma;
 748         struct prio_tree_iter iter;
 749         int ret = 0;
 750 
 751         BUG_ON(PageAnon(page));
 752 
 753         spin_lock(&mapping->i_mmap_lock);   <<------------------- lock acquired (1)
 754         vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
 755                 if (vma->vm_flags & VM_SHARED) {
 756                         unsigned long address = vma_address(page, vma);



266 void flush_tlb_page(struct vm_area_struct *vma, unsigned long va)
267 {
268         struct mm_struct *mm = vma->vm_mm;
269 
270         preempt_disable(); <<<-------------- (2)


175 static void flush_tlb_others_ipi(const struct cpumask *cpumask,
176                                  struct mm_struct *mm, unsigned long va)
177 {
178         unsigned int sender;
179         union smp_flush_state *f;
180 
181         /* Caller has disabled preemption */
182         sender = smp_processor_id() % NUM_INVALIDATE_TLB_VECTORS;
183         f = &flush_state[sender];
184 
185         /*
186          * Could avoid this lock when
187          * num_online_cpus() <= NUM_INVALIDATE_TLB_VECTORS, but it is
188          * probably not worth checking this for a cache-hot lock.
189          */
190         spin_lock(&f->tlbstate_lock); ----- lock acquired (3)
191 
...
194         if (cpumask_andnot(to_cpumask(f->flush_cpumask), cpumask, cpumask_of(smp_processor_id()))) {
195                 /*
196                  * We have to send the IPI only to
197                  * CPUs affected.
198                  */
199                 apic->send_IPI_mask(to_cpumask(f->flush_cpumask),
200                               INVALIDATE_TLB_VECTOR_START + sender);
201 
202                 while (!cpumask_empty(to_cpumask(f->flush_cpumask))) <<---------- We are waiting here 
203                         cpu_relax();
204         }
205 


Due 

crash> px flush_state[0]
$6 = {
  {
    flush_mm = 0xffff880c16bb07c0, 
    flush_va = 0xf6b09000, 
    tlbstate_lock = {
      raw_lock = {
        slock = 0xccbbccba
      }
    }, 
    flush_cpumask = {0x4, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}
  }, 
  pad = "\300\a\273\026\f\210\377\377\000\220\260\366\000\000\000\000\272̻\314\000\000\000\000\004\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"
}

Due CPU mask is not empty we probably seems waiting in the while loop here over CPU-0 due CPU-2.

Comment 2 gopal krishna tiwari 2014-07-18 06:24:37 UTC
Continuing from my previous update :

Let's dig in further to CPU -3 

PID: 28035  TASK: ffff88060eaad500  CPU: 3   COMMAND: "java"
 #0 [ffff880028227e90] crash_nmi_callback at ffffffff8102d3a6
 #1 [ffff880028227ea0] notifier_call_chain at ffffffff81513875
 #2 [ffff880028227ee0] atomic_notifier_call_chain at ffffffff815138da
 #3 [ffff880028227ef0] notify_die at ffffffff8109cc3e
 #4 [ffff880028227f20] do_nmi at ffffffff8151153b
 #5 [ffff880028227f50] nmi at ffffffff81510e00
    [exception RIP: _spin_lock_irqsave+0x2f]
    RIP: ffffffff8151055f  RSP: ffff88059f941948  RFLAGS: 00000097
    RAX: 000000000000f418  RBX: ffff880c16462b58  RCX: 000000000000f417
    RDX: 0000000000000282  RSI: ffff88059f941b78  RDI: ffff880c16462b58
    RBP: ffff88059f941948   R8: ffff88059f941b50   R9: 00000000ffffffff
    R10: 0000000000000000  R11: 0000000000000000  R12: ffff88059f941b78
    R13: ffff88059f941ab8  R14: ffff8806243622c0  R15: 0000000000000100
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <NMI exception stack> ---
 #6 [ffff88059f941948] _spin_lock_irqsave at ffffffff8151055f
 #7 [ffff88059f941950] remove_wait_queue at ffffffff81097064
 #8 [ffff88059f941980] free_poll_entry at ffffffff8119723e
 #9 [ffff88059f9419a0] poll_freewait at ffffffff8119728d
#10 [ffff88059f9419d0] do_select at ffffffff81198335
#11 [ffff88059f941d70] core_sys_select at ffffffff8119858a
#12 [ffff88059f941f10] sys_select at ffffffff81198917
#13 [ffff88059f941f80] system_call_fastpath at ffffffff8100b072
    RIP: 0000003bdd0e14f3  RSP: 00007fbce3c99a00  RFLAGS: 00000206
    RAX: 0000000000000017  RBX: ffffffff8100b072  RCX: 0000000000000001
    RDX: 0000000000000000  RSI: 00007fbce3c99cf0  RDI: 00000000000000c8

crash> struct poll_wqueues ffff88059f941b68
struct poll_wqueues {
  pt = {
    qproc = 0xffff880c24bc6440, 
    key = 0xdb
  }, 
  table = 0xffff880500000000, 
  polling_task = 0xffff88059f941ab8, 
  triggered = 0x811973e0, 
  error = 0xffffffff, 
  inline_index = 0x16462b60, 
  inline_entries = {{
      filp = 0xffff880c16462b60, 
      key = 0xffff880c16462b58, 
      wait = {
        flags = 0x243f4bc0, 
        private = 0xdb, 
        func = 0xffff880500000000, 
        task_list = {
          next = 0xffff88059f941ab8, 
          prev = 0xffffffff811973e0 <pollwake>
        }
      }, 
      wait_address = 0xffff880624ca45e0
    }, {
      filp = 0xffff880624ca45e0, 
      key = 0xffff880624ca45d8, 
      wait = {
        flags = 0x243622c0, 
        private = 0xdb, 
        func = 0x0, 
        task_list = {
          next = 0xffff88059f941ab8, 
          prev = 0xffffffff811973e0 <pollwake>
        }
      }, 
      wait_address = 0xffff880624c877e0
   
crash> struct poll_table_entry ffff88059f941b68
struct poll_table_entry {
  filp = 0xffff880c24bc6440, 
  key = 0xdb, 
  wait = {
    flags = 0x0, 
    private = 0xffff88059f941ab8, 
    func = 0xffffffff811973e0 <pollwake>, 
    task_list = {
      next = 0xffff880c16462b60, 
      prev = 0xffff880c16462b60
    }
  }, 
  wait_address = 0xffff880c16462b58

crash>  wait_queue_head_t -o
typedef struct __wait_queue_head {
   [0x0] spinlock_t lock;
   [0x8] struct list_head task_list;
} wait_queue_head_t;
SIZE: 0x18

crash> wait_queue_head_t ffff880c16462b58 <<------------------------ (1)
struct wait_queue_head_t {
  lock = {
    raw_lock = {
      slock = 0xf419f417
    }
  }, 
  task_list = {
    next = 0xffff88059f941b90, 
    prev = 0xffff88059f941b90
  }
}
..

...

Which remindes me of wait queue of CPU-2 I think Let's go back to CPU-2 trace now

#11 [ffff880645427f20] do_nmi at ffffffff8151153b
#12 [ffff880645427f50] nmi at ffffffff81510e00
    [exception RIP: task_rq_lock+0x4d]
    RIP: ffffffff81055d5d  RSP: ffff8806454239d0  RFLAGS: 00000082
    RAX: 0000000000000003  RBX: 0000000000016700  RCX: 0000000000000000
    RDX: 0000000000000082  RSI: ffff880645423a28  RDI: ffff88060eaad500
    RBP: ffff8806454239f0   R8: 0000000000000000   R9: 0000000000000000
    R10: 0000000000000000  R11: 0000000000000000  R12: ffff88060eaad500
    R13: ffff880645423a28  R14: 0000000000016700  R15: 0000000000000001
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0000
--- <NMI exception stack> ---
#13 [ffff8806454239d0] task_rq_lock at ffffffff81055d5d
#14 [ffff8806454239f8] try_to_wake_up at ffffffff81062f8c
#15 [ffff880645423a68] default_wake_function at ffffffff81063342
#16 [ffff880645423a78] pollwake at ffffffff81197436
#17 [ffff880645423ab8] __wake_up_common at ffffffff81051439
#18 [ffff880645423b08] __wake_up at ffffffff81055ac8

"Because wait queues can also be modified in interrupts, a spinlock named lock must be acquired before
the queue is manipulated" seems to me the could be a culprit here as well let's check that 

Which seem from the code of __wake_up call 

void __wake_up(wait_queue_head_t *q, unsigned int mode,
                        int nr_exclusive, void *key)
{
        unsigned long flags;

        spin_lock_irqsave(&q->lock, flags); <<---------------------------------
        __wake_up_common(q, mode, nr_exclusive, 0, key);
        spin_unlock_irqrestore(&q->lock, flags);
}
EXPORT_SYMBOL(__wake_up);

Now let's check the wait_queue_head_t here 

0xffffffff81055ab5 <__wake_up+0x35>:    mov    %rax,%r12
/usr/src/debug/kernel-2.6.32-358.14.1.el6/linux-2.6.32-358.14.1.el6.x86_64/kernel/sched.c: 6256
0xffffffff81055ab8 <__wake_up+0x38>:    mov    %r14d,%edx
0xffffffff81055abb <__wake_up+0x3b>:    mov    %r13d,%esi
0xffffffff81055abe <__wake_up+0x3e>:    mov    %rbx,%rdi <<--------------- wait_queue_head_t 
0xffffffff81055ac1 <__wake_up+0x41>:    xor    %ecx,%ecx
0xffffffff81055ac3 <__wake_up+0x43>:    callq  0xffffffff810513e0 <__wake_up_common>
/usr/src/debug/kernel-2.6.32-358.14.1.el6/linux-2.6.32-358.14.1.el6.x86_64/kernel/sched.c: 6257
0xffffffff81055ac8 <__wake_up+0x48>:    mov    %r12,%rsi

crash> dis -lr ffffffff81051439
/usr/src/debug/kernel-2.6.32-358.14.1.el6/linux-2.6.32-358.14.1.el6.x86_64/kernel/sched.c: 6228
0xffffffff810513e0 <__wake_up_common>:  push   %rbp
0xffffffff810513e1 <__wake_up_common+0x1>:      mov    %rsp,%rbp
0xffffffff810513e4 <__wake_up_common+0x4>:      push   %r15
0xffffffff810513e6 <__wake_up_common+0x6>:      push   %r14
0xffffffff810513e8 <__wake_up_common+0x8>:      push   %r13
0xffffffff810513ea <__wake_up_common+0xa>:      push   %r12
0xffffffff810513ec <__wake_up_common+0xc>:      push   %rbx  <<------------ 6th Push
0xffffffff810513ed <__wake_up_common+0xd>:      sub    $0x18,%rsp
0xffffffff810513f1 <__wake_up_common+0x11>:     nopl   0x0(%rax,%rax,1)



    ffff880645423a60: ffff880645423a70 ffffffff81063342 
#15 [ffff880645423a68] default_wake_function at ffffffff81063342
    ffff880645423a70: ffff880645423ab0 ffffffff81197436 
#16 [ffff880645423a78] pollwake at ffffffff81197436
    ffff880645423a80: 0000000000000000 ffff88060eaad500 
    ffff880645423a90: ffffffff81063330 0000000000000000 
    ffff880645423aa0: 0000000000000000 00041220ac3cd1d0 
    ffff880645423ab0: ffff880645423b00 ffffffff81051439 
#17 [ffff880645423ab8] __wake_up_common at ffffffff81051439
    ffff880645423ac0: ffff880c13bc6580 0000000100000001 
    ffff880645423ad0: 0000000000000020 ffff880c16462b58 
                                        ^^^^^^^^^^^^^^--------------------- wait_queue_head_t 6th push 
    ffff880645423ae0: 0000000000000082 0000000000000001 
    ffff880645423af0: 0000000000000001 0000000000000000 
    ffff880645423b00: ffff880645423b40 ffffffff81055ac8 
#18 [ffff880645423b08] __wake_up at ffffffff81055ac8
    ffff880645423b10: ffff880645423c50 ffff880c16462b40 
    ffff880645423b20: ffff880c16462b50 0000000000000082 
    ffff880645423b30: ffff880be9ea8380 ffff88059fbe4f80

crash> wait_queue_head_t ffff880c16462b58 <<-------------- (2)
struct wait_queue_head_t {
  lock = {
    raw_lock = {
      slock = 0xf419f417
    }
  }, 
  task_list = {
    next = 0xffff88059f941b90, 
    prev = 0xffff88059f941b90
  }
}

Yeah seems assumption is working here 

From (1) and (2) 

Which seems to me that CPU-2 has acquired the semaphore and waiting for migration to task which is running over CPU-3 which intern waiting over the same spinlock acquired by CPU-2. Created a scenario of circular wait here which is a deadlock condition.

Thanks 
Gopal

Comment 8 Doug Ledford 2014-11-12 15:29:59 UTC
This doesn't look like an InfiniBand or mlx4 related issue to me.  On cpu2, the mlx4 driver called to the core infiniband code which then called __wake_up which took the appropriate wait queue spin lock (A), and then we get to where we intend to wake a task up and attempt to take the task->lock for the task on the wait queue (B).

Meanwhile, on cpu3 the java command (the task we are attempting to wake) is actually up and running and evidently its own lock (B) is already held, and it is trying to remove itself from the same wait queue we are trying to start it from, so it needs to grab lock A.  Classic A<->B deadlock, but the problem is in the waitqueue subsystem, not the InfiniBand subsystem.

Comment 9 Doug Ledford 2014-12-03 21:25:01 UTC
As the problem here appears to be with waitqueue management, and that is most closely related to the scheduler, I changed the kernel subcomponent so this will be brought to the right people's attention.

Comment 10 Larry Woodman 2014-12-04 17:22:54 UTC
I dont think this is a classic A<->B deadlock.  CPU2 took the wait queue lock passed into __wake_up() and wants the task runqueue lock in try_to_wake_up().

CPU3 wants the same wait queue lock in remove_wait_queue() that CPU2 already has but doesnt own the task runqueue lock that CPU2 wants.  Not sure who owns the task runqueue lock that CPU2 wants...


Larry

Comment 11 Rafael Aquini 2014-12-04 18:30:30 UTC
(In reply to Larry Woodman from comment #10)
> I dont think this is a classic A<->B deadlock.  CPU2 took the wait queue
> lock passed into __wake_up() and wants the task runqueue lock in
> try_to_wake_up().
>

Agreed
 
> CPU3 wants the same wait queue lock in remove_wait_queue() that CPU2 already
> has but doesnt own the task runqueue lock that CPU2 wants.  Not sure who
> owns the task runqueue lock that CPU2 wants...
> 

Theoretically CPU2 would be waiting on CPU's 3 rq->lock, which seems to be free:
crash> runq | grep "CPU\ 3"
CPU 3 RUNQUEUE: ffff880028236700
crash> px ((struct rq *)0xffff880028236700)->lock
$5 = {
  raw_lock = {
    slock = 0xda9fda9f
  }
}

I wonder if we could get stuck looping @ task_rq_lock(), if task CPU2 is trying to wake up keeps migrating between (1201) and (1203) -- I'm not sure how could that happen, though:

 1199         for (;;) {
 1200                 local_irq_save(*flags);
 1201                 rq = task_rq(p);
 1202                 spin_lock(&rq->lock);
 1203                 if (likely(rq == task_rq(p)))
 1204                         return rq;
 1205                 spin_unlock_irqrestore(&rq->lock, *flags);
 1206         }

Comment 12 Larry Woodman 2014-12-04 19:21:51 UTC
I *think* CPU2 wouldnt wait on CPU's 3 rq->lock if it is free(unlocked), it would just take the lock & go.  Right?

As for looping in task_rq_lock(), seems like it couldnt happen forever.  In order for it to loop the task would have to move from one rq to another over that spin_lock(&rq->lock). Right?


Larry

Comment 13 Dave Anderson 2014-12-04 19:47:04 UTC
> Theoretically CPU2 would be waiting on CPU's 3 rq->lock, which seems to be free:

Right, and if it were spinning, the exception would be taken in _spin_lock
and not in task_rq_lock() itself.  The RIP is ffffffff81055d5d:

crash> dis task_rq_lock
0xffffffff81055d10 <task_rq_lock>:      push   %rbp
0xffffffff81055d11 <task_rq_lock+0x1>:  mov    %rsp,%rbp
0xffffffff81055d14 <task_rq_lock+0x4>:  sub    $0x20,%rsp
0xffffffff81055d18 <task_rq_lock+0x8>:  mov    %rbx,(%rsp)
0xffffffff81055d1c <task_rq_lock+0xc>:  mov    %r12,0x8(%rsp)
0xffffffff81055d21 <task_rq_lock+0x11>: mov    %r13,0x10(%rsp)
0xffffffff81055d26 <task_rq_lock+0x16>: mov    %r14,0x18(%rsp)
0xffffffff81055d2b <task_rq_lock+0x1b>: nopl   0x0(%rax,%rax,1)
0xffffffff81055d30 <task_rq_lock+0x20>: mov    $0x16700,%rbx
0xffffffff81055d37 <task_rq_lock+0x27>: mov    %rdi,%r12
0xffffffff81055d3a <task_rq_lock+0x2a>: mov    %rsi,%r13
0xffffffff81055d3d <task_rq_lock+0x2d>: pushfq 
0xffffffff81055d3e <task_rq_lock+0x2e>: pop    %rax
0xffffffff81055d3f <task_rq_lock+0x2f>: nopl   0x0(%rax,%rax,1)
0xffffffff81055d44 <task_rq_lock+0x34>: mov    %rax,%rdx
0xffffffff81055d47 <task_rq_lock+0x37>: cli    
0xffffffff81055d48 <task_rq_lock+0x38>: nopw   0x0(%rax,%rax,1)
0xffffffff81055d4e <task_rq_lock+0x3e>: mov    %rdx,0x0(%r13)
0xffffffff81055d52 <task_rq_lock+0x42>: mov    0x8(%r12),%rax
0xffffffff81055d57 <task_rq_lock+0x47>: mov    %rbx,%r14
0xffffffff81055d5a <task_rq_lock+0x4a>: mov    0x18(%rax),%eax
0xffffffff81055d5d <task_rq_lock+0x4d>: add    -0x7e402200(,%rax,8),%r14  <==
0xffffffff81055d65 <task_rq_lock+0x55>: mov    %r14,%rdi
0xffffffff81055d68 <task_rq_lock+0x58>: callq  0xffffffff81510650 <_spin_lock>

Comment 14 Rafael Aquini 2014-12-04 19:53:31 UTC
Dave just beat me here, but I'll update what I was writing anyways :)

(In reply to Larry Woodman from comment #12)
> I *think* CPU2 wouldnt wait on CPU's 3 rq->lock if it is free(unlocked), it
> would just take the lock & go.  Right?
> 
> As for looping in task_rq_lock(), seems like it couldnt happen forever.  In
> order for it to loop the task would have to move from one rq to another over
> that spin_lock(&rq->lock). Right?
> 
Right for both, and I know it sounds crazy, but what made me think about a loop around those lines is the fact we're not (totally) within spin_lock(&rq->lock) yet:

crash> dis task_rq_lock
0xffffffff81055d10 <task_rq_lock>:	push   %rbp
0xffffffff81055d11 <task_rq_lock+0x1>:	mov    %rsp,%rbp
0xffffffff81055d14 <task_rq_lock+0x4>:	sub    $0x20,%rsp
0xffffffff81055d18 <task_rq_lock+0x8>:	mov    %rbx,(%rsp)
0xffffffff81055d1c <task_rq_lock+0xc>:	mov    %r12,0x8(%rsp)
0xffffffff81055d21 <task_rq_lock+0x11>:	mov    %r13,0x10(%rsp)
0xffffffff81055d26 <task_rq_lock+0x16>:	mov    %r14,0x18(%rsp)
0xffffffff81055d2b <task_rq_lock+0x1b>:	nopl   0x0(%rax,%rax,1)
0xffffffff81055d30 <task_rq_lock+0x20>:	mov    $0x16700,%rbx
0xffffffff81055d37 <task_rq_lock+0x27>:	mov    %rdi,%r12
0xffffffff81055d3a <task_rq_lock+0x2a>:	mov    %rsi,%r13
0xffffffff81055d3d <task_rq_lock+0x2d>:	pushfq 
0xffffffff81055d3e <task_rq_lock+0x2e>:	pop    %rax
0xffffffff81055d3f <task_rq_lock+0x2f>:	nopl   0x0(%rax,%rax,1)
0xffffffff81055d44 <task_rq_lock+0x34>:	mov    %rax,%rdx
0xffffffff81055d47 <task_rq_lock+0x37>:	cli    
0xffffffff81055d48 <task_rq_lock+0x38>:	nopw   0x0(%rax,%rax,1)
0xffffffff81055d4e <task_rq_lock+0x3e>:	mov    %rdx,0x0(%r13)
0xffffffff81055d52 <task_rq_lock+0x42>:	mov    0x8(%r12),%rax
0xffffffff81055d57 <task_rq_lock+0x47>:	mov    %rbx,%r14
0xffffffff81055d5a <task_rq_lock+0x4a>:	mov    0x18(%rax),%eax
0xffffffff81055d5d <task_rq_lock+0x4d>:	add    -0x7e402200(,%rax,8),%r14
0xffffffff81055d65 <task_rq_lock+0x55>:	mov    %r14,%rdi
0xffffffff81055d68 <task_rq_lock+0x58>:	callq  0xffffffff81510650 <_spin_lock>


As the exception RIP suggests, we're about to execute task_rq_lock+0x4d, which will give us the rq->lock address -- it's the last instruction from rq = task_rq(p);

From instruction at task_rq_lock+0x42, which is the point we take *p->stack, we can infer struct task_struct *p is at R12, so
crash> pd ((struct thread_info *)((struct task_struct *)0xffff88060eaad500)->stack)->cpu
$9 = 3

We really were just about to take CPU#3 rq->lock when the watchdog kicked the NMI.


I'll keep looking into this crashdump to see if I can figure something else...

Comment 15 Larry Woodman 2014-12-04 20:05:54 UTC
gopal krishna tiwari, do you know if this is reproducible?  As you can see we are having a haed time understanding exaclty what state the system is in when this hang occurs.

Larry

Comment 16 Rafael Aquini 2014-12-06 01:31:53 UTC
Here goes an attempt to summarize my scavenging session in the crashdump at hand: (Sorry, this is going to be a long post)

- Evidence collected by Gopal & Doug:

a) CPU0 is looping @ flush_tlb_others_ipi() while it waits for CPU2 to acknowledging the issued IPI and clearing the flush_cpumask.

b) CPU2 is servicing an IRQ (mlx4_msi_x_interrupt), so it's not handling other hardware maskable interrupts;

c) The softlockup warning watchdog issued for CPU0 (a) suggests CPU2 has spent more than 60secs spinning at (b);

d) Because of (c), the NMI watchdog panic'ed the system (hardlockup_panic is set);

e) Task on CPU3 waits for a spinlock that CPU2 is currently holding;


As Larry pointed out, in comment #10 the theory on a classic AB-BA deadlock is refutable because, despite the fact CPU2 is holding the wait_queue lock wanted by CPU3, CPU3 doesn't hold the lock CPU2 is about to take. Dave and I collected the final evidence that proves CPU2 is about to take CPU3's runqueue lock, which is free -- comment #13 & comment #14. Also, another piece of evidence about task at CPU3 that reinforces there's no deadlock between CPUs 2 & 3:

f) task at CPU3 appears to have 'just' arrived to run:
crash> bt 0xffff88060eaad500 | head -n1
PID: 28035  TASK: ffff88060eaad500  CPU: 3   COMMAND: "java"

crash> runq | grep "CPU 3"
CPU 3 RUNQUEUE: ffff880028236700

crash> px ((struct rq *)0xffff880028236700)->clock
$12 = 0x141acf9a4aa06
 
crash> px ((struct task_struct *)0xffff88060eaad500)->sched_info.last_arrival
$13 = 0x141acf9a4aa06


Larry also questioned (comment #12) the other theory that suggested we could be looping around task_rq_lock() (comment #11) which finds support on the evidence aforementioned (CPU2 does not yet hold, nor spin at, CPU3's runqueue lock). I have to agree with Larry on that one too and refute that second theory. Although evidence suggesting it, that's a very unlikely scenario to happen and despite its plausibility (per data we gathered) it would be a transitory condition as Larry pointed out. Yet, we have a real hard lock up on CPU2.

Another possible explanation to the occurrence is that it would be a (bad) side effect of a wake-up storm due to some sort of workload spike and the unfortunate coincidence of seeing (a), (b) and (e) at the crime scene, in vmcore. With some crash + awk simple sorcery, I could figure the following:

g) workload is constrained, basically, into 2 cpusets:
$ awk -f ~/helper.awk task-list-cpumask 
TASKS    CPUMASK          CPUS_ALLOWED
------------------------------------------------------------+
645      0xffa   (11, 10, 9, 8, 7, 6, 5, 4, 3, 1)           | user workload
298      0x5     (2, 0)                                     | user workload + irqbalance
------------------------------------------------------------+
24       0x100	 (8)                                        |
1        0x555	 (10, 8, 6, 4, 2, 0) *kswapd0               |
24       0x200	 (9)                                        |
1        0xaaa	 (11, 9, 7, 5, 3, 1) *kswapd1               |
24       0x400	 (10)                                       |
24       0x8	 (3)                                        | ktrheads
88       0xfff	 (11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0)     | and few system tasks
24       0x10	 (4)                                        |
24       0x20	 (5)                                        |
24       0x800	 (11)                                       |
24       0x40	 (6)                                        |
27       0x1	 (0)                                        |
24       0x2	 (1)                                        |
24       0x80	 (7)                                        |
24       0x4	 (2)                                        |
------------------------------------------------------------+


h) from Gopal findings (comment #2), we know that CPU3 was polling file pointer 0xffff880c24bc6440: 
crash> set -c3
    PID: 28035
COMMAND: "java"
   TASK: ffff88060eaad500  [THREAD_INFO: ffff88059f940000]
    CPU: 3
  STATE: TASK_RUNNING (ACTIVE)
crash> files | grep ffff880c24bc6440
191 ffff880c24bc6440 ffff8806bd569380 ffff880c261c7838 REG  anon_inode:/[uverbs-event]


i) and there are 109 tasks sharing the same file pointer across the system:
$ awk -f ~/helper-2.awk taks-list-all-filepointers | awk '/ffff880c24bc6440/ {print}'
109	 ffff880c24bc6440 REG anon_inode:/[uverbs-event]


So, as we see, it seems the CPU2, while servicing that interrupt, was kept really long time dealing with waking-up a big bunch of tasks (i). Although 100+ tasks might not seem like a huge deal of work, I'm inclined to think the system was coping with more of those thundering herds of wake-ups load, considering the recorded load average:

crash> sys
      KERNEL: /cores/retrace/repos/kernel/x86_64/usr/lib/debug/lib/modules/2.6.32-358.14.1.el6.x86_64/vmlinux
    DUMPFILE: /cores/retrace/tasks/243587822/crash/vmcore  [PARTIAL DUMP]
        CPUS: 12
        DATE: Thu Jul 10 08:42:16 2014
      UPTIME: 4 days, 02:16:11
LOAD AVERAGE: 97.12, 43.00, 19.05
       TASKS: 1324


as well as considering there are several other big group of polling-event shared inodes  across the system:

l) top-20 polling-event shared inodes:
$ awk -f ~/helper-2.awk taks-list-all-filepointers | sort -k1 -r -n | awk '/anon_inode/ {print}' | head -n20
132	 ffff880c23e826c0 REG anon_inode:/[eventpoll]
132	 ffff880c142b4cc0 REG anon_inode:/[eventpoll]
132	 ffff880c047f3380 REG anon_inode:/[eventpoll]
132	 ffff880bf361dbc0 REG anon_inode:/[eventpoll]
132	 ffff880625bf7a40 REG anon_inode:/[uverbs-event]
132	 ffff880625926740 REG anon_inode:/[uverbs-event]
132	 ffff8806240fbb40 REG anon_inode:/[eventpoll]
132	 ffff88061dae2b40 REG anon_inode:/[uverbs-event]
132	 ffff88061dae2900 REG anon_inode:/[uverbs-event]
132	 ffff88060e880b40 REG anon_inode:/[uverbs-event]
132	 ffff8805e8150f00 REG anon_inode:/[uverbs-event]
132	 ffff8805e8150cc0 REG anon_inode:/[uverbs-event]
109	 ffff880c24bc6440 REG anon_inode:/[uverbs-event]
109	 ffff880c1519c300 REG anon_inode:/[eventpoll]
109	 ffff880bf34a4c00 REG anon_inode:/[uverbs-event]
109	 ffff880bf34a43c0 REG anon_inode:/[uverbs-event]
109	 ffff880626ea69c0 REG anon_inode:/[eventpoll]
109	 ffff8806243f4bc0 REG anon_inode:/[uverbs-event]
109	 ffff880624362bc0 REG anon_inode:/[uverbs-event]
109	 ffff880624362800 REG anon_inode:/[uverbs-event]


Considering all that goes above, I'd say this is probably not a bug, but a workload issue that might be addressed via sizing/tuning. However, as Larry also noted, we need to know if this is reproducible, or if it was one event isolated.


-- Rafael

Comment 17 Li Wang 2014-12-10 05:51:46 UTC
(In reply to Larry Woodman from comment #15)
> gopal krishna tiwari, do you know if this is reproducible?  As you can see
> we are having a haed time understanding exaclty what state the system is in
> when this hang occurs.

I have the same question. If this is a reproducible issue how can?


Note You need to log in before you can comment on or make changes to this bug.