Description of problem: Kernel hanging on hotplug operations. The task turning off a CPU hangs in the following stack: crash> bt 5654 PID: 5654 TASK: ffff880620cb98c0 CPU: 4 COMMAND: "S98bounce-nohz" #0 [ffff88032431b9b8] __schedule at ffffffff815f1a7c #1 [ffff88032431ba50] schedule at ffffffff815f2194 #2 [ffff88032431ba70] schedule_timeout at ffffffff815f0975 #3 [ffff88032431bae0] wait_for_completion at ffffffff815f258f #4 [ffff88032431bb40] wait_rcu_gp at ffffffff81084b59 #5 [ffff88032431bb90] synchronize_rcu at ffffffff81108739 #6 [ffff88032431bba0] perf_pmu_migrate_context at ffffffff8113b895 #7 [ffff88032431bc30] uncore_change_context at ffffffff81023b7a #8 [ffff88032431bc80] uncore_cpu_notifier at ffffffff81023f39 #9 [ffff88032431bcd0] notifier_call_chain at ffffffff815f8525 #10 [ffff88032431bd10] __raw_notifier_call_chain at ffffffff8108f0fe #11 [ffff88032431bd20] __cpu_notify at ffffffff81060bf0 #12 [ffff88032431bd30] _cpu_down at ffffffff815ea386 <------ The above function #13 [ffff88032431be00] cpu_down at ffffffff815ea728 #14 [ffff88032431be20] cpu_subsys_offline at ffffffff813c3ed4 #15 [ffff88032431be30] device_offline at ffffffff813bd15d #16 [ffff88032431be60] store_online at ffffffff813be448 #17 [ffff88032431be90] dev_attr_store at ffffffff813bc0d0 #18 [ffff88032431bea0] sysfs_write_file at ffffffff8122adc5 #19 [ffff88032431bef0] vfs_write at ffffffff811abbdf #20 [ffff88032431bf20] sys_write at ffffffff811ac690 #21 [ffff88032431bf80] tracesys at ffffffff815fc568 (via system_call) RIP: 00007fb8f477e6d0 RSP: 00007ffd00b9f4e8 RFLAGS: 00000246 RAX: ffffffffffffffda RBX: ffffffff815fc568 RCX: ffffffffffffffff RDX: 0000000000000002 RSI: 00007fb8f5072000 RDI: 0000000000000001 RBP: 00007fb8f5072000 R8: 000000000000000a R9: 00007fb8f5064700 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000001 R13: 00007ffd00b9f59c R14: 0000000000000002 R15: 00007fb8f4a32040 ORIG_RAX: 0000000000000001 CS: 0033 SS: 002b Version-Release number of selected component (if applicable): 3.10.0-327.rt56.183.el6rt.x86_64 How reproducible: Often Steps to Reproduce: 1. Disable some CPUS at boot, re-enabling them after. That is a hack to move all timers away from isolated CPUs. Actual results: The task that is disabling the CPU hangs Expected results: No problems on hotplug operations Additional info: I will post some findings of a crash dump analysis.
The oldest UN process: [1 10:40:50.067] [UN] PID: 607 TASK: ffff880623df0000 CPU: 4 COMMAND: "kworker/4:1" [1 10:40:50.398] [UN] PID: 5709 TASK: ffff880622b718c0 CPU: 3 COMMAND: "sync_unplug/3" [1 10:40:50.398] [UN] PID: 5654 TASK: ffff880620cb98c0 CPU: 4 COMMAND: "S98bounce-nohz" kworker is waiting on a rt mutex at: crash> bt -f 607 PID: 607 TASK: ffff880623df0000 CPU: 4 COMMAND: "kworker/4:1" #0 [ffff88062383fb98] __schedule at ffffffff815f1a7c ffff88062383fba0: 0000000000000002 ffff88062383fc48 ffff88062383fbb0: ffff88062383c000 ffff88062383c010 ffff88062383fbc0: ffff88062383c000 ffff88062383ffd8 ffff88062383fbd0: ffff88062383c010 ffff88062383c000 ffff88062383fbe0: ffff88062383c000 ffff8806242618c0 ffff88062383fbf0: ffff880623df0000 ffff88062383fd28 ffff88062383fc00: ffff880623df0000 0000000000000000 ffff88062383fc10: ffff88062383c000 ffff88062383c010 ffff88062383fc20: ffff880623df0000 ffff88062383fc48 ffff88062383fc30: ffffffff815f2194 #1 [ffff88062383fc30] schedule at ffffffff815f2194 ffff88062383fc38: ffff88062383c010 -----> ffffffff8191e588 <----- rt mutex ffff88062383fc48: ffff88062383fcc8 ffffffff815f2e85 #2 [ffff88062383fc50] __rt_mutex_slowlock at ffffffff815f2e85 ffff88062383fc58: 0000000000000000 ffffffff8191e5b0 ffff88062383fc68: 0000000000000002 ffff88062383fce8 ffff88062383fc78: 0000000200000246 ffff880623df0000 ffff88062383fc88: 0000000000000292 0000000000000000 ffff88062383fc98: ffff88062383fca8 ffffffff8191e588 ffff88062383fca8: 0000000000000000 0000000000000000 ffff88062383fcb8: ffff88062383fce8 0000000000000002 ffff88062383fcc8: ffff88062383fd88 ffffffff815f356e #3 [ffff88062383fcd0] rt_mutex_slowlock at ffffffff815f356e The owner of the mutex is: crash> px ((struct rt_mutex *)0xffffffff8191e588)->owner $1 = (struct task_struct *) 0xffff880620cb98c1 crash> ps 0xffff880620cb98c0 PID PPID CPU TASK ST %MEM VSZ RSS COMM 5654 4304 4 ffff880620cb98c0 UN 0.0 108496 1900 S98bounce-nohz The next task is sync_unplug/3: crash> bt 5709 PID: 5709 TASK: ffff880622b718c0 CPU: 3 COMMAND: "sync_unplug/3" #0 [ffff8806235cfcf8] __schedule at ffffffff815f1a7c #1 [ffff8806235cfd90] schedule at ffffffff815f2194 #2 [ffff8806235cfdb0] schedule_timeout at ffffffff815f0975 #3 [ffff8806235cfe20] wait_for_completion at ffffffff815f258f #4 [ffff8806235cfe80] sync_unplug_thread at ffffffff8106150e <--- waiting here: #5 [ffff8806235cfec0] kthread at ffffffff810886fe #6 [ffff8806235cff50] ret_from_fork at ffffffff815fc2c8 that translates this in asm: crash> dis -l ffffffff8106150e /usr/src/debug/kernel-3.10.0/linux-3.10.0.x86_64/arch/x86/include/asm/thread_info.h: 219 0xffffffff8106150e <sync_unplug_thread+30>: mov %gs:0xb908,%rax crash> dis sync_unplug_thread 0xffffffff810614f0 <sync_unplug_thread>: push %rbp 0xffffffff810614f1 <sync_unplug_thread+1>: mov %rsp,%rbp 0xffffffff810614f4 <sync_unplug_thread+4>: push %r13 0xffffffff810614f6 <sync_unplug_thread+6>: push %r12 0xffffffff810614f8 <sync_unplug_thread+8>: push %rbx 0xffffffff810614f9 <sync_unplug_thread+9>: sub $0x18,%rsp 0xffffffff810614fd <sync_unplug_thread+13>: data32 data32 data32 xchg %ax,%ax 0xffffffff81061502 <sync_unplug_thread+18>: mov %rdi,%rbx 0xffffffff81061505 <sync_unplug_thread+21>: lea 0x38(%rdi),%rdi 0xffffffff81061509 <sync_unplug_thread+25>: callq 0xffffffff815f24e0 <wait_for_completion> <---- Here. 0xffffffff8106150e <sync_unplug_thread+30>: mov %gs:0xb908,%rax <---- It is pointing here because it is the returning point. That in code turns out into this: --------------- %< --------------------- static int sync_unplug_thread(void *data) { struct hotplug_pcp *hp = data; wait_for_completion(&hp->unplug_wait); <---- here preempt_disable(); hp->unplug = current; wait_for_pinned_cpus(hp); /* * This thread will synchronize the cpu_down() with threads * that have pinned the CPU. When the pinned CPU count reaches * zero, we inform the cpu_down code to continue to the next step. */ set_current_state(TASK_UNINTERRUPTIBLE); preempt_enable(); complete(&hp->synced); --------------- >% --------------------- The task is waiting to be awakened by the following function: --------------- %< --------------------- static void __cpu_unplug_wait(unsigned int cpu) { struct hotplug_pcp *hp = &per_cpu(hotplug_pcp, cpu); complete(&hp->unplug_wait); wait_for_completion(&hp->synced); } --------------- >% --------------------- This function will be called by S98bounce-nohz: --------------- %< --------------------- static int __ref _cpu_down(unsigned int cpu, int tasks_frozen) { int mycpu, err, nr_calls = 0; void *hcpu = (void *)(long)cpu; unsigned long mod = tasks_frozen ? CPU_TASKS_FROZEN : 0; struct take_cpu_down_param tcd_param = { .mod = mod, .hcpu = hcpu, }; cpumask_var_t cpumask; cpumask_var_t cpumask_org; if (num_online_cpus() == 1) return -EBUSY; if (!cpu_online(cpu)) return -EINVAL; /* Move the downtaker off the unplug cpu */ if (!alloc_cpumask_var(&cpumask, GFP_KERNEL)) return -ENOMEM; if (!alloc_cpumask_var(&cpumask_org, GFP_KERNEL)) { free_cpumask_var(cpumask); return -ENOMEM; } cpumask_copy(cpumask_org, tsk_cpus_allowed(current)); cpumask_andnot(cpumask, cpu_online_mask, cpumask_of(cpu)); set_cpus_allowed_ptr(current, cpumask); free_cpumask_var(cpumask); migrate_disable(); mycpu = smp_processor_id(); if (mycpu == cpu) { printk(KERN_ERR "Yuck! Still on unplug CPU\n!"); migrate_enable(); err = -EBUSY; goto restore_cpus; } migrate_enable(); cpu_hotplug_begin(); err = cpu_unplug_begin(cpu); if (err) { printk("cpu_unplug_begin(%d) failed\n", cpu); goto out_cancel; } err = __cpu_notify(CPU_DOWN_PREPARE | mod, hcpu, -1, &nr_calls); <---- S98bounce-nohz is here if (err) { nr_calls--; __cpu_notify(CPU_DOWN_FAILED | mod, hcpu, nr_calls, NULL); printk("%s: attempt to take down CPU %u failed\n", __func__, cpu); goto out_release; } /* * By now we've cleared cpu_active_mask, wait for all preempt-disabled * and RCU users of this state to go away such that all new such users * will observe it. * * For CONFIG_PREEMPT we have preemptible RCU and its sync_rcu() might * not imply sync_sched(), so explicitly call both. * * Do sync before park smpboot threads to take care the rcu boost case. */ #ifdef CONFIG_PREEMPT synchronize_sched(); #endif synchronize_rcu(); __cpu_unplug_wait(cpu); <------------------- wake up sync_unplug/3 --------------- >% --------------------- So sync_unplug/3 is also waiting for S98bounce-nohz to make progress. Here is S98bounce-nohz stack trace: crash> bt 5654 PID: 5654 TASK: ffff880620cb98c0 CPU: 4 COMMAND: "S98bounce-nohz" #0 [ffff88032431b9b8] __schedule at ffffffff815f1a7c #1 [ffff88032431ba50] schedule at ffffffff815f2194 #2 [ffff88032431ba70] schedule_timeout at ffffffff815f0975 #3 [ffff88032431bae0] wait_for_completion at ffffffff815f258f #4 [ffff88032431bb40] wait_rcu_gp at ffffffff81084b59 #5 [ffff88032431bb90] synchronize_rcu at ffffffff81108739 #6 [ffff88032431bba0] perf_pmu_migrate_context at ffffffff8113b895 #7 [ffff88032431bc30] uncore_change_context at ffffffff81023b7a #8 [ffff88032431bc80] uncore_cpu_notifier at ffffffff81023f39 #9 [ffff88032431bcd0] notifier_call_chain at ffffffff815f8525 #10 [ffff88032431bd10] __raw_notifier_call_chain at ffffffff8108f0fe #11 [ffff88032431bd20] __cpu_notify at ffffffff81060bf0 #12 [ffff88032431bd30] _cpu_down at ffffffff815ea386 <------ The above function #13 [ffff88032431be00] cpu_down at ffffffff815ea728 #14 [ffff88032431be20] cpu_subsys_offline at ffffffff813c3ed4 #15 [ffff88032431be30] device_offline at ffffffff813bd15d #16 [ffff88032431be60] store_online at ffffffff813be448 #17 [ffff88032431be90] dev_attr_store at ffffffff813bc0d0 #18 [ffff88032431bea0] sysfs_write_file at ffffffff8122adc5 #19 [ffff88032431bef0] vfs_write at ffffffff811abbdf #20 [ffff88032431bf20] sys_write at ffffffff811ac690 #21 [ffff88032431bf80] tracesys at ffffffff815fc568 (via system_call) RIP: 00007fb8f477e6d0 RSP: 00007ffd00b9f4e8 RFLAGS: 00000246 RAX: ffffffffffffffda RBX: ffffffff815fc568 RCX: ffffffffffffffff RDX: 0000000000000002 RSI: 00007fb8f5072000 RDI: 0000000000000001 RBP: 00007fb8f5072000 R8: 000000000000000a R9: 00007fb8f5064700 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000001 R13: 00007ffd00b9f59c R14: 0000000000000002 R15: 00007fb8f4a32040 ORIG_RAX: 0000000000000001 CS: 0033 SS: 002b In the above function, the S98bounce-nohz is stuck in this function: /** * synchronize_rcu - wait until a grace period has elapsed. * * Control will return to the caller some time after a full grace * period has elapsed, in other words after all currently executing RCU * read-side critical sections have completed. Note, however, that * upon return from synchronize_rcu(), the caller might well be executing * concurrently with new RCU read-side critical sections that began while * synchronize_rcu() was waiting. RCU read-side critical sections are * delimited by rcu_read_lock() and rcu_read_unlock(), and may be nested. * * See the description of synchronize_sched() for more detailed information * on memory ordering guarantees. */ void synchronize_rcu(void) { rcu_lockdep_assert(!lock_is_held(&rcu_bh_lock_map) && !lock_is_held(&rcu_lock_map) && !lock_is_held(&rcu_sched_lock_map), "Illegal synchronize_rcu() in RCU read-side critical section"); if (!rcu_scheduler_active) return; if (rcu_expedited) synchronize_rcu_expedited(); else wait_rcu_gp(call_rcu); <----- HERE } This is implemented with the following logic: void wait_rcu_gp(call_rcu_func_t crf) { struct rcu_synchronize rcu; <-- the RCU callback definition + completion init_rcu_head_on_stack(&rcu.head); <--- initiate callback init_completion(&rcu.completion); <--- initiate the completion /* Will wake me after RCU finished. */ crf(&rcu.head, wakeme_after_rcu); <--- schedule an RCU callback to call wakeme_after_rcu() /* Wait for it. */ wait_for_completion(&rcu.completion); <--- wait the completion destroy_rcu_head_on_stack(&rcu.head); } The wakeme_after_rcu() will be called as an RCU callback. It will just wake up the completion: /* * Awaken the corresponding synchronize_rcu() instance now that a * grace period has elapsed. */ static void wakeme_after_rcu(struct rcu_head *head) { struct rcu_synchronize *rcu; rcu = container_of(head, struct rcu_synchronize, head); complete(&rcu->completion); } As RCU callbacks run after the GP (grace period), the solution to wait for a GP is to schedule an RCU callback and wait for it to run. The S98bounce-nohz is so waiting for a callback to run. The problem is that it is not running. The first guess would be an RCU stall taking place, delaying the RCU callback. But we have no RCU stall messages! The next step is to define why does the RCU callback did not run - that is holding the system.
Created attachment 1204264 [details] rpm to install the boot script that causes the hotplug bug
Created attachment 1212181 [details] Boot script that causes the problem (extracted from RPM)
Looking at a core of a kernel dump of this bug, it's the debug kernel. Does this happen on a non debug kernel, and if so, can we get a core file from that?
Created attachment 1214621 [details] Boot script that causes the problem (extracted from RPM) - the correct
The customer reported that the problem was fixed in the latest kernel. -- Daniel
Bhu, We can close this BZ. -- Daniel
Thanks, Daniel. Closing CurrentRelease