1376184 – hotplug bug: hanging on synchronize_rcu()

Bug 1376184 - hotplug bug: hanging on synchronize_rcu()

Summary: hotplug bug: hanging on synchronize_rcu()

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise MRG
Classification:	Red Hat
Component:	realtime-kernel
Sub Component:
Version:	2.5
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	---
Assignee:	Steven Rostedt
QA Contact:	Jiri Kastner
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-09-14 19:24 UTC by Daniel Bristot de Oliveira
Modified:	2020-04-15 14:40 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-03-08 14:05:53 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
rpm to install the boot script that causes the hotplug bug (4.37 KB, application/x-rpm) 2016-09-23 20:16 UTC, Daniel Bristot de Oliveira	no flags	Details
Boot script that causes the problem (extracted from RPM) (96 bytes, text/plain) 2016-10-19 14:50 UTC, Daniel Bristot de Oliveira	no flags	Details
Boot script that causes the problem (extracted from RPM) - the correct (2.26 KB, application/x-shellscript) 2016-10-27 14:04 UTC, Daniel Bristot de Oliveira	no flags	Details
Show Obsolete (1) View All

Description Daniel Bristot de Oliveira 2016-09-14 19:24:20 UTC

Description of problem:
Kernel hanging on hotplug operations. The task turning off a CPU hangs in the following stack:

crash> bt 5654
PID: 5654   TASK: ffff880620cb98c0  CPU: 4   COMMAND: "S98bounce-nohz"
 #0 [ffff88032431b9b8] __schedule at ffffffff815f1a7c
 #1 [ffff88032431ba50] schedule at ffffffff815f2194
 #2 [ffff88032431ba70] schedule_timeout at ffffffff815f0975
 #3 [ffff88032431bae0] wait_for_completion at ffffffff815f258f
 #4 [ffff88032431bb40] wait_rcu_gp at ffffffff81084b59
 #5 [ffff88032431bb90] synchronize_rcu at ffffffff81108739
 #6 [ffff88032431bba0] perf_pmu_migrate_context at ffffffff8113b895
 #7 [ffff88032431bc30] uncore_change_context at ffffffff81023b7a
 #8 [ffff88032431bc80] uncore_cpu_notifier at ffffffff81023f39
 #9 [ffff88032431bcd0] notifier_call_chain at ffffffff815f8525
#10 [ffff88032431bd10] __raw_notifier_call_chain at ffffffff8108f0fe
#11 [ffff88032431bd20] __cpu_notify at ffffffff81060bf0
#12 [ffff88032431bd30] _cpu_down at ffffffff815ea386          <------ The above function
#13 [ffff88032431be00] cpu_down at ffffffff815ea728
#14 [ffff88032431be20] cpu_subsys_offline at ffffffff813c3ed4
#15 [ffff88032431be30] device_offline at ffffffff813bd15d
#16 [ffff88032431be60] store_online at ffffffff813be448
#17 [ffff88032431be90] dev_attr_store at ffffffff813bc0d0
#18 [ffff88032431bea0] sysfs_write_file at ffffffff8122adc5
#19 [ffff88032431bef0] vfs_write at ffffffff811abbdf
#20 [ffff88032431bf20] sys_write at ffffffff811ac690
#21 [ffff88032431bf80] tracesys at ffffffff815fc568 (via system_call)
    RIP: 00007fb8f477e6d0  RSP: 00007ffd00b9f4e8  RFLAGS: 00000246
    RAX: ffffffffffffffda  RBX: ffffffff815fc568  RCX: ffffffffffffffff
    RDX: 0000000000000002  RSI: 00007fb8f5072000  RDI: 0000000000000001
    RBP: 00007fb8f5072000   R8: 000000000000000a   R9: 00007fb8f5064700
    R10: 0000000000000000  R11: 0000000000000246  R12: 0000000000000001
    R13: 00007ffd00b9f59c  R14: 0000000000000002  R15: 00007fb8f4a32040
    ORIG_RAX: 0000000000000001  CS: 0033  SS: 002b

Version-Release number of selected component (if applicable):
3.10.0-327.rt56.183.el6rt.x86_64

How reproducible:
Often

Steps to Reproduce:
1. Disable some CPUS at boot, re-enabling them after. That is a hack to
move all timers away from isolated CPUs.

Actual results:
The task that is disabling the CPU hangs

Expected results:
No problems on hotplug operations

Additional info:
I will post some findings of a crash dump analysis.

Comment 1 Daniel Bristot de Oliveira 2016-09-14 20:12:55 UTC

The oldest UN process:

[1 10:40:50.067] [UN]  PID: 607    TASK: ffff880623df0000  CPU: 4   COMMAND: "kworker/4:1"
[1 10:40:50.398] [UN]  PID: 5709   TASK: ffff880622b718c0  CPU: 3   COMMAND: "sync_unplug/3"
[1 10:40:50.398] [UN]  PID: 5654   TASK: ffff880620cb98c0  CPU: 4   COMMAND: "S98bounce-nohz"

kworker is waiting on a rt mutex at:
crash> bt -f 607
PID: 607    TASK: ffff880623df0000  CPU: 4   COMMAND: "kworker/4:1"
 #0 [ffff88062383fb98] __schedule at ffffffff815f1a7c
    ffff88062383fba0: 0000000000000002 ffff88062383fc48 
    ffff88062383fbb0: ffff88062383c000 ffff88062383c010 
    ffff88062383fbc0: ffff88062383c000 ffff88062383ffd8 
    ffff88062383fbd0: ffff88062383c010 ffff88062383c000 
    ffff88062383fbe0: ffff88062383c000 ffff8806242618c0 
    ffff88062383fbf0: ffff880623df0000 ffff88062383fd28 
    ffff88062383fc00: ffff880623df0000 0000000000000000 
    ffff88062383fc10: ffff88062383c000 ffff88062383c010 
    ffff88062383fc20: ffff880623df0000 ffff88062383fc48 
    ffff88062383fc30: ffffffff815f2194 
 #1 [ffff88062383fc30] schedule at ffffffff815f2194
    ffff88062383fc38: ffff88062383c010 -----> ffffffff8191e588 <----- rt mutex
    ffff88062383fc48: ffff88062383fcc8 ffffffff815f2e85 
 #2 [ffff88062383fc50] __rt_mutex_slowlock at ffffffff815f2e85
    ffff88062383fc58: 0000000000000000 ffffffff8191e5b0 
    ffff88062383fc68: 0000000000000002 ffff88062383fce8 
    ffff88062383fc78: 0000000200000246 ffff880623df0000 
    ffff88062383fc88: 0000000000000292 0000000000000000 
    ffff88062383fc98: ffff88062383fca8 ffffffff8191e588 
    ffff88062383fca8: 0000000000000000 0000000000000000 
    ffff88062383fcb8: ffff88062383fce8 0000000000000002 
    ffff88062383fcc8: ffff88062383fd88 ffffffff815f356e 
 #3 [ffff88062383fcd0] rt_mutex_slowlock at ffffffff815f356e

The owner of the mutex is:

crash> px ((struct rt_mutex *)0xffffffff8191e588)->owner
$1 = (struct task_struct *) 0xffff880620cb98c1
crash> ps 0xffff880620cb98c0
   PID    PPID  CPU       TASK        ST  %MEM     VSZ    RSS  COMM
   5654   4304   4  ffff880620cb98c0  UN   0.0  108496   1900  S98bounce-nohz

The next task is sync_unplug/3:

crash> bt 5709
PID: 5709   TASK: ffff880622b718c0  CPU: 3   COMMAND: "sync_unplug/3"
 #0 [ffff8806235cfcf8] __schedule at ffffffff815f1a7c
 #1 [ffff8806235cfd90] schedule at ffffffff815f2194
 #2 [ffff8806235cfdb0] schedule_timeout at ffffffff815f0975
 #3 [ffff8806235cfe20] wait_for_completion at ffffffff815f258f
 #4 [ffff8806235cfe80] sync_unplug_thread at ffffffff8106150e <--- waiting here:
 #5 [ffff8806235cfec0] kthread at ffffffff810886fe
 #6 [ffff8806235cff50] ret_from_fork at ffffffff815fc2c8

that translates this in asm:

crash> dis -l ffffffff8106150e
/usr/src/debug/kernel-3.10.0/linux-3.10.0.x86_64/arch/x86/include/asm/thread_info.h: 219
0xffffffff8106150e <sync_unplug_thread+30>:     mov    %gs:0xb908,%rax
crash> dis sync_unplug_thread
0xffffffff810614f0 <sync_unplug_thread>:        push   %rbp
0xffffffff810614f1 <sync_unplug_thread+1>:      mov    %rsp,%rbp
0xffffffff810614f4 <sync_unplug_thread+4>:      push   %r13
0xffffffff810614f6 <sync_unplug_thread+6>:      push   %r12
0xffffffff810614f8 <sync_unplug_thread+8>:      push   %rbx
0xffffffff810614f9 <sync_unplug_thread+9>:      sub    $0x18,%rsp
0xffffffff810614fd <sync_unplug_thread+13>:     data32 data32 data32 xchg %ax,%ax
0xffffffff81061502 <sync_unplug_thread+18>:     mov    %rdi,%rbx
0xffffffff81061505 <sync_unplug_thread+21>:     lea    0x38(%rdi),%rdi
0xffffffff81061509 <sync_unplug_thread+25>:     callq  0xffffffff815f24e0 <wait_for_completion> <---- Here.
0xffffffff8106150e <sync_unplug_thread+30>:     mov    %gs:0xb908,%rax                          <---- It is pointing here because it is the returning point.


That in code turns out into this:
--------------- %< ---------------------
static int sync_unplug_thread(void *data)
{
        struct hotplug_pcp *hp = data;

        wait_for_completion(&hp->unplug_wait);                           <---- here
        preempt_disable();
        hp->unplug = current;
        wait_for_pinned_cpus(hp);

        /*
         * This thread will synchronize the cpu_down() with threads
         * that have pinned the CPU. When the pinned CPU count reaches
         * zero, we inform the cpu_down code to continue to the next step.
         */
        set_current_state(TASK_UNINTERRUPTIBLE);
        preempt_enable();
        complete(&hp->synced);
--------------- >% ---------------------

The task is waiting to be awakened by the following function:
--------------- %< ---------------------
static void __cpu_unplug_wait(unsigned int cpu)
{
        struct hotplug_pcp *hp = &per_cpu(hotplug_pcp, cpu);

        complete(&hp->unplug_wait);
        wait_for_completion(&hp->synced);
}
--------------- >% ---------------------

This function will be called by S98bounce-nohz:
--------------- %< ---------------------
static int __ref _cpu_down(unsigned int cpu, int tasks_frozen)
{
        int mycpu, err, nr_calls = 0;
        void *hcpu = (void *)(long)cpu;
        unsigned long mod = tasks_frozen ? CPU_TASKS_FROZEN : 0;
        struct take_cpu_down_param tcd_param = {
                .mod = mod,
                .hcpu = hcpu,
        };
        cpumask_var_t cpumask;
        cpumask_var_t cpumask_org;

        if (num_online_cpus() == 1)
                return -EBUSY;

        if (!cpu_online(cpu))
                return -EINVAL;

        /* Move the downtaker off the unplug cpu */
        if (!alloc_cpumask_var(&cpumask, GFP_KERNEL))
                return -ENOMEM;
        if (!alloc_cpumask_var(&cpumask_org, GFP_KERNEL))  {
                free_cpumask_var(cpumask);
                return -ENOMEM;
        }

        cpumask_copy(cpumask_org, tsk_cpus_allowed(current));
        cpumask_andnot(cpumask, cpu_online_mask, cpumask_of(cpu));
        set_cpus_allowed_ptr(current, cpumask);
        free_cpumask_var(cpumask);
        migrate_disable();
        mycpu = smp_processor_id();
        if (mycpu == cpu) {
                printk(KERN_ERR "Yuck! Still on unplug CPU\n!");
                migrate_enable();
                err = -EBUSY;
                goto restore_cpus;
        }
        migrate_enable();

        cpu_hotplug_begin();
        err = cpu_unplug_begin(cpu);
        if (err) {
                printk("cpu_unplug_begin(%d) failed\n", cpu);
                goto out_cancel;
        }

        err = __cpu_notify(CPU_DOWN_PREPARE | mod, hcpu, -1, &nr_calls);  <---- S98bounce-nohz is here
        if (err) {
                nr_calls--;
                __cpu_notify(CPU_DOWN_FAILED | mod, hcpu, nr_calls, NULL);
                printk("%s: attempt to take down CPU %u failed\n",
                                __func__, cpu);
                goto out_release;
        }

        /*
         * By now we've cleared cpu_active_mask, wait for all preempt-disabled
         * and RCU users of this state to go away such that all new such users
         * will observe it.
         *
         * For CONFIG_PREEMPT we have preemptible RCU and its sync_rcu() might
         * not imply sync_sched(), so explicitly call both.
         *
         * Do sync before park smpboot threads to take care the rcu boost case.
         */
#ifdef CONFIG_PREEMPT
        synchronize_sched();
#endif
        synchronize_rcu();

        __cpu_unplug_wait(cpu);                          <------------------- wake up sync_unplug/3
--------------- >% ---------------------


So sync_unplug/3 is also waiting for S98bounce-nohz to make progress. 

Here is S98bounce-nohz stack trace:
crash> bt 5654
PID: 5654   TASK: ffff880620cb98c0  CPU: 4   COMMAND: "S98bounce-nohz"
 #0 [ffff88032431b9b8] __schedule at ffffffff815f1a7c
 #1 [ffff88032431ba50] schedule at ffffffff815f2194
 #2 [ffff88032431ba70] schedule_timeout at ffffffff815f0975
 #3 [ffff88032431bae0] wait_for_completion at ffffffff815f258f
 #4 [ffff88032431bb40] wait_rcu_gp at ffffffff81084b59
 #5 [ffff88032431bb90] synchronize_rcu at ffffffff81108739
 #6 [ffff88032431bba0] perf_pmu_migrate_context at ffffffff8113b895
 #7 [ffff88032431bc30] uncore_change_context at ffffffff81023b7a
 #8 [ffff88032431bc80] uncore_cpu_notifier at ffffffff81023f39
 #9 [ffff88032431bcd0] notifier_call_chain at ffffffff815f8525
#10 [ffff88032431bd10] __raw_notifier_call_chain at ffffffff8108f0fe
#11 [ffff88032431bd20] __cpu_notify at ffffffff81060bf0
#12 [ffff88032431bd30] _cpu_down at ffffffff815ea386          <------ The above function
#13 [ffff88032431be00] cpu_down at ffffffff815ea728
#14 [ffff88032431be20] cpu_subsys_offline at ffffffff813c3ed4
#15 [ffff88032431be30] device_offline at ffffffff813bd15d
#16 [ffff88032431be60] store_online at ffffffff813be448
#17 [ffff88032431be90] dev_attr_store at ffffffff813bc0d0
#18 [ffff88032431bea0] sysfs_write_file at ffffffff8122adc5
#19 [ffff88032431bef0] vfs_write at ffffffff811abbdf
#20 [ffff88032431bf20] sys_write at ffffffff811ac690
#21 [ffff88032431bf80] tracesys at ffffffff815fc568 (via system_call)
    RIP: 00007fb8f477e6d0  RSP: 00007ffd00b9f4e8  RFLAGS: 00000246
    RAX: ffffffffffffffda  RBX: ffffffff815fc568  RCX: ffffffffffffffff
    RDX: 0000000000000002  RSI: 00007fb8f5072000  RDI: 0000000000000001
    RBP: 00007fb8f5072000   R8: 000000000000000a   R9: 00007fb8f5064700
    R10: 0000000000000000  R11: 0000000000000246  R12: 0000000000000001
    R13: 00007ffd00b9f59c  R14: 0000000000000002  R15: 00007fb8f4a32040
    ORIG_RAX: 0000000000000001  CS: 0033  SS: 002b


In the above function, the S98bounce-nohz is stuck in this function:

/**
 * synchronize_rcu - wait until a grace period has elapsed.
 *
 * Control will return to the caller some time after a full grace
 * period has elapsed, in other words after all currently executing RCU
 * read-side critical sections have completed.  Note, however, that
 * upon return from synchronize_rcu(), the caller might well be executing
 * concurrently with new RCU read-side critical sections that began while
 * synchronize_rcu() was waiting.  RCU read-side critical sections are
 * delimited by rcu_read_lock() and rcu_read_unlock(), and may be nested.
 *
 * See the description of synchronize_sched() for more detailed information
 * on memory ordering guarantees.
 */
void synchronize_rcu(void)
{
        rcu_lockdep_assert(!lock_is_held(&rcu_bh_lock_map) &&
                           !lock_is_held(&rcu_lock_map) &&
                           !lock_is_held(&rcu_sched_lock_map),
                           "Illegal synchronize_rcu() in RCU read-side critical section");
        if (!rcu_scheduler_active)
                return;
        if (rcu_expedited)
                synchronize_rcu_expedited();
        else
                wait_rcu_gp(call_rcu);   <----- HERE
}


This is implemented with the following logic:
void wait_rcu_gp(call_rcu_func_t crf)
{
        struct rcu_synchronize rcu;                 <-- the RCU callback definition + completion

        init_rcu_head_on_stack(&rcu.head);          <--- initiate callback
        init_completion(&rcu.completion);           <--- initiate the completion
        /* Will wake me after RCU finished. */
        crf(&rcu.head, wakeme_after_rcu);	    <--- schedule an RCU callback to call wakeme_after_rcu()
        /* Wait for it. */
        wait_for_completion(&rcu.completion);       <--- wait the completion
        destroy_rcu_head_on_stack(&rcu.head);
}


The wakeme_after_rcu() will be called as an RCU callback. It will just
wake up the completion:

/*
 * Awaken the corresponding synchronize_rcu() instance now that a
 * grace period has elapsed.
 */
static void wakeme_after_rcu(struct rcu_head  *head)
{
        struct rcu_synchronize *rcu;

        rcu = container_of(head, struct rcu_synchronize, head);
        complete(&rcu->completion);
}


As RCU callbacks run after the GP (grace period), the solution to wait for a
GP is to schedule an RCU callback and wait for it to run.

The S98bounce-nohz is so waiting for a callback to run. The problem is
that it is not running.

The first guess would be an RCU stall taking place, delaying the RCU
callback. But we have no RCU stall messages!

The next step is to define why does the RCU callback did not run - that is
holding the system.

Comment 3 Daniel Bristot de Oliveira 2016-09-23 20:16:18 UTC

Created attachment 1204264 [details]
rpm to install the boot script that  causes the hotplug bug

Comment 4 Daniel Bristot de Oliveira 2016-10-19 14:50:22 UTC

Created attachment 1212181 [details]
Boot script that causes the problem (extracted from RPM)

Comment 5 Steven Rostedt 2016-10-26 19:23:42 UTC

Looking at a core of a kernel dump of this bug, it's the debug kernel. Does this happen on a non debug kernel, and if so, can we get a core file from that?

Comment 6 Daniel Bristot de Oliveira 2016-10-27 14:04:04 UTC

Created attachment 1214621 [details]
Boot script that causes the problem (extracted from RPM) - the correct

Comment 14 Daniel Bristot de Oliveira 2017-03-08 13:17:38 UTC

The customer reported that the problem was fixed in the latest kernel.

-- Daniel

Comment 15 Daniel Bristot de Oliveira 2017-03-08 13:20:03 UTC

Bhu,

We can close this BZ.

-- Daniel

Comment 16 Beth Uptagrafft 2017-03-08 14:05:53 UTC

Thanks, Daniel. Closing CurrentRelease

Note You need to log in before you can comment on or make changes to this bug.