Bug 1303733 - backport of: "softirq: split timer softirqs out of ksoftirqd"
backport of: "softirq: split timer softirqs out of ksoftirqd"
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: kernel-rt (Show other bugs)
x86_64 Linux
high Severity medium
: rc
: 7.3
Assigned To: Clark Williams
Jiri Kastner
: ZStream
Depends On:
Blocks: 1203710 1274397 1306962 1313435
  Show dependency treegraph
Reported: 2016-02-01 14:45 EST by Daniel Bristot de Oliveira
Modified: 2016-11-03 15:41 EDT (History)
4 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1306962 1313435 (view as bug list)
Last Closed: 2016-11-03 15:41:42 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)

  None (edit)
Description Daniel Bristot de Oliveira 2016-02-01 14:45:29 EST
Description of problem:

Backport of the patch:
"softirq: split timer softirqs out of ksoftirqd"

"The softirqd runs in -RT with SCHED_FIFO (prio 1) and deals mostly with
timer wakeup which can not happen in hardirq context. The prio has been
risen from the normal SCHED_OTHER so the timer wakeup does not happen
too late.
With enough networking load it is possible that the system never goes
idle and schedules ksoftirqd and everything else with a higher priority.
One of the tasks left behind is one of RCU's threads and so we see stalls
and eventually run out of memory.
This patch moves the TIMER and HRTIMER softirqs out of the `ksoftirqd`
thread into its own `ktimersoftd`. The former can now run SCHED_OTHER
(same as mainline) and the latter at SCHED_FIFO due to the wakeups.

From networking point of view: The NAPI callback runs after the network
interrupt thread completes. If its run time takes too long the NAPI code
itself schedules the `ksoftirqd`. Here in the thread it can run at
SCHED_OTHER priority and it won't defer RCU anymore."

Another motivation for this backport comes from customers using
spinning rt tasks. I will post an RCA analysis of a rcu stall
that can be avoided if the timer softirq can receive a higher
priority without cause the interference of all softirqs.
Comment 2 Daniel Bristot de Oliveira 2016-02-01 14:59:04 EST
A customer faced a RCU stall because a the timer softirq was preempted
by a spinning RT thread. As the ksoftirq currently runs network load,
it is not a good idea to run it on a high RT priority. On the other
hand, the delay on delivering a timer, or a long preemption of the
timer softirq while it holds a lock can cause damage on system's
operation. Being able to increase the priority of the timer
softirq can be the best solution for these potential problems.

Here is the RCA of a RCU stall caused by the preemption of the
timer softirq, which can be avoided by increasing timer's
softirq thread priority.

==== RCA of a customer's support case ====
The system complained about a RCU Stall, and an hung task - the rcu_preempt.

The rcu_preempt task is sleeping on UN state for 00:02:26.544:

crash> ps -m 14
[0 00:02:26.544] [UN]  PID: 14     TASK: ffff880c151a98b0  CPU: 6   COMMAND: "rcu_preempt"

rcu_preempt's stack is:

crash> bt
PID: 14     TASK: ffff880c151a98b0  CPU: 6   COMMAND: "rcu_preempt"
 #0 [ffff880c151b3c78] __schedule at ffffffff815bf324
 #1 [ffff880c151b3d20] schedule at ffffffff815bf9c4
 #2 [ffff880c151b3d40] del_timer_sync at ffffffff8106b5f5
 #3 [ffff880c151b3dc0] schedule_timeout at ffffffff815be334
 #4 [ffff880c151b3e30] rcu_gp_kthread at ffffffff810fe55f
 #5 [ffff880c151b3ec0] kthread at ffffffff8108389e
 #6 [ffff880c151b3f50] ret_from_fork at ffffffff815c966c

The task is sleeping on del_timer_sync() function...

del_timer_sync() definition:
#if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT_RT_FULL)
 * del_timer_sync - deactivate a timer and wait for the handler to finish.
 * @timer: the timer to be deactivated
 * This function only differs from del_timer() on SMP: besides deactivating
 * the timer it also makes sure the handler has finished executing on other
 * CPUs.
 * Synchronization rules: Callers must prevent restarting of the timer,
 * otherwise this function is meaningless. It must not be called from
 * interrupt contexts unless the timer is an irqsafe one. The caller must
 * not hold locks which would prevent completion of the timer's
 * handler. The timer's handler must not call add_timer_on(). Upon exit the
 * timer is not queued and the handler is not running on any CPU.
 * Note: For !irqsafe timers, you must not hold locks that are held in
 *   interrupt context while calling this function. Even if the lock has
 *   nothing to do with the timer in question.  Here's why:
 *    CPU0                             CPU1
 *    ----                             ----
 *                                   <SOFTIRQ>
 *                                   call_timer_fn();
 *                                     base->running_timer = mytimer;
 *  spin_lock_irq(somelock);
 *                                     <IRQ>
 *                                        spin_lock(somelock);
 *  del_timer_sync(mytimer);
 *   while (base->running_timer == mytimer);
 * Now del_timer_sync() will never return and never release somelock.
 * The interrupt on the other CPU is waiting to grab somelock but
 * it has interrupted the softirq that CPU0 is waiting to finish.
 * The function returns whether it has deactivated a pending timer or not.
int del_timer_sync(struct timer_list *timer)
        unsigned long flags;

         * If lockdep gives a backtrace here, please reference
         * the synchronization rules above.
         * don't use it in hardirq context, because it
         * could lead to deadlock.
        WARN_ON(in_irq() && !tbase_get_irqsafe(timer->base));
        for (;;) {
                int ret = try_to_del_timer_sync(timer);
                if (ret >= 0)
                        return ret;
                wait_for_running_timer(timer);          <--- It is waiting for a running timer.

The wait_for_running_timer() waits for an active timer softirq to finish its
work, it sleeps on a wait queue.

So, there should be a ksoftirqd thread unable to run. Thus I checked runqueues
and I found a ksoftirqd not able to run due to a high priority rt user-space
thread on CPU 8:

CPU 8 RUNQUEUE: ffff88187f8541c0
  CURRENT: PID: 19306  TASK: ffff880bfefb18b0  COMMAND: "SPINNING_RT"
  RT PRIO_ARRAY: ffff88187f8542a0
     [ 97] PID: 19306  TASK: ffff880bfefb18b0  COMMAND: "SPINNING_RT"
     [ 98] PID: 255    TASK: ffff8818156198b0  COMMAND: "ksoftirqd/8"
  CFS RB_ROOT: ffff88187f854260
     [no tasks queued]

These threads are able to run for more than 2 minutes:

crash> ps -m 19306
[0 00:02:26.546] [RU]  PID: 19306  TASK: ffff880bfefb18b0  CPU: 8   COMMAND: "SPINNING_RT"
crash> ps -m 255
[0 00:02:26.546] [RU]  PID: 255    TASK: ffff8818156198b0  CPU: 8   COMMAND: "ksoftirqd/8"

But as SPINNING_RT has a higher priority, it is running since it was awakened.

The ksoftirq was preempted by SPINNING_RT while running the timer softirq:

crash> bt 255
PID: 255    TASK: ffff8818156198b0  CPU: 8   COMMAND: "ksoftirqd/8"
 #0 [ffff881815623b08] __schedule at ffffffff815bf324
 #1 [ffff881815623bb0] preempt_schedule at ffffffff815bf77f
 #2 [ffff881815623bd0] _raw_spin_unlock_irqrestore at ffffffff815c1c3e
 #3 [ffff881815623c40] wake_up_process at ffffffff81097127
 #4 [ffff881815623c60] process_timeout at ffffffff8106abae
 #5 [ffff881815623c70] call_timer_fn at ffffffff8106b8e9
 #6 [ffff881815623cb0] run_timer_softirq at ffffffff8106bbb9
 #7 [ffff881815623d50] handle_softirq at ffffffff81062ca5
 #8 [ffff881815623da0] do_current_softirqs at ffffffff81062f51
 #9 [ffff881815623e20] run_ksoftirqd at ffffffff81063064
#10 [ffff881815623e40] smpboot_thread_fn at ffffffff8108ccb9
#11 [ffff881815623ec0] kthread at ffffffff8108389e
#12 [ffff881815623f50] ret_from_fork at ffffffff815c966c

The timer softirq was on the way to wake up a thread, but it was preempted
before be able to finish this operation, checking the stack to figure out
which thread was about to be awakened:

 #1 [ffff881815623bb0] preempt_schedule at ffffffff815bf77f
    ffff881815623bb8: 0000000000000296 > ffff880c151a98b0 
    ffff881815623bc8: ffff881815623bd8 ffffffff815c1c3e       
 #2 [ffff881815623bd0] _raw_spin_unlock_irqrestore at ffffffff815c1c3e
    ffff881815623bd8: ffff881815623c38 ffffffff81096f79 
    ffff881815623be8: ffff881815620010 00000000000141c0 
    ffff881815623bf8: ffff881815623d18 ffffffff00000008 
    ffff881815623c08: ffff881815620010 ffff880c151a98b0 
    ffff881815623c18: ffffffff8106aba0 ffff880c151b3dc8 
    ffff881815623c28: ffff881815620000 ---> ffff880c151a98b0 -> 3: that was pushed here.
    ffff881815623c38: ffff881815623c58 ffffffff81097127      -> 2: it was stored at rbx
 #3 [ffff881815623c40] wake_up_process at ffffffff81097127   -> 1: task_struct is the first argument
    ffff881815623c48: 0000000000000000 0000000000000000 
    ffff881815623c58: ffff881815623c68 ffffffff8106abae 
 #4 [ffff881815623c60] process_timeout at ffffffff8106abae

The process that was about to wake up was:

crash> ps -m ffff880c151a98b0
[0 00:02:26.544] [UN]  PID: 14     TASK: ffff880c151a98b0  CPU: 6   COMMAND: "rcu_preempt"

The rcu_preempt task!

Conclusion: the high priority SPINNING_RT thread was running for a long time, 
starving ksoftirqd thread, that was holding the rcu_preempt thread, that end up
causing a hung task/rcu stall.

For sake of completeness, the SPINNING_RT was not blocked by any other thread, as
it was running on user-space. So, it is not the case of a kernel problem:

crash> set 19306
    PID: 19306
   TASK: ffff880bfefb18b0  [THREAD_INFO: ffff880c1356c000]
    CPU: 8
crash> bt
PID: 19306  TASK: ffff880bfefb18b0  CPU: 8   COMMAND: "SPINNING_RT"
 #0 [ffff88187f847e30] crash_nmi_callback at ffffffff81030dd7
 #1 [ffff88187f847e40] nmi_handle at ffffffff815c30ba
 #2 [ffff88187f847ea0] default_do_nmi at ffffffff815c3208
 #3 [ffff88187f847ed0] do_nmi at ffffffff815c34c0
 #4 [ffff88187f847ef0] end_repeat_nmi at ffffffff815c25b7
    RIP: 0000000001179715  RSP: 00007f0d309694b0  RFLAGS: 00000293
    RAX: 0000000000000007  RBX: 00000000034a4200  RCX: 0000000000000006
    RDX: 0000000000000001  RSI: 000000000000000a  RDI: 0000000000000002
    RBP: 0000000000000000   R8: 0000000000000008   R9: 0000000000000003
    R10: 0000000000000004  R11: 0000000000000005  R12: 00007f0d309694e0
    R13: 0000000000000000  R14: 0000000000000001  R15: 0000000000000005
    ORIG_RAX: ffffffffffffffff  CS: 0033  SS: 002b
--- <NMI exception stack> ---
Comment 10 errata-xmlrpc 2016-11-03 15:41:42 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.