Bug 1608672

Summary: RT system hang due to wrong of rq's nr_running
Product: Red Hat Enterprise Linux 7 Reporter: tianyongjiang <tian.yongjiang>
Component: kernel-rtAssignee: Crystal Wood <crwood>
kernel-rt sub component: Memory Management QA Contact: Jiri Kastner <jkastner>
Status: CLOSED ERRATA Docs Contact: Marie Hornickova <mdolezel>
Severity: urgent    
Priority: high CC: bhu, crwood, dhoward, jiang.biao2, jkastner, lgoncalv, mkolaja, pezhang, stalexan
Version: 7.4Keywords: ZStream
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: kernel-rt-3.10.0-931.rt56.881.el7 Doc Type: Bug Fix
Doc Text:
A race condition that prevented tasks from being scheduled properly has been fixed Previously, preemption was enabled too early after a context switch. If a task was migrated to another CPU after a context switch, a mismatch between CPU and runqueue during load balancing sometimes occurred. Consequently, a runnable task on an idle CPU failed to run, and the operating system became unresponsive. This update disables preemption in the schedule_tail() function. As a result, CPU migration during post-schedule processing no longer occurs, which prevents the above mismatch. The operating system no longer hangs due to this bug.
Story Points: ---
Clone Of:
: 1617941 1618466 (view as bug list) Environment:
Last Closed: 2018-10-30 09:43:34 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1175461, 1532680, 1541534, 1617941, 1618466    
Attachments:
Description Flags
output of crash tool none

Description tianyongjiang 2018-07-26 06:36:53 UTC
Created attachment 1470606 [details]
output of crash tool

Description of problem:
  I find my RT system hang. So I trigger panic by sysrq, and then analyze vmcore. 

  The RT process(PID 4, RU)has been waken up to hold the boot_tvec_bases lock ,but the nr_running of rq(cpu 0) is wrong(there has one RT process, but it is 0). so the pick_next_task() function always choose the idle process.  
There are also many process waiting for hold the boot_tvec_bases lock. 
  Additionally the nr_running of rq(cpu 24) is wrong(there has one RT process, but it is 2).

  crash>runq
  CPU 0 RUNQUEUE: ffff881ffca19080
    CURRENT: PID: 0      TASK: ffffffff81a02480  COMMAND: "swapper/0"
    RT PRIO_ARRAY: ffff881ffca19208
        [ 98] PID: 4      TASK: ffff88022bb030f0  COMMAND: "ktimersoftd/0"
    CFS RB_ROOT: ffff881ffca19120
        [no tasks queued]
  ...
  CPU 24 RUNQUEUE: ffff881ffcd19080
    CURRENT: PID: 450    TASK: ffff881ffe02d190  COMMAND: "irq/46-xhci_hcd"
    RT PRIO_ARRAY: ffff881ffcd19208
        [ 49] PID: 450    TASK: ffff881ffe02d190  COMMAND: "irq/46-xhci_hcd"
    CFS RB_ROOT: ffff881ffcd19120
        [no tasks queued]
  ...

  crash> ps
    PID    PPID  CPU       TASK        ST  %MEM     VSZ    RSS  COMM
    ...
      4      2   0  ffff88022bb030f0  RU   0.0       0      0  [ktimersoftd/0]
    ...

  crash> task 4
    ...
    exec_start = 129364170626886,
    ...
    last_arrival = 129364170620770,
    ...

  crash> struct rq ffff881ffca19080
    ...
    nr_running = 0,
    ...
    idle_stamp = 269573437263498,
    ...
    rt = {
      ...
      rt_nr_running = 1,
      ...
    }
    ...

  crash> struct rq ffff881ffcd19080
    ...
    nr_running = 2,
    ...
    rt = {
      ...
      rt_nr_running = 1,
      ...
    }
    ...


Version-Release number of selected component (if applicable):
  kernel-rt-3.10.0-693.11.1.rt56.632.el7

How reproducible:
  No obvious regular pattern. 

Additional info:
  Please refer to attachments.

Comment 6 tianyongjiang 2018-08-09 08:44:29 UTC
Hi swood :
  The preempt_enable() in  schedule_tail() function is not effective,because only ia64 and mips have __ARCH_WANT_UNLOCKED_CTXSW(I seach kernel code), my kernel-rt is x86_64.
  So Preemption is not enabled, I am confused.

  How did you fix? Can you describe in more detail,or paste the code?

asmlinkage void schedule_tail(struct task_struct *prev)
        __releases(rq->lock)
{
        struct rq *rq = this_rq();

        finish_task_switch(rq, prev);

        /*
         * FIXME: do we need to worry about rq being invalidated by the
         * task_switch?
         */
        post_schedule(rq);

#ifdef __ARCH_WANT_UNLOCKED_CTXSW
        /* In this case, finish_task_switch does not reenable preemption */
        preempt_enable();
#endif
        if (current->set_child_tid)
                put_user(task_pid_vnr(current), current->set_child_tid);
}


  In addition, the preemption is also enabled in latest kernel code: 
 https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/sched/core.c?h=v4.18-rc8

Comment 7 tianyongjiang 2018-08-09 10:49:59 UTC
Hi swood :
 If you need vmcore, I can provide it.

Comment 10 Crystal Wood 2018-08-09 17:00:54 UTC
(In reply to tianyongjiang from comment #6)
> Hi swood :
>   The preempt_enable() in  schedule_tail() function is not effective,because
> only ia64 and mips have __ARCH_WANT_UNLOCKED_CTXSW(I seach kernel code), my
> kernel-rt is x86_64.
>   So Preemption is not enabled, I am confused.
> 
>   How did you fix? Can you describe in more detail,or paste the code?

That is not where preemption is getting enabled.  finish_task_switch() calls finish_lock_switch() which releases rq->lock, and releasing a raw spinlock contains a preempt_enable().

The fix is upstream commit 1a43a14a5bd9c32 ("sched: Fix schedule_tail() to disable preemption").

>   In addition, the preemption is also enabled in latest kernel code: 
>  https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/
> kernel/sched/core.c?h=v4.18-rc8

Yes, but not until after balance_callback() -- and upstream has made the preempt count always be 2 on context switch, whereas in 3.10 a newly forked thread has a preempt count of 1.

Comment 15 Beth Uptagrafft 2018-09-17 16:04:31 UTC
*** Bug 1590222 has been marked as a duplicate of this bug. ***

Comment 23 errata-xmlrpc 2018-10-30 09:43:34 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:3096