1609717 – [unwinder] CPU spins indefinitely in __save_stack_trace() call chain

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1609717 - [unwinder] CPU spins indefinitely in __save_stack_trace() call chain

Summary: [unwinder] CPU spins indefinitely in __save_stack_trace() call chain

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	7.5
Hardware:	x86_64
OS:	Linux
Priority:	urgent
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Josh Poimboeuf
QA Contact:	Chunyu Hu
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1625537
TreeView+	depends on / blocked

Reported:	2018-07-30 08:54 UTC by Oleksandr Natalenko
Modified:	2018-10-30 09:28 UTC (History)
CC List:	7 users (show)
Fixed In Version:	kernel-3.10.0-943.el7
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1625537 (view as bug list)
Environment:
Last Closed:	2018-10-30 09:27:14 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	3547841	0	None	None	None	2018-07-31 06:25:43 UTC
Red Hat Product Errata	RHSA-2018:3083	0	None	None	None	2018-10-30 09:28:37 UTC

Comment 5 Oleksandr Natalenko 2018-07-30 09:58:15 UTC

 === In Red Hat Customer Portal Case 02137847 ===
--- Comment by Natalenko, Oleksandr on 7/30/2018 11:58 AM ---

Hello.

Thanks for getting back.

Based on all 3 vmcores you've provided we've opened a BZ 1609717 for the developers to check whether this behaviour of unwinder makes any sense. Once we have more info on the issue, we'll update you.

Thanks for your patience.

The overview of the 3rd vmcore without "onload" module is given below.

---

Another host, same symptoms:

        CPUS: 28
        DATE: Wed Jul 25 19:27:43 2018
      UPTIME: 00:33:31
LOAD AVERAGE: 45.94, 35.61, 19.27
       TASKS: 1116
    NODENAME: nickel05.psc.des.co
     RELEASE: 3.10.0-862.9.1.el7.x86_64
     VERSION: #1 SMP Wed Jun 27 04:30:39 EDT 2018
     MACHINE: x86_64  (2599 Mhz)
      MEMORY: 255.2 GB
       PANIC: "SysRq : Trigger a crash"

Going step by step from the oldest UN task:

crash> ps -S
  RU: 42
  UN: 32
  IN: 1018
  ZO: 24

crash> foreach UN ps -m | tail -n 3
[0 00:08:29.107] [UN]  PID: 44142  TASK: ffff9fcd71b1dee0  CPU: 24  COMMAND: "cat"
[0 00:08:29.107] [UN]  PID: 44137  TASK: ffff9fcdd87e0000  CPU: 4   COMMAND: "cat"
[0 00:08:29.110] [UN]  PID: 44095  TASK: ffff9fcd7fb60fd0  CPU: 25  COMMAND: "cat"

crash> bt 44095
PID: 44095  TASK: ffff9fcd7fb60fd0  CPU: 25  COMMAND: "cat"
…
 #2 [ffff9fee56693d48] __mutex_lock_killable_slowpath at ffffffffb37125b9
…
    ffff9fee56693d70: ffff9fce610a3118 ffff9fce7b31bf40
    ffff9fee56693d80: ffff9fee50273140 ffff9fcda178a700
    ffff9fee56693d90: ffff9fce7b31bf40 ffff9fee56693db0
    ffff9fee56693da0: ffffffffb371268a
 #3 [ffff9fee56693da0] mutex_lock_killable at ffffffffb371268a
 #4 [ffff9fee56693db8] lock_trace at ffffffffb3292562
 #5 [ffff9fee56693dd8] proc_pid_stack at ffffffffb329260e
 #6 [ffff9fee56693e30] proc_single_show at ffffffffb3293712
 #7 [ffff9fee56693e68] seq_read at ffffffffb3241b80
 #8 [ffff9fee56693ed8] vfs_read at ffffffffb321b2ff
 #9 [ffff9fee56693f08] sys_read at ffffffffb321c1cf
…

It is blocked on the following mutex:

crash> dis -r ffffffffb371268a
…
0xffffffffb371265a <mutex_lock_killable+0xa>:   mov    %rdi,%rbx
…
0xffffffffb3712682 <mutex_lock_killable+0x32>:  mov    %rbx,%rdi
0xffffffffb3712685 <mutex_lock_killable+0x35>:  callq  0xffffffffb3712487 <__mutex_lock_killable_slowpath>
…

crash> dis -r ffffffffb37125b9 | grep push | grep -n rbx
6:0xffffffffb37124a1 <__mutex_lock_killable_slowpath+0x1a>:     push   %rbx

Which has lots of waiters:

crash> struct mutex.count,wait_list,owner ffff9fce610a3118
  count = {
    counter = 0xfffffff1
  }
  wait_list = {
    next = 0xffff9fee3166fd50,
    prev = 0xffff9fc7ff473d70
  }
  owner = 0xffff9fcdd3addee0

crash> list 0xffff9fee3166fd50 | wc -l
13

And the owner is stuck unwinding the stack:

crash> bt 0xffff9fcdd3addee0
PID: 44085  TASK: ffff9fcdd3addee0  CPU: 7   COMMAND: "cat"
…
 #3 [ffff9fce801c8ef0] end_repeat_nmi at ffffffffb3717d69
    [exception RIP: unwind_next_frame+0x5b]
    RIP: ffffffffb306aa9b  RSP: ffff9fcdb5b9fd08  RFLAGS: 00000297
    RAX: ffffffffb3afd2d8  RBX: ffff9fcdb5b9fd50  RCX: ffff9fcdb4243b68
    RDX: ffff9fcdb4243b60  RSI: 00000000ffffffff  RDI: ffff9fcdb5b9fd50
    RBP: ffff9fcdb5b9fd18   R8: 0000000000000000   R9: ffff9fafffc03600
    R10: ffffffffb32925e1  R11: ffffffffb31b3aad  R12: ffff9fcdb4243b70
    R13: ffff9fcdb4243b68  R14: ffff9fce7b31bf40  R15: ffff9fce7b31bf40
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <NMI exception stack> ---
 #4 [ffff9fcdb5b9fd08] unwind_next_frame at ffffffffb306aa9b
    ffff9fcdb5b9fd10: ffff9fcdb4243b70 ffff9fcdb5b9fd40
    ffff9fcdb5b9fd20: ffffffffb306abf8
 #5 [ffff9fcdb5b9fd20] __unwind_start at ffffffffb306abf8
    ffff9fcdb5b9fd28: ffff9fcdb5b9fde0 0000000000000001
    ffff9fcdb5b9fd38: 0000000000000000 ffff9fcdb5b9fdc0
    ffff9fcdb5b9fd48: ffffffffb303ab5e
 #6 [ffff9fcdb5b9fd48] __save_stack_trace at ffffffffb303ab5e
…
    ffff9fcdb5b9fdc0: ffff9fcdb5b9fdd0 ffffffffb303ac5c
 #7 [ffff9fcdb5b9fdc8] save_stack_trace_tsk at ffffffffb303ac5c
 #8 [ffff9fcdb5b9fdd8] proc_pid_stack at ffffffffb329264e
…
    ffff9fcdb5b9fe20: ffffffffb3c4fde0 ffff9fcdb5b9fe60
    ffff9fcdb5b9fe30: ffffffffb3293712
 #9 [ffff9fcdb5b9fe30] proc_single_show at ffffffffb3293712
#10 [ffff9fcdb5b9fe68] seq_read at ffffffffb3241b80
#11 [ffff9fcdb5b9fed8] vfs_read at ffffffffb321b2ff
#12 [ffff9fcdb5b9ff08] sys_read at ffffffffb321c1cf
…

The CPU is blocked for a long time too:

crash> runq -T | tail -n 2
  CPU 27: 0.33 secs
   CPU 7: 508.51 secs

Trying to get some meaningful info.

crash> whatis save_stack_trace_tsk
void save_stack_trace_tsk(struct task_struct *, struct stack_trace *);

crash> dis -r ffffffffb329264e
…
0xffffffffb3292640 <proc_pid_stack+0xa0>:       lea    -0x48(%rbp),%rsi
0xffffffffb3292644 <proc_pid_stack+0xa4>:       mov    %r15,%rdi
…
0xffffffffb3292649 <proc_pid_stack+0xa9>:       callq  0xffffffffb303ac30 <save_stack_trace_tsk>
…

So, the tracee (it exists, unlike in previous vmcores) is:

crash> ps -m ffff9fce7b31bf40
[0 00:08:29.107] [IN]  PID: 32561  TASK: ffff9fce7b31bf40  CPU: 19  COMMAND: "repro-kernel-bu"

Note, 8 × 60 + 29 == 509 s, so PID 32561 (tracee) sleeps as long as PID 44085 (tracer) spins.

Nothing special on its active stack:

crash> bt 32561
PID: 32561  TASK: ffff9fce7b31bf40  CPU: 19  COMMAND: "repro-kernel-bu"
…
 #2 [ffff9fcdb4243e58] do_wait at ffffffffb30978b6
 #3 [ffff9fcdb4243ec0] sys_wait4 at ffffffffb30989d0
…

And the struct stack_trace looks unpopulated:

crash> px 0xffff9fcdb5b9fe28-0x48
$3 = 0xffff9fcdb5b9fde0

crash> stack_trace 0xffff9fcdb5b9fde0
struct stack_trace {
  nr_entries = 0x0,
  max_entries = 0x40,
  entries = 0xffff9fce28e4ae00,
  skip = 0x0
}

So, going up:

crash> whatis __save_stack_trace
void __save_stack_trace(struct stack_trace *, struct task_struct *, struct pt_regs *, bool);

crash> whatis __unwind_start
void __unwind_start(struct unwind_state *, struct task_struct *, struct pt_regs *, unsigned long *);

crash> dis -r ffffffffb303ab5e
…
0xffffffffb303ab10 <__save_stack_trace+0x10>:   mov    %rdx,%r13
…
0xffffffffb303ab48 <__save_stack_trace+0x48>:   mov    0x98(%r13),%rcx
0xffffffffb303ab4f <__save_stack_trace+0x4f>:   lea    -0x70(%rbp),%rdi
…
0xffffffffb303ab59 <__save_stack_trace+0x59>:   callq  0xffffffffb306ab30 <__unwind_start>
…

The current unwinder state is:

crash> px 0xffff9fcdb5b9fdc0-0x70
$2 = 0xffff9fcdb5b9fd50

crash> unwind_state 0xffff9fcdb5b9fd50
struct unwind_state {
  stack_info = {
    type = STACK_TYPE_TASK,
    begin = 0xffff9fcdb4240000,
    end = 0xffff9fcdb4244000,
    next_sp = 0x0
  },
  stack_mask = 0x2,
  task = 0xffff9fce7b31bf40,
  graph_idx = 0x0,
  error = 0x0,
  bp = 0xffff9fcdb4243b60,
  ip = 0xffff9fcdb4243b60
}

One interesting thing. In the subsequent frame pt_regs is pushed as %r13 on stack:

crash> dis __unwind_start | grep push | grep -n r13
2:0xffffffffb306ab3f <__unwind_start+0xf>:      push   %r13

But it is zero (2nd push is ffff9fcdb5b9fd38)!

crash> rd -x ffff9fcdb5b9fd38
ffff9fcdb5b9fd38:  0000000000000000

Why so? It was, however, successfully dereferenced into %rcx, and the .sp field was subsequently re-saved into %r13:

crash> pt_regs -o | grep 0x98
  [0x98] unsigned long sp;

So, .sp here == %r13 == %rcx == ffff9fcdb4243b68.

Looking at the tracee, this .sp field is likely to point into nowhere:

crash> task_struct.stack ffff9fce7b31bf40
  stack = 0xffff9fcdb4240000

crash> rd -SS 0xffff9fcdb4240000 2048
…
ffff9fcdb4243b30:  0000000300000000 ffffffffffffffff
ffff9fcdb4243b40:  [ffff9fce7b31bf40:task_struct] [ffff9fce7b31bf40:task_struct]
ffff9fcdb4243b50:  0000000000000000 ffff9fee7f355c40
ffff9fcdb4243b60:  ffff9fcdb4243b60 ffff9fcdb4243b60
                                    ^^^^^^^^^^^^^^^^
ffff9fcdb4243b70:  ffff9fcdb4243b70 ffff9fcdb4243b70
ffff9fcdb4243b80:  00000000ff87db21 000000000000000e
ffff9fcdb4243b90:  fffff99c77d94b00 ffff9fee7f355bc0
ffff9fcdb4243ba0:  ffff9feefffd9000 ffff9feefffd9000
ffff9fcdb4243bb0:  ffff9fcdb4243c00 pagevec_lru_move_fn+0x10c
…

Or, rather, to some frame that was active in the past.

Interesting leftover, likely, this was the active part of the stack before the task went sleeping:

crash> rd -SS 0xffff9fcdb4240000 2048
…
ffff9fcdb4243d30:  ffff9fcdb4243d80 dequeue_task_fair+0x41e
ffff9fcdb4243d40:  ffff9fcdb4243d60 sched_clock_cpu+0x85
ffff9fcdb4243d50:  ffff9fee7f358b40 ffff9fee7f358b40
ffff9fcdb4243d60:  ffff9fcdb4243db8 ffff9fcdb4243d78
ffff9fcdb4243d70:  tracing_is_on+0x15 ffff9fcdb4243d90
ffff9fcdb4243d80:  tracing_record_cmdline+0x1d [ffff9fcf53981fa0:task_struct]
ffff9fcdb4243d90:  ffff9fcdb4243db8 probe_sched_switch+0x2d
ffff9fcdb4243da0:  ffff9fee7f358b40 [ffff9fee72ab6c20:kmalloc-64]
ffff9fcdb4243db0:  [ffff9fce7b31bf40:task_struct] ffff9fcdb4243e40
ffff9fcdb4243dc0:  0000000000000086 [ffff9fce7b31bf40:task_struct]
ffff9fcdb4243dd0:  ffff9fcdb4243fd8 ffff9fcdb4243fd8
ffff9fcdb4243de0:  ffff9fcdb4243fd8 [ffff9fce7b31bf40:task_struct]
ffff9fcdb4243df0:  ffff9fae80000460 ffff9fcdb4243e28
ffff9fcdb4243e00:  0000000000000080 00000000ff87db21
ffff9fcdb4243e10:  [ffff9fce610a2d20:signal_cache] ffff9fcdb4243ec8
ffff9fcdb4243e20:  [ffff9fce7b31c430:task_struct] [ffff9fce7b31bf40:task_struct]
ffff9fcdb4243e30:  [ffff9fce7b31bf40:task_struct] [ffff9fce7b31bf30:task_struct]
ffff9fcdb4243e40:  ffff9fcdb4243e50 schedule+0x29
ffff9fcdb4243e50:  ffff9fcdb4243eb8 do_wait+0x1f6
…

Back to the code, most likely, this is where the tracer loops:

117 void __unwind_start(struct unwind_state *state, struct task_struct *task,
118             struct pt_regs *regs, unsigned long *first_frame)
119 {
…
142     while (!unwind_done(state) &&
143            (!on_stack(&state->stack_info, first_frame, sizeof(long)) ||
144             state->bp < first_frame))
145         unwind_next_frame(state);
146 }

Apparently, state->bp < first_frame == true, so unwind_next_frame() was called:

crash> dis __unwind_start
…
0xffffffffb306ab41 <__unwind_start+0x11>:       mov    %rcx,%r13
…

  state->bp == 0xffff9fcdb4243b60
first_frame == 0xffff9fcdb4243b68

So, maybe, .bp is not updated in the update_stack_state() here:

 18 static bool update_stack_state(struct unwind_state *state,
 19                    unsigned long *next_bp)
 20 {
…
 31     while (!on_stack(info, next_bp, FRAME_HEADER_SIZE))
 32         if (get_stack_info(info->next_sp, state->task, info,
 33                    &state->stack_mask))
 34             return false;
…
 43     return true;
 44 }

?

on_stack() should return true, since the address is withing bounds:

 35 static inline bool on_stack(struct stack_info *info, void *addr, size_t len)
 36 {
 37     void *begin = info->begin;
 38     void *end   = info->end;
 39
 40     return (info->type != STACK_TYPE_UNKNOWN &&
 41         addr >= begin && addr < end &&
 42         addr + len > begin && addr + len <= end);
 43 }

Then, maybe, it *was* out of bounds.

Comment 25 Bruno Meneguele 2018-09-04 13:01:44 UTC

Patch(es) committed on kernel repository and an interim kernel build is undergoing testing

Comment 27 Bruno Meneguele 2018-09-04 13:05:16 UTC

Patch(es) available on kernel-3.10.0-943.el7

Comment 32 errata-xmlrpc 2018-10-30 09:27:14 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:3083

Note You need to log in before you can comment on or make changes to this bug.