Bug 712214 - bt: cannot transition from exception stack to process stack
Summary: bt: cannot transition from exception stack to process stack
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: crash
Version: 6.1
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: rc
: ---
Assignee: Dave Anderson
QA Contact: Kernel Dump QE
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-06-09 19:42 UTC by Dave Anderson
Modified: 2011-12-06 16:30 UTC (History)
2 users (show)

Fixed In Version: crash-5.1.7-1.el6
Doc Type: Bug Fix
Doc Text:
In a rare scenario, a non-crashing CPU received a shutdown NMI (non-maskable interrupt) immediately after receiving an interrupt from another source. Because the IRQ entry-point symbols "IRQ0x00_interrupt" through "IRQ0x##_interrupt" no longer existed, the bt command terminated with the "bt: cannot transition from exception stack to current process stack" error message on AMD64 and Intel 64 architectures. This bug has been fixed, and backtrace now properly transitions from the NMI stack back to the interrupted process stack.
Clone Of:
Environment:
Last Closed: 2011-12-06 16:30:07 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2011:1648 0 normal SHIPPED_LIVE crash bug fix and enhancement update 2011-12-06 00:50:30 UTC

Description Dave Anderson 2011-06-09 19:42:03 UTC
Description of problem:

  kdump testing yielded a vmcore where the following backtrace error
  occurred when backtracing the active tasks:

PID: 0      TASK: ffff88012cd74b00  CPU: 3   COMMAND: "swapper"
 #0 [ffff880028267e90] crash_nmi_callback at ffffffff81028a96
 #1 [ffff880028267ea0] notifier_call_chain at ffffffff814e13e5
 #2 [ffff880028267ee0] atomic_notifier_call_chain at ffffffff814e144a
 #3 [ffff880028267ef0] notify_die at ffffffff810942fe
 #4 [ffff880028267f20] do_nmi at ffffffff814df033
 #5 [ffff880028267f50] nmi at ffffffff814de940
    [exception RIP: irq_entries_start+296]
    RIP: ffffffff8100b728  RSP: ffff88012cd79e38  RFLAGS: 00000006
    RAX: 0000000000000000  RBX: 0000000000000004  RCX: 0000000000000000
    RDX: 00000000000000eb  RSI: 0000000000000000  RDI: 00000000000399dd
    RBP: ffff88012cd79ed8   R8: 0000000000000000   R9: 0000000000000320
    R10: 0000000000000000  R11: 0000000000000000  R12: 0000000000000000
    R13: 00000000000000eb  R14: 0000000000000002  R15: 0000000000000003
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <NMI exception stack> ---
 #6 [ffff88012cd79e38] irq_entries_start at ffffffff8100b728
bt: cannot transition from exception stack to current process stack:
    exception stack pointer: ffff880028267e90
      process stack pointer: ffff88012cd7a048
         current stack base: ffff88012cd78000


Version-Release number of selected component (if applicable):

  crash-5.1.1-2.el6
  kernel-2.6.32-156.el6.x86_64

How reproducible:

  Very difficult -- NMI issued to non-crashing cpu must be received in
  a small window of opportunity.

Steps to Reproduce:
1. 
2.
3.
  
Actual results:

  As shown above.

Expected results:

  Backtrace should properly transition from the NMI stack back to
  the interrupted process stack.

Additional info:

  Reported by Paul Bunyan while kdump testing on 
  intel-piketon-tpm-01.lab.bos.redhat.com

https://beaker.engineering.redhat.com/jobs/95032

http://beaker-archive.app.eng.bos.redhat.com/beaker-logs/2011/06/950/95032/193998/2097628/9796461//test_log--kernel-kdump-analyse-crash.log

  I have a copy of the vmlinux/vmcore pair.

Comment 3 Dave Anderson 2011-06-10 18:14:33 UTC
The shutdown NMI has to be received by a non-crashing cpu
within a couple of instructions after having received an
interrupt from another source.  So it's highly unlikely
that it can be reproducible.

I have a fix for it -- the backtrace looks like this:

PID: 0      TASK: ffff88012cd74b00  CPU: 3   COMMAND: "swapper"
 #0 [ffff880028267e90] crash_nmi_callback at ffffffff81028a96
 #1 [ffff880028267ea0] notifier_call_chain at ffffffff814e13e5
 #2 [ffff880028267ee0] atomic_notifier_call_chain at ffffffff814e144a
 #3 [ffff880028267ef0] notify_die at ffffffff810942fe
 #4 [ffff880028267f20] do_nmi at ffffffff814df033
 #5 [ffff880028267f50] nmi at ffffffff814de940
    [exception RIP: irq_entries_start+296]
    RIP: ffffffff8100b728  RSP: ffff88012cd79e38  RFLAGS: 00000006
    RAX: 0000000000000000  RBX: 0000000000000004  RCX: 0000000000000000
    RDX: 00000000000000eb  RSI: 0000000000000000  RDI: 00000000000399dd
    RBP: ffff88012cd79ed8   R8: 0000000000000000   R9: 0000000000000320
    R10: 0000000000000000  R11: 0000000000000000  R12: 0000000000000000
    R13: 00000000000000eb  R14: 0000000000000002  R15: 0000000000000003
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <NMI exception stack> ---
 #6 [ffff88012cd79e38] irq_entries_start at ffffffff8100b728
 #7 [ffff88012cd79e60] intel_idle at ffffffff812bc2a1
 #8 [ffff88012cd79ee0] cpuidle_idle_call at ffffffff813ed4b7
 #9 [ffff88012cd79f00] cpu_idle at ffffffff81009de6
  
The non-crashing cpu was sitting idle, received an interrupt from
some source, but then immediately received a shutdown NMI from the
crashing cpu.

Comment 7 Tomas Capek 2011-10-18 15:02:07 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
In a rare scenario, a non-crashing CPU received a shutdown NMI (non-maskable interrupt) immediately after receiving an interrupt from another source. Because the IRQ entry-point symbols "IRQ0x00_interrupt" through "IRQ0x##_interrupt" no longer existed, the bt command terminated with the "bt: cannot transition from exception stack to current process stack" error message on AMD64 and Intel 64 architectures. This bug has been fixed, and backtrace now properly transitions from the NMI stack back to the interrupted process stack.

Comment 8 errata-xmlrpc 2011-12-06 16:30:07 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2011-1648.html


Note You need to log in before you can comment on or make changes to this bug.