Bug 1179480

Summary: crash "bt" command mislabels exception stacks on x86_64 kernels which have no STACKFAULT stack
Product: Red Hat Enterprise Linux 6 Reporter: Dave Anderson <anderson>
Component: crashAssignee: Dave Anderson <anderson>
Status: CLOSED ERRATA QA Contact: Qiao Zhao <qzhao>
Severity: high Docs Contact:
Priority: unspecified    
Version: 6.7CC: jherrman
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: crash-7.1.0-1.el6 Doc Type: Bug Fix
Doc Text:
A prior update of the AMD64 and Intel 64 kernels removed the STACKFAULT exception stack. As a consequence, using the "bt" command with the updated kernels previously displayed an incorrect exception stack name if the backtrace originated in an exception stack other than STACKFAULT. In addition, the "mach" command displayed incorrect names for exception stacks other than STACKFAULT. This update ensures that stack names are generated properly in the described circumstances, and both "bt" and "mach" now display correct information.
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-07-22 06:27:11 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Dave Anderson 2015-01-06 20:58:40 UTC
Description of problem:

RHEL6 kernels patched for CVE-2014-9322 have removed their STACKFAULT
exception stacks.  This causes the "bt" command to display an invalid
exception stack location in backtraces that transition from an exception
stack to a process kernel stack.

Version-Release number of selected component (if applicable):

crash-6.1.0-5.el6
kernel-2.6.32-519.el6

How reproducible:

Always

Steps to Reproduce:

1. enter sysrq-c to force the kernel to panic and create a kdump
2. run the "bt -a" command
3. as an example, note the backtraces of the non-panicking active
   tasks, which will all transition from the NMI stack back to the
   active task's kernel stack. 

Actual results:

crash> bt -a
PID: 0      TASK: ffffffff81911440  CPU: 0   COMMAND: "swapper/0"
 #0 [ffff88007fa05e70] crash_nmi_callback at ffffffff810406c2
 #1 [ffff88007fa05e80] nmi_handle at ffffffff8160d819
 #2 [ffff88007fa05ec8] do_nmi at ffffffff8160d930
 #3 [ffff88007fa05ef0] end_repeat_nmi at ffffffff8160cc71
    [exception RIP: native_safe_halt+6]
    RIP: ffffffff81052dd6  RSP: ffffffff818ffe98  RFLAGS: 00000286
    RAX: 0000000000000010  RBX: 0000000000000010  RCX: 0000000000000286
    RDX: ffffffff818ffe98  RSI: 0000000000000018  RDI: 0000000000000001
    RBP: ffffffff81052dd6   R8: ffffffff81052dd6   R9: 0000000000000018
    R10: ffffffff818ffe98  R11: 0000000000000286  R12: ffffffffffffffff
    R13: 0000000000000046  R14: 0000000000000000  R15: 0000000000000000
    ORIG_RAX: 0000000000000000  CS: 0010  SS: 0018
--- <DOUBLEFAULT exception stack> ---
 #4 [ffffffff818ffe98] native_safe_halt at ffffffff81052dd6
 #5 [ffffffff818ffea0] default_idle at ffffffff8101c93f
 #6 [ffffffff818ffec0] arch_cpu_idle at ffffffff8101d236
 #7 [ffffffff818ffed0] cpu_startup_entry at ffffffff810c6925
 #8 [ffffffff818fff30] rest_init at ffffffff815f3357
 #9 [ffffffff818fff40] start_kernel at ffffffff81a45057
#10 [ffffffff818fff88] x86_64_start_reservations at ffffffff81a445ee
#11 [ffffffff818fff98] x86_64_start_kernel at ffffffff81a44742

...

Note that the stack-transition point is mislabeled as coming from
the "DOUBLEFAULT exception stack". 


Expected results:

The backtrace should show it coming from the "NMI exception stack":

crash> bt -a
PID: 0      TASK: ffffffff81911440  CPU: 0   COMMAND: "swapper/0"
 #0 [ffff88007fa05e70] crash_nmi_callback at ffffffff810406c2
 #1 [ffff88007fa05e80] nmi_handle at ffffffff8160d819
 #2 [ffff88007fa05ec8] do_nmi at ffffffff8160d930
 #3 [ffff88007fa05ef0] end_repeat_nmi at ffffffff8160cc71
    [exception RIP: native_safe_halt+6]
    RIP: ffffffff81052dd6  RSP: ffffffff818ffe98  RFLAGS: 00000286
    RAX: 00000000ffffffed  RBX: ffffffff818fffd8  RCX: 0100000000000000
    RDX: 0000000000000000  RSI: 0000000000000000  RDI: 0000000000000046
    RBP: ffffffff818ffe98   R8: 0000000000000000   R9: 0000000000000000
    R10: 0000000000000000  R11: 0000000000000001  R12: 0000000000000000
    R13: ffffffff818fffd8  R14: ffff88007fce65c0  R15: ffffffff818fffd8
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <NMI exception stack> ---
 #4 [ffffffff818ffe98] native_safe_halt at ffffffff81052dd6
 #5 [ffffffff818ffea0] default_idle at ffffffff8101c93f
 #6 [ffffffff818ffec0] arch_cpu_idle at ffffffff8101d236
 #7 [ffffffff818ffed0] cpu_startup_entry at ffffffff810c6925
 #8 [ffffffff818fff30] rest_init at ffffffff815f3357
 #9 [ffffffff818fff40] start_kernel at ffffffff81a45057
#10 [ffffffff818fff88] x86_64_start_reservations at ffffffff81a445ee
#11 [ffffffff818fff98] x86_64_start_kernel at ffffffff81a44742

...


Additional info:

This issue has been addressed upstream:

https://github.com/crash-utility/crash/commit/e4cc9e7faf4517d049d2c997603015d08fe2c9e4
  
Fix for the X86_64 "bt" and "mach" commands when running against
kernels that have the following Linux 3.18 commit, which removes the
special per-cpu exception stack for handling stack segment faults:

  commit 6f442be2fb22be02cafa606f1769fa1e6f894441
  x86_64, traps: Stop using IST for #SS

Without this patch, backtraces that originate on any of the other 4
per-cpu exception stacks will be mis-labeled at the transition point
back to the previous stack.  For example, backtraces that that
originate on the NMI stack will indicate that they are coming from
the "DOUBLEFAULT" stack.  The patch examines all idt_table entries
during initialization, looking for gate descriptors that have
non-zero index values, and when found, pulls out out the handler
function address; from that information, the exception stack name
string array is properly initialized rather than being hard-coded.
This fix also properly labels the exception stack names on x86_64
CONFIG_PREEMPT_RT realtime kernels, which only utilize 3 exception
stacks instead of the traditional 5 (now 4 with this kernel commit),
instead of just showing "RT".  Also, without the patch, the "mach"
command will mis-label the stack names when it displays the base
addresses of each per-cpu exception stack.
(anderson)

Comment 1 Dave Anderson 2015-02-10 20:25:51 UTC
This issue is fixed in crash-7.1.0-1.el6, which was built today for the
rebase approved for the rhel6.7 crash utility errata.

Comment 7 errata-xmlrpc 2015-07-22 06:27:11 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-1309.html