Bug 608171 - [RHEL5] possibly bogus exception frame -- NMI while running on thread user stack
Summary: [RHEL5] possibly bogus exception frame -- NMI while running on thread user stack
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: crash
Version: 5.5
Hardware: All
OS: Linux
low
medium
Target Milestone: rc
: ---
Assignee: Dave Anderson
QA Contact: Red Hat Kernel QE team
URL:
Whiteboard:
Depends On:
Blocks: 608173
TreeView+ depends on / blocked
 
Reported: 2010-06-25 21:12 UTC by Guy Streeter
Modified: 2016-02-10 01:33 UTC (History)
3 users (show)

Fixed In Version: crash-4.1.2-6.el5
Doc Type: Bug Fix
Doc Text:
Prior to this update, the "bt" command failed to make the transition from the NMI exception stack to the process stack when a task had just entered the kernel, but had not switched its stack pointer from the user-space per-thread stack to the relevant kernel stack yet. This has been fixed, and such transition is made as expected.
Clone Of:
: 608173 (view as bug list)
Environment:
Last Closed: 2011-01-13 22:50:47 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2011:0059 0 normal SHIPPED_LIVE crash bug fix update 2011-01-12 17:15:15 UTC

Description Guy Streeter 2010-06-25 21:12:37 UTC
crash> bt 992
PID: 992    TASK: ffff810eb00cd7e0  CPU: 27  COMMAND: "java"
 #0 [ffff810e37150f20] crash_nmi_callback at ffffffff8007af27
 #1 [ffff810e37150f40] do_nmi at ffffffff8006588a
 #2 [ffff810e37150f50] nmi at ffffffff80064eef
    [exception RIP: system_call]
    RIP: ffffffff8005d098  RSP: 00000000431dfc58  RFLAGS: 00000003
    RAX: 0000000000000018  RBX: 0000000040dd3b60  RCX: 00000032440baa27
    RDX: 00002aaad007dc88  RSI: 00002aaad007a130  RDI: 0000000000000000
    RBP: 00000000431dfc60   R8: 0000000000000020   R9: 00000000000006af
    R10: 00000000000003e0  R11: 0000000000000203  R12: 000000000000015e
    R13: 0000000040dd3b70  R14: 00000000431dfca8  R15: 00000000431dfcb4
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
bt: WARNING: possibly bogus exception frame
--- <NMI exception stack> ---
 #3 [431dfc58] system_call at ffffffff8005d098
bt: cannot transition from exception stack to current process stack:
    exception stack pointer: ffff810e37150f20
      process stack pointer: 431dfc58
         current stack base: ffff810ecd8ac000


Analysis from Dave Anderson:

That's a new crash bug.  The exception frame shows that the cpu was in
the system_call routine, but has not a chance to switch the RSP from the
user-space address stack to the kernel stack.  I had recently put a fix
into 5.0.5 fix for just that scenario, but it depended upon this function
returning TRUE:
  
  int
  in_user_stack(ulong task, ulong vaddr)
  {
          ulong vma, vm_flags;
          char *vma_buf;
  
          if ((vma = vm_area_dump(task, UVADDR|VERIFY_ADDR, vaddr, 0))) {
                  vma_buf = fill_vma_cache(vma);
                  vm_flags = SIZE(vm_area_struct_vm_flags) == sizeof(short) ?
                          USHORT(vma_buf+ OFFSET(vm_area_struct_vm_flags)) :
                          ULONG(vma_buf+ OFFSET(vm_area_struct_vm_flags));
  
                  if (vm_flags & (VM_GROWSUP|VM_GROWSDOWN))
                          return TRUE;
          }
          return FALSE;
  }
  
But that java task is apparently running in a per-thread-stack instead of
the "normal" process stack at the top of memory:

  crash> vm 992
  PID: 992    TASK: ffff810eb00cd7e0  CPU: 27  COMMAND: "java"
         MM               PGD          RSS    TOTAL_VM
  ffff810ec6e51c40  ffff810ece3cc000  1130452k  6829724k
        VMA           START       END     FLAGS FILE
  ...
  ffff810ecb9df088   430e1000   431e1000 100077
  ...
  ffff810ecbe73768 7fffec321000 7fffec51e000 100177 
  crash>
  
And annoyingly enough, those per-thread stacks don't set the 
GROWSUP/GROWSDOWN flag in their vma_area_struct's vm_flags:

  crash> vm -f 100077
  100077: (READ|WRITE|EXEC|MAYREAD|MAYWRITE|MAYEXEC|WRITECOMBINED|ACCOUNT)
  crash> vm -f 100177
  100177: (READ|WRITE|EXEC|MAYREAD|MAYWRITE|MAYEXEC|GROWSDOWN|WRITECOMBINED|ACCOUNT)
  crash>

a vmcore exhibiting the problem is on
megatron.gsslab.rdu.redhat.com
in
/cores/20100614175236/work

Comment 2 Dave Anderson 2010-07-06 18:52:17 UTC
QA assist:

Using the supplied vmlinux/anritsu-2023578-vmcore pair, crash version
4.1.2-5.el5 fails to make the transition from the NMI exception stack
to the process stack if the interrupted, active, multi-threaded task had 
just entered the kernel but had not yet switched its stack pointer from
the user-space per-thread stack to the process's kernel stack:
  
  # crash vmlinux anritsu-2023578-vmcore
  
  crash 4.1.2-5.el5
  Copyright (C) 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009  Red Hat, Inc.
  Copyright (C) 2004, 2005, 2006  IBM Corporation
  Copyright (C) 1999-2006  Hewlett-Packard Co
  Copyright (C) 2005, 2006  Fujitsu Limited
  Copyright (C) 2006, 2007  VA Linux Systems Japan K.K.
  Copyright (C) 2005  NEC Corporation
  Copyright (C) 1999, 2002, 2007  Silicon Graphics, Inc.
  Copyright (C) 1999, 2000, 2001, 2002  Mission Critical Linux, Inc.
  This program is free software, covered by the GNU General Public License,
  and you are welcome to change it and/or distribute copies of it under
  certain conditions.  Enter "help copying" to see the conditions.
  This program has absolutely no warranty.  Enter "help warranty" for details.
   
  GNU gdb 6.1
  Copyright 2004 Free Software Foundation, Inc.
  GDB is free software, covered by the GNU General Public License, and you are
  welcome to change it and/or distribute copies of it under certain conditions.
  Type "show copying" to see the conditions.
  There is absolutely no warranty for GDB.  Type "show warranty" for details.
  This GDB was configured as "x86_64-unknown-linux-gnu"...
  
  please wait... (determining panic task)         
  bt: cannot transition from exception stack to current process stack:
      exception stack pointer: ffff810e37150f20
        process stack pointer: 431dfc58
           current stack base: ffff810ecd8ac000
  
        KERNEL: vmlinux                           
      DUMPFILE: anritsu-2023578-vmcore  [PARTIAL DUMP]
          CPUS: 32
          DATE: Mon Jun 14 16:25:59 2010
        UPTIME: 2 days, 01:45:24
  LOAD AVERAGE: 6.28, 3.21, 1.97
         TASKS: 1497
      NODENAME: qad-prod.us.anritsu.com
       RELEASE: 2.6.18-194.3.1.el5
       VERSION: #1 SMP Sun May 2 04:17:42 EDT 2010
       MACHINE: x86_64  (2400 Mhz)
        MEMORY: 63.1 GB
         PANIC: "Kernel panic - not syncing: softlockup: hung tasks"
           PID: 28942
       COMMAND: "java"
          TASK: ffff810bbd90d7a0  [THREAD_INFO: ffff810be0888000]
           CPU: 6
         STATE: TASK_RUNNING (PANIC)
  
  crash> bt 992
  PID: 992    TASK: ffff810eb00cd7e0  CPU: 27  COMMAND: "java"
   #0 [ffff810e37150f20] crash_nmi_callback at ffffffff8007af27
   #1 [ffff810e37150f40] do_nmi at ffffffff8006588a
   #2 [ffff810e37150f50] nmi at ffffffff80064eef
      [exception RIP: system_call]
      RIP: ffffffff8005d098  RSP: 00000000431dfc58  RFLAGS: 00000003
      RAX: 0000000000000018  RBX: 0000000040dd3b60  RCX: 00000032440baa27
      RDX: 00002aaad007dc88  RSI: 00002aaad007a130  RDI: 0000000000000000
      RBP: 00000000431dfc60   R8: 0000000000000020   R9: 00000000000006af
      R10: 00000000000003e0  R11: 0000000000000203  R12: 000000000000015e
      R13: 0000000040dd3b70  R14: 00000000431dfca8  R15: 00000000431dfcb4
      ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
  bt: WARNING: possibly bogus exception frame
  --- <NMI exception stack> ---
   #3 [431dfc58] system_call at ffffffff8005d098
  bt: cannot transition from exception stack to current process stack:
      exception stack pointer: ffff810e37150f20
        process stack pointer: 431dfc58
           current stack base: ffff810ecd8ac000
  crash>
  
Crash version 4.1.2-6.el5 makes the stack transition correctly:
  
  # crash vmlinux anritsu-2023578-vmcore
  
  crash 4.1.2-6.el5
  Copyright (C) 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009  Red Hat, Inc.
  Copyright (C) 2004, 2005, 2006  IBM Corporation
  Copyright (C) 1999-2006  Hewlett-Packard Co
  Copyright (C) 2005, 2006  Fujitsu Limited
  Copyright (C) 2006, 2007  VA Linux Systems Japan K.K.
  Copyright (C) 2005  NEC Corporation
  Copyright (C) 1999, 2002, 2007  Silicon Graphics, Inc.
  Copyright (C) 1999, 2000, 2001, 2002  Mission Critical Linux, Inc.
  This program is free software, covered by the GNU General Public License,
  and you are welcome to change it and/or distribute copies of it under
  certain conditions.  Enter "help copying" to see the conditions.
  This program has absolutely no warranty.  Enter "help warranty" for details.
   
  GNU gdb 6.1
  Copyright 2004 Free Software Foundation, Inc.
  GDB is free software, covered by the GNU General Public License, and you are
  welcome to change it and/or distribute copies of it under certain conditions.
  Type "show copying" to see the conditions.
  There is absolutely no warranty for GDB.  Type "show warranty" for details.
  This GDB was configured as "x86_64-unknown-linux-gnu"...
  
        KERNEL: vmlinux
      DUMPFILE: anritsu-2023578-vmcore  [PARTIAL DUMP]
          CPUS: 32
          DATE: Mon Jun 14 16:25:59 2010
        UPTIME: 2 days, 01:45:24
  LOAD AVERAGE: 6.28, 3.21, 1.97
         TASKS: 1497
      NODENAME: qad-prod.us.anritsu.com
       RELEASE: 2.6.18-194.3.1.el5
       VERSION: #1 SMP Sun May 2 04:17:42 EDT 2010
       MACHINE: x86_64  (2400 Mhz)
        MEMORY: 63.1 GB
         PANIC: "Kernel panic - not syncing: softlockup: hung tasks"
           PID: 28942
       COMMAND: "java"
          TASK: ffff810bbd90d7a0  [THREAD_INFO: ffff810be0888000]
           CPU: 6
         STATE: TASK_RUNNING (PANIC)
  
  crash> bt 992
  PID: 992    TASK: ffff810eb00cd7e0  CPU: 27  COMMAND: "java"
   #0 [ffff810e37150f20] crash_nmi_callback at ffffffff8007af27
   #1 [ffff810e37150f40] do_nmi at ffffffff8006588a
   #2 [ffff810e37150f50] nmi at ffffffff80064eef
      [exception RIP: system_call]
      RIP: ffffffff8005d098  RSP: 00000000431dfc58  RFLAGS: 00000003
      RAX: 0000000000000018  RBX: 0000000040dd3b60  RCX: 00000032440baa27
      RDX: 00002aaad007dc88  RSI: 00002aaad007a130  RDI: 0000000000000000
      RBP: 00000000431dfc60   R8: 0000000000000020   R9: 00000000000006af
      R10: 00000000000003e0  R11: 0000000000000203  R12: 000000000000015e
      R13: 0000000040dd3b70  R14: 00000000431dfca8  R15: 00000000431dfcb4
      ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
  --- <NMI exception stack> ---
   #3 [431dfc58] system_call at ffffffff8005d098
      RIP: 00000032440baa27  RSP: 00000000431dfc58  RFLAGS: 00000203
      RAX: 0000000000000000  RBX: 0000000040dd3b60  RCX: ffffffffffffffff
      RDX: 00002aaad007dc88  RSI: 00002aaad007a130  RDI: 0000000000000000
      RBP: 00000000431dfc60   R8: 0000000000000020   R9: 00000000000006af
      R10: 00000000000003e0  R11: 0000000000000203  R12: 000000000000015d
      R13: 0000000040dd3b70  R14: 00000000431dfca8  R15: 00000000431dfcb4
      ORIG_RAX: 0000000000000018  CS: 0033  SS: 002b
  crash>

Comment 5 Jaromir Hradilek 2010-12-01 19:05:57 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Prior to this update, the "bt" command failed to make the transition from the NMI exception stack to the process stack when a task had just entered the kernel, but had not switched its stack pointer from the user-space per-thread stack to the relevant kernel stack yet. This has been fixed, and such transition is made as expected.

Comment 7 errata-xmlrpc 2011-01-13 22:50:47 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0059.html


Note You need to log in before you can comment on or make changes to this bug.