Bug 466724 - [5.3][xen] bt: invalid structure size: task_struct
Summary: [5.3][xen] bt: invalid structure size: task_struct
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: crash
Version: 5.2
Hardware: i386
OS: Linux
medium
medium
Target Milestone: rc
: ---
Assignee: Dave Anderson
QA Contact: Martin Jenner
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2008-10-13 08:44 UTC by Qian Cai
Modified: 2009-09-02 09:40 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
When run on a Xen hypervisor in which the backtrace leads to either "process_softirqs" or "page_fault", the "bt" command backtrace would indicate: "bt: cannot resolve stack trace". The recovery code would then terminate the command with the nonsensical error message: "bt: invalid structure size: task_struct". The command now properly terminates the backtrace.
Clone Of:
Environment:
Last Closed: 2009-09-02 09:40:19 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2009:1283 0 normal SHIPPED_LIVE crash bug fix update 2009-09-01 09:50:41 UTC

Description Qian Cai 2008-10-13 08:44:05 UTC
Description of problem:
If Xen Domain 0 Kernel or hypervisor crashes while CPUs is handling IRQs, the generated vmcore could not be analysed with bt -a command in Xen hypervisor mode.

crash 4.0-7.2.3
Copyright (C) 2002, 2003, 2004, 2005, 2006, 2007, 2008  Red Hat, Inc.
Copyright (C) 2004, 2005, 2006  IBM Corporation
Copyright (C) 1999-2006  Hewlett-Packard Co
Copyright (C) 2005, 2006  Fujitsu Limited
Copyright (C) 2006, 2007  VA Linux Systems Japan K.K.
Copyright (C) 2005  NEC Corporation
Copyright (C) 1999, 2002, 2007  Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002  Mission Critical Linux, Inc.
This program is free software, covered by the GNU General Public License,
and you are welcome to change it and/or distribute copies of it under
certain conditions.  Enter "help copying" to see the conditions.
This program has absolutely no warranty.  Enter "help warranty" for details.
 
NOTE: stdin: not a tty

GNU gdb 6.1
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "i686-pc-linux-gnu"...

   KERNEL: /boot/xen-syms-2.6.18-118.el5
DEBUGINFO: /usr/lib/debug/boot/xen-syms-2.6.18-118.el5.debug
 DUMPFILE: /var/crash/127.0.0.1-2008-10-13-04:05:45/vmcore
     CPUS: 8
  DOMAINS: 4
   UPTIME: 00:10:00
  MACHINE: Intel(R) Xeon(TM) CPU 3.73GHz  (3724 Mhz)
   MEMORY: 4 GB
  PCPU-ID: 6
     PCPU: ffbeffb4
  VCPU-ID: 6
     VCPU: ffbc1080  (VCPU_RUNNING)
DOMAIN-ID: 0
   DOMAIN: ffbd8080  (DOMAIN_RUNNING)
    STATE: CRASH

crash> bt -a
PCPU:  0  VCPU: ffbc7080
bt: cannot resolve stack trace:
 #0 [ff1d3ebc] elf_core_save_regs at ff10a810
 #1 [ff1d3ec4] common_interrupt at ff1222ed
 #2 [ff1d3ed0] do_nmi at ff1335bb
 #3 [ff1d3ef0] handle_nmi_mce at ff17442e
 #4 [ff1d3f24] csched_tick at ff110aa7
 #5 [ff1d3f80] timer_softirq_action at ff1155d2
 #6 [ff1d3fa0] do_softirq at ff1143fe
 #7 [ff1d3fb0] process_softirqs at ff173f61
bt: text symbols on stack:

bt: invalid structure size: task_struct
    FILE: x86.c  LINE: 1576  FUNCTION: x86_eframe_search()

    [ff1d3ebc] disable_local_APIC at ff11db75
    [ff1d3ec0] crash_nmi_callback at ff13cc96
    [ff1d3ec4] common_interrupt at ff1222f2
    [ff1d3ed0] do_nmi at ff1335c1
    [ff1d3ef0] handle_nmi_mce at ff174435
    [ff1d3f18] csched_tick at ff110aa7
    [ff1d3f80] timer_softirq_action at ff1155d4
    [ff1d3fa0] do_softirq at ff114405
    [ff1d3fb0] process_softirqs at ff173f66
[/usr/bin/crash] error trace: 81637af => 816450b => 810c544 => 813eebc

  813eebc: SIZE_verify+126
  810c544: (undetermined)
  816450b: (undetermined)
  81637af: lkcd_x86_back_trace+2370

bt: invalid structure size: task_struct
    FILE: x86.c  LINE: 1576  FUNCTION: x86_eframe_search()


Version-Release number of selected component (if applicable):
crash-7.2.3
kernel-xen-2.6.18-118.el5
kernel-PAE-2.6.18-118.el5

How reproducible:
always

Steps to Reproduce:
1. configure Kdump on Xen with crashkernel=128M@32M
2. use jprobe to trigger BUG() in __do_IRQ().
3. crash xen-syms vmcore
4. bt -a
  
Actual results:
See errors.

Expected results:
No error.

Comment 2 Qian Cai 2008-10-13 08:52:09 UTC
There is another failure,

crash 4.0-7.2.3
Copyright (C) 2002, 2003, 2004, 2005, 2006, 2007, 2008  Red Hat, Inc.
Copyright (C) 2004, 2005, 2006  IBM Corporation
Copyright (C) 1999-2006  Hewlett-Packard Co
Copyright (C) 2005, 2006  Fujitsu Limited
Copyright (C) 2006, 2007  VA Linux Systems Japan K.K.
Copyright (C) 2005  NEC Corporation
Copyright (C) 1999, 2002, 2007  Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002  Mission Critical Linux, Inc.
This program is free software, covered by the GNU General Public License,
and you are welcome to change it and/or distribute copies of it under
certain conditions.  Enter "help copying" to see the conditions.
This program has absolutely no warranty.  Enter "help warranty" for details.
 
NOTE: stdin: not a tty

GNU gdb 6.1
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "i686-pc-linux-gnu"...

   KERNEL: /boot/xen-syms-2.6.18-118.el5
DEBUGINFO: /usr/lib/debug/boot/xen-syms-2.6.18-118.el5.debug
 DUMPFILE: /var/crash/127.0.0.1-2008-10-10-07:08:55/vmcore
     CPUS: 8
  DOMAINS: 4
   UPTIME: 00:02:08
  MACHINE: Dual-Core AMD Opteron(tm) Processor 8216  (2411 Mhz)
   MEMORY: 14 GB
  PCPU-ID: 0
     PCPU: ff1d3fb4
  VCPU-ID: 0
     VCPU: ff2ab080  (VCPU_RUNNING)
DOMAIN-ID: 0
   DOMAIN: ff2bc080  (DOMAIN_RUNNING)
    STATE: CRASH

crash> bt -a
PCPU:  0  VCPU: ffbfc080
 #0 [ff1d3f40] elf_core_save_regs at ff10a810
 #1 [ff1d3f44] crash_nmi_callback at ff13cc91
 #2 [ff1d3f54] do_nmi at ff1335bb
 #3 [ff1d3f74] handle_nmi_mce at ff17442e
 #4 [ff1d3fa8] idle_loop at ff11f975

PCPU:  1  VCPU: ff1b6080
 #0 [ff1bff40] elf_core_save_regs at ff10a810
 #1 [ff1bff44] crash_nmi_callback at ff13cc91
 #2 [ff1bff54] do_nmi at ff1335bb
 #3 [ff1bff74] handle_nmi_mce at ff17442e
 #4 [ff1bffa8] idle_loop at ff11f975

PCPU:  2  VCPU: ff1cc080
 #0 [ff1c7f60] elf_core_save_regs at ff10a810
 #1 [ff1c7f64] kexec_crash at ff10abe0
 #2 [ff1c7f74] do_crashdump_trigger at ff10b388
 #3 [ff1c7f84] keypress_softirq at ff10a024
 #4 [ff1c7f94] do_softirq at ff1143fe
 #5 [ff1c7fa4] idle_loop at ff11f975

PCPU:  3  VCPU: ff1b9080
 #0 [ff1c3f40] elf_core_save_regs at ff10a810
 #1 [ff1c3f44] crash_nmi_callback at ff13cc91
 #2 [ff1c3f54] do_nmi at ff1335bb
 #3 [ff1c3f74] handle_nmi_mce at ff17442e
 #4 [ff1c3fa8] idle_loop at ff11f975

PCPU:  4  VCPU: ff23f080
 #0 [ff23bf40] elf_core_save_regs at ff10a810
 #1 [ff23bf44] crash_nmi_callback at ff13cc91
 #2 [ff23bf54] do_nmi at ff1335bb
 #3 [ff23bf74] handle_nmi_mce at ff17442e
 #4 [ff23bfa8] idle_loop at ff11f975

PCPU:  5  VCPU: ff23d080
 #0 [ff237f40] elf_core_save_regs at ff10a810
 #1 [ff237f44] crash_nmi_callback at ff13cc91
 #2 [ff237f54] do_nmi at ff1335bb
 #3 [ff237f74] handle_nmi_mce at ff17442e
 #4 [ff237fa8] idle_loop at ff11f975

PCPU:  6  VCPU: ff233080
 #0 [ffbeff40] elf_core_save_regs at ff10a810
 #1 [ffbeff44] crash_nmi_callback at ff13cc91
 #2 [ffbeff54] do_nmi at ff1335bb
 #3 [ffbeff74] handle_nmi_mce at ff17442e
 #4 [ffbeffa8] idle_loop at ff11f975

PCPU:  7  VCPU: ff231080
bt: cannot resolve stack trace:
 #0 [ffbebebc] elf_core_save_regs at ff10a810
 #1 [ffbebec0] crash_nmi_callback at ff13cc91
 #2 [ffbebed0] do_nmi at ff1335bb
 #3 [ffbebef0] handle_nmi_mce at ff17442e
 #4 [ffbebf24] ns_read_reg at ff11cb0a
 #5 [ffbebf24] ns16550_interrupt at ff11cc79
 #6 [ffbebf44] do_IRQ at ff1262c1
 #7 [ffbebf74] common_interrupt at ff1222ed
bt: text symbols on stack:

bt: invalid structure size: task_struct
    FILE: x86.c  LINE: 1576  FUNCTION: x86_eframe_search()

    [ffbebebc] disable_local_APIC at ff11db75
    [ffbebec0] crash_nmi_callback at ff13cc96
    [ffbebed0] do_nmi at ff1335c1
    [ffbebef0] handle_nmi_mce at ff174435
    [ffbebf18] ns_read_reg at ff11cb0a
    [ffbebf24] ns16550_interrupt at ff11cc7e
    [ffbebf44] do_IRQ at ff1262c3
    [ffbebf74] common_interrupt at ff1222f2
    [ffbebf9c] idle_loop at ff11f975
[/usr/bin/crash] error trace: 81637af => 816450b => 810c544 => 813eebc

  813eebc: SIZE_verify+126
  810c544: (undetermined)
  816450b: (undetermined)
  81637af: lkcd_x86_back_trace+2370

bt: invalid structure size: task_struct
    FILE: x86.c  LINE: 1576  FUNCTION: x86_eframe_search()

Comment 3 Qian Cai 2008-10-13 10:44:28 UTC
Don't know if this is related, but I have seen crash even failed to analyse Xen Domain 0 Kernel,

crash 4.0-7.2.3
Copyright (C) 2002, 2003, 2004, 2005, 2006, 2007, 2008  Red Hat, Inc.
Copyright (C) 2004, 2005, 2006  IBM Corporation
Copyright (C) 1999-2006  Hewlett-Packard Co
Copyright (C) 2005, 2006  Fujitsu Limited
Copyright (C) 2006, 2007  VA Linux Systems Japan K.K.
Copyright (C) 2005  NEC Corporation
Copyright (C) 1999, 2002, 2007  Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002  Mission Critical Linux, Inc.
This program is free software, covered by the GNU General Public License,
and you are welcome to change it and/or distribute copies of it under
certain conditions.  Enter "help copying" to see the conditions.
This program has absolutely no warranty.  Enter "help warranty" for details.
 
NOTE: stdin: not a tty

GNU gdb 6.1
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "i686-pc-linux-gnu"...

      KERNEL: /usr/lib/debug/lib/modules/2.6.18-118.el5/vmlinux
    DUMPFILE: /var/crash/2008-10-13-06:05/vmcore
        CPUS: 1
        DATE: Mon Oct 13 06:04:21 2008
      UPTIME: 00:01:53
LOAD AVERAGE: 1.10, 0.50, 0.19
       TASKS: 77
    NODENAME: dellgx240.rhts.bos.redhat.com
     RELEASE: 2.6.18-118.el5
     VERSION: #1 SMP Sat Oct 4 00:21:41 EDT 2008
     MACHINE: i686  (1694 Mhz)
      MEMORY: 1 GB
       PANIC: "kernel BUG at /mnt/tests/kernel/kdump/crash-lkdtm/lkdtm/lkdtm.c:258!"
         PID: 330
     COMMAND: "udevd"
        TASK: f7f50000  [THREAD_INFO: f7fad000]
         CPU: 0
       STATE: TASK_RUNNING (PANIC)

crash> bt -t
PID: 330    TASK: f7f50000  CPU: 0   COMMAND: "udevd"
      START: crash_kexec at c0443d42
  [c075dea8] __do_IRQ at c044e22c
  [c075dec0] lkdtm_handler at f89b40a4
  [c075dedc] die at c04064cb
  [c075df04] do_invalid_op at c0406b98
  [c075df0c] do_invalid_op at c0406c29
  [c075df2c] lkdtm_handler at f89b40a4
  [c075df48] release_console_sem at c042544c
  [c075df70] __do_IRQ at c044e22c
  [c075df74] kprobe_exceptions_notify at c06153a9
  [c075df90] notifier_call_chain at c0615ebb
  [c075df9c] __do_IRQ at c044e22c
  [c075dfbc] error_code at c0405a89
  [c075dfd0] __do_IRQ at c044e22c
  [c075dfe8] lkdtm_handler at f89b40a4
  [c075dff8] jp_do_irq at f89b4148
  [c075dffc] do_IRQ at c04074ce
--- <hard IRQ> ---
bt: invalid stack address for this task: f89b4148
    (valid range: f7fad000 - f7fae000)
      START: do_IRQ at c0407435
crash> bt -r
PID: 330    TASK: f7f50000  CPU: 0   COMMAND: "udevd"
f7fad000:  f7f50000 default_exec_domain 00000000 00000000 
f7fad010:  00000000 00010000 c0000000 00b0e410 
f7fad020:  do_no_restart_syscall 00000000 00000000 00000000 
f7fad030:  00000000 00000000 00000000 00000000 
f7fad040:  00000000 00000000 00000000 00000000 
f7fad050:  00000000 00000000 00000000 00000000 
f7fad060:  00000000 00000000 00000000 00000000 
f7fad070:  00000000 00000000 00000000 00000000 
f7fad080:  00000000 00000000 00000000 00000000 
f7fad090:  00000000 00000000 00000000 00000000 
f7fad0a0:  00000000 00000000 00000000 00000000 
f7fad0b0:  00000000 00000000 00000000 00000000 
f7fad0c0:  00000000 00000000 00000000 00000000 
f7fad0d0:  00000000 00000000 00000000 00000000 
f7fad0e0:  00000000 00000000 00000000 00000000 
f7fad0f0:  00000000 00000000 00000000 00000000 
f7fad100:  00000000 00000000 00000000 00000000 
f7fad110:  00000000 00000000 00000000 00000000 
f7fad120:  00000000 00000000 00000000 00000000 
f7fad130:  00000000 00000000 00000000 00000000 
f7fad140:  00000000 00000000 00000000 00000000 
f7fad150:  00000000 00000000 00000000 00000000 
f7fad160:  00000000 00000000 00000000 00000000 
f7fad170:  00000000 00000000 00000000 00000000 
f7fad180:  00000000 00000000 00000000 00000000 
f7fad190:  00000000 00000000 00000000 00000000 
f7fad1a0:  00000000 00000000 00000000 00000000 
f7fad1b0:  00000000 00000000 00000000 00000000 
f7fad1c0:  00000000 00000000 00000000 00000000 
f7fad1d0:  00000000 00000000 00000000 00000000 
f7fad1e0:  00000000 00000000 00000000 00000000 
f7fad1f0:  00000000 00000000 00000000 00000000 
f7fad200:  00000000 00000000 00000000 00000000 
f7fad210:  00000000 00000000 00000000 00000000 
f7fad220:  00000000 00000000 00000000 00000000 
f7fad230:  00000000 00000000 00000000 00000000 
f7fad240:  00000000 00000000 00000000 00000000 
f7fad250:  00000000 00000000 00000000 00000000 
f7fad260:  00000000 00000000 00000000 00000000 
f7fad270:  00000000 00000000 00000000 00000000 
f7fad280:  00000000 00000000 00000000 00000000 
f7fad290:  00000000 00000000 00000000 00000000 
f7fad2a0:  00000000 00000000 00000000 00000000 
f7fad2b0:  00000000 00000000 00000000 00000000 
f7fad2c0:  00000000 00000000 00000000 00000000 
f7fad2d0:  00000000 00000000 00000000 00000000 
f7fad2e0:  00000000 00000000 00000000 00000000 
f7fad2f0:  00000000 00000000 00000000 00000000 
f7fad300:  00000000 00000000 00000000 00000000 
f7fad310:  00000000 00000000 00000000 00000000 
f7fad320:  00000000 00000000 00000000 00000000 
f7fad330:  00000000 00000000 00000000 00000000 
f7fad340:  00000000 00000000 00000000 00000000 
f7fad350:  00000000 00000000 00000000 00000000 
f7fad360:  00000000 00000000 00000000 00000000 
f7fad370:  00000000 00000000 00000000 00000000 
f7fad380:  00000000 00000000 00000000 00000000 
f7fad390:  00000000 00000000 00000000 00000000 
f7fad3a0:  00000000 00000000 00000000 00000000 
f7fad3b0:  00000000 00000000 00000000 00000000 
f7fad3c0:  00000000 00000000 00000000 00000000 
f7fad3d0:  00000000 00000000 00000000 00000000 
f7fad3e0:  00000000 00000000 00000000 00000000 
f7fad3f0:  00000000 00000000 00000000 00000000 
f7fad400:  00000000 00000000 00000000 00000000 
f7fad410:  00000000 00000000 00000000 00000000 
f7fad420:  00000000 00000000 00000000 00000000 
f7fad430:  00000000 00000000 00000000 00000000 
f7fad440:  00000000 00000000 00000000 00000000 
f7fad450:  00000000 00000000 00000000 00000000 
f7fad460:  00000000 00000000 00000000 00000000 
f7fad470:  00000000 00000000 00000000 00000000 
f7fad480:  00000000 00000000 00000000 00000000 
f7fad490:  00000000 00000000 00000000 00000000 
f7fad4a0:  00000000 00000000 00000000 00000000 
f7fad4b0:  00000000 00000000 00000000 00000000 
f7fad4c0:  00000000 00000000 00000000 00000000 
f7fad4d0:  00000000 00000000 00000000 00000000 
f7fad4e0:  00000000 00000000 00000000 00000000 
f7fad4f0:  00000000 00000000 00000000 00000000 
f7fad500:  00000000 00000000 00000000 00000000 
f7fad510:  00000000 00000000 00000000 00000000 
f7fad520:  00000000 00000000 00000000 00000000 
f7fad530:  00000000 00000000 00000000 00000000 
f7fad540:  00000000 00000000 00000000 00000000 
f7fad550:  00000000 00000000 00000000 00000000 
f7fad560:  00000000 00000000 00000000 00000000 
f7fad570:  00000000 00000000 00000000 00000000 
f7fad580:  00000000 00000000 00000000 00000000 
f7fad590:  00000000 00000000 00000000 00000000 
f7fad5a0:  00000000 00000000 00000000 00000000 
f7fad5b0:  00000000 00000000 00000000 00000000 
f7fad5c0:  00000000 00000000 00000000 00000000 
f7fad5d0:  00000000 00000000 00000000 00000000 
f7fad5e0:  00000000 00000000 00000000 00000000 
f7fad5f0:  00000000 00000000 00000000 00000000 
f7fad600:  00000000 00000000 00000000 00000000 
f7fad610:  00000000 00000000 00000000 00000000 
f7fad620:  00000000 00000000 00000000 00000000 
f7fad630:  00000000 00000000 00000000 00000000 
f7fad640:  00000000 00000000 00000000 00000000 
f7fad650:  00000000 00000000 00000000 00000000 
f7fad660:  00000000 00000000 00000000 00000000 
f7fad670:  00000000 00000000 00000000 00000000 
f7fad680:  00000000 00000000 00000000 00000000 
f7fad690:  00000000 00000000 00000000 00000000 
f7fad6a0:  00000000 00000000 00000000 00000000 
f7fad6b0:  00000000 00000000 00000000 00000000 
f7fad6c0:  00000000 00000000 00000000 00000000 
f7fad6d0:  00000000 00000000 00000000 00000000 
f7fad6e0:  00000000 00000000 00000000 00000000 
f7fad6f0:  00000000 00000000 00000000 00000000 
f7fad700:  00000000 00000000 00000000 00000000 
f7fad710:  00000000 00000000 00000000 00000000 
f7fad720:  00000000 00000000 00000000 00000000 
f7fad730:  00000000 00000000 00000000 00000000 
f7fad740:  00000000 00000000 00000000 00000000 
f7fad750:  00000000 00000000 00000000 00000000 
f7fad760:  00000000 00000000 00000000 00000000 
f7fad770:  00000000 00000000 00000000 00000000 
f7fad780:  00000000 00000000 00000000 00000000 
f7fad790:  00000000 00000000 00000000 00000000 
f7fad7a0:  00000000 00000000 00000000 00000000 
f7fad7b0:  00000000 00000000 00000000 00000000 
f7fad7c0:  00000000 00000000 00000000 00000000 
f7fad7d0:  00000000 00000000 00000000 00000000 
f7fad7e0:  00000000 00000000 00000000 00000000 
f7fad7f0:  00000000 00000000 00000000 00000000 
f7fad800:  00000000 00000000 00000000 00000000 
f7fad810:  00000000 00000000 00000000 00000000 
f7fad820:  00000000 00000000 00000000 00000000 
f7fad830:  00000000 00000000 00000000 00000000 
f7fad840:  00000000 00000000 00000000 00000000 
f7fad850:  00000000 00000000 00000000 00000000 
f7fad860:  00000000 00000000 00000000 00000000 
f7fad870:  00000000 00000000 00000000 00000000 
f7fad880:  00000000 00000000 00000000 00000000 
f7fad890:  00000000 00000000 00000000 00000000 
f7fad8a0:  00000000 00000000 00000000 00000000 
f7fad8b0:  00000000 00000000 00000000 00000000 
f7fad8c0:  00000000 00000000 00000000 00000000 
f7fad8d0:  00000000 00000000 00000000 00000000 
f7fad8e0:  00000000 00000000 00000000 00000000 
f7fad8f0:  00000000 00000000 00000000 00000000 
f7fad900:  00000000 00000000 00000000 00000000 
f7fad910:  00000000 00000000 00000000 00000000 
f7fad920:  00000000 00000000 00000000 00000000 
f7fad930:  00000000 00000000 00000000 00000000 
f7fad940:  00000000 00000000 00000000 00000000 
f7fad950:  00000000 00000000 00000000 00000000 
f7fad960:  00000000 00000000 00000000 00000000 
f7fad970:  00000000 00000000 00000000 00000000 
f7fad980:  00000000 00000000 00000000 00000000 
f7fad990:  00000000 00000000 00000000 00000000 
f7fad9a0:  00000000 00000000 00000000 00000000 
f7fad9b0:  00000000 00000000 00000000 00000000 
f7fad9c0:  00000000 00000000 00000000 00000000 
f7fad9d0:  00000000 00000000 00000000 00000000 
f7fad9e0:  00000000 00000000 00000000 00000000 
f7fad9f0:  00000000 00000000 00000000 00000000 
f7fada00:  00000000 00000000 00000000 00000000 
f7fada10:  00000000 00000000 00000000 00000000 
f7fada20:  00000000 00000000 00000000 00000000 
f7fada30:  00000000 00000000 00000000 00000000 
f7fada40:  00000000 00000000 00000000 00000000 
f7fada50:  00000000 00000000 00000000 f7d86204 
f7fada60:  c9806de0 f7d86200 f7fadaf4 __next_cpu+18 
f7fada70:  00000000 find_busiest_group+375 00000031 00000031 
f7fada80:  f7fadb58 00000000 c9807940 00000031 
f7fada90:  00000000 f7d86200 00000000 00000280 
f7fadaa0:  00000005 00000005 00000080 00000000 
f7fadab0:  00000000 00000000 00000000 00000002 
f7fadac0:  00000001 00000000 00000000 ffffffff 
f7fadad0:  00000000 00000000 ffffffff 00000000 
f7fadae0:  f7f50000 f4758700 c9803d00 f7f50000 
f7fadaf0:  f4758550 00000000 00400000 f741b040 
f7fadb00:  f7fadb68 schedule+2505 88b27e00 0000001a 
f7fadb10:  f7fadb48 00000031 00000031 00000001 
f7fadb20:  f4758550 init_task 88b58f5e 0000001a 
f7fadb30:  0003115e 00000000 f7f5010c c9806de0 
f7fadb40:  00000001 00000000 f7fadbf0 00000203 
f7fadb50:  ffffffff 00000000 00000000 7fffffff 
f7fadb60:  f7fadc48 remove_wait_queue+22 f7fadc44 f7fadbe0 
f7fadb70:  00000203 free_poll_entry+14 f7fadc44 poll_freewait+24 
f7fadb80:  00000001 f7cc35c0 00000008 00000008 
f7fadb90:  do_select+942 f7fadfa0 f7fadf4c 00000000 
f7fadba0:  00000008 f7fade5c f7fade60 f7fade64 
f7fadbb0:  f7fade50 f7fade54 f7fade58 000000b8 
f7fadbc0:  00000000 00000000 000000b8 00000010 
f7fadbd0:  00000000 00000000 00000001 00000082 
f7fadbe0:  __pollwait 00000000 00000000 f7f50000 
f7fadbf0:  00000000 00000003 fffd27ae 00000029 
f7fadc00:  00000050 00000100 00000029 f7f50000 
f7fadc10:  00000000 c9806de0 f7f50000 00000000 
f7fadc20:  f7fadc4c 00000000 kprobe_exceptions_nb f7fadc4c 
f7fadc30:  0000000c notifier_call_chain+25 00000021 f7fadc74 
f7fadc40:  00000021 00000021 do_nmi+163 f7fadc74 
f7fadc50:  __func__.18214+6516 00000021 do_IRQ+181 f7aebf9c 
f7fadc60:  f7a645dc f7aebf9c constraint_expr_eval+934 00000000 
f7fadc70:  f7aebf84 f7a645c4 00000000 00000001 
f7fadc80:  00000001 f7fadd34 f7fadd34 f7fadcc4 
f7fadc90:  c9908380 f7a446e0 f7fadd34 00000280 
f7fadca0:  context_struct_compute_av+548 c9908360 000728c0 f7aebf84 
f7fadcb0:  f7a645c4 f7cbf9a0 f7a32928 f7a312a8 
f7fadcc0:  00000700 025606db 00070007 f7fadd34 
f7fadcd0:  f7a645c4 00000246 f7fadd28 00100000 
f7fadce0:  f7fadd68 avc_alloc_node+22 00072622 0000005b 
f7fadcf0:  c96d3020 c96d3020 contig_page_data+9600 00000286 
f7fadd00:  00000286 contig_page_data+9472 00000000 f7fadd2c 
f7fadd10:  f7faddac 00000001 __pagevec_free+20 c96d3020 
f7fadd20:  contig_page_data+9472 release_pages+287 00000001 f7fadda8 
f7fadd30:  f7b299d0 00000000 00000000 f7fadd50 
f7fadd40:  0000000e f7fadda8 00000000 f7fadda8 
f7fadd50:  00000000 f7b299cc find_get_pages+37 0000000e 
f7fadd60:  00000000 f7fadda0 shmem_free_blocks+32 20080010 
f7fadd70:  f7b29924 f7b298c4 shmem_truncate_range+1555 00000000 
f7fadd80:  00000000 00000096 f7a312a8 00000000 
f7fadd90:  00000000 f7f50000 00000000 00000003 
f7fadda0:  fffd27af 00000000 kprobe_exceptions_nb f7faddcc 
f7faddb0:  0000000c notifier_call_chain+25 00000021 f7faddf4 
f7faddc0:  00000021 00000021 do_nmi+163 f7faddf4 
f7faddd0:  __func__.18214+6516 00000021 00000002 f7faddf4 
f7fadde0:  f7a446e0 f7a44880 f7fadeb4 00000700 
f7faddf0:  common_interrupt+26 f7a446e0 00000013 00000000 
f7fade00:  f7a44880 f7fadeb4 00000700 08000000 
f7fade10:  f7fa007b 0000007b ffffffff context_struct_compute_av+262 
f7fade20:  00000060 00000246 00041d25 f7a645c4 
f7fade30:  f7a645c4 c9aeeb20 f7a32928 f7a32928 
f7fade40:  00000253 06db0232 00070004 f7fadeb4 
f7fade50:  f7a645c4 00000056 00000056 security_compute_av+152 
f7fade60:  00200000 f7fadeb4 00040000 f7fadea8 
f7fade70:  00000001 00000000 avc_cache+2636 avc_has_perm_noaudit+282 
f7fade80:  00200000 f7fadeb4 00000056 00000056 
f7fade90:  00000004 0000010e 00000004 0004ffff 
f7fadea0:  00000000 f7fadee0 f7b298c4 f7b298c4 
f7fadeb0:  shmem_swp_entry+42 00000000 ffffffff 00000000 
f7fadec0:  ffffffff 00000001 00000003 f7a61ac0 
f7faded0:  00000001 f7ca1ac0 00000000 selinux_vm_enough_memory_mm+62 
f7fadee0:  00200000 00000000 f7d7ce40 f7b298c4 
f7fadef0:  f7b298dc shmem_getpage+732 f7d77890 00000000 
f7fadf00:  f7fadf6c 00000000 f7b29924 01a61ac0 
f7fadf10:  f7b299cc f7fadf30 file_has_perm+127 f7fadf30 
f7fadf20:  00000000 00000000 00000000 bffaff78 
f7fadf30:  00000000 shmem_file_write+291 00000003 00000000 
f7fadf40:  f7b29998 f7b29924 00000000 00000000 
f7fadf50:  00000000 00000004 48f31d25 3b35eb1e 
f7fadf60:  00000004 00000000 00000000 00000000 
f7fadf70:  f7b29924 f7f47ec0 shmem_file_write bffaff78 
f7fadf80:  00000004 vfs_write+161 f7fadfa4 f7f47ec0 
f7fadf90:  fffffff7 0911d410 f7fad000 sys_write+60 
f7fadfa0:  f7fadfa4 00000000 00000000 00000000 
f7fadfb0:  00000008 00000008 syscall_call+7 00000008 
f7fadfc0:  bffaff78 00000004 00000008 0911d410 
f7fadfd0:  bffaffa8 00000004 0000007b 0000007b 
f7fadfe0:  00000004 00b0e402 00000073 00000246 
f7fadff0:  bffafe44 0000007b 00000000 00000000 
crash> bt -T
PID: 330    TASK: f7f50000  CPU: 0   COMMAND: "udevd"
  [c075dbe4] notifier_call_chain at c0615ebb
  [c075dbf8] do_nmi at c04068a0
  [c075dc20] nmi_stack_correct at c0405b2e
  [c075dc4c] serial_in at c054f115
  [c075dc64] notifier_call_chain at c0615ebb
  [c075dc78] do_nmi at c04068a0
  [c075dca0] nmi_stack_correct at c0405b2e
  [c075dccc] serial_in at c054f115
  [c075dcd8] __delay at c04ea970
  [c075dcdc] serial8250_console_putchar at c05516e6
  [c075dcf0] uart_console_write at c054cb7b
  [c075dd18] serial8250_console_write at c0551035
  [c075dd24] __call_console_drivers at c0425177
  [c075dd4c] notifier_call_chain at c0615ebb
  [c075dd88] nmi_stack_correct at c0405b2e
  [c075ddb8] vsnprintf at c04ea4db
  [c075de40] __do_IRQ at c044e22c
  [c075de50] machine_kexec at c04199c5
  [c075de6c] relocate_kernel at c041a000
  [c075de94] crash_kexec at c0443d42
  [c075dea8] __do_IRQ at c044e22c
  [c075dec0] lkdtm_handler at f89b40a4
  [c075dedc] die at c04064cb
  [c075df04] do_invalid_op at c0406b98
  [c075df0c] do_invalid_op at c0406c29
  [c075df2c] lkdtm_handler at f89b40a4
  [c075df48] release_console_sem at c042544c
  [c075df70] __do_IRQ at c044e22c
  [c075df74] kprobe_exceptions_notify at c06153a9
  [c075df90] notifier_call_chain at c0615ebb
  [c075df9c] __do_IRQ at c044e22c
  [c075dfbc] error_code at c0405a89
  [c075dfd0] __do_IRQ at c044e22c
  [c075dfe8] lkdtm_handler at f89b40a4
  [c075dff8] jp_do_irq at f89b4148
  [c075dffc] do_IRQ at c04074ce
--- <hard IRQ> ---
bt: invalid stack address for this task: f89b4148
    (valid range: f7fad000 - f7fae000)
  [f7fada6c] __next_cpu at c04e6c9c
  [f7fada74] find_busiest_group at c041dfe3
  [f7fadb04] schedule at c0613405
  [f7fadb64] remove_wait_queue at c0435803
  [f7fadb74] free_poll_entry at c0483996
  [f7fadb7c] poll_freewait at c04839b6
  [f7fadb90] do_select at c0484115
  [f7fadbe0] __pollwait at c048465b
  [f7fadc34] notifier_call_chain at c0615ebb
  [f7fadc48] do_nmi at c04068a0
  [f7fadc58] do_IRQ at c04074ea
  [f7fadc68] constraint_expr_eval at c04ce7d1
  [f7fadca0] context_struct_compute_av at c04cea79
  [f7fadce4] avc_alloc_node at c04c2367
  [f7fadd18] __pagevec_free at c045baad
  [f7fadd24] release_pages at c045daf0
  [f7fadd58] find_get_pages at c045a011
  [f7fadd68] shmem_free_blocks at c046c925
  [f7fadd78] shmem_truncate_range at c046d0f3
  [f7faddb4] notifier_call_chain at c0615ebb
  [f7faddc8] do_nmi at c04068a0
  [f7faddf0] common_interrupt at c0405946
  [f7fade1c] context_struct_compute_av at c04ce95b
  [f7fade5c] security_compute_av at c04cf311
  [f7fade7c] avc_has_perm_noaudit at c04c25bb
  [f7fadeb0] shmem_swp_entry at c046ca12
  [f7fadedc] selinux_vm_enough_memory_mm at c04c33f5
  [f7fadef4] shmem_getpage at c046dca1
  [f7fadf18] file_has_perm at c04c38f7
  [f7fadf34] shmem_file_write at c046e793
  [f7fadf78] shmem_file_write at c046e670
  [f7fadf84] vfs_write at c0473c7b
  [f7fadf9c] sys_write at c047426d
  [f7fadfb8] syscall_call at c0404f17

Dave, please let me know if you think we need to create a new BZ for this case and make the vmcore available for you.

Comment 4 Qian Cai 2008-10-13 11:05:00 UTC
(In reply to comment #3)
> Don't know if this is related, but I have seen crash even failed to analyse Xen
> Domain 0 Kernel,
> 

Sorry, I mean I have seen crash even failed to analyse bare metal Kernel.

Comment 5 Dave Anderson 2008-10-13 14:48:37 UTC
> Dave, please let me know if you think we need to create a new BZ for this
> case and make the vmcore available for you.

Yes, please file a new bugzilla for the vmlinux-related bug.

Also, it appears that you are referencing 3 different vmcore files,
so please make them all available.

Also, when making xen hypervisor dumps available, please save all *three*
files available instead of the two file like you have been doing.  For
example:

>    KERNEL: /boot/xen-syms-2.6.18-118.el5
> DEBUGINFO: /usr/lib/debug/boot/xen-syms-2.6.18-118.el5.debug
>  DUMPFILE: /var/crash/127.0.0.1-2008-10-10-07:08:55/vmcore

Please copy the vmcore, xen-syms-2.6.18-118.el5 *and* the xen-syms-2.6.18-118.el5.debug to the saved directory.

Comment 6 Qian Cai 2008-10-13 15:15:24 UTC
(In reply to comment #5)
> Yes, please file a new bugzilla for the vmlinux-related bug.
> 
> Also, it appears that you are referencing 3 different vmcore files,
> so please make them all available.
> 
> Also, when making xen hypervisor dumps available, please save all *three*
> files available instead of the two file like you have been doing.  For
> example:
> 
> >    KERNEL: /boot/xen-syms-2.6.18-118.el5
> > DEBUGINFO: /usr/lib/debug/boot/xen-syms-2.6.18-118.el5.debug
> >  DUMPFILE: /var/crash/127.0.0.1-2008-10-10-07:08:55/vmcore
> 
> Please copy the vmcore, xen-syms-2.6.18-118.el5 *and* the
> xen-syms-2.6.18-118.el5.debug to the saved directory.

Done copying files. I'll file other bugs shortly.

Comment 7 Qian Cai 2008-10-13 15:22:54 UTC
(In reply to comment #5)
> Also, it appears that you are referencing 3 different vmcore files,
> so please make them all available.
> 

The one in comment #2 was generated by a automated test, and I have not found a permanent place to save vmcores generated from those tests yet. In addition, I don't know how to reproduce that one, which need CPUs to run certain tasks while crashing. However, I could re-create other two vmcores reliably.

Comment 8 Dave Anderson 2008-10-14 12:35:05 UTC
Also, I am not interested in any crash dump that was
generated with a kprobe that in turn purposely generated
the crash.  The kprobe mechanism itself violates the
architecture's calling convention, and therefore the
crash utility backtrace code cannot follow the trace.

Please do not file bugs against the crash backtrace
capability when the crash was generated by a kprobe.

Comment 9 Dave Anderson 2008-10-14 20:39:53 UTC
With respect to the "bt: invalid structure size: task_struct" seen on
the xen-syms vmcore analysis, it is clear that the backtrace error-handling
code is incorrectly using a vmlinux-related function that searches for
exception frames.  And that is because the backtrace runs into an
assembly-language entry point that the upstream maintainers had
apparently never seen.  However, I don't know whether the only reason it is
being seen in this case is because of the type of trap that was generated
by the bogus kprobe operation done on the dom0 kernel or not.  

In any case, I have posted a suggested patch and a query to the upstream
maintainers of the xen hypervisor parts of the crash utility:

  [Crash-utility] Question re: xen hypervisor backtrace problem
  https://www.redhat.com/archives/crash-utility/2008-October/msg00063.html

Comment 10 Dave Anderson 2008-10-15 12:30:35 UTC
The upstream maintainer has agreed with my fixes.  Here is his response,
which also contains my full initial query:

* From: Itsuro ODA <oda valinux co jp>
* To: Dave Anderson <anderson redhat com>
* Cc: crash-utility redhat com
* Subject: [Crash-utility] Re: Question re: xen hypervisor backtrace problem
* Date: Wed, 15 Oct 2008 08:22:39 +0900

Hi Dave,

> Do you agree with these changes?

Yes.

Thank you.
Itsuro Oda

On Tue, 14 Oct 2008 16:30:18 -0400 (EDT)
Dave Anderson <anderson redhat com> wrote:

> 
> Hello Oda-san,
> 
> I have a xen-syms vmcore that finds a path that the hypervisor-related
> changes in lkcd_x86_trace.c cannot handle.  When the back trace runs 
> into the "process_softirqs" text return address reference from 
> "xen/arch/x86/x86_32/entry.S", it cannot go any further.  Therefore 
> the backtrace fails, and in the recovery code it incorrectly searches 
> for a (vmlinux) eframe: 
> 
>   crash> bt -a
>   PCPU:  0  VCPU: ffbc7080
>   bt: cannot resolve stack trace:
>    #0 [ff1d3ebc] elf_core_save_regs at ff10a810
>    #1 [ff1d3ec4] common_interrupt at ff1222ed
>    #2 [ff1d3ed0] do_nmi at ff1335bb
>    #3 [ff1d3ef0] handle_nmi_mce at ff17442e
>    #4 [ff1d3f24] csched_tick at ff110aa7
>    #5 [ff1d3f80] timer_softirq_action at ff1155d2
>    #6 [ff1d3fa0] do_softirq at ff1143fe
>    #7 [ff1d3fb0] process_softirqs at ff173f61
>   bt: text symbols on stack:
>       [ff1d3ebc] disable_local_APIC at ff11db75
>       [ff1d3ec0] crash_nmi_callback at ff13cc96
>       [ff1d3ec4] common_interrupt at ff1222f2
>       [ff1d3ed0] do_nmi at ff1335c1
>       [ff1d3ef0] handle_nmi_mce at ff174435
>       [ff1d3f18] csched_tick at ff110aa7
>       [ff1d3f80] timer_softirq_action at ff1155d4
>       [ff1d3fa0] do_softirq at ff114405
>       [ff1d3fb0] process_softirqs at ff173f66
>   
>   bt: invalid structure size: task_struct
>       FILE: x86.c  LINE: 1576  FUNCTION: x86_eframe_search()
>   
>   [/usr/bin/crash] error trace: 816373b => 8164497 => 810c40c => 813ed94
>   
>     813ed94: SIZE_verify+126
>     810c40c: x86_eframe_search+1075
>     8164497: handle_trace_error+692
>     816373b: lkcd_x86_back_trace+2370
>   
>   bt: invalid structure size: task_struct
>       FILE: x86.c  LINE: 1576  FUNCTION: x86_eframe_search()
>   
>   crash> 
>   
> Now, the bogus vmlinux eframe search can be avoided by doing this in 
> handle_trace_error():
> 
> --- lkcd_x86_trace.c.orig       2008-10-14 15:46:33.000000000 -0400
> +++ lkcd_x86_trace.c    2008-10-14 16:09:26.000000000 -0400
> @@ -2440,12 +2441,14 @@ handle_trace_error(struct bt_info *bt, i
>          bt->flags |= BT_TEXT_SYMBOLS_PRINT|BT_ERROR_MASK;
>          back_trace(bt);
>  
> -        bt->flags = BT_EFRAME_COUNT;
> -        if ((cnt = machdep->eframe_search(bt))) {
> -               error(INFO, "possible exception frame%s:\n", 
> -                       cnt > 1 ? "s" : "");
> -               bt->flags &= ~(ulonglong)BT_EFRAME_COUNT;
> -               machdep->eframe_search(bt); 
> +       if (!XEN_HYPER_MODE()) {
> +               bt->flags = BT_EFRAME_COUNT;
> +               if ((cnt = machdep->eframe_search(bt))) {
> +                       error(INFO, "possible exception frame%s:\n", 
> +                               cnt > 1 ? "s" : "");
> +                       bt->flags &= ~(ulonglong)BT_EFRAME_COUNT;
> +                       machdep->eframe_search(bt); 
> +               }
>         }
>  }
> 
> After doing the above, the bt -a shows this, and therefore does 
> not fail prematurely:
>   
>   crash> bt -a
>   PCPU:  0  VCPU: ffbc7080
>   bt: cannot resolve stack trace:
>    #0 [ff1d3ebc] elf_core_save_regs at ff10a810
>    #1 [ff1d3ec4] common_interrupt at ff1222ed
>    #2 [ff1d3ed0] do_nmi at ff1335bb
>    #3 [ff1d3ef0] handle_nmi_mce at ff17442e
>    #4 [ff1d3f24] csched_tick at ff110aa7
>    #5 [ff1d3f80] timer_softirq_action at ff1155d2
>    #6 [ff1d3fa0] do_softirq at ff1143fe
>    #7 [ff1d3fb0] process_softirqs at ff173f61
>   bt: text symbols on stack:
>       [ff1d3ebc] disable_local_APIC at ff11db75
>       [ff1d3ec0] crash_nmi_callback at ff13cc96
>       [ff1d3ec4] common_interrupt at ff1222f2
>       [ff1d3ed0] do_nmi at ff1335c1
>       [ff1d3ef0] handle_nmi_mce at ff174435
>       [ff1d3f18] csched_tick at ff110aa7
>       [ff1d3f80] timer_softirq_action at ff1155d4
>       [ff1d3fa0] do_softirq at ff114405
>       [ff1d3fb0] process_softirqs at ff173f66
> 
>   PCPU:  1  VCPU: ff1b6080
>   ...
>   
> Carrying it one step further, and given that the relevant part 
> of the stack from above looks like this:
> 
>   crash> rd -s ff1d3ebc 84
>   ff1d3ebc:  disable_local_APIC+5 crash_nmi_callback+38 common_interrupt+82 cpu0_stack+16076 
>   ff1d3ecc:  0003d027 do_nmi+49 cpu0_stack+16120 00000000 
>   ff1d3edc:  ffbca000 ffbcbeb0 00000030 cpu0_stack+16308 
>   ff1d3eec:  0000e010 handle_nmi_mce+91 cpu0_stack+16120 00000100 
>   ff1d3efc:  00000005 000000ff 000005dc ffbdee88 
>   ff1d3f0c:  00000000 00000960 00020000 csched_tick+1239 
>   ff1d3f1c:  0000e008 00000083 ffbc7080 00000030 
>   ff1d3f2c:  0003d027 80000003 000583a8 per_cpu__schedule_data 
>   ff1d3f3c:  c840ceb2 00000000 ffbfda80 00000000 
>   ff1d3f4c:  00000000 00000000 00000100 00000960 
>   ff1d3f5c:  ffbdee80 00000246 000000ff csched_priv+4 
>   ff1d3f6c:  00000000 ffbfda8c __per_cpu_data_end+54972 e4c5d8d9 
>   ff1d3f7c:  0000008b timer_softirq_action+132 00000000 ffbc7080 
>   ff1d3f8c:  per_cpu__timers 00000000 cpu0_stack+16308 0000007b 
>   ff1d3f9c:  eaed7700 do_softirq+53 00000000 ffbc7080 
>   ff1d3fac:  0000007b process_softirqs+6 eb396d84 00000002 
>   ff1d3fbc:  c0678470 c0678470 00000002 eaed7700 
>   ff1d3fcc:  00000000 000d0000 c04011a7 00000061 
>   ff1d3fdc:  00000202 eb396d48 00000069 0000007b 
>   ff1d3fec:  0000007b 00000000 00000000 00000000 
>   ff1d3ffc:  ffbc7080 ffffffff ffffffff ffffffff
>   crash> 
>   
> Clearly "process_softirqs" is the last text return address
> reference that the backtrace code can work with.  So to try
> to clean up the backtrace, I added this:
> 
> --- lkcd_x86_trace.c.orig       2008-10-14 15:46:33.000000000 -0400
> +++ lkcd_x86_trace.c    2008-10-14 16:09:26.000000000 -0400
> @@ -1423,6 +1423,7 @@ find_trace(
>                 if (XEN_HYPER_MODE()) {
>                         func_name = kl_funcname(pc);
>                         if (STREQ(func_name, "idle_loop") || STREQ(func_name, "hypercall")
> +                               || STREQ(func_name, "process_softirqs")
>                                 || STREQ(func_name, "tracing_off")
>                                 || STREQ(func_name, "handle_exception")) {
>                                 UPDATE_FRAME(func_name, pc, 0, sp, bp, asp, 0, 0, bp - sp, 0);
> 
> which shows:
>   
>   crash> bt -a
>   PCPU:  0  VCPU: ffbc7080
>    #0 [ff1d3ebc] elf_core_save_regs at ff10a810
>    #1 [ff1d3ec4] common_interrupt at ff1222ed
>    #2 [ff1d3ed0] do_nmi at ff1335bb
>    #3 [ff1d3ef0] handle_nmi_mce at ff17442e
>    #4 [ff1d3f24] csched_tick at ff110aa7
>    #5 [ff1d3f80] timer_softirq_action at ff1155d2
>    #6 [ff1d3fa0] do_softirq at ff1143fe
>    #7 [ff1d3fb0] process_softirqs at ff173f61
>   
>   PCPU:  1  VCPU: ff1b6080
>   ...
>         
> The patch to avoid eframe search can be avoided entirely by applying 
> the second patch, but it seems that it should be left in place for 
> other unforeseen possibilities in the future.
> 
> Do you agree with these changes?
> 
> Thanks,
>   Dave
> 

-- 
Itsuro ODA <oda valinux co jp>

Comment 11 Dave Anderson 2008-10-15 12:50:34 UTC
With this patch, the backtrace terminates with no complaints:

--- lkcd_x86_trace.c.orig       2008-10-14 15:46:33.000000000 -0400
+++ lkcd_x86_trace.c    2008-10-14 16:09:26.000000000 -0400
@@ -1423,6 +1423,7 @@
                if (XEN_HYPER_MODE()) {
                        func_name = kl_funcname(pc);
                        if (STREQ(func_name, "idle_loop") || STREQ(func_name, "hypercall")
+                               || STREQ(func_name, "process_softirqs")
                                || STREQ(func_name, "tracing_off")
                                || STREQ(func_name, "handle_exception")) {
                                UPDATE_FRAME(func_name, pc, 0, sp, bp, asp, 0, 0, bp - sp, 0);
@@ -2440,12 +2441,14 @@
         bt->flags |= BT_TEXT_SYMBOLS_PRINT|BT_ERROR_MASK;
         back_trace(bt);
 
-        bt->flags = BT_EFRAME_COUNT;
-        if ((cnt = machdep->eframe_search(bt))) {
-               error(INFO, "possible exception frame%s:\n", 
-                       cnt > 1 ? "s" : "");
-               bt->flags &= ~(ulonglong)BT_EFRAME_COUNT;
-               machdep->eframe_search(bt); 
+       if (!XEN_HYPER_MODE()) {
+               bt->flags = BT_EFRAME_COUNT;
+               if ((cnt = machdep->eframe_search(bt))) {
+                       error(INFO, "possible exception frame%s:\n", 
+                               cnt > 1 ? "s" : "");
+                       bt->flags &= ~(ulonglong)BT_EFRAME_COUNT;
+                       machdep->eframe_search(bt); 
+               }
        }
 }
 

With the patch above:

crash> bt -a
PCPU:  0  VCPU: ffbc7080
 #0 [ff1d3ebc] elf_core_save_regs at ff10a810
 #1 [ff1d3ec4] common_interrupt at ff1222ed
 #2 [ff1d3ed0] do_nmi at ff1335bb
 #3 [ff1d3ef0] handle_nmi_mce at ff17442e
 #4 [ff1d3f24] csched_tick at ff110aa7
 #5 [ff1d3f80] timer_softirq_action at ff1155d2
 #6 [ff1d3fa0] do_softirq at ff1143fe
 #7 [ff1d3fb0] process_softirqs at ff173f61

PCPU:  1  VCPU: ff1b6080
 #0 [ff1bff40] elf_core_save_regs at ff10a810
 #1 [ff1bff44] crash_nmi_callback at ff13cc91
 #2 [ff1bff54] do_nmi at ff1335bb
 #3 [ff1bff74] handle_nmi_mce at ff17442e
 #4 [ff1bffa8] idle_loop at ff11f975

PCPU:  2  VCPU: ff1cc080
 #0 [ff1c7f40] elf_core_save_regs at ff10a810
 #1 [ff1c7f44] crash_nmi_callback at ff13cc91
 #2 [ff1c7f54] do_nmi at ff1335bb
 #3 [ff1c7f74] handle_nmi_mce at ff17442e
 #4 [ff1c7fa8] idle_loop at ff11f975

PCPU:  3  VCPU: ffbc4080
 #0 [ff1c3f78] elf_core_save_regs at ff10a810
 #1 [ff1c3f7c] crash_nmi_callback at ff13cc91
 #2 [ff1c3f8c] do_nmi at ff1335bb
 #3 [ff1c3fac] handle_nmi_mce at ff17442e

PCPU:  4  VCPU: ffbc3080
 #0 [ff23bedc] elf_core_save_regs at ff10a810
 #1 [ff23bee0] crash_nmi_callback at ff13cc91
 #2 [ff23bef0] do_nmi at ff1335bb
 #3 [ff23bf10] handle_nmi_mce at ff17442e
 #4 [ff23bf44] get_s_time at ff131cb4
 #5 [ff23bf70] reprogram_timer at ff11de46
 #6 [ff23bf80] timer_softirq_action at ff115619
 #7 [ff23bfa0] do_softirq at ff1143fe
 #8 [ff23bfb0] process_softirqs at ff173f61

PCPU:  5  VCPU: ff23d080
 #0 [ff237f40] elf_core_save_regs at ff10a810
 #1 [ff237f44] crash_nmi_callback at ff13cc91
 #2 [ff237f54] do_nmi at ff1335bb
 #3 [ff237f74] handle_nmi_mce at ff17442e
 #4 [ff237fa8] idle_loop at ff11f975

PCPU:  6  VCPU: ffbc1080
 #0 [ffbefee4] elf_core_save_regs at ff10a810
 #1 [ffbefee8] kexec_crash at ff10abe0
 #2 [ffbefef8] do_kexec_op at ff10adbf
 #3 [ffbeff98] hypercall at ff173efb

PCPU:  7  VCPU: ff231080
 #0 [ffbebf40] elf_core_save_regs at ff10a810
 #1 [ffbebf44] crash_nmi_callback at ff13cc91
 #2 [ffbebf54] do_nmi at ff1335bb
 #3 [ffbebf74] handle_nmi_mce at ff17442e
 #4 [ffbebfa8] idle_loop at ff11f975
crash>

Comment 12 Qian Cai 2008-10-22 08:14:14 UTC
(In reply to comment #9)
> With respect to the "bt: invalid structure size: task_struct" seen on
> the xen-syms vmcore analysis, it is clear that the backtrace error-handling
> code is incorrectly using a vmlinux-related function that searches for
> exception frames.  And that is because the backtrace runs into an
> assembly-language entry point that the upstream maintainers had
> apparently never seen.  However, I don't know whether the only reason it is
> being seen in this case is because of the type of trap that was generated
> by the bogus kprobe operation done on the dom0 kernel or not.  
> 

It is not. The same error happens even without using jprobe().

crash 4.0-7.2.3
Copyright (C) 2002, 2003, 2004, 2005, 2006, 2007, 2008  Red Hat, Inc.
Copyright (C) 2004, 2005, 2006  IBM Corporation
Copyright (C) 1999-2006  Hewlett-Packard Co
Copyright (C) 2005, 2006  Fujitsu Limited
Copyright (C) 2006, 2007  VA Linux Systems Japan K.K.
Copyright (C) 2005  NEC Corporation
Copyright (C) 1999, 2002, 2007  Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002  Mission Critical Linux, Inc.
This program is free software, covered by the GNU General Public License,
and you are welcome to change it and/or distribute copies of it under
certain conditions.  Enter "help copying" to see the conditions.
This program has absolutely no warranty.  Enter "help warranty" for details.
 
NOTE: stdin: not a tty

GNU gdb 6.1
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "i686-pc-linux-gnu"...

   KERNEL: /boot/xen-syms-2.6.18-120.el5
DEBUGINFO: /usr/lib/debug/boot/xen-syms-2.6.18-120.el5.debug
 DUMPFILE: /var/crash/127.0.0.1-2008-10-22-03:10:54/vmcore
     CPUS: 8
  DOMAINS: 4
   UPTIME: 00:01:33
  MACHINE: Intel(R) Xeon(R) CPU           E5450  @ 3.00GHz  (3000 Mhz)
   MEMORY: 4 GB
  PCPU-ID: 0
     PCPU: ff1d3fb4
  VCPU-ID: 0
     VCPU: ffbd0080  (VCPU_RUNNING)
DOMAIN-ID: 0
   DOMAIN: ffbdc080  (DOMAIN_RUNNING)
    STATE: CRASH

crash> bt
PCPU:  0  VCPU: ffbd0080
 #0 [ff1d3ee4] elf_core_save_regs at ff10a810
 #1 [ff1d3ee8] kexec_crash at ff10abe0
 #2 [ff1d3ef8] do_kexec_op at ff10adbf
 #3 [ff1d3f98] hypercall at ff173efb
crash> bt -a
PCPU:  0  VCPU: ffbd0080
 #0 [ff1d3ee4] elf_core_save_regs at ff10a810
 #1 [ff1d3ee8] kexec_crash at ff10abe0
 #2 [ff1d3ef8] do_kexec_op at ff10adbf
 #3 [ff1d3f98] hypercall at ff173efb

PCPU:  1  VCPU: ff1b7080
 #0 [ffbf3f40] elf_core_save_regs at ff10a810
 #1 [ffbf3f44] crash_nmi_callback at ff13cc91
 #2 [ffbf3f54] do_nmi at ff1335bb
 #3 [ffbf3f74] handle_nmi_mce at ff17442e
 #4 [ffbf3fa8] idle_loop at ff11f975

PCPU:  2  VCPU: ff1cd080
 #0 [ff1bff40] elf_core_save_regs at ff10a810
 #1 [ff1bff44] crash_nmi_callback at ff13cc91
 #2 [ff1bff54] do_nmi at ff1335bb
 #3 [ff1bff74] handle_nmi_mce at ff17442e
 #4 [ff1bffa8] idle_loop at ff11f975

PCPU:  3  VCPU: ff1ba080
 #0 [ff1c7f40] elf_core_save_regs at ff10a810
 #1 [ff1c7f44] crash_nmi_callback at ff13cc91
 #2 [ff1c7f54] do_nmi at ff1335bb
 #3 [ff1c7f74] handle_nmi_mce at ff17442e
 #4 [ff1c7fa8] idle_loop at ff11f975

PCPU:  4  VCPU: ffbc8080
bt: cannot resolve stack trace:
 #0 [ff1c3eac] elf_core_save_regs at ff10a810
 #1 [ff1c3eb0] crash_nmi_callback at ff13cc91
 #2 [ff1c3ec0] do_nmi at ff1335bb
 #3 [ff1c3ee0] handle_nmi_mce at ff17442e
 #4 [ff1c3f14] read_platform_stime at ff131a08
 #5 [ff1c3f20] local_time_calibration at ff1325c2
 #6 [ff1c3f80] timer_softirq_action at ff1155d2
 #7 [ff1c3fa0] do_softirq at ff1143fe
 #8 [ff1c3fb0] process_softirqs at ff173f61
bt: text symbols on stack:

bt: invalid structure size: task_struct
    FILE: x86.c  LINE: 1576  FUNCTION: x86_eframe_search()

    [ff1c3eac] disable_local_APIC at ff11db75
    [ff1c3eb0] crash_nmi_callback at ff13cc96
    [ff1c3ec0] do_nmi at ff1335c1
    [ff1c3ee0] handle_nmi_mce at ff174435
    [ff1c3f08] read_platform_stime at ff131a08
    [ff1c3f20] local_time_calibration at ff1325c7
    [ff1c3f80] timer_softirq_action at ff1155d4
    [ff1c3fa0] do_softirq at ff114405
    [ff1c3fb0] process_softirqs at ff173f66
[/usr/bin/crash] error trace: 81637af => 816450b => 810c544 => 813eebc

  813eebc: SIZE_verify+126
  810c544: (undetermined)
  816450b: (undetermined)
  81637af: lkcd_x86_back_trace+2370

bt: invalid structure size: task_struct
    FILE: x86.c  LINE: 1576  FUNCTION: x86_eframe_search()

crash> bt -f
PCPU:  0  VCPU: ffbd0080
 #0 [ff1d3ee4] elf_core_save_regs at ff10a810
    [RA: ff10abe5  SP: ff1d3ee4  FP: ff1d3ee8  SIZE: 8]
    ff1d3ee4: 0240a498  ff10abe5  
 #1 [ff1d3ee8] kexec_crash at ff10abe0
    [RA: ff10adc4  SP: ff1d3eec  FP: ff1d3ef8  SIZE: 16]
    ff1d3eec: ff1dfe58  c0993e30  00000002  ff10adc4  
 #2 [ff1d3ef8] do_kexec_op at ff10adbf
    [RA: ff173f02  SP: ff1d3efc  FP: ff1d3f98  SIZE: 160]
    ff1d3efc: ff1d3f30  c0993e30  00000004  3a530020  
    ff1d3f0c: ffffffda  ffffffda  0000007b  ff1050ab  
    ff1d3f1c: 0000000d  c0993cd8  00000004  3063005d  
    ff1d3f2c: 30333939  00000001  3d6b7361  32313163  
    ff1d3f3c: 30303064  73617420  69742e6b  3930633d  
    ff1d3f4c: 30303339  00002930  ffbd0080  ffbdc080  
    ff1d3f5c: 00000000  0b80ffff  1da4c067  00000000  
    ff1d3f6c: ff1ae024  00000000  00000007  04a640db  
    ff1d3f7c: 0000000d  00000005  00000002  ffbd0080  
    ff1d3f8c: 0000007b  0000007b  00000000  ff173f02  
 #3 [ff1d3f98] hypercall at ff173efb
    [RA: 0  SP: ff1d3f9c  FP: ff1d3fd0  SIZE: 52]
    ff1d3f9c: 00000000  c0993e30  c75bec14  c0993f68  
    ff1d3fac: c0993e78  00000000  00000000  c0993e30  
    ff1d3fbc: c75bec14  c0993f68  c0993e78  00000000  
    ff1d3fcc: 00000025

Comment 13 Dave Anderson 2008-10-22 12:44:06 UTC
(In reply to comment #12)
> (In reply to comment #9)
> > With respect to the "bt: invalid structure size: task_struct" seen on
> > the xen-syms vmcore analysis, it is clear that the backtrace error-handling
> > code is incorrectly using a vmlinux-related function that searches for
> > exception frames.  And that is because the backtrace runs into an
> > assembly-language entry point that the upstream maintainers had
> > apparently never seen.  However, I don't know whether the only reason it is
> > being seen in this case is because of the type of trap that was generated
> > by the bogus kprobe operation done on the dom0 kernel or not.  
> > 
> 
> It is not. The same error happens even without using jprobe().

Right -- it has nothing to do with kprobes/jprobes, it's just a
matter of not "stopping" the hypervisor backtrace at the 
"process_softirqs" entry point.

This was the discsussion on the crash utility mailing list:

  https://www.redhat.com/archives/crash-utility/2008-October/msg00063.html

Comment 14 Dave Anderson 2008-11-05 22:06:19 UTC
Cai,

Do you think that this BZ is necessary for RHEL5.3?  (i.e., requiring
a respin of the current crash package errata).

And if so, why is it any more important than these others that you filed?:

  https://bugzilla.redhat.com/show_bug.cgi?id=462819  (rhel5.4.0 +)
  https://bugzilla.redhat.com/show_bug.cgi?id=464116  (no flags)
  https://bugzilla.redhat.com/show_bug.cgi?id=464288  (no flags)
  https://bugzilla.redhat.com/show_bug.cgi?id=466797  (no flags)

Seems like they all could be deferred to RHEL5.4.

Thanks,
  Dave

Comment 15 Qian Cai 2008-11-06 01:12:56 UTC
OK. I am fine to defer those to RHEL5.4, since they are either Xen Domain 0 related or look like a few corner cases. I'll adjust the flag.

Comment 19 Ruediger Landmann 2009-05-18 06:47:20 UTC
Release note added. If any revisions are required, please set the 
"requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

New Contents:
When run on a Xen hypervisor in which the backtrace leads to either "process_softirqs" or "page_fault", the "bt" command backtrace would indicate: "bt: cannot resolve stack trace". The recovery code would then terminate the command with the nonsensical error message: "bt: invalid structure size: task_struct". The command now properly terminates the backtrace.

Comment 20 Dave Anderson 2009-05-18 14:03:09 UTC
Release note looks fine.

Comment 22 errata-xmlrpc 2009-09-02 09:40:19 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2009-1283.html


Note You need to log in before you can comment on or make changes to this bug.