Bug 462624 - [5.3][crash] bt command does not show interruption frame
[5.3][crash] bt command does not show interruption frame
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: crash (Show other bugs)
5.2
i386 Linux
medium Severity medium
: rc
: ---
Assigned To: Dave Anderson
CAI Qian
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2008-09-17 12:04 EDT by Takao Indoh
Modified: 2010-01-28 05:10 EST (History)
3 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2009-01-20 17:13:48 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Takao Indoh 2008-09-17 12:04:26 EDT
Description of problem:
bt command of crash does not show the interruption frame.

crash> bt
PID: 5983   TASK: eb2a0aa0  CPU: 0   COMMAND: "dd"
 #0 [c075de8c] crash_kexec at c0443cb2
 #1 [c075ded4] __handle_sysrq at c053fe4f
 #2 [c075defc] handle_sysrq at c053fed4
 #3 [c075df04] kbd_event at c053adec
 #4 [c075df2c] input_event at c05987ee
 #5 [c075df48] atkbd_interrupt at c059c056
 #6 [c075df74] serio_interrupt at c0595685
 #7 [c075df94] i8042_interrupt at c05962e1
 #8 [c075dfcc] handle_IRQ_event at c044e153
 #9 [c075dfe4] __do_IRQ at c044e21b
crash>

The stack frames which follows __do_IRQ is not displayed.


Version-Release number of selected component (if applicable):
crash-4.0-5.0.3

How reproducible:
Always

Steps to Reproduce:
1. In RHEL5.2 system, start kdump using sysrq+c key
2. Open vmcore using crash
3. Issue bt command
  
Actual results:
The interruption frame is not displayed.

Expected results:
The interruption frame is displayed.
The following is a result of bt on RHEL4.

 #0 [c0403d84] disk_dump at f8a6c1b2
 #1 [c0403d88] printk at c0122b47
 #2 [c0403d94] freeze_other_cpus at f8a6bef5
(snip)
#12 [c0403f68] atkbd_report_key at c02760aa
#13 [c0403f78] atkbd_interrupt at c027656d
#14 [c0403fa4] serio_interrupt at c021d39b
#15 [c0403fc4] i8042_interrupt at c021d80f
#16 [c0403fdc] handle_IRQ_event at c01074d0
--- <hard IRQ> ---
 #0 [f7e04f88] do_IRQ at c0107916
 #1 [f7e04f84] common_interrupt at c02e0ed3
    EAX: 00000000  EBX: f7e04000  ECX: 00000000  EDX: 00000000  EBP: 00000000
    DS:  007b      ESI: 00000000  ES:  007b      EDI: 00000000
    CS:  0060      EIP: c01040e8  ERR: ffffff01  EFLAGS: 00000246
 #2 [f7e04fb8] mwait_idle at c01040e8
 #3 [f7e04fc0] cpu_idle at c010409e


Additional info:
I confirmed the same problem occured in the following system.

RHEL5.1
RHEL5.2
RHEL5.2 + kernel-2.6.18-115.el5 and crash-4.0-7.2
Comment 1 Takao Indoh 2008-09-17 12:16:57 EDT
I guess that the following part of restore_stack() in kernel.c causes this problem.

        case BT_HARDIRQ:
                bt->instptr = symbol_value("do_IRQ");
                bt->stkptr = ULONG(bt->stackbuf +
                        SIZE(irq_ctx) - (sizeof(unsigned int)*2));
                type = BT_HARDIRQ;

I think bt->stkptr is not correct.
I don't know the accurate meaning of this code, but I think this code
means calculating an address of pt_regs in the stack(irqctx), and setting stack
pointer to the address. In 5.2 kernel, the address of pt_regs is passed from
do_IRQ to __do_IRQ using the register, so this calculation is not correct
in RHEL5, I think.

Takao
Comment 2 Dave Anderson 2008-09-17 12:26:47 EDT
> I think bt->stkptr is not correct.

Yes, I see that the transition back to the process stack is not working
by using "bt -t":

crash> bt -t
PID: 10524  TASK: ead2f550  CPU: 0   COMMAND: "dd"
      START: crash_kexec at c04440f2
  [c074bea8] startup_32 at c040007b
  [c074bed0] sysrq_handle_crashdump at c053a2f4
  [c074bed4] __handle_sysrq at c053a14d
  [c074befc] handle_sysrq at c053a1d5
  [c074bf04] kbd_event at c053567d
  [c074bf2c] input_event at c058e1c1
  [c074bf48] atkbd_interrupt at c0591a2b
  [c074bf74] serio_interrupt at c058b058
  [c074bf94] i8042_interrupt at c058bcb6
  [c074bfcc] handle_IRQ_event at c044dd9b
  [c074bfe4] __do_IRQ at c044de45
  [c074bffc] do_IRQ at c04073f4
--- <hard IRQ> ---
bt: invalid stack address for this task: bfcb06a8
    (valid range: e9852000 - e9853000)
      START: do_IRQ at c0407361
crash>

Hopefully there will be something in the hard IRQ stack that I
can use to find the path back...

Thanks,
  Dave
Comment 3 Dave Anderson 2008-09-17 15:37:25 EDT
Another question -- when you tested the RHEL5.1 and RHEL5.2 kernels,
were you interrupting a user-space process while it was running in
user space?  In the sample kernel-2.6.18-115.el5 case, the interrupted
"dd" task was running in user-space when the alt-sysrq-c was entered,
and so the faulty bt->stkptr is the user-space stack pointer value
when the alt-sysrq-c was entered.

In the RHEL4 example above, you interrupted the kernel while it was
operating/idling in kernel space.  And that worked OK

And so, that's why I'm interested in whether the RHEL5.1 and RHEL5.2 kernels
were running in user space or kernel space when the alt-sysrq-c was entered?

For that matter, I'm wondering whether the same problem could occur
in RHEL4 if the alt-sysrq-c were entered while the interrupted process
was running in user-space?

We really need this test matrix, where your testing has confirmed
only 2 of the 6 possibilities (so far), because I don't know whether
your RHEL5.1 and RHEL5.2 tests interrupted user-space or kernel-space:

  RHEL4      user-space (?)      kernel-space (OK)
  RHEL5.[12] user-space (?)      kernel-space (?)
  RHEL5.3    user-space (FAIL)   kernel-space (?)

The RHEL5.[12]/RHEL5.3 differentiation applies because of the recent
linux-2.6-x86-execute-stack-overflow-warning-on-interrupt-stack.patch
that went into 2.6.18-108.el5, which modified the process-to-IRQ stack
transition.
Comment 4 Dave Anderson 2008-09-17 16:07:40 EDT
> I guess that the following part of restore_stack() in kernel.c causes this
> problem.
>
>        case BT_HARDIRQ:
>                bt->instptr = symbol_value("do_IRQ");
>                bt->stkptr = ULONG(bt->stackbuf +
>                        SIZE(irq_ctx) - (sizeof(unsigned int)*2));
>                type = BT_HARDIRQ;

Applying this patch to restore_stack():

--- kernel.c    2008-09-03 13:32:09.000000000 -0400
+++ kernel.c.test      2008-09-17 15:59:26.681603000 -0400
@@ -2104,8 +2104,8 @@
        {
        case BT_HARDIRQ:
                bt->instptr = symbol_value("do_IRQ");
-               bt->stkptr = ULONG(bt->stackbuf +
-                       SIZE(irq_ctx) - (sizeof(unsigned int)*2));
+               bt->stkptr = ULONG(bt->stackbuf +
+                       OFFSET(thread_info_previous_esp));
                type = BT_HARDIRQ;
                break;

makes the backtrace work on the supplied dumpfile:

  crash> bt
  PID: 10524  TASK: ead2f550  CPU: 0   COMMAND: "dd"
   #0 [c074be8c] crash_kexec at c04440f2
   #1 [c074bed4] __handle_sysrq at c053a14b
   #2 [c074befc] handle_sysrq at c053a1d0
   #3 [c074bf04] kbd_event at c0535678
   #4 [c074bf2c] input_event at c058e1be
   #5 [c074bf48] atkbd_interrupt at c0591a26
   #6 [c074bf74] serio_interrupt at c058b055
   #7 [c074bf94] i8042_interrupt at c058bcb1
   #8 [c074bfcc] handle_IRQ_event at c044dd99
   #9 [c074bfe4] __do_IRQ at c044de40
  --- <hard IRQ> ---
   #0 [e9852fac] do_IRQ at c0407361
   #1 [e9852fb8] common_interrupt at c0405929
      EAX: 00000001  EBX: 0805339c  ECX: 00000800  EDX: 00000000
      DS:  007b      ESI: 00000800  ES:  007b      EDI: bfcb061c
      SS:  007b      ESP: bfcb0610  EBP: bfcb06a8
      CS:  0073      EIP: 08049b46  ERR: fffffffe  EFLAGS: 00000282
  crash>

So using the IRQ context's thread_info.previous_esp works in this
case, and in fact is what restore_stack() always does for the soft
IRQ stack transition.

But I don't remember why I have always used passed parameter values for
hard IRQ's instead of the thread_info.previous_esp value?

So for sanity sake, the patch it will require testing in all 6 test
cases above.
Comment 5 Takao Indoh 2008-09-17 16:30:20 EDT
>Another question -- when you tested the RHEL5.1 and RHEL5.2 kernels,
>were you interrupting a user-space process while it was running in
>user space?  In the sample kernel-2.6.18-115.el5 case, the interrupted
>"dd" task was running in user-space when the alt-sysrq-c was entered,
>and so the faulty bt->stkptr is the user-space stack pointer value
>when the alt-sysrq-c was entered.

Yes, I was running scripts when I pressed alt-sysrq-c.

>We really need this test matrix, where your testing has confirmed
>only 2 of the 6 possibilities (so far), because I don't know whether
>your RHEL5.1 and RHEL5.2 tests interrupted user-space or kernel-space:

Ok, I'll check all these test items of the matrix using your patch and
report the result to you.
Please wait...
Comment 6 Takao Indoh 2008-09-18 15:50:14 EDT
I confirmed the patch of comment #4 worked well.
The following is the result of test.

Test Items
-------------------------
[A]RHEL4.6
A-1. user-space
A-2. kernel-space

[B]RHEL5.2
B-1. user-space
B-2. kernel-space

[C]RHEL5.3(kerne-2.6.18-115.el5)
C-1. user-space
C-2. kernel-space


Test Result
-------------------------
case 1:without patch

A-1  OK
A-2  OK
B-1  NG
B-2  NG
C-1  NG
C-2  NG

case 2:with patch

A-1  OK
A-2  OK
B-1  OK
B-2  OK
C-1  OK
C-2  OK
Comment 7 Dave Anderson 2008-09-18 16:04:46 EDT
Takao,

Thank you very much for carrying out all of the test possibilities
in the matrix.  I'd like to see if I can get this bugzilla approved
for the in-progress RHEL5.3 errata:

  RHBA-2009:8121-03 - crash bug fix update
  http://errata.devel.redhat.com/errata/info/7813

Thanks again,
  Dave
Comment 9 RHEL Product and Program Management 2008-09-18 16:21:50 EDT
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.
Comment 10 Dave Anderson 2008-09-18 16:38:15 EDT
Linda, can you set devel_ack+ on this one?  

(I don't have the permissions to do it)

Thanks,
  Dave
Comment 15 errata-xmlrpc 2009-01-20 17:13:48 EST
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2009-0240.html

Note You need to log in before you can comment on or make changes to this bug.