Bug 515753

Summary:

kdump corefile cannot be backtraced in IA64

Product:

Red Hat Enterprise Linux 5

Reporter:

Jon Thomas <jthomas>

Component:

kernel

Assignee:

Takao Indoh <tindoh>

Status:

CLOSED ERRATA

QA Contact:

Red Hat Kernel QE team <kernel-qe>

Severity:

high

Docs Contact:

Priority:

urgent

Version:

5.4

CC:

anderson, daobrien, dhoward, dzickus, jolsa, jpirko, lwang, qcai, rlerch, tao

Target Milestone:

Keywords:

ZStream

Target Release:

5.5

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2010-03-30 06:59:39 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

499522, 525215, 527955, 533192, 541103, 542581, 542582

Attachments:

Description	Flags
Test program to reproduce	none

Description Jon Thomas 2009-08-05 15:54:22 UTC

Description of problem:

The dump taken by using kdump cannot be backtraced in RHEL5.4 GA Snapshot2.
This problem occurs in RHEL5.3 too.

This has some of the same signature noted in bz471136. But are also some differences in the bt, hence a new bug was opened. 


Version-Release number of selected component (if applicable):

Kernel Version: 2.6.18-157.el5
kexec-tools-1.102pre-76.el5
crash 4.0-8.9.1.el5


How reproducible:

When kdump works on disk I/O high load system, it dumps incorrect vmcore.
It occurs once every about 40 times.

Steps to Reproduce:
1) Load the disk I/O.
2) Collect dumps with the INIT button.
  
Actual results:

core can't be backtraced

Expected results:

core should be backtraced

Additional info:


crash> bt -a
PID: 0      TASK: e000000001010000  CPU: 0   COMMAND: "INIT 3043"
#0 [BSP:e000000001011378] kdump_init_notifier at a00000010005f680
#1 [BSP:e000000001011340] notifier_call_chain at a00000010066df10
#2 [BSP:e000000001011310] atomic_notifier_call_chain at a00000010066dfb0
#3 [BSP:e000000001011290] ia64_init_handler at a000000100048ef0
(INIT) INTERRUPTED TASK
PID: 3043   TASK: e00000012b8e0000  CPU: 0   COMMAND: "dd"
#0 [interrupted in user space]
 EFRAME: e000000001017e40
 [exception frame incomplete -- check salinfo for complete context]
     B0: 0000000000000000      CR_IIP: 0000000000000000
CR_IPSR: 0000101308526030      CR_IFS: 0000000000000000
 AR_PFS: c000000000000a9a      AR_RSC: 0000000000000000
AR_UNAT: 0000000000000000     AR_RNAT: 0000000000000000
 AR_CCV: 0000000000000000     AR_FPSR: 0009804c8a70033f
 LOADRS: 0000000001f00000 AR_BSPSTORE: 600007ffffd18110
     B6: 200000000016ad20          B7: a000000000010640
     PR: 0000000000000000          R1: 0000000000000000
     R2: 0000000000000000          R3: 0000000000000000
     R8: 0000000000000000          R9: 0000000000000000
    R10: 0000000000000000         R11: 0000000000000000
    R12: 0000000000000000         R13: 0000000000000000
    R14: 0000000000000000         R15: 0000000000000000
    R16: 0000000000000000         R17: 0000000000000000
    R18: 0000000000000000         R19: 0000000000000000
    R20: 0000000000000000         R21: 0000000000000000
    R22: 0000000000000000         R23: 0000000000000000
    R24: 0000000000000000         R25: 0000000000000000
    R26: 0000000000000000         R27: 0000000000000000
    R28: 0000000000000000         R29: 0000000000000000
    R30: 0000000000000000         R31: 0000000000000000
     F6: 000000000000000000000     F7: 000000000000000000000
     F8: 000000000000000000000     F9: 000000000000000000000
    F10: 000000000000000000000    F11: 000000000000000000000

PID: 0      TASK: e00000010a6c8000  CPU: 1   COMMAND: "INIT 3046"
#0 [BSP:e00000010a6c92c8] machine_kexec at a00000010005a380
#1 [BSP:e00000010a6c92b0] machine_kdump_on_init at a00000010005f3c0
#2 [BSP:e00000010a6c9278] kdump_init_notifier at a00000010005f6c0
#3 [BSP:e00000010a6c9240] notifier_call_chain at a00000010066df10
#4 [BSP:e00000010a6c9210] atomic_notifier_call_chain at a00000010066dfb0
#5 [BSP:e00000010a6c9188] ia64_init_handler at a000000100049280
(INIT) INTERRUPTED TASK
PID: 3046   TASK: e000000129128000  CPU: 1   COMMAND: "dd"
#0 [BSP:e000000129129278] __ia64_leave_kernel at a00000010000bfe0
 EFRAME: e00000012912fc60
     B0: a000000100178240      CR_IIP: a000000100178290
CR_IPSR: 0000101008526030      CR_IFS: 800000000000050f
 AR_PFS: 000000000000050f      AR_RSC: 0000000000000003
AR_UNAT: 0000000000000000     AR_RNAT: 0000000000000000
 AR_CCV: 0000000000000000     AR_FPSR: 0009804c0270033f
 LOADRS: 0000000000f00000 AR_BSPSTORE: e000000129129188
     B6: a000000100112da0          B7: a000000100112480
     PR: 0000000000569999          R1: a000000100c36560
     R2: e00000012912fc78          R3: e00000012912fc70
     R8: 0000000000000200          R9: 0000000027d0df00
    R10: 0000000000000000         R11: 0000000000560559
    R12: e00000012912fe20         R13: e000000129128000
    R14: fffffffffffffdef         R15: e00000012912fd50
    R16: 000000004a667ff6         R17: 0000000027d0df00
    R18: e000000110e088c8         R19: a000000100d21008
    R20: e00000010b0c4668         R21: 0000000000000010
    R22: 2ffc5adb40000000         R23: 000000004a667ff6
    R24: e00000012912fcb8         R25: 0000000000000200
    R26: e00000012912fd20         R27: 0000000000000000
    R28: e000000110e088d0         R29: 0000000000000002
    R30: 0000000000000010         R31: 0000000000004000
     F6: 000000000000000000000     F7: 000000000000000000000
     F8: 000000000000000000000     F9: 000000000000000000000
    F10: 000000000000000000000    F11: 000000000000000000000
#1 [BSP:e000000129129200] vfs_read at a000000100178290
#2 [BSP:e000000129129180] sys_read at a000000100178910
#3 [BSP:e000000129129180] __ia64_trace_syscall at a00000010000bd70
 EFRAME: e00000012912fe40
     B0: 4000000000003200      CR_IIP: a000000000010620
CR_IPSR: 0000121308526030      CR_IFS: 0000000000000003
 AR_PFS: c000000000000003      AR_RSC: 000000000000000f
AR_UNAT: 0000000000000000     AR_RNAT: 0000000000000000
 AR_CCV: 0000000000000000     AR_FPSR: 0009804c8a70033f
 LOADRS: 0000000000e00000 AR_BSPSTORE: 600007ffffae0110
     B6: 2000000000211f40          B7: a000000000010640
     PR: 000000000056a241          R1: 2000000000280238
     R2: c000000000000a9a          R3: e00000012912fef8
     R8: 0000000000000001          R9: 2000000000303c88
    R10: 0000000000000000         R11: c00000000000038a
    R12: 60000fffffadb8a0         R13: 200000000034cfe0
    R14: 000000000000000f         R15: 0000000000000402
    R16: 000000000056a241         R17: 0000000000000000
    R18: 6000000000010010         R19: 6000000000010000
    R20: 6000000000008010         R21: 6000000000008000
    R22: 0000000000000200         R23: 600007ffffae0110
    R24: 0000000000000000         R25: 0000000000000000
    R26: c000000000000003         R27: 000000000000000f
    R28: a000000000010620         R29: 00001213085a6010
    R30: 60000000000081f4         R31: 6000000000010214
     F6: 000000000000000000000     F7: 000000000000000000000
     F8: 000000000000000000000     F9: 000000000000000000000
    F10: 000000000000000000000    F11: 000000000000000000000
#4 [BSP:e000000129129180] __kernel_syscall_via_break at a000000000010620

PID: 0      TASK: e00000010a718000  CPU: 2   COMMAND: "INIT swapper 2"
#0 [BSP:e00000010a719240] kdump_init_notifier at a00000010005f680
#1 [BSP:e00000010a719208] notifier_call_chain at a00000010066df10
#2 [BSP:e00000010a7191d0] atomic_notifier_call_chain at a00000010066dfb0
#3 [BSP:e00000010a719150] ia64_init_handler at a000000100048ef0
(INIT) INTERRUPTED TASK
PID: 0      TASK: e00000010a708000  CPU: 2   COMMAND: "swapper"
#0 [BSP:e00000010a709198] __ia64_leave_kernel at a00000010000bfe0
 EFRAME: e00000010a70fbd0
     B0: e00000007fa1e5f0      CR_IIP: a000000100664a30
CR_IPSR: 0000121008522030      CR_IFS: 0000000000000000
 AR_PFS: 0000000000000209      AR_RSC: 0000000000000000
AR_UNAT: 0000000000000000     AR_RNAT: 0000000000000000
 AR_CCV: 0000000000000010     AR_FPSR: 0009804c8a70433f
 LOADRS: 0000000000b00000 AR_BSPSTORE: e00000010a7090e8
     B6: a000000100014b20          B7: e00000007fa1e560
     PR: 0bad0bad0bad5519          R1: a000000100c36560
     R2: e00000010a70fe30          R3: 0000000000000209
     R8: fffffffffffffffe          R9: 0000000000000000
    R10: 0000000000000000         R11: 0000000000000000
    R12: e00000010a70fd90         R13: e00000010a708000
    R14: 0000000000000001         R15: 0000101008522030
    R16: 0000000000000001         R17: 0bad0bad0bad5519
    R18: 0000000000000000         R19: 00000000000002ab
    R20: 0000101008520030         R21: a000000100011be0
    R22: 0000001008522030         R23: 0000000080000100
    R24: 0000000080000100         R25: 0003000000000000
    R26: ffffffffffff0028         R27: 0000000000000000
    R28: 0000000000000000         R29: 0000000000000000
    R30: 0000000000000000         R31: 0000000000000000
     F6: 0fffcfffffffff0000000     F7: 0ffebef3a800000000000
     F8: 1000fef3a800000000000     F9: 100018000000000000000
    F10: 1000def3a7ffff10c5800    F11: 1003e000000000000779d
#1 [BSP:e00000010a709198] schedule at a000000100664a30
#2 [BSP:e00000010a708f20] __gp at a000000100c36560
#3 [BSP:e00000010a708ca8] v at e00000007fa1e5f0
#4 [BSP:e00000010a708a30] v at e00000007fa1e5f0
#5 [BSP:e00000010a7087b0] v at e00000007fa1e5f0
#6 [BSP:e00000010a708538] v at e00000007fa1e5f0
#7 [BSP:e00000010a7082c0] v at e00000007fa1e5f0
#8 [BSP:e00000010a708048] v at e00000007fa1e5f0
bt: unwind: bsp (0xe00000010a707dc8) out of range [0xe00000010a708000-0xe0000001
0a709198]

PID: 0      TASK: e00000010a748000  CPU: 3   COMMAND: "INIT"
#0 [BSP:e00000010a749248] kdump_init_notifier at a00000010005f680
#1 [BSP:e00000010a749210] notifier_call_chain at a00000010066df10
#2 [BSP:e00000010a7491d8] atomic_notifier_call_chain at a00000010066dfb0
#3 [BSP:e00000010a749158] ia64_init_handler at a000000100048ef0
 EFRAME: e00000010a74fe40
     B0: 0000000000000000      CR_IIP: 0000000000000000
CR_IPSR: 0000121008520030      CR_IFS: 0000000000000000
 AR_PFS: 0000000000000710      AR_RSC: 0000000000000000
AR_UNAT: 0000000000000000     AR_RNAT: 0000000000000000
 AR_CCV: 0000000000000000     AR_FPSR: 0009804c0270033f
 LOADRS: 0000000000b80000 AR_BSPSTORE: e00000012a769110
     B6: 2000000000212040          B7: a000000000010640
     PR: 0000000000000000          R1: 0000000000000000
     R2: 0000000000000000          R3: 0000000000000000
     R8: 0000000000000000          R9: 0000000000000000
    R10: 0000000000000000         R11: 0000000000000000
    R12: 0000000000000000         R13: 0000000000000000
    R14: 0000000000000000         R15: 0000000000000000
    R16: 0000000000000000         R17: 0000000000000000
    R18: 0000000000000000         R19: 0000000000000000
    R20: 0000000000000000         R21: 0000000000000000
    R22: 0000000000000000         R23: 0000000000000000
    R24: 0000000000000000         R25: 0000000000000000
    R26: 0000000000000000         R27: 0000000000000000
    R28: 0000000000000000         R29: 0000000000000000
    R30: 0000000000000000         R31: 0000000000000000
     F6: 000000000000000000000     F7: 000000000000000000000
     F8: 000000000000000000000     F9: 000000000000000000000
    F10: 000000000000000000000    F11: 000000000000000000000
bt: unwind: failed to locate return link (ip=0x0)!

Comment 2 Takao Indoh 2009-08-05 19:25:15 UTC

Here is the detail informaiton about this problem.

I investigated this problem and I found a bug in
ia64_mca_modify_original_stack(arch/ia64/kernel/mca.c). If INIT is issued
while kernel is in fsys-mode, the register is not saved in the stack. So,
when crash executes bt command, crash cannot find registers in vmcore and
it fails. I have not found the solution of this problem yet, but at first
I have to start discussion in upstream.

The following is the result of my investigation.
------

I found the following message in the vmcore.

<6>Entered OS INIT handler. PSP=fff301a0 cpu=6 monarch=0.
<6>Entered OS INIT handler. PSP=fff301a0 cpu=0 monarch=0.
<6>Entered OSINIT handler. PSP=fff301a0 cpu=3 monarch=0.
<6>Entered OS INIT handler. PSP=fff301a0 cpu=7 monarch=0.
<6>Entered OS INIT handler.PSP=fff301a0 cpu=1 monarch=0.
<6>Entered OS INIThandler. PSP=fff301a0 cpu=4 monarch=0.
<6>Entered OS INIT handler. PSP=fff301a0 cpu=5 monarch=0.
<6>Entered OS INIT handler. PSP=fff301a0 cpu=2 monarch=1.
<6>cpu 2, INIT inconsistent previous current and r13, original stack not modified.

This is a message which is printed when INIT is issued. The last line
of this message means register was not saved before jumping to 2nd
kernel.

arch/ia64/kernel/mca.c:ia64_mca_modify_original_stack()
(snip)
        if (!mca_recover_range(ms->pmsa_iip)) {
                if (r13 != sos->prev_IA64_KR_CURRENT) {
                        msg = "inconsistent previous current and r13";
                        goto no_mod;
                }

This error message is displayed because there is conflict between
r13 and ar.k6 register. To dig more, I added some printk into kernel
source so that kernel can print the value of the register when this
problem happens. The result is as follows.

r13    = 2000000000350fe0
ar.k6  = e0000040e4148000
cr.iip = 0xa000000100010220

What we should note here is that cr.iip points 0xa000000100010220,
which is fsys_bubble_down+32. This means that kernel was in fsys-mode
when INIT is issued. In the fsys-mode, r13 is not always the same as
ar.k6. Therefore,

        if (!mca_recover_range(ms->pmsa_iip)) {
                if (r13 != sos->prev_IA64_KR_CURRENT) {
                        msg = "inconsistent previous current and r13";
                        goto no_mod;
                }

This is not correct. We have to change this like this:

        if (!mca_recover_range(ms->pmsa_iip)) {
+          if (not fsys_mode ?) {
                if (r13 != sos->prev_IA64_KR_CURRENT) {
                        msg = "inconsistent previous current and r13";
                        goto no_mod;
                }
+          } else {
+                   /* do something */
+          }

Comment 3 RHEL Program Management 2009-10-05 15:19:27 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 4 Qian Cai 2009-10-14 14:54:25 UTC

Hello!

Two questions to understand this bug better if we can test this in-house.

* Is it always reproducible? The rate is always around one in 40 times?
  what was the tool to load disk IO?

* How difficult is to reproduce this on other IA-64 systems? I don't have
  physical access to any PRIMEQUEST system here.

If it turns out it is difficult to reproduce in-house. Can customer help verify it for RHEL5.5?

Thanks!
CAI Qian

Comment 5 Takao Indoh 2009-10-14 20:24:46 UTC

Hi Qian,

>* Is it always reproducible? The rate is always around one in 40 times?
> what was the tool to load disk IO?

Usually it can be reproduced within 50 times by just running many dd commands.

>* How difficult is to reproduce this on other IA-64 systems? I don't have
> physical access to any PRIMEQUEST system here.

You can reproduce this on any IA64 machine as well. This problem does not depend on the platform.

Thanks,
Takao

Comment 6 Takao Indoh 2009-10-14 20:33:29 UTC

Created attachment 364806 [details]
Test program to reproduce

Qian, please try this program. I think we can reproduce within a few times using this.

How to reproduce
1. Build this program
gcc -Wall -lrt dumptest.c
2. Run
./a.out
3. Send INIT to start kdump

Thanks,
Takao

Comment 7 Qian Cai 2009-10-14 23:48:07 UTC

Great! That is all the information need at this stage. Looks like reproducible. Thanks Takao!

Comment 12 Don Zickus 2009-12-02 21:02:50 UTC

in kernel-2.6.18-176.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Please do NOT transition this bugzilla state to VERIFIED until our QE team
has sent specific instructions indicating when to do so.  However feel free
to provide a comment indicating that this fix has been verified.

Comment 13 Don Zickus 2009-12-02 21:14:14 UTC

in kernel-2.6.18-176.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Please do NOT transition this bugzilla state to VERIFIED until our QE team
has sent specific instructions indicating when to do so.  However feel free
to provide a comment indicating that this fix has been verified.

Comment 23 errata-xmlrpc 2010-03-30 06:59:39 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2010-0178.html