Bug 515753
| Summary: | kdump corefile cannot be backtraced in IA64 | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 5 | Reporter: | Jon Thomas <jthomas> | ||||
| Component: | kernel | Assignee: | Takao Indoh <tindoh> | ||||
| Status: | CLOSED ERRATA | QA Contact: | Red Hat Kernel QE team <kernel-qe> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | urgent | ||||||
| Version: | 5.4 | CC: | anderson, daobrien, dhoward, dzickus, jolsa, jpirko, lwang, qcai, rlerch, tao | ||||
| Target Milestone: | rc | Keywords: | ZStream | ||||
| Target Release: | 5.5 | ||||||
| Hardware: | All | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2010-03-30 06:59:39 UTC | Type: | --- | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | |||||||
| Bug Blocks: | 499522, 525215, 527955, 533192, 541103, 542581, 542582 | ||||||
| Attachments: |
|
||||||
|
Description
Jon Thomas
2009-08-05 15:54:22 UTC
Here is the detail informaiton about this problem.
I investigated this problem and I found a bug in
ia64_mca_modify_original_stack(arch/ia64/kernel/mca.c). If INIT is issued
while kernel is in fsys-mode, the register is not saved in the stack. So,
when crash executes bt command, crash cannot find registers in vmcore and
it fails. I have not found the solution of this problem yet, but at first
I have to start discussion in upstream.
The following is the result of my investigation.
------
I found the following message in the vmcore.
<6>Entered OS INIT handler. PSP=fff301a0 cpu=6 monarch=0.
<6>Entered OS INIT handler. PSP=fff301a0 cpu=0 monarch=0.
<6>Entered OSINIT handler. PSP=fff301a0 cpu=3 monarch=0.
<6>Entered OS INIT handler. PSP=fff301a0 cpu=7 monarch=0.
<6>Entered OS INIT handler.PSP=fff301a0 cpu=1 monarch=0.
<6>Entered OS INIThandler. PSP=fff301a0 cpu=4 monarch=0.
<6>Entered OS INIT handler. PSP=fff301a0 cpu=5 monarch=0.
<6>Entered OS INIT handler. PSP=fff301a0 cpu=2 monarch=1.
<6>cpu 2, INIT inconsistent previous current and r13, original stack not modified.
This is a message which is printed when INIT is issued. The last line
of this message means register was not saved before jumping to 2nd
kernel.
arch/ia64/kernel/mca.c:ia64_mca_modify_original_stack()
(snip)
if (!mca_recover_range(ms->pmsa_iip)) {
if (r13 != sos->prev_IA64_KR_CURRENT) {
msg = "inconsistent previous current and r13";
goto no_mod;
}
This error message is displayed because there is conflict between
r13 and ar.k6 register. To dig more, I added some printk into kernel
source so that kernel can print the value of the register when this
problem happens. The result is as follows.
r13 = 2000000000350fe0
ar.k6 = e0000040e4148000
cr.iip = 0xa000000100010220
What we should note here is that cr.iip points 0xa000000100010220,
which is fsys_bubble_down+32. This means that kernel was in fsys-mode
when INIT is issued. In the fsys-mode, r13 is not always the same as
ar.k6. Therefore,
if (!mca_recover_range(ms->pmsa_iip)) {
if (r13 != sos->prev_IA64_KR_CURRENT) {
msg = "inconsistent previous current and r13";
goto no_mod;
}
This is not correct. We have to change this like this:
if (!mca_recover_range(ms->pmsa_iip)) {
+ if (not fsys_mode ?) {
if (r13 != sos->prev_IA64_KR_CURRENT) {
msg = "inconsistent previous current and r13";
goto no_mod;
}
+ } else {
+ /* do something */
+ }
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. Hello! Two questions to understand this bug better if we can test this in-house. * Is it always reproducible? The rate is always around one in 40 times? what was the tool to load disk IO? * How difficult is to reproduce this on other IA-64 systems? I don't have physical access to any PRIMEQUEST system here. If it turns out it is difficult to reproduce in-house. Can customer help verify it for RHEL5.5? Thanks! CAI Qian Hi Qian, >* Is it always reproducible? The rate is always around one in 40 times? > what was the tool to load disk IO? Usually it can be reproduced within 50 times by just running many dd commands. >* How difficult is to reproduce this on other IA-64 systems? I don't have > physical access to any PRIMEQUEST system here. You can reproduce this on any IA64 machine as well. This problem does not depend on the platform. Thanks, Takao Created attachment 364806 [details]
Test program to reproduce
Qian, please try this program. I think we can reproduce within a few times using this.
How to reproduce
1. Build this program
gcc -Wall -lrt dumptest.c
2. Run
./a.out
3. Send INIT to start kdump
Thanks,
Takao
Great! That is all the information need at this stage. Looks like reproducible. Thanks Takao! in kernel-2.6.18-176.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 Please do NOT transition this bugzilla state to VERIFIED until our QE team has sent specific instructions indicating when to do so. However feel free to provide a comment indicating that this fix has been verified. in kernel-2.6.18-176.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 Please do NOT transition this bugzilla state to VERIFIED until our QE team has sent specific instructions indicating when to do so. However feel free to provide a comment indicating that this fix has been verified. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2010-0178.html |