Bug 436283 - [5.2][kdump] kdump not work on intel-s6e5231-01
Summary: [5.2][kdump] kdump not work on intel-s6e5231-01
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel   
(Show other bugs)
Version: 5.2
Hardware: ia64 Linux
Target Milestone: rc
: ---
Assignee: Luming Yu
QA Contact: Martin Jenner
Depends On:
TreeView+ depends on / blocked
Reported: 2008-03-06 11:36 UTC by Qian Cai
Modified: 2013-08-06 01:43 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2008-04-07 03:41:41 UTC
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)
1.102pre-11.el5 + kexec-tools patch from BZ434927#28 + 2.6.18-84.el5 (22.06 KB, text/plain)
2008-03-06 11:36 UTC, Qian Cai
no flags Details
1.102pre-11.el5 + kexec-tools patch from BZ434927#28 + 2.6.18-53.el5 (7.94 KB, text/plain)
2008-03-06 11:38 UTC, Qian Cai
no flags Details
1.102pre-11.el5 + 2.6.18-84.el5 (1.30 KB, text/plain)
2008-03-06 11:39 UTC, Qian Cai
no flags Details
1.101-194.el5 + 2.6.18-84.el5 (6.76 KB, text/plain)
2008-03-06 11:40 UTC, Qian Cai
no flags Details
patch to set MCA SAL displatch length (689 bytes, patch)
2008-03-07 17:12 UTC, Neil Horman
no flags Details | Diff
sosreport (2.15 MB, application/octet-stream)
2008-03-27 10:26 UTC, Qian Cai
no flags Details

Description Qian Cai 2008-03-06 11:36:46 UTC
Description of problem:
Capture kernel seems usually Oops or reset just after INIT (there is only 1 in
around 10 attempts got a vmcore). In addition, when booting into capture kernel,
the remote serial console output has been scratched. I have only managed to get
some data from log files later. Please see attachments.

Version-Release number of selected component (if applicable):

I have tried the following kexec-tools + kernel combination,

1.102pre-11.el5 + kexec-tools patch from BZ434927#28 + 2.6.18-84.el5
1.102pre-11.el5 + kexec-tools patch from BZ434927#28 + 2.6.18-53.el5
1.102pre-11.el5 + 2.6.18-84.el5
1.101-194.el5   + 2.6.18-84.el5
1.101-194.el5   + 2.6.18-53.el5

It was almost the same effect.

How reproducible:
Usually on intel-s6e5231-01.rhts.boston.redhat.com

Steps to Reproduce:
1. configure kdump with crashkernel=512M@256M
2. SysRq-c

Comment 1 Qian Cai 2008-03-06 11:36:46 UTC
Created attachment 297029 [details]
1.102pre-11.el5 + kexec-tools patch from BZ434927#28 + 2.6.18-84.el5

Comment 2 Qian Cai 2008-03-06 11:38:08 UTC
Created attachment 297030 [details]
1.102pre-11.el5 + kexec-tools patch from BZ434927#28 + 2.6.18-53.el5

Comment 3 Qian Cai 2008-03-06 11:39:26 UTC
Created attachment 297032 [details]
1.102pre-11.el5 + 2.6.18-84.el5

Comment 4 Qian Cai 2008-03-06 11:40:17 UTC
Created attachment 297033 [details]
1.101-194.el5   + 2.6.18-84.el5

Comment 5 Neil Horman 2008-03-06 18:52:51 UTC
is this the only system this occurs on?

Comment 6 Qian Cai 2008-03-07 02:34:19 UTC

Comment 7 Neil Horman 2008-03-07 03:41:36 UTC
ugh, Ok.  I'm not sure what we're going to about this if its just on one system.
 I have an outstanding bug with an hp system that is unique in this fashion, 
except its reproducible consistently (you may have reported it in fact).  That
was about SAL checksumming with OS callouts.  I wonder if something simmilar is
going on here.  I'll dig out the notes/patch I sent to HP.  If it works we may
just need to inform intel about the problem (as its looking like a sal firmware
issue at the moment)

Comment 8 Neil Horman 2008-03-07 17:12:38 UTC
Created attachment 297218 [details]
patch to set MCA SAL displatch length

it was bz 277531 that I'm taking notes from here.  I believe this was the patch
that got me past the reset which led me to believe that there was a SAL issue,
which doug is looking into.  I may need to set the other handler lengths as
well, but this will be a good starting confirmation.  Cai, can you please test
this out and see if it gets you any farther?  Thanks!

Comment 9 Qian Cai 2008-03-11 01:59:11 UTC
Looks like it did not make any difference. The first time, the capture kernel
failed after INIT,

ip[3260]: Oops 11012296146944

The second time, the machine was dead like we have seen before. It neither
responded to any ping nor reboot command from RHTS WebUI, and it never came back
again. Output from the serial console was,

Error: No response to keepalive - Terminating session
Error: No response de-activating SOL payload
<<<PAYLOAD LOST ... retrying in 30 secs>>>
Error: Unable to establish IPMI v2 / RMCP+ session
Error: No response activating SOL payload
<<<PAYLOAD LOST ... retrying in 10 secs>>>

Comment 10 Neil Horman 2008-03-11 10:34:31 UTC
ok, I'll have doug try to comment on this as well, I'm really not sure what I
can do in situations like this on ia64.  Doug, any thouhgts you may have are

Comment 11 Luming Yu 2008-03-12 09:06:47 UTC

Does the problem happen with RHEL 5.1?
Is it a regression or just a new bug?
Is there any upstream solution for this bug?


Comment 12 Qian Cai 2008-03-12 09:26:28 UTC
Q: Does the problem happen with RHEL 5.1?
A: Yes, at least I have tried the the latest RHEL 5.2 tree but with RHEL 5.1
versions of kernel and kexec-tools packages.

Q: Is it a regression or just a new bug?
A: Looks like new.

Comment 13 Luming Yu 2008-03-13 05:25:54 UTC
what is intel-s6e5231-01? Is it intel tiger box?
I'm pretty sure intel tiger box should have been tested with RHEL 5.1 on the
kdump fetature, and got postive results..

Comment 14 Qian Cai 2008-03-16 05:06:40 UTC
I guess it is S6E5200 SDP Series (Intel S6E5200 Series Software Development

Comment 15 Luming Yu 2008-03-17 03:17:51 UTC
Is this bug relative to bz#434927:  Zero-size /proc/vmcore after kdump? 
Sounds like there are at least two different box have kdump issue now...

Comment 16 Neil Horman 2008-03-17 12:06:50 UTC
I don't think so (at least not as far as I undersand it currently).  Bz 434972
seems to be a problem relating to elf core header alignment on ia64. 
Apparently, upstream forced the alignment of the buffer that holds the vmcore
header to be 4096 bytes to prevent  it running of the edge of a grain of memory
on ia64 if CONFIG_DISCONTIG is set.  However, that same alignment change causes
that buffer to get lost on reboot in the RHEL kernel.  Given that the previous
kexec tools worked fine with the previous alignment of 1024 bytes, I'm assuming
that some change in the disconig memory setup in upstream  kernel necessitated
this, and I'm reverting the alignment in that bug once its approved. 
Conversely, this bug seems to be a reset (possibly during a  SAL callout to the
OS), when we're just starting the kdump process.  Previously I've seen this
problem on an isloated HP machine, and was able to correct it by  setting the
length of the registered callout function for MCA dispatch to something non-zero
(the code commentatry apparently incorrectly indicating that zero was supposed
to disable checksum verification).  That fix however doesn't seem to work in
this case however.

Comment 17 Luming Yu 2008-03-18 04:02:16 UTC
I have verfied rhel 5.2 kdump on one intel tiger box. Basically it works after
downgrading kexec-tool package...
So please describe the configuration details of the Intel S6E5200 box.., and
post dmesg here.


Comment 18 Qian Cai 2008-03-18 04:27:18 UTC
The system is currently unavailable at our lab in U.S.. I'll let you know when I
can access it again.

Comment 19 Luming Yu 2008-03-20 03:19:16 UTC
Is the information available?

Comment 20 Qian Cai 2008-03-20 06:16:24 UTC
Not yet. Looks like it only failed for this particular system though. There are
no such problem on similar boxes here,

S6E5132 - HITACHI Cold Fusion-3e 4S4U
S6E4433 - HITACHI Cold Fusion-3e 2S4U

Comment 21 Neil Horman 2008-03-20 11:39:46 UTC
Based on the last few comments, I'm guessing this is still waiting on input from
doug (or perhaps luming?)

Comment 22 Doug Chapman 2008-03-20 16:26:38 UTC
(In reply to comment #21)
> Based on the last few comments, I'm guessing this is still waiting on input from
> doug (or perhaps luming?)

Not waiting on me.  From what I see this is all issues reported on an intel
system not HP.

- Doug

Comment 23 Neil Horman 2008-03-20 16:35:22 UTC
sorry, should beLuming, then.

Comment 24 Luming Yu 2008-03-21 01:33:09 UTC
please describe the configuration details of the Intel S6E5200 box.., and
post dmesg, I don't see any kdump problem now on intel tiger box that I have

Comment 25 Qian Cai 2008-03-21 02:14:16 UTC
Arlinton, could you let us know when intel-s6e5231-01.rhts.boston.redhat.com can
be available again? so we could generate sosreport of it. Thanks.

Comment 26 Luming Yu 2008-03-25 02:03:41 UTC
Any news on this bug?

Comment 27 Luming Yu 2008-03-25 02:17:08 UTC
Also please try crashkernel=512M@1G

Comment 28 Qian Cai 2008-03-27 10:21:21 UTC
OK. The machine is back. You can reserve it via RHTS, and I have tried
crashkernel=512M@1G with the latest kexec-tools-1.102pre-16.el5 (fixed zero-size
vmcore in ia64 bug). Unfortunately, it had the same problem on this box.

Comment 29 Qian Cai 2008-03-27 10:26:02 UTC
Created attachment 299297 [details]

Comment 30 Neil Horman 2008-03-27 15:20:51 UTC
Cai, are you sure you tried the latest kexec-tools and kernel?  I just reserved
the system, setup kernel 2.6.18-86.el5 and kexec-tools-1.102pre-16.el5, and have
a core file saved on the system.  I've got it reserved, so feel free to hop on.
 If you would please, repeat/confirm these findings, and if you do, I think we
can close this as fixed currentrelease

Note that the serial console on this system does seem to encounter some odd
behavior.  Specfically it keeps issuing screen clears or line feeds , which lead
to us only seeing bits of the console output, but the core will be present on reset.

Comment 31 Qian Cai 2008-03-28 04:44:47 UTC
Neil, well, I believe that it is _possible_ to get a vmcore occasionally. I have
tried it today with 10 attempts, that 3 of them failed. Looking through tty
logging files, one of them seems was just after capture kernel booting in init

"Unable to handle kernel paging request at virtual address 5334642f5b795c73
exe[1081]: Oops 8813272891392
Kernel panic - not syncing: Fatal exception"

while another was probably failed to boot capture kernel without any further
output after sysrq-c.

Importantly, I noticed that the system you tried was installed with
RHEL5-U1-Server. However, the original bug report was using the latest RHEL-5-U2
tree. The failure rates looks much higher in the later distro, and I seldom
successfully got a vmcore. I just got this a few minutes ago,

ip[3317]: Oops 8813272891392
Kernel panic - not syncing: Fatal exception"

I have reserved the machine with RHEL5.2-Server-20080326.nightly installed, so
free feel to grab it.

Comment 32 Neil Horman 2008-03-28 22:09:59 UTC
hmm, no stack trace?  that stinks.

I'm not sure what to tell you.  Given that this doesn't happen on any other
platform, I'd strongly encourage intel to look into it.  I'd jump on and look
somemore.  But I'm out of the office until next friday, so that may be your best
shot regardless (at least until then).  I know luming above said that he
verified that it worked on tiger, which he believies is the system family this
system in particular belongs to.  Is it possible this one machine just has some
sort of hardware issue?  Random crashes like this might be the result of bad
memory perhaps?  is there an equivalent for ia64 of memtest86?  That might
actually be worth investigating.

Comment 33 Qian Cai 2008-03-29 04:21:25 UTC
The kernel panic stack trace is always something like,

St^[[m^[[2J^[[H^[[m^[[2J^[[H^[[m^[[2J^[[HUnable to handle kernel paging
pfs : 000000000000040e rsc : 0000000000000003^M
rnat: 0000000000000000 bsps: 00000000000000^[0 [pm^[r [ 2J:^[ 0[0H00000000555a19^M
ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c8a70033f^M
csd : 000000^[[0m00^[0[200J0^[[0H0 ssd : 0000000000000000^M
b0  : e00000007ff3a970 b6  : e00000007fe1e940 b7  : a0000001002d6300
^[[m^[[2J^[[Hf6  : 000000000000000000000 f7  : 1003e0000000000000003^M
f8  : 100300^[0[m00^[0[20J00^[[0H001000 f9  : 1003e0000000000000000^M
f10 : 1003e0000000000000003 f11 : 1003e0000000000001000^M
r ^[ :[ m^[e[0020J^[00[H007f1b8000 r2  : 000000000000040e r3  : e000000016d97c60^M
r8  : 0000000000009900 r9  : e0000000169^[7a[m4^[0 [r2J1^[0 [:H e000000016d979c8^M
r11 : e000000016d97ac0 r12 : e000000016d979b0 r13 : e000000016d90000^M
r14 : 0^[0[00m0^[[0020J^[0[00H4000 r15 : e000000016d97a40 r16 : 000000000000ff00^M
r17 : e000000016d979e8 r18 : e000000016d972^[0 [mr^[19[ 2J:^[ e[H000000016d97a28^M
r20 : e000000016d97a30 r21 : 0000000000009901 r22 : 0000000000009900^M
r23 : a000^[0[m01^[0[20J9e^[1[H3e0 r24 : 0000000000010300 r25 : 0000000000009900^M
r26 : e000000016d979b0 r27 : e000000016d979d0r2^[[8m :^[ [2eJ00^[0[H000016d979c8^M
r29 : e000000016d979c0 r30 : e000000016d979b8 r31 : e000000016d979b0^M
Call Trace:
^[[m^[[2J^[[H [<a000000100013ae0>] show_stack+0x40/0xa0^M
                     ^[  [ m^[ [  2 J ^[ [ Hsp=e000000016d97540
 [<a0000001000143e0>] show_regs+0x840/0x880^M
          ^[ [ m^[  [ 2J  ^[ [H             sp=e000000016d97710
 [<a000000100037bc0>] die+0x1c0/0x2c0^M
   ^[  [ m^[  [ 2J ^[  [H                   sp=e000000016d97710
 [<a0000001006361e0>] ia64_do_pagefa^[u[ml^[t+[02Jx8^[e[H0/0xa20^M
                                sp=e000000016d97730 bsp=e000000016d91740^M
 [<a000000100^[0[c0m2^[[02>J] ^[[_H_ia64_leave_kernel+0x0/0x280^M
                                sp=e000000016d977e0 bsp=e00000001d^[91[m74^[0[2^MJ
^[[H [<e00000007ff3a0e0>] 0xe00000007ff3a0e0^M
 [<e00000007ff3a970>] 0xe00000007ff3a970^M
 [<e00000007ff3a970>] 0xe00000007ff3a970^M
 [<e00000007ff3a970>] 0xe00000007ff3a970^M
                              ^[[ smp^[[=2e0J0^[[00H00016d979b0
[<e00000007ff3a970>] 0xe00000007ff3a970^M

Because of the serial console problem, the stack trace is unpleasant to look at.
I may double-check if it is a hardware next week. Luming, the machine is finally
back, and sosreport is there, do you have any insight about this problem?

Comment 34 Luming Yu 2008-04-02 02:16:17 UTC
I'm not aware of any significant things that could make this coldfuison box
different with other coldfuions so far...
you may need to disalbe "headless support" in the COM1 Console Redirection Menu
in System steup...

Comment 35 Neil Horman 2008-04-02 13:11:17 UTC
Cai, I'm still on vacation.  Can you try Lumings sugestion?

Comment 36 Qian Cai 2008-04-03 07:47:16 UTC
I am afraid disabling "headless support" does not make any difference.

Comment 37 Luming Yu 2008-04-03 08:08:15 UTC
Please try it.. this is the suggestion I got from intel support for the issue on
the box.. If it doesn't work, and other coldfusion doesn't have same problem,
please just let me know. 


Comment 38 Qian Cai 2008-04-03 08:28:24 UTC
Yes, and comment #36 is based on what I have just tried. Arlinton also tried to
switch between the physical serial port and the BMC directed serial, but without
any luck.

Comment 39 Arlinton Bourne 2008-04-03 20:24:51 UTC
Hi Luming Yu, what baud rate are you running your serial console at? We are
running ours at 19200.

Comment 40 Luming Yu 2008-04-07 02:07:35 UTC
please open a premier support case. 

Please provide:
BIOS version
System serial number
SEL dump (use selview.efi utility included on BIOS update image on Premier)
Processor type and speed and cache size

Product registration is at http://support.intel.com/support/go/s6e5200SDP
(information is also on the support label on the systems top cover)
Premier Product info is at: http://premier.intel.com under product S6E5200
Series SDP 

Comment 41 Luming Yu 2008-04-07 03:41:41 UTC
based on comment# 20 and my testing, the problem doesn't happen on other
architectually similar coldfusion box , Please follow the steps in comment# 40
to look for intel premier support to get firmware/parts update/replacement..,
i.e. I don't think this is a kernel problem that need a patch to be chased down.

reassigning the owner to me..

Comment 42 Luming Yu 2008-04-07 03:43:31 UTC
changing the resolution to "NOT A BUG", since it is _not_ a kernel bug.

Comment 43 Luming Yu 2008-04-07 03:47:58 UTC
please feel free to re-open the bug if the statement in comment# 42 could be
wrong, or it could be perferable to have a patch to workaround/solve the problem
in kernel.

Comment 44 Qian Cai 2008-04-10 09:27:28 UTC
Thanks Arlinton for updating BIOS on this box. I confirm that it solved the
problem here.

Note You need to log in before you can comment on or make changes to this bug.