Description of problem: Capture kernel seems usually Oops or reset just after INIT (there is only 1 in around 10 attempts got a vmcore). In addition, when booting into capture kernel, the remote serial console output has been scratched. I have only managed to get some data from log files later. Please see attachments. Version-Release number of selected component (if applicable): RHEL5.2-20080303.0 I have tried the following kexec-tools + kernel combination, 1.102pre-11.el5 + kexec-tools patch from BZ434927#28 + 2.6.18-84.el5 1.102pre-11.el5 + kexec-tools patch from BZ434927#28 + 2.6.18-53.el5 1.102pre-11.el5 + 2.6.18-84.el5 1.101-194.el5 + 2.6.18-84.el5 1.101-194.el5 + 2.6.18-53.el5 It was almost the same effect. How reproducible: Usually on intel-s6e5231-01.rhts.boston.redhat.com Steps to Reproduce: 1. configure kdump with crashkernel=512M@256M 2. SysRq-c
Created attachment 297029 [details] 1.102pre-11.el5 + kexec-tools patch from BZ434927#28 + 2.6.18-84.el5
Created attachment 297030 [details] 1.102pre-11.el5 + kexec-tools patch from BZ434927#28 + 2.6.18-53.el5
Created attachment 297032 [details] 1.102pre-11.el5 + 2.6.18-84.el5
Created attachment 297033 [details] 1.101-194.el5 + 2.6.18-84.el5
is this the only system this occurs on?
Yes.
ugh, Ok. I'm not sure what we're going to about this if its just on one system. I have an outstanding bug with an hp system that is unique in this fashion, except its reproducible consistently (you may have reported it in fact). That was about SAL checksumming with OS callouts. I wonder if something simmilar is going on here. I'll dig out the notes/patch I sent to HP. If it works we may just need to inform intel about the problem (as its looking like a sal firmware issue at the moment)
Created attachment 297218 [details] patch to set MCA SAL displatch length it was bz 277531 that I'm taking notes from here. I believe this was the patch that got me past the reset which led me to believe that there was a SAL issue, which doug is looking into. I may need to set the other handler lengths as well, but this will be a good starting confirmation. Cai, can you please test this out and see if it gets you any farther? Thanks!
Looks like it did not make any difference. The first time, the capture kernel failed after INIT, ip[3260]: Oops 11012296146944 The second time, the machine was dead like we have seen before. It neither responded to any ping nor reboot command from RHTS WebUI, and it never came back again. Output from the serial console was, Error: No response to keepalive - Terminating session Error: No response de-activating SOL payload <<<PAYLOAD LOST ... retrying in 30 secs>>> <<<PAYLOAD RESTART>>> Error: Unable to establish IPMI v2 / RMCP+ session Error: No response activating SOL payload <<<PAYLOAD LOST ... retrying in 10 secs>>> <<<PAYLOAD RESTART>>> ...
ok, I'll have doug try to comment on this as well, I'm really not sure what I can do in situations like this on ia64. Doug, any thouhgts you may have are appreciated.
Hi, Does the problem happen with RHEL 5.1? Is it a regression or just a new bug? Is there any upstream solution for this bug? Thanks, Luming
Q: Does the problem happen with RHEL 5.1? A: Yes, at least I have tried the the latest RHEL 5.2 tree but with RHEL 5.1 versions of kernel and kexec-tools packages. Q: Is it a regression or just a new bug? A: Looks like new.
what is intel-s6e5231-01? Is it intel tiger box? I'm pretty sure intel tiger box should have been tested with RHEL 5.1 on the kdump fetature, and got postive results..
I guess it is S6E5200 SDP Series (Intel S6E5200 Series Software Development Platform)
Is this bug relative to bz#434927: Zero-size /proc/vmcore after kdump? Sounds like there are at least two different box have kdump issue now...
I don't think so (at least not as far as I undersand it currently). Bz 434972 seems to be a problem relating to elf core header alignment on ia64. Apparently, upstream forced the alignment of the buffer that holds the vmcore header to be 4096 bytes to prevent it running of the edge of a grain of memory on ia64 if CONFIG_DISCONTIG is set. However, that same alignment change causes that buffer to get lost on reboot in the RHEL kernel. Given that the previous kexec tools worked fine with the previous alignment of 1024 bytes, I'm assuming that some change in the disconig memory setup in upstream kernel necessitated this, and I'm reverting the alignment in that bug once its approved. Conversely, this bug seems to be a reset (possibly during a SAL callout to the OS), when we're just starting the kdump process. Previously I've seen this problem on an isloated HP machine, and was able to correct it by setting the length of the registered callout function for MCA dispatch to something non-zero (the code commentatry apparently incorrectly indicating that zero was supposed to disable checksum verification). That fix however doesn't seem to work in this case however.
I have verfied rhel 5.2 kdump on one intel tiger box. Basically it works after downgrading kexec-tool package... So please describe the configuration details of the Intel S6E5200 box.., and post dmesg here. Thanks, Luming
The system is currently unavailable at our lab in U.S.. I'll let you know when I can access it again.
Is the information available?
Not yet. Looks like it only failed for this particular system though. There are no such problem on similar boxes here, S6E5132 - HITACHI Cold Fusion-3e 4S4U S6E4433 - HITACHI Cold Fusion-3e 2S4U
Based on the last few comments, I'm guessing this is still waiting on input from doug (or perhaps luming?)
(In reply to comment #21) > Based on the last few comments, I'm guessing this is still waiting on input from > doug (or perhaps luming?) Not waiting on me. From what I see this is all issues reported on an intel system not HP. - Doug
sorry, should beLuming, then.
please describe the configuration details of the Intel S6E5200 box.., and post dmesg, I don't see any kdump problem now on intel tiger box that I have access..
Arlinton, could you let us know when intel-s6e5231-01.rhts.boston.redhat.com can be available again? so we could generate sosreport of it. Thanks.
Any news on this bug?
Also please try crashkernel=512M@1G
OK. The machine is back. You can reserve it via RHTS, and I have tried crashkernel=512M@1G with the latest kexec-tools-1.102pre-16.el5 (fixed zero-size vmcore in ia64 bug). Unfortunately, it had the same problem on this box.
Created attachment 299297 [details] sosreport
Cai, are you sure you tried the latest kexec-tools and kernel? I just reserved the system, setup kernel 2.6.18-86.el5 and kexec-tools-1.102pre-16.el5, and have a core file saved on the system. I've got it reserved, so feel free to hop on. If you would please, repeat/confirm these findings, and if you do, I think we can close this as fixed currentrelease Note that the serial console on this system does seem to encounter some odd behavior. Specfically it keeps issuing screen clears or line feeds , which lead to us only seeing bits of the console output, but the core will be present on reset.
Neil, well, I believe that it is _possible_ to get a vmcore occasionally. I have tried it today with 10 attempts, that 3 of them failed. Looking through tty logging files, one of them seems was just after capture kernel booting in init stage, "Unable to handle kernel paging request at virtual address 5334642f5b795c73 exe[1081]: Oops 8813272891392 ... Kernel panic - not syncing: Fatal exception" while another was probably failed to boot capture kernel without any further output after sysrq-c. Importantly, I noticed that the system you tried was installed with RHEL5-U1-Server. However, the original bug report was using the latest RHEL-5-U2 tree. The failure rates looks much higher in the later distro, and I seldom successfully got a vmcore. I just got this a few minutes ago, ip[3317]: Oops 8813272891392 ... Kernel panic - not syncing: Fatal exception" I have reserved the machine with RHEL5.2-Server-20080326.nightly installed, so free feel to grab it.
hmm, no stack trace? that stinks. I'm not sure what to tell you. Given that this doesn't happen on any other platform, I'd strongly encourage intel to look into it. I'd jump on and look somemore. But I'm out of the office until next friday, so that may be your best shot regardless (at least until then). I know luming above said that he verified that it worked on tiger, which he believies is the system family this system in particular belongs to. Is it possible this one machine just has some sort of hardware issue? Random crashes like this might be the result of bad memory perhaps? is there an equivalent for ia64 of memtest86? That might actually be worth investigating.
The kernel panic stack trace is always something like, St^[[m^[[2J^[[H^[[m^[[2J^[[H^[[m^[[2J^[[HUnable to handle kernel paging request²mµ¥ÉÑmÕ)<85>±m!<85><91><91>É<95>Í͵½<91>Áɽ<89><95>mu=½ÁÍmu5½<91>Õ±<95>ͱ¥¹<95><91>¥¹Í¡Á<8d>¡Á¥<91><95><8d><91>mµm)m!<8d><91>ɽµ<91>µ}͹<85>ÁÍ¡½Ñ<91>µ}é<95>ɽ<91>µ}µ¥ÉɽÉ<91>µ}µ½<91><95>áÑ©<89><91><85>Ñ<85>}Á¥¥á±<89><85>Ñ<85>µÁÑm͵<85>mÍ)m!Í<8d>Í¥}ÑÉ<85>¹ÍÁ½ÉÑ}Í<85>͵ÁÑÍ<8d>Í¥¡Í<91>}µ½<91>Í<8d>Í¥}µ½<91>µÁÑ<89><85>Í<95>A¥<91>^MAU<8d>½µµµ½<91>ɽm<89>µ<95>m)m!ÁÍÉ<85>¥<99>Í¥Ám<95><99><99><85><95>uQ<85>¥¹Ñ<95><91>^] ¥Á¥Í<85>àððððàððÀððððððÀàÀüððÀüüÀðþðàÀððÀðððððÀ<80>ððàþÀðüðððàðð00^[[0m00^[[002J00^[0[H0000 pfs : 000000000000040e rsc : 0000000000000003^M rnat: 0000000000000000 bsps: 00000000000000^[0 [pm^[r [ 2J:^[ 0[0H00000000555a19^M ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c8a70033f^M csd : 000000^[[0m00^[0[200J0^[[0H0 ssd : 0000000000000000^M b0 : e00000007ff3a970 b6 : e00000007fe1e940 b7 : a0000001002d6300 ^[[m^[[2J^[[Hf6 : 000000000000000000000 f7 : 1003e0000000000000003^M f8 : 100300^[0[m00^[0[20J00^[[0H001000 f9 : 1003e0000000000000000^M f10 : 1003e0000000000000003 f11 : 1003e0000000000001000^M r ^[ :[ m^[e[0020J^[00[H007f1b8000 r2 : 000000000000040e r3 : e000000016d97c60^M r8 : 0000000000009900 r9 : e0000000169^[7a[m4^[0 [r2J1^[0 [:H e000000016d979c8^M r11 : e000000016d97ac0 r12 : e000000016d979b0 r13 : e000000016d90000^M r14 : 0^[0[00m0^[[0020J^[0[00H4000 r15 : e000000016d97a40 r16 : 000000000000ff00^M r17 : e000000016d979e8 r18 : e000000016d972^[0 [mr^[19[ 2J:^[ e[H000000016d97a28^M r20 : e000000016d97a30 r21 : 0000000000009901 r22 : 0000000000009900^M r23 : a000^[0[m01^[0[20J9e^[1[H3e0 r24 : 0000000000010300 r25 : 0000000000009900^M r26 : e000000016d979b0 r27 : e000000016d979d0r2^[[8m :^[ [2eJ00^[0[H000016d979c8^M r29 : e000000016d979c0 r30 : e000000016d979b8 r31 : e000000016d979b0^M ^M Call Trace: ^[[m^[[2J^[[H [<a000000100013ae0>] show_stack+0x40/0xa0^M ^[ [ m^[ [ 2 J ^[ [ Hsp=e000000016d97540 bsp=e000000016d91838^M [<a0000001000143e0>] show_regs+0x840/0x880^M ^[ [ m^[ [ 2J ^[ [H sp=e000000016d97710 bsp=e000000016d917d8^M [<a000000100037bc0>] die+0x1c0/0x2c0^M ^[ [ m^[ [ 2J ^[ [H sp=e000000016d97710 bsp=e000000016d91790^M [<a0000001006361e0>] ia64_do_pagefa^[u[ml^[t+[02Jx8^[e[H0/0xa20^M sp=e000000016d97730 bsp=e000000016d91740^M [<a000000100^[0[c0m2^[[02>J] ^[[_H_ia64_leave_kernel+0x0/0x280^M sp=e000000016d977e0 bsp=e00000001d^[91[m74^[0[2^MJ ^[[H [<e00000007ff3a0e0>] 0xe00000007ff3a0e0^M sp=e000000016d979b0 bsp=0^[00[0m^[0[0021J^[6d[9H1710^M [<e00000007ff3a970>] 0xe00000007ff3a970^M sp=e000000016d99b^[[0m b^[[s2pJ=e^[0[H00000016d916d0^M [<e00000007ff3a970>] 0xe00000007ff3a970^M sp=e000^[00[m0^[16[d29J^[7[9bH0 bsp=e000000016d91690^M [<e00000007ff3a970>] 0xe00000007ff3a970^M ^[[ smp^[[=2e0J0^[[00H00016d979b0 bsp=e000000016d91650^M [<e00000007ff3a970>] 0xe00000007ff3a970^M Because of the serial console problem, the stack trace is unpleasant to look at. I may double-check if it is a hardware next week. Luming, the machine is finally back, and sosreport is there, do you have any insight about this problem?
I'm not aware of any significant things that could make this coldfuison box different with other coldfuions so far... you may need to disalbe "headless support" in the COM1 Console Redirection Menu in System steup...
Cai, I'm still on vacation. Can you try Lumings sugestion?
I am afraid disabling "headless support" does not make any difference.
Please try it.. this is the suggestion I got from intel support for the issue on the box.. If it doesn't work, and other coldfusion doesn't have same problem, please just let me know. Thanks, Luming
Yes, and comment #36 is based on what I have just tried. Arlinton also tried to switch between the physical serial port and the BMC directed serial, but without any luck.
Hi Luming Yu, what baud rate are you running your serial console at? We are running ours at 19200.
please open a premier support case. Please provide: BIOS version System serial number SEL dump (use selview.efi utility included on BIOS update image on Premier) Processor type and speed and cache size Product registration is at http://support.intel.com/support/go/s6e5200SDP (information is also on the support label on the systems top cover) Premier Product info is at: http://premier.intel.com under product S6E5200 Series SDP
based on comment# 20 and my testing, the problem doesn't happen on other architectually similar coldfusion box , Please follow the steps in comment# 40 to look for intel premier support to get firmware/parts update/replacement.., i.e. I don't think this is a kernel problem that need a patch to be chased down. reassigning the owner to me..
changing the resolution to "NOT A BUG", since it is _not_ a kernel bug.
please feel free to re-open the bug if the statement in comment# 42 could be wrong, or it could be perferable to have a patch to workaround/solve the problem in kernel.
Thanks Arlinton for updating BIOS on this box. I confirm that it solved the problem here.