Created attachment 313546 [details] dmesg Description of problem: For Sun Fire X4600 M2 machines (sun-x4600-01.rhts.bos.redhat.com), Kdump failed to capture a vmcore, because the capture Kernel reset to BIOS when copying vmcore to disk. The size of vmcore is around 30G. The disk device of it was MPT SAS, 07:04.0 SCSI storage controller: LSI Logic / Symbios Logic SAS1064 PCI-X Fusion-MPT SAS (rev 02) Version-Release number of selected component (if applicable): kernel-2.6.18-92.el5.x86_64 kexec-tools-1.102pre-21.el5.x86_64 How reproducible: always Steps to Reproduce: 1. configure Kdump with 128M@16M. 2. SysRq-C Actual results: Either no vmcore or vmcore-incomplete Expected results: vmcore around 30G in size Additional info: It could also be reproduced this way, 1. configure Kdump with 128M@16M. 2. use an empty Kdump configuration file, so it will mount rootfs and run init. 3. disable Kdump service (chkconfig kdump off), so we could login within the capture Kernel. 4. SysRq-C 5. after init finished, login and run the following commands, dd if=/dev/zero of=vmcore bs=3M count=15000 Then, the machine will be reset in the middle of copying.
As we previously discussed, will it reset if you just leave it sitting there? What if you idle it in the initramfs using a kdump_pre script?
Somehow, kdump_pre script did not work for me, but within the capture Kernel, I have tried to left it idle for a period of time, echo messages in a loop, and read though all entries except vmcore in /proc directory in a loop, but have not seen the reset. However, those cp and dd commands caused the reset almost immediately.
How did kdump_pre not work for you? I don't know that I'm going to be able to do much about this if we're getting a hard reset without any indication as to why it occured. Can you attach the serial console log from the system as it boots the kdump kernel? I might be able to note a discrepancy with the dmesg log above. Thanks.
It happened when reading or writing huge files in the capture Kernel like, dd if=/dev/zero of=vmcore bs=3M count=15000 I don't have Kdump Kernel boot logs on hand at the moment, but the machine is in RHTS, so feel free to reserve it.
ahh, so it doesn't just happen when reading /proc/vmcore then? You can dd from any source and the problem will present on this system?
Yes, that is correct. I have had Kdump boot logs, and dmidecode output this time. Interesting bits in Kdump boot logs are, DMI 2.3 present.^M >>> ERROR: Invalid checksum^M ACPI: 2 duplicate SRAT table ignored.^M SRAT: PXM 0 Console lost, exiting Looks like the BIOS is a little bit dated, Version: 080012 Release Date: 04/19/2007 I have also had the machine reserved for the next few hours.
Created attachment 313696 [details] Kdump Kernel boot logs
Created attachment 313697 [details] dmidecode output
It does look like the bios could use an update given that kdump kernel boot seems to have an issue parsing the dmi and srat tables. I imagine thats whats leading to the difficulty in starting the cpufreq govenor service later on. There have apparently been a few bios updates since the one you reference: http://www.sun.com/servers/x64/x4600/downloads.jsp#M2 It would likely be worth installing the latest just to see if the issue goes away. I'm also concerned that we're still seeing 16 cpu cores (as would appear evident from the fact that cpusspeed sees 16 cpu subdirectories under /sys, despite the fact that we booted with maxcpus=1. It seems that not being able to properly parse the srat and dmi tables is leading to us starting all the cpus which can lead to various problems including resets Are you able to update the bios on this system?
Arlinton, is it possible to update sun-x4600-01.rhts.bos.redhat.com to the latest BIOS? We suspect Kdump is not working on this system because of dated firmware. Thanks in advance.
I have also tried Kdump on a similar model sun-x4200. Although it had the same error messages in Kdump Kernel booting logs, DMI 2.3 present.^M >>> ERROR: Invalid checksum^M ACPI: 2 duplicate SRAT table ignored.^M SRAT: PXM 0 Console lost, exiting Kdump could save a complete vmcore on that box despite some EDAC errors while copying it, EDAC k8 MC0: GART TLB errorr: transaction type(generic), cache level(generic) EDAC k8 MC0: extended error code: GART error EDAC k8 MC0: processor context corrupt In addition, it survived of the above dd commands.
any word on the firmware update here?
The BIOS update is complete it's running the latest version available from SUN: http://www.sun.com/servers/x64/x4600/downloads.jsp Version 2.1 of the software update package.
Thanks Arlinton! However, the machine still reset during copying of vmcore in Kdump Kernel (although it seems it could survive of the above dd commands).
Neil, the machine has been reserved for the next few days in case you would like have a look.
looks like we lost it, I've put in a request to reserve it again. I'll look when I get it
Note to self: Looks like the reset happens when we access a specfic region of the vmcore. about 132MB into the file, or a dd of the 33912-th 4096 byte block seems to consistently reset us, so I'm guessing that for some reason that is an area of memory that should be reserved but isn't
if I dd from 4k block number 34000 onward no hang, so it looks like something between 4k block 33912 and 34000 is no supposed to be accessible from kdump
Don't know if it is related, but hp-dl785g5-01.rhts.bos.redhat.com has the similar problem. It reset during the copying of vmcore. It failed on both RHEL-5.2 and RHEL-5.3 beta (.115.el5 kernel and .37.el5 kexec-tools). I have the machine reserved, feel free to grab it.
A patch just came accross rhkl regarding the use of the memmap boot time parameter. I'm strongly suspicious it might affect this problem. I'll be building a kerenl with that patch for testing soon
prarit posted a fix for a bug in the parsing of user memory map specifications recently, and I think this may be related to that. I've built kernels here: /mnt/redhat/brewroot/scratch/nhorman/task_1499612 If you could please, try that kernel on this system and see if it doesn't clear up the problem. I have a feeling that it will
Cai, yes, Looks like after the above lab ticket wound up modifying the system name in RHTS. The new name is sun-x4600m2-01.rhts.bos.redhat.com. I just reset it, and it appears to be working in RHTS for you. Kernels -110 forward should have Prarits fix in place for this issue, so you should be able to test with any of them. Bear in mind, I don't recall what became of the decision, but Prarits patch exposed another bug on another system, and there was talk of reverting it. I think we figured out what that problem was, and so its all moot, but if that changes, this bug will re-appear (although that doesn't appear to be the case at the moment)
It is still reset during copying of VMCore. I could not get a fresh installed system on sun-x4600m2-01.rhts.bos.redhat.com, because RHTS scheduler could not pick this machine up for reservation. I'll fill a ticket for that. Anyway, since it has already installed RHEL5-Server-U2-RC-1 tree, I have installed the latest RHEL 5.3 packages, kernel-2.6.18-122.el5 and kexec-1.102pre-49.el5 to try Kdump on it. Next, I'll try to build a new Kernel with the patch on top of -122.el5 mentioned by Neil, which is targeted for RHEL 5.4. http://post-office.corp.redhat.com/archives/rhkernel-list/2008-September/msg01193.html
cai, I wouldn't trust results on that system until you've been able to reinstall. I'm looking at it right now, and its behavior in general is very odd. issuing a mount command returns no results, yet /proc/mounts is fully and correctly populated, and starting the kdump service fails for no apparent reason. I strongly recommend you reinstall the system before testing.
(In reply to comment #28) > It is still reset during copying of VMCore. > > I could not get a fresh installed system on sun-x4600m2-01.rhts.bos.redhat.com, > because RHTS scheduler could not pick this machine up for reservation. I'll > fill a ticket for that. > > Anyway, since it has already installed RHEL5-Server-U2-RC-1 tree, I have > installed the latest RHEL 5.3 packages, kernel-2.6.18-122.el5 and > kexec-1.102pre-49.el5 to try Kdump on it. > > Next, I'll try to build a new Kernel with the patch on top of -122.el5 > mentioned by Neil, which is targeted for RHEL 5.4. > > http://post-office.corp.redhat.com/archives/rhkernel-list/2008-September/msg01193.html OK, I re-installed the system, and tried both original -122.el5 and the one with the above patch build reset the system. So, basically Kdump is not working on this system.
I just tested this on RHEL 5 GA kernel and RHEL 5.1, and both fail, so I think we can safely say this isn't a regression as its never worked. That being said, In my testing today, I've noticed a strage bios e820 map type that is changing on the system. I don't know if I'll find the root cause by the deadline here, but I have a lead to follow
Created attachment 323107 [details] sun loaner system dmidecode As requested by jwest, this is the dmidecode of the system on loan from sun that works
that seems like rather a non-starter then. The sun system in question here has no AGP card in place, but there certainly could be several out there. I'll test this patch though. Thanks
I've got sun-x4600m2-01 reserved and am building a kernel to test the above proposed fix
Negative on the fix as referenced. Same behavior as previous. Looks like we're back to a hardware issue here, which looking at it makes sense, give the loaner I tested on from sun worked.
Thanks for the update Neil and for testing that patch. I inform Sun and continue to investigate this with Sun.
Negative. Tried the bios change, no difference, still fails exactly the same way
Neil, you set the NEEDINFO for a wrong person again.
Created attachment 325188 [details] SEL logs Jeremy, here is the SEL logs captured during the reset in kdump kernel. [root@sun-x4600m2-01 ~]# date Mon Dec 1 03:41:01 EST 2008 [root@sun-x4600m2-01 ~]# echo c >/proc/sysrq-trigger
Hey guys, have we tried the latest kernel with this system? This reeks of the gart bug that dchapman and I dealt with recently
gary ping, any update here? You should be able to fix this with the 5.3 kexec-tools and kernel
I have tested kexec-tools-1.102pre-62.el5 and kernel-2.6.18-138.el5 on sun-x4600m2-01.rhts.bos.redhat.com again, and it has been able to capture VMCores with and without compressed. Neil, do you think we can close this one out as the dup of, [Bug 475507] [5.3] hp-dl785g5 Reset During Copying of Vmcore ?
Given your test results, definately. I'm closing this as a dup. *** This bug has been marked as a duplicate of bug 475507 ***