458047 – Capture Kernel Reset under Heavy IO

Bug 458047 - Capture Kernel Reset under Heavy IO

Summary: Capture Kernel Reset under Heavy IO

Keywords:
Status:	CLOSED DUPLICATE of bug 475507
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kexec-tools
Sub Component:
Version:	5.2
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Neil Horman
QA Contact:	Red Hat Kernel QE team
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2008-08-06 10:14 UTC by Qian Cai
Modified:	2009-09-09 05:08 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2009-04-07 13:33:09 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
dmesg (38.85 KB, text/plain) 2008-08-06 10:14 UTC, Qian Cai	no flags	Details
Kdump Kernel boot logs (7.71 KB, text/plain) 2008-08-07 14:35 UTC, Qian Cai	no flags	Details
dmidecode output (27.88 KB, text/plain) 2008-08-07 14:36 UTC, Qian Cai	no flags	Details
sun loaner system dmidecode (68.09 KB, text/plain) 2008-11-10 19:32 UTC, Neil Horman	no flags	Details
View All

Description Qian Cai 2008-08-06 10:14:43 UTC

Created attachment 313546 [details]
dmesg

Description of problem:
For Sun Fire X4600 M2 machines (sun-x4600-01.rhts.bos.redhat.com), Kdump failed to capture a vmcore, because the capture Kernel reset to BIOS when copying vmcore to disk. The size of vmcore is around 30G. The disk device of it was MPT SAS,

07:04.0 SCSI storage controller: LSI Logic / Symbios Logic SAS1064 PCI-X Fusion-MPT SAS (rev 02)

Version-Release number of selected component (if applicable):
kernel-2.6.18-92.el5.x86_64
kexec-tools-1.102pre-21.el5.x86_64

How reproducible:
always

Steps to Reproduce:
1. configure Kdump with 128M@16M.
2. SysRq-C
  
Actual results:
Either no vmcore or vmcore-incomplete

Expected results:
vmcore around 30G in size

Additional info:
It could also be reproduced this way,

1. configure Kdump with 128M@16M.
2. use an empty Kdump configuration file, so it will mount rootfs and run init.
3. disable Kdump service (chkconfig kdump off), so we could login within the capture Kernel.
4. SysRq-C
5. after init finished, login and run the following commands,
dd if=/dev/zero of=vmcore bs=3M count=15000

Then, the machine will be reset in the middle of copying.

Comment 1 Neil Horman 2008-08-06 11:40:01 UTC

As we previously discussed, will it reset if you just leave it sitting there?  What if you idle it in the initramfs using a kdump_pre script?

Comment 2 Qian Cai 2008-08-06 14:14:56 UTC

Somehow, kdump_pre script did not work for me, but within the capture Kernel, I have tried to left it idle for a period of time, echo messages in a loop, and read though all entries except vmcore in /proc directory in a loop, but have not seen the reset. However, those cp and dd commands caused the reset almost immediately.

Comment 3 Neil Horman 2008-08-06 14:22:57 UTC

How did kdump_pre not work for you?

I don't know that I'm going to be able to do much about this if we're getting a hard reset without any indication as to why it occured.  Can you attach the serial console log from the system as it boots the kdump kernel?  I might be able to note a discrepancy with the dmesg log above.  Thanks.

Comment 4 Qian Cai 2008-08-06 15:22:56 UTC

It happened when reading or writing huge files in the capture Kernel like,

dd if=/dev/zero of=vmcore bs=3M count=15000

I don't have Kdump Kernel boot logs on hand at the moment, but the machine is in RHTS, so feel free to reserve it.

Comment 5 Neil Horman 2008-08-07 12:14:31 UTC

ahh, so it doesn't just happen when reading /proc/vmcore then?  You can dd from any source and the problem will present on this system?

Comment 6 Qian Cai 2008-08-07 14:34:53 UTC

Yes, that is correct. I have had Kdump boot logs, and dmidecode output this time. Interesting bits in Kdump boot logs are,

DMI 2.3 present.^M
  >>> ERROR: Invalid checksum^M
ACPI: 2 duplicate SRAT table ignored.^M
SRAT: PXM 0 Console lost, exiting

Looks like the BIOS is a little bit dated,

Version: 080012 
Release Date: 04/19/2007

I have also had the machine reserved for the next few hours.

Comment 7 Qian Cai 2008-08-07 14:35:46 UTC

Created attachment 313696 [details]
Kdump Kernel boot logs

Comment 8 Qian Cai 2008-08-07 14:36:28 UTC

Created attachment 313697 [details]
dmidecode output

Comment 9 Neil Horman 2008-08-07 15:29:58 UTC

It does look like the bios could use an update given that kdump kernel boot seems to have an issue parsing the dmi and srat tables.  I imagine thats whats leading to the difficulty in starting the cpufreq govenor service later on.  There have apparently been a few bios updates since the one you reference:
http://www.sun.com/servers/x64/x4600/downloads.jsp#M2

It would likely be worth installing the latest just to see if the issue goes away.   I'm also concerned that we're still seeing 16 cpu cores (as would appear evident from the fact that cpusspeed sees 16 cpu subdirectories under /sys, despite the fact that we booted with maxcpus=1.  It seems that not being able to properly parse the srat and dmi tables is leading to us starting all the cpus which can lead to various problems including resets

Are you able to update the bios on this system?

Comment 10 Qian Cai 2008-08-07 15:58:13 UTC

Arlinton, is it possible to update sun-x4600-01.rhts.bos.redhat.com to the latest BIOS? We suspect Kdump is not working on this system because of dated firmware. Thanks in advance.

Comment 11 Qian Cai 2008-08-07 17:01:17 UTC

I have also tried Kdump on a similar model sun-x4200. Although it had the same error messages in Kdump Kernel booting logs,

DMI 2.3 present.^M
  >>> ERROR: Invalid checksum^M
ACPI: 2 duplicate SRAT table ignored.^M
SRAT: PXM 0 Console lost, exiting

Kdump could save a complete vmcore on that box despite some EDAC errors while copying it,

EDAC k8 MC0: GART TLB errorr: transaction type(generic), cache level(generic)
EDAC k8 MC0: extended error code: GART error
EDAC k8 MC0: processor context corrupt

In addition, it survived of the above dd commands.

Comment 12 Neil Horman 2008-08-12 19:16:34 UTC

any word on the firmware update here?

Comment 14 Arlinton Bourne 2008-08-19 00:32:33 UTC

The BIOS update is complete it's running the latest version available from SUN:

http://www.sun.com/servers/x64/x4600/downloads.jsp

Version 2.1 of the software update package.

Comment 15 Qian Cai 2008-08-19 05:41:00 UTC

Thanks Arlinton!

However, the machine still reset during copying of vmcore in Kdump Kernel (although it seems it could survive of the above dd commands).

Comment 16 Qian Cai 2008-08-19 05:45:28 UTC

Neil, the machine has been reserved for the next few days in case you would like have a look.

Comment 17 Neil Horman 2008-08-20 20:05:46 UTC

looks like we lost it, I've put in a request to reserve it again.  I'll look when I get it

Comment 18 Neil Horman 2008-08-21 17:31:51 UTC

Note to self: Looks like the reset happens when we access a specfic region of the vmcore.  about 132MB into the file, or a dd of the 33912-th 4096 byte block seems to consistently reset us, so I'm guessing that for some reason that is an area of memory that should be reserved but isn't

Comment 19 Neil Horman 2008-08-22 19:56:58 UTC

if I dd from 4k block number 34000 onward no hang, so it looks like something between 4k block 33912 and 34000 is no supposed to be accessible from kdump

Comment 21 Qian Cai 2008-09-18 10:37:37 UTC

Don't know if it is related, but hp-dl785g5-01.rhts.bos.redhat.com has the similar problem. It reset during the copying of vmcore. It failed on both RHEL-5.2 and RHEL-5.3 beta (.115.el5 kernel and .37.el5 kexec-tools). I have the machine reserved, feel free to grab it.

Comment 22 Neil Horman 2008-09-29 18:47:30 UTC

A patch just came accross rhkl regarding the use of the memmap boot time parameter.  I'm strongly suspicious it might affect this problem.  I'll be building a kerenl with that patch for testing soon

Comment 23 Neil Horman 2008-09-30 15:58:15 UTC

prarit posted a fix for a bug in the parsing of user memory map specifications recently, and I think this may be related to that.  I've built kernels here:
/mnt/redhat/brewroot/scratch/nhorman/task_1499612

If you could please, try that kernel on this system and see if it doesn't clear up the problem.  I have  a feeling that it will

Comment 27 Neil Horman 2008-11-06 01:29:25 UTC

Cai, yes, Looks like after the above lab ticket wound up modifying the system name
in RHTS.  The new name is sun-x4600m2-01.rhts.bos.redhat.com.  I just reset it,
and it appears to be working in RHTS for you.  Kernels -110 forward should have
Prarits fix in place for this issue, so you should be able to test with any of
them.  Bear in mind, I don't recall what became of the decision, but Prarits
patch exposed another bug on another system, and there was talk of reverting
it.  I think we figured out what that problem was, and so its all moot, but if
that changes, this bug will re-appear (although that doesn't appear to be the
case at the moment)

Comment 28 Qian Cai 2008-11-06 02:23:17 UTC

It is still reset during copying of VMCore.

I could not get a fresh installed system on sun-x4600m2-01.rhts.bos.redhat.com,
because RHTS scheduler could not pick this machine up for reservation. I'll
fill a ticket for that.

Anyway, since it has already installed RHEL5-Server-U2-RC-1 tree, I have
installed the latest RHEL 5.3 packages, kernel-2.6.18-122.el5 and
kexec-1.102pre-49.el5 to try Kdump on it.

Next, I'll try to build a new Kernel with the patch on top of -122.el5
mentioned by Neil, which is targeted for RHEL 5.4.

http://post-office.corp.redhat.com/archives/rhkernel-list/2008-September/msg01193.html

Comment 29 Neil Horman 2008-11-06 02:50:49 UTC

cai, I wouldn't trust results on that system until you've been able to reinstall. I'm looking at it right now, and its behavior in general is very odd.  issuing a mount command returns no results, yet /proc/mounts is fully and correctly populated, and starting the kdump service fails for no apparent reason.  I strongly recommend you reinstall the system before testing.

Comment 31 Qian Cai 2008-11-06 10:08:12 UTC

(In reply to comment #28)
> It is still reset during copying of VMCore.
> 
> I could not get a fresh installed system on sun-x4600m2-01.rhts.bos.redhat.com,
> because RHTS scheduler could not pick this machine up for reservation. I'll
> fill a ticket for that.
> 
> Anyway, since it has already installed RHEL5-Server-U2-RC-1 tree, I have
> installed the latest RHEL 5.3 packages, kernel-2.6.18-122.el5 and
> kexec-1.102pre-49.el5 to try Kdump on it.
> 
> Next, I'll try to build a new Kernel with the patch on top of -122.el5
> mentioned by Neil, which is targeted for RHEL 5.4.
> 
> http://post-office.corp.redhat.com/archives/rhkernel-list/2008-September/msg01193.html

OK, I re-installed the system, and tried both original -122.el5 and the one with the above patch build reset the system. So, basically Kdump is not working on this system.

Comment 32 Neil Horman 2008-11-06 19:38:27 UTC

I just tested this on RHEL 5 GA kernel and RHEL 5.1, and both fail, so I think we can safely say this isn't a regression as its never worked.

That being said, In my testing today, I've noticed a strage bios e820 map type that is changing on the system.  I don't know if I'll find the root cause by the deadline here, but I have a lead to follow

Comment 47 Neil Horman 2008-11-10 19:32:29 UTC

Created attachment 323107 [details]
sun loaner system dmidecode

As requested by jwest, this is the dmidecode of the system on loan from sun that works

Comment 49 Neil Horman 2008-11-13 18:29:03 UTC

that seems like rather a non-starter then.  The sun system in question here has no  AGP card in place, but there certainly could be several out there.  I'll test this patch though.  Thanks

Comment 50 Neil Horman 2008-11-14 02:07:18 UTC

I've got sun-x4600m2-01 reserved and am building a kernel to test the above proposed fix

Comment 51 Neil Horman 2008-11-14 02:57:36 UTC

Negative on the fix as referenced.  Same behavior as previous.  Looks like we're back to a hardware issue here, which looking at it makes sense, give the loaner I tested on from sun worked.

Comment 52 Jeremy West 2008-11-14 04:20:43 UTC

Thanks for the update Neil and for testing that patch.  I inform Sun and continue to investigate this with Sun.

Comment 54 Neil Horman 2008-11-21 20:49:15 UTC

Negative.  Tried the bios change, no difference, still fails exactly the same way

Comment 55 Qian Cai 2008-11-24 02:46:06 UTC

Neil, you set the NEEDINFO for a wrong person again.

Comment 57 Qian Cai 2008-12-01 08:53:38 UTC

Created attachment 325188 [details]
SEL logs

Jeremy, here is the SEL logs captured during the reset in kdump kernel.

[root@sun-x4600m2-01 ~]# date
Mon Dec  1 03:41:01 EST 2008
[root@sun-x4600m2-01 ~]# echo c >/proc/sysrq-trigger

Comment 59 Neil Horman 2009-02-20 19:39:40 UTC

Hey guys, have we tried the latest kernel with this system?  This reeks of the gart bug that dchapman and I dealt with recently

Comment 60 Neil Horman 2009-03-23 11:04:46 UTC

gary ping, any update here?  You should be able to fix this with the 5.3 kexec-tools and kernel

Comment 61 Qian Cai 2009-04-07 11:48:50 UTC

I have tested kexec-tools-1.102pre-62.el5 and kernel-2.6.18-138.el5 on sun-x4600m2-01.rhts.bos.redhat.com again, and it has been able to capture VMCores with and without compressed. Neil, do you think we can close this one out as the dup of,
[Bug 475507] [5.3] hp-dl785g5 Reset During Copying of Vmcore ?

Comment 62 Neil Horman 2009-04-07 13:33:09 UTC

Given your test results, definately.  I'm closing this as a dup.

*** This bug has been marked as a duplicate of bug 475507 ***

Note You need to log in before you can comment on or make changes to this bug.