Bug 242648

Summary: kdump broken on x86_64
Product: Red Hat Enterprise Linux 5 Reporter: Gerd Hoffmann <kraxel>
Component: kernelAssignee: Neil Horman <nhorman>
Status: CLOSED ERRATA QA Contact: Martin Jenner <mjenner>
Severity: low Docs Contact:
Priority: low    
Version: 5.0CC: jnomura, sprabhu
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: RHBA-2007-0959 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2007-11-07 19:51:22 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 223736, 230752, 243442    
Attachments:
Description Flags
workaround
none
boot log diff
none
patch to correct x86_64 crashdump validation none

Description Gerd Hoffmann 2007-06-05 09:41:30 UTC
Description of problem:
kdump broken on x86_64

Version-Release number of selected component (if applicable):
2.6.18-20.el5

How reproducible:
boot with crashkernel=something, check dmesg and /proc/iomem,
reserving crash kernel memory failed.

Additional info:
linux-2.6-kdump-bounds-checking-for-crashkernel-args.patch is broken.
Initialization order bug: On x86_64 the sanity check happens before
max_low_pfn is initialized, thus it fails no matter what arguments
where given.

Comment 1 Neil Horman 2007-06-05 20:07:46 UTC
I'm looking at the boot code, and from what I see, max_low_pfn is set prior to
every call to setup_bootmem_allocator, where that check takes place, so I'm not
sure that your analysis is correct (at least not yet).  What your describing
however, sounds to me like the result of a patch that was added in 2.6.18-17.EL5
for bz 236759.  It was supposed to fix a broken northbridge setup, but for some
systems it resulted in a hang (and as a side effect, a failure to validate
crashkernel reources, probably from a miscomputed max_low_pfn value).  I'm not
sure of the details, but you can likely confirm this is the case by testing with
kernel-2.6.18-16.EL5, and then again with kernel-2.6.18-17.EL5.  If the problem
is not present in the former, and is in the latter, then we can assume thats the
problem we need to track down.

Comment 2 Gerd Hoffmann 2007-06-06 08:09:55 UTC
Created attachment 156327 [details]
workaround

I'm using the attached patch to workaround the broken test (which just comments
it).  I had also added man_low_pfn output to the error message for debugging,
it showed max_low_pfn still being 0.  I'll go fetch and test 16+17 soon.

Comment 3 Gerd Hoffmann 2007-06-06 08:44:02 UTC
Hmm, both 16 and 17 fail, so it must be something else.  Current release
(8.something) works ok though.  Will try the builds inbetween now ...

Comment 4 Gerd Hoffmann 2007-06-06 09:20:35 UTC
Created attachment 156329 [details]
boot log diff

The regression was added between 14.el5 and 15.el5.

Comment 5 Neil Horman 2007-06-06 13:37:26 UTC
Well, thats odd.  That correlates to prarits addition of the bounds checking
addition, but that just brings me back to wondering why max_low_pfn is set
improperly for you.  I'll try reproduce on an x86_64 system here, but I'm
beginning to suspect that this is a problem with the max_low_pfn computation
specific to your system.

About the only other thing that might relate is my patch for support for the
calgary iommu.  Is this an IBM system by any chance that your working with?  If
so could you please try booting your system with iommu=soft on the command line?

Comment 6 Gerd Hoffmann 2007-06-06 13:59:33 UTC
No ibm box, it is a virtual machine, using kvm (i.e. pretty standard pc hardware
as emulated by qemu, intel ich3 chipset IIRC ...).

Comment 7 Neil Horman 2007-06-06 14:24:59 UTC
Ok, so it is prarits check addition that has done this, and I still don't
understand why max_low_pfn isn't set yet.  On the up side, I just recreated
here.  I'll start debugging right away.  Thanks!

Comment 8 Neil Horman 2007-06-06 18:05:37 UTC
Created attachment 156376 [details]
patch to correct x86_64 crashdump validation 

Think I found the problem.  As it turns out I was wrong before, x86_64 doesn't
even initalize max_low_pfn since max_low_pfn represents the maximum page frame
number that is allowed in lowmem, and x86_64, not currently having the concept
of lowmem (or rather, having the address space to treat all memory as de facto
lowmem), never sets max_low_pfn.  That being the case, kexec should be able to
reserve memory from anywhere in the physical address space, as long as the ram
actually exists.  As such, instead of checking our crashkernel parameter
against max_low_pfn, we need to be checking it against end_pfn, which should
allow any address that is physically populated with ram. 

I've tested the attached patch, and it works well for me.  Please test it on
your system as well and confirm.  Thanks!

Comment 9 Neil Horman 2007-06-06 18:10:57 UTC
I'm posting this, since today is the deadline for this.

Comment 10 RHEL Program Management 2007-06-06 19:01:59 UTC
This request was evaluated by Red Hat Kernel Team for inclusion in a Red
Hat Enterprise Linux maintenance release, and has moved to bugzilla 
status POST.

Comment 11 Gerd Hoffmann 2007-06-07 06:34:50 UTC
Patch works fine for me, thanks.

Comment 12 Jun'ichi Nomura (Red Hat) 2007-06-07 14:27:31 UTC
*** Bug 242817 has been marked as a duplicate of this bug. ***

Comment 13 Neil Horman 2007-06-12 16:06:33 UTC
*** Bug 238987 has been marked as a duplicate of this bug. ***

Comment 14 Don Zickus 2007-06-16 00:38:41 UTC
in 2.6.18-27.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Comment 17 errata-xmlrpc 2007-11-07 19:51:22 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0959.html