Bug 585979 - kdump fails to save vmcore on machine with 1TB memory
Summary: kdump fails to save vmcore on machine with 1TB memory
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kexec-tools
Version: 5.5
Hardware: x86_64
OS: Linux
urgent
high
Target Milestone: rc
: ---
Assignee: Cong Wang
QA Contact: Red Hat Kernel QE team
URL:
Whiteboard:
Depends On:
Blocks: 563345 590547
TreeView+ depends on / blocked
 
Reported: 2010-04-26 15:01 UTC by Dave Maley
Modified: 2018-10-27 16:06 UTC (History)
11 users (show)

Fixed In Version: kexec-tools-1.102pre-98.el5
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-01-13 23:18:43 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
kexec-phys40bit-fix.patch (1.17 KB, patch)
2010-04-26 15:01 UTC, Dave Maley
no flags Details | Diff
panic trace (10.45 KB, application/octet-stream)
2010-05-10 22:14 UTC, Shyam Iyer
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2011:0061 0 normal SHIPPED_LIVE kexec-tools bug fix update 2011-01-12 17:22:27 UTC

Description Dave Maley 2010-04-26 15:01:07 UTC
Created attachment 409205 [details]
kexec-phys40bit-fix.patch

Description of problem:
RHEL5 kernel supports upto 40bits physical address and discard pages above the boundary.  /proc/vmcore refuses to access over-40bits areas.

On the other hand, /sbin/kexec generates ELF header including the over-40bits area as physical memory due to /proc/iomem contents.

As a result, kdump tries to read /proc/vmcore beyond 40bits boundary, gets -EINVAL and fails.

The proposed patch fixes the problem by applying the same limitation as kernel to kexec.


Version-Release number of selected component (if applicable):
kernel-2.6.18-194.el5


How reproducible:
100%


Steps to Reproduce:
1. Set up kdump
2. Trigger kdump
    # echo c > /proc/sysrq-trigger

  
Actual results:
Saving vmcore aborts near the end.


Expected results:
Saving vmcore succeeds.


Additional info:

Comment 1 Neil Horman 2010-04-30 13:49:25 UTC
I fixed this a few weeks back already, but thank you dave!  1.102pre-97 should have the fix you need.

*** This bug has been marked as a duplicate of bug 559928 ***

Comment 2 Dave Maley 2010-04-30 20:19:51 UTC
Hi Neil - These bugs certainly appear to be similar, however this one's x86_64.  The patch (provided by NEC) appears to do essentially the same thing as the patch from bug 559928, but at a different boundary.

-----

Fix kdump on a machine with 1TB memory.

RHEL5 kernel supports upto 40bits physical address and discard pages above the boundary. /proc/vmcore refuses to access over-40bits areas.

On the other hand, /sbin/kexec generates ELF header including the over-40bits area as physical memory due to /proc/iomem contents.

As a result, kdump tries to read /proc/vmcore beyond 40bits boundary, gets -EINVAL and fails.

This patch fixes the problem by applying the same limitation as kernel to kexec.

--- kexec-tools-testing-20070330/kexec/arch/x86_64/crashdump-x86_64.c   2010-04-20 11:38:56.000000000 +0900
+++ kexec-tools-testing-20070330.1TB/kexec/arch/x86_64/crashdump-x86_64.c       2010-04-21 16:38:18.000000000 +0900
@@ -203,6 +203,11 @@ static int get_crash_memory_ranges(struc
                /* Only Dumping memory of type System RAM. */
                if (memcmp(str, "System RAM\n", 11) == 0) {
                        type = RANGE_RAM;
+#define MAX_PHYSMEM_40BIT ((1UL << 40) - 1)
+                       if (start > MAX_PHYSMEM_40BIT)
+                               continue;
+                       else if (end > MAX_PHYSMEM_40BIT)
+                               end = MAX_PHYSMEM_40BIT;
                } else if (memcmp(str, "Crash kernel\n", 13) == 0) {
                                /* Reserved memory region. New kernel can
                                 * use this region to boot into. */

Comment 3 Neil Horman 2010-05-03 17:12:19 UTC
ah, I see what they're doing now.  Ok, I can take this, once all the flags get set

Comment 5 Marizol Martinez 2010-05-07 14:35:33 UTC
Neil -- We happen to have a 1TB system in Westford *today*. I have added Shyam so he can test the patch. Could you please upload or provide a link to a test rpm that he could try? Thanks!

Comment 6 Marizol Martinez 2010-05-07 14:36:51 UTC
The 1TB system I refer to below is x86_64.

Comment 7 Larry Troan 2010-05-07 14:47:39 UTC
Neil, does the suggested patch cover the case where the BIOS remaps memory (creates holes in the physical address space) so kdump would be physically trying to reference addresses over the 1TB boundary?

Comment 10 Neil Horman 2010-05-07 17:11:45 UTC
fixed, thanks!

Comment 11 Shyam Iyer 2010-05-07 19:10:27 UTC
Would all crashkernel parameters work here ?

crashkernel=128M@16M would not work for me whereas crashkernel=128M&32M would work
when I was debugging another issue on this system.

Comment 13 Shyam Iyer 2010-05-10 22:14:23 UTC
So, the kernel still crashes if I pass 
crashkernel=128M@16M for the kdump configuration

The crashkernel=128M@32M always passes.This was the workaround used to pass certification. 

In my environment I don't see the kdump failing but a driver that needs the memory already reserved by kdump that panics. It is either the storage(boot controller driver) or the usb driver.

Tested with kexec-tools-1.102pre-96.el5_5.2.x86_64.rpm

Comment 14 Shyam Iyer 2010-05-10 22:14:58 UTC
Created attachment 412985 [details]
panic trace

Comment 16 Han Pingtian 2010-05-19 03:39:47 UTC
(In reply to comment #11)
> Would all crashkernel parameters work here ?
> 
> crashkernel=128M@16M would not work for me whereas crashkernel=128M&32M would
> work
> when I was debugging another issue on this system.    

hi Shyam, this is maybe another bug I think. And any chance to verify this 1tb bug with kexec-tools-1.102pre-96.el5_5? Thanks.

Comment 17 Han Pingtian 2010-05-19 03:40:57 UTC
(In reply to comment #16)
> (In reply to comment #11)
> > Would all crashkernel parameters work here ?
> > 
> > crashkernel=128M@16M would not work for me whereas crashkernel=128M&32M would
> > work
> > when I was debugging another issue on this system.    
> 
> hi Shyam, this is maybe another bug I think. And any chance to verify this 1tb
> bug with kexec-tools-1.102pre-96.el5_5? Thanks.    

Sorry, should be 1.102pre-96.el5_5.2. Thanks.

Comment 20 Chris Ward 2010-05-21 07:42:03 UTC
Partners, 

Please grab the latest available bits here to test whether the new kdump can save vmcore to local disk.

http://people.redhat.com/qcai/kexec-tools/

Comment 21 Marizol Martinez 2010-05-21 18:57:48 UTC
We no longer have in Westford the 1TB Dell system I referred to in comments 5, 6 (it had to be sent back and it's currently not available to RH). Shyam had kindly volunteered to assist with the testing, but given that the system is no longer available, he won't be able to provide any additional testing feedback.

Comment 22 Issue Tracker 2010-05-24 05:18:26 UTC
Event posted on 05-24-2010 01:54pm JST by jnomura

File uploaded: kexec-1.102pre-96.el5_5.2.log

This event sent from IssueTracker by mfuruta 
 issue 795343
it_file 694393

Comment 23 Issue Tracker 2010-05-24 05:18:28 UTC
Event posted on 05-24-2010 01:54pm JST by jnomura

Furuta-san,

With kexec-1.102pre-96.el5_5.2 and "crashkernel=128M@16M" boot option,
  - vmcore was saved to the local disk without error
  - crash can open the vmcore without error
on our 1TB-memory machine.
Attached is a short log.


Internal Status set to 'Waiting on Support'
Status set to: Waiting on Tech

This event sent from IssueTracker by mfuruta 
 issue 795343

Comment 24 Masaki Furuta ( RH ) 2010-05-24 05:22:58 UTC
Hi,

(In reply to comment #20)
> Partners, 
> 
> Please grab the latest available bits here to test whether the new kdump can
> save vmcore to local disk.
> 
> http://people.redhat.com/qcai/kexec-tools/    

NEC had verified this on kexec-1.102pre-96.el5_5.2, I've forwarded it from IT#795343, could you please check last comment from them?

Thank you in advance.

Best Regards,
Masaki Furuta

Comment 25 Chris Ward 2010-05-24 07:53:01 UTC
Thank you Masaki-san.

Comment 30 errata-xmlrpc 2011-01-13 23:18:43 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0061.html


Note You need to log in before you can comment on or make changes to this bug.