Bug 585979 - kdump fails to save vmcore on machine with 1TB memory
kdump fails to save vmcore on machine with 1TB memory
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kexec-tools (Show other bugs)
5.5
x86_64 Linux
urgent Severity high
: rc
: ---
Assigned To: Cong Wang
Red Hat Kernel QE team
: OtherQA, Reopened, ZStream
Depends On:
Blocks: 563345 590547
  Show dependency treegraph
 
Reported: 2010-04-26 11:01 EDT by Dave Maley
Modified: 2013-09-29 22:14 EDT (History)
11 users (show)

See Also:
Fixed In Version: kexec-tools-1.102pre-98.el5
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2011-01-13 18:18:43 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
kexec-phys40bit-fix.patch (1.17 KB, patch)
2010-04-26 11:01 EDT, Dave Maley
no flags Details | Diff
panic trace (10.45 KB, application/octet-stream)
2010-05-10 18:14 EDT, Shyam Iyer
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2011:0061 normal SHIPPED_LIVE kexec-tools bug fix update 2011-01-12 12:22:27 EST

  None (edit)
Description Dave Maley 2010-04-26 11:01:07 EDT
Created attachment 409205 [details]
kexec-phys40bit-fix.patch

Description of problem:
RHEL5 kernel supports upto 40bits physical address and discard pages above the boundary.  /proc/vmcore refuses to access over-40bits areas.

On the other hand, /sbin/kexec generates ELF header including the over-40bits area as physical memory due to /proc/iomem contents.

As a result, kdump tries to read /proc/vmcore beyond 40bits boundary, gets -EINVAL and fails.

The proposed patch fixes the problem by applying the same limitation as kernel to kexec.


Version-Release number of selected component (if applicable):
kernel-2.6.18-194.el5


How reproducible:
100%


Steps to Reproduce:
1. Set up kdump
2. Trigger kdump
    # echo c > /proc/sysrq-trigger

  
Actual results:
Saving vmcore aborts near the end.


Expected results:
Saving vmcore succeeds.


Additional info:
Comment 1 Neil Horman 2010-04-30 09:49:25 EDT
I fixed this a few weeks back already, but thank you dave!  1.102pre-97 should have the fix you need.

*** This bug has been marked as a duplicate of bug 559928 ***
Comment 2 Dave Maley 2010-04-30 16:19:51 EDT
Hi Neil - These bugs certainly appear to be similar, however this one's x86_64.  The patch (provided by NEC) appears to do essentially the same thing as the patch from bug 559928, but at a different boundary.

-----

Fix kdump on a machine with 1TB memory.

RHEL5 kernel supports upto 40bits physical address and discard pages above the boundary. /proc/vmcore refuses to access over-40bits areas.

On the other hand, /sbin/kexec generates ELF header including the over-40bits area as physical memory due to /proc/iomem contents.

As a result, kdump tries to read /proc/vmcore beyond 40bits boundary, gets -EINVAL and fails.

This patch fixes the problem by applying the same limitation as kernel to kexec.

--- kexec-tools-testing-20070330/kexec/arch/x86_64/crashdump-x86_64.c   2010-04-20 11:38:56.000000000 +0900
+++ kexec-tools-testing-20070330.1TB/kexec/arch/x86_64/crashdump-x86_64.c       2010-04-21 16:38:18.000000000 +0900
@@ -203,6 +203,11 @@ static int get_crash_memory_ranges(struc
                /* Only Dumping memory of type System RAM. */
                if (memcmp(str, "System RAM\n", 11) == 0) {
                        type = RANGE_RAM;
+#define MAX_PHYSMEM_40BIT ((1UL << 40) - 1)
+                       if (start > MAX_PHYSMEM_40BIT)
+                               continue;
+                       else if (end > MAX_PHYSMEM_40BIT)
+                               end = MAX_PHYSMEM_40BIT;
                } else if (memcmp(str, "Crash kernel\n", 13) == 0) {
                                /* Reserved memory region. New kernel can
                                 * use this region to boot into. */
Comment 3 Neil Horman 2010-05-03 13:12:19 EDT
ah, I see what they're doing now.  Ok, I can take this, once all the flags get set
Comment 5 Marizol Martinez 2010-05-07 10:35:33 EDT
Neil -- We happen to have a 1TB system in Westford *today*. I have added Shyam so he can test the patch. Could you please upload or provide a link to a test rpm that he could try? Thanks!
Comment 6 Marizol Martinez 2010-05-07 10:36:51 EDT
The 1TB system I refer to below is x86_64.
Comment 7 Larry Troan 2010-05-07 10:47:39 EDT
Neil, does the suggested patch cover the case where the BIOS remaps memory (creates holes in the physical address space) so kdump would be physically trying to reference addresses over the 1TB boundary?
Comment 10 Neil Horman 2010-05-07 13:11:45 EDT
fixed, thanks!
Comment 11 Shyam Iyer 2010-05-07 15:10:27 EDT
Would all crashkernel parameters work here ?

crashkernel=128M@16M would not work for me whereas crashkernel=128M&32M would work
when I was debugging another issue on this system.
Comment 13 Shyam Iyer 2010-05-10 18:14:23 EDT
So, the kernel still crashes if I pass 
crashkernel=128M@16M for the kdump configuration

The crashkernel=128M@32M always passes.This was the workaround used to pass certification. 

In my environment I don't see the kdump failing but a driver that needs the memory already reserved by kdump that panics. It is either the storage(boot controller driver) or the usb driver.

Tested with kexec-tools-1.102pre-96.el5_5.2.x86_64.rpm
Comment 14 Shyam Iyer 2010-05-10 18:14:58 EDT
Created attachment 412985 [details]
panic trace
Comment 16 Han Pingtian 2010-05-18 23:39:47 EDT
(In reply to comment #11)
> Would all crashkernel parameters work here ?
> 
> crashkernel=128M@16M would not work for me whereas crashkernel=128M&32M would
> work
> when I was debugging another issue on this system.    

hi Shyam, this is maybe another bug I think. And any chance to verify this 1tb bug with kexec-tools-1.102pre-96.el5_5? Thanks.
Comment 17 Han Pingtian 2010-05-18 23:40:57 EDT
(In reply to comment #16)
> (In reply to comment #11)
> > Would all crashkernel parameters work here ?
> > 
> > crashkernel=128M@16M would not work for me whereas crashkernel=128M&32M would
> > work
> > when I was debugging another issue on this system.    
> 
> hi Shyam, this is maybe another bug I think. And any chance to verify this 1tb
> bug with kexec-tools-1.102pre-96.el5_5? Thanks.    

Sorry, should be 1.102pre-96.el5_5.2. Thanks.
Comment 20 Chris Ward 2010-05-21 03:42:03 EDT
Partners, 

Please grab the latest available bits here to test whether the new kdump can save vmcore to local disk.

http://people.redhat.com/qcai/kexec-tools/
Comment 21 Marizol Martinez 2010-05-21 14:57:48 EDT
We no longer have in Westford the 1TB Dell system I referred to in comments 5, 6 (it had to be sent back and it's currently not available to RH). Shyam had kindly volunteered to assist with the testing, but given that the system is no longer available, he won't be able to provide any additional testing feedback.
Comment 22 Issue Tracker 2010-05-24 01:18:26 EDT
Event posted on 05-24-2010 01:54pm JST by jnomura

File uploaded: kexec-1.102pre-96.el5_5.2.log

This event sent from IssueTracker by mfuruta@redhat.com 
 issue 795343
it_file 694393
Comment 23 Issue Tracker 2010-05-24 01:18:28 EDT
Event posted on 05-24-2010 01:54pm JST by jnomura

Furuta-san,

With kexec-1.102pre-96.el5_5.2 and "crashkernel=128M@16M" boot option,
  - vmcore was saved to the local disk without error
  - crash can open the vmcore without error
on our 1TB-memory machine.
Attached is a short log.


Internal Status set to 'Waiting on Support'
Status set to: Waiting on Tech

This event sent from IssueTracker by mfuruta@redhat.com 
 issue 795343
Comment 24 Masaki Furuta 2010-05-24 01:22:58 EDT
Hi,

(In reply to comment #20)
> Partners, 
> 
> Please grab the latest available bits here to test whether the new kdump can
> save vmcore to local disk.
> 
> http://people.redhat.com/qcai/kexec-tools/    

NEC had verified this on kexec-1.102pre-96.el5_5.2, I've forwarded it from IT#795343, could you please check last comment from them?

Thank you in advance.

Best Regards,
Masaki Furuta
Comment 25 Chris Ward 2010-05-24 03:53:01 EDT
Thank you Masaki-san.
Comment 30 errata-xmlrpc 2011-01-13 18:18:43 EST
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0061.html

Note You need to log in before you can comment on or make changes to this bug.