Bug 696085

Summary: Regression: kexec-tools update results in invalid vmcore
Product: Red Hat Enterprise Linux 5 Reporter: Takuma Umeya <tumeya>
Component: kexec-toolsAssignee: Cong Wang <amwang>
Status: CLOSED DUPLICATE QA Contact: Kernel Dump QE <kernel-dump-qe>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 5.6CC: cww, phan, qcai, rkhan
Target Milestone: rcKeywords: Regression
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-05-06 01:55:33 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
initial patch
none
Proposed patch none

Description Takuma Umeya 2011-04-13 08:47:27 UTC
Description of problem:
bz682085 released kexec-tools update, which causes kdump regression. There are two symptoms observed: 
1. Data of less than 640kB is invalid, as the data is stored in incorrect address. This causes the vmcore to be unreadable by crash command. 
2. Partial dump may fail when it was specified excluding free page.

Version-Release number of selected component (if applicable):
Applying RHBA-2011-0382 kexec-tools-1.102pre-126.el5_6.5 causes this issue. 

How reproducible:
Always

Steps to Reproduce:
For 1: 
1. Specify /etc/kdump.conf: ext3 /dev/sda6, core_collector makedumpfile -d 0 /etc/sysconfig/kdump is default.
2. Invoke kdump: echo c > /proc/sysrq-trigger
3. Open vmcore by crash.

For 2: 
1. Specify /etc/kdump.conf: ext3 /dev/sda6, core_collector makedumpfile -d 31 /etc/sysconfig/kdump is default.
2. echo c > /proc/sysrq-trigger

Actual results:
For 1: 
zone_table which is stored at less than 640kB shows invalid data(all zero). 

For 2: 
dump failed with following error: 
page_to_pfn: Can't convert the address of page descriptor (21eb6fefa7dfded5) to pfn.

Expected results:
vmcore should be collected and valid data must be collected. 

Additional info:
0-640k is utilized by 2nd OS so the original would be copied. When this happens, because of the errata, the 2nd kernel omits the use of reserved area. 
For example: 
0-64k reserved
 -638k usable
 -640k reserved
the memory range from 64k-638k is the only portion that'll be copied, but the copied area considers the range was copied from 0k, not 64k, which causes the vmcore to be corrupted. The vendor has proposed with an initial patch.

Comment 1 Takuma Umeya 2011-04-13 08:48:57 UTC
Created attachment 491685 [details]
initial patch

Comment 3 Qian Cai 2011-04-15 17:34:15 UTC
Takuma,

Please elabrate more about the reproducer and environment of this issue. We can't reproduce this in-house so far.

CAI Qian

Comment 4 Takuma Umeya 2011-04-18 01:07:03 UTC
bz678308 tells it's been reproduced. Not sure why you are not hitting this. The key here is that the first memory region must be set to reserved in the 1st kernel and this should happen. If you can use the box that's been used on bz678303, that should reproduce the issue.

Comment 8 Cong Wang 2011-04-20 13:19:19 UTC
Takuma, could you try the patch here,
https://bugzilla.redhat.com/attachment.cgi?id=493217&action=diff
? I think this is a duplicate of bug 696547.

Thanks.

Comment 12 Cong Wang 2011-04-22 07:56:53 UTC
Created attachment 494095 [details]
Proposed patch

(In reply to comment #11)
Takuma, please try the attached one.

Comment 15 Takuma Umeya 2011-05-06 01:55:33 UTC

*** This bug has been marked as a duplicate of bug 696547 ***