Bug 1439170

Summary:	crash: "vmlinux and vmcore do not match!" with ELF vmcores generated with makedumpfile -E or scp
Product:	Red Hat Enterprise Linux 7	Reporter:	Emma Wu <xiawu>
Component:	crash	Assignee:	Dave Anderson <anderson>
Status:	CLOSED ERRATA	QA Contact:	Emma Wu <xiawu>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	7.4	CC:	bhe, dyoung, panand, qzhao, xiawu
Target Milestone:	rc	Keywords:	Reopened
Target Release:	---
Hardware:	x86_64
OS:	Unspecified
Whiteboard:
Fixed In Version:	crash-7.1.9-1.el7	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-08-01 22:04:38 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Comment 3 Dave Anderson 2017-04-05 13:06:16 UTC

Please (always) provide a pointer to the actual vmcore.

Comment 5 Dave Anderson 2017-04-05 15:06:52 UTC

Thanks!

In the meantime, I was able to reproduce the issue with: 

  kernel-3.10.0-640.el7.x86_64
  kexec-tools-2.0.14-4.el7.x86_64

As you mentioned, crash works OK with compressed kdumps, but fails with
ELF format dumpfiles.  I simplified my test to remove any filtering:

  core_collector makedumpfile -E
 
As it turns out, the problem is due to the crash utility's calculation of 
the kernel's "phys_base" value.

Something must have changed with the kernel's /proc/vmcore output, more
specifically the contents of the ELF PT_LOAD segments.  

I will update this bugzilla when I have more information.

Comment 6 Dave Anderson 2017-04-05 15:30:17 UTC

The problem also occurs with:

  core_collector scp

So we can take makedumpfile out of the picture entirely.

Given that it appears to be related to the kernel's creation of /proc/vmcore,
do you know the kernel version where this problem started happening?

Comment 8 Dave Anderson 2017-04-05 18:23:44 UTC

OK, thanks.  I saw that segmentation fault when using the installed version of kexec-tools.  When I upgraded to kexec-tools-2.0.14-4.el7, it fixed itself.

Anyway, upon further investigation, it is not an issue with the /proc/vmcore
PT_LOAD segments, but rather with recent KASLR-related kernel changes
related to KERNEL_IMAGE_SIZE, which affects the virtual memory address
space layout.  

The problem at hand is that the value of KERNEL_IMAGE_SIZE is not exported
with your 3.10.0-609.el7 kernel (or my 3.10.0-640.el7) kernel. 

If you run "crash vmlinux vmcore --machdep kernel_image_size=1g" it
should work OK. 

However, I see this recent rhel7 commit, which exports the values
of PHYS_BASE and KERNEL_IMAGE_SIZE:

commit 2a74f863738828916976c987e70e7d4c76099394
Author: Baoquan He <bhe>
Date:   Fri Mar 24 13:57:28 2017 -0400

    [kernel] kexec: export the value of phys_base instead of symbol address

... [ cut ] ...
 
diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
index e58b0f8..9075516 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -326,7 +326,7 @@ void machine_kexec(struct kimage *image)

 void arch_crash_save_vmcoreinfo(void)
 {
-       VMCOREINFO_SYMBOL(phys_base);
+       VMCOREINFO_NUMBER(phys_base);
        VMCOREINFO_SYMBOL(init_level4_pgt);

 #ifdef CONFIG_NUMA
@@ -335,6 +335,7 @@ void arch_crash_save_vmcoreinfo(void)
 #endif
        vmcoreinfo_append_str("KERNELOFFSET=%lx\n",
                              kaslr_offset());
+       VMCOREINFO_NUMBER(KERNEL_IMAGE_SIZE);
 }


and which will show up in kernel-3.10.0-641.el7:

  $ git describe --contains 2a74f863738828916976c987e70e7d4c76099394
  kernel-3.10.0-641.el7~10
  $ 

So with that kernel patch in place, the problem should go away.

Comment 9 Dave Anderson 2017-04-05 18:44:34 UTC

> So with that kernel patch in place, the problem should go away.

With "core_collector scp", here is a 3.10.0-643.el7 kernel:

# crash /var/crash/127.0.0.1-2017-04-05-14:32:04/vmcore /usr/lib/debug/lib/modules/3.10.0-643.el7.x86_64/vmlinux
 
crash 7.1.8-2.el7
Copyright (C) 2002-2016  Red Hat, Inc.
Copyright (C) 2004, 2005, 2006, 2010  IBM Corporation
Copyright (C) 1999-2006  Hewlett-Packard Co
Copyright (C) 2005, 2006, 2011, 2012  Fujitsu Limited
Copyright (C) 2006, 2007  VA Linux Systems Japan K.K.
Copyright (C) 2005, 2011  NEC Corporation
Copyright (C) 1999, 2002, 2007  Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002  Mission Critical Linux, Inc.
This program is free software, covered by the GNU General Public License,
and you are welcome to change it and/or distribute copies of it under
certain conditions.  Enter "help copying" to see the conditions.
This program has absolutely no warranty.  Enter "help warranty" for details.
 
GNU gdb (GDB) 7.6
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu"...

WARNING: kernel relocated [634MB]: patching 77552 gdb minimal_symbol values

      KERNEL: /usr/lib/debug/lib/modules/3.10.0-643.el7.x86_64/vmlinux 
    DUMPFILE: /var/crash/127.0.0.1-2017-04-05-14:32:04/vmcore
        CPUS: 12
        DATE: Wed Apr  5 14:31:55 2017
      UPTIME: 00:04:34
LOAD AVERAGE: 0.29, 0.58, 0.30
       TASKS: 266
    NODENAME: hp-z400-02.ml3.eng.bos.redhat.com
     RELEASE: 3.10.0-643.el7.x86_64
     VERSION: #1 SMP Tue Apr 4 19:00:14 EDT 2017
     MACHINE: x86_64  (3066 Mhz)
      MEMORY: 4 GB
       PANIC: "SysRq : Trigger a crash"
         PID: 14248
     COMMAND: "bash"
        TASK: ffff91df76f78fb0  [THREAD_INFO: ffff91def3f98000]
         CPU: 2
       STATE: TASK_RUNNING (SYSRQ)

crash> 

Since the RHEL7 kernel code has been "in transition" with respect to KASLR,
I would prefer to close this bugzilla since it works with the more recent
kernels.

Do you all agree?

Comment 10 Dave Anderson 2017-04-05 20:41:49 UTC

> Do you all agree?

FWIW, I do have a patch that would fix the "interim KASLR kernel" problem,
but again, does it make sense to do it?

Comment 11 Baoquan He 2017-04-06 00:10:38 UTC

Hi Dave,

Looks good to me if we close it as WORKSFORME or CURRENTRELEASE. Surely if you have a easy fix and that doesn't bring maintenance confusion to latest code, it's also good.

Thanks
Baoquan

Comment 12 Baoquan He 2017-04-06 00:13:21 UTC

OOPS, I just tried to see what bug close flag should be taken, forget canceling it when submit comment. Sorry about that. Leave it to Dave to decide whether it should be closed or not.

Comment 13 Dave Anderson 2017-04-06 13:16:09 UTC

Although I don't believe that this bug could ever happen with an older
upstream kernel, I am going to commit a fix into the upstream github crash
repository.

Comment 14 Dave Anderson 2017-04-06 17:15:21 UTC

Patch pushed upstream:

https://github.com/crash-utility/crash/commit/eb1057eff00620d4519c60db8a3a88ecc6c92fea

Fix for the determination of the x86_64 "phys_base" value when it is
not passed in the VMCOREINFO data of ELF vmcores.  Without the patch,
it is possible that the base address of the vmalloc region is unknown
and initialized to an incorrect default address during the very early
stages of initialization, which causes the parsing of the PT_LOAD
segments for the START_KERNEL_map region to fail.
(anderson)

Comment 18 errata-xmlrpc 2017-08-01 22:04:38 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2019