Bug 981582 - crash command can not read the dump-guest-memory file when paging=false [RHEL-7]
crash command can not read the dump-guest-memory file when paging=false [RHEL-7]
Status: CLOSED CURRENTRELEASE
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: qemu-kvm (Show other bugs)
7.0
x86_64 Linux
medium Severity medium
: rc
: ---
Assigned To: Laszlo Ersek
Virtualization Bugs
: Reopened
Depends On:
Blocks: 990118
  Show dependency treegraph
 
Reported: 2013-07-05 03:58 EDT by zhonglinzhang
Modified: 2014-06-17 23:30 EDT (History)
14 users (show)

See Also:
Fixed In Version: qemu-kvm-1.5.2-4.el7
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 989585 (view as bug list)
Environment:
Last Closed: 2014-06-13 06:54:20 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
'readelf -W -a' output for comment 20 (76.32 KB, text/plain)
2013-07-25 17:01 EDT, Laszlo Ersek
no flags Details
proposed RHEL-6 patch: dump: clamp guest-provided mapping lengths to ramblock sizes (6.29 KB, patch)
2013-07-25 20:56 EDT, Laszlo Ersek
no flags Details | Diff
RHEL-7 / upstream version of the patch (6.17 KB, patch)
2013-07-25 21:09 EDT, Laszlo Ersek
no flags Details | Diff
[2/4] dump: introduce GuestPhysBlockList (4.84 KB, patch)
2013-07-27 17:49 EDT, Laszlo Ersek
no flags Details | Diff
[3/4] dump_init(): populate guest_phys_blocks (6.78 KB, patch)
2013-07-27 17:50 EDT, Laszlo Ersek
no flags Details | Diff
[4/4] dump: rebase from host-private RAMBlock offsets to guest-physical addresses (16.06 KB, patch)
2013-07-27 17:50 EDT, Laszlo Ersek
no flags Details | Diff
RHEL-7 patches 2/4 to 4/4 (mbox) (27.89 KB, application/mbox)
2013-07-28 08:00 EDT, Laszlo Ersek
no flags Details

  None (edit)
Description zhonglinzhang 2013-07-05 03:58:46 EDT
Description of problem:
boot guest with QMP, use dump-guest-memory command to create crash dump file  when paging=false, then fail to read dump-guest-memory file by crash command.    

Version-Release number of selected component (if applicable):
host and guest kernel:
3.10.0-0.rc7.64.el7.x86_64
qemu-kvm:
qemu-img-1.5.1-2.el7.x86_64
crash version:
crash-7.0.1-1.el7.x86_64

How reproducible:
100%

Steps to Reproduce:
1. boot guest with QMP:
/usr/libexec/qemu-kvm -M q35 -cpu SandyBridge -enable-kvm -m 4G -smp 4,sockets=2,cores=2,threads=2 -name network-test -uuid 389d06a7-ed31-4fae-baf4-87bcb9b5596e -rtc base=utc,clock=host,driftfix=slew -k en-us -boot menu=on -device ioh3420,bus=pcie.0,id=root.0 -device x3130-upstream,bus=root.0,id=upstream -device xio3130-downstream,bus=upstream,id=downstream0,chassis=1 -drive file=/home/guest-rhel7.0-64.qcow3,if=none,id=drive-system-disk,media=disk,format=qcow2,aio=native,werror=stop,rerror=stop -device virtio-blk-pci,bus=downstream0,drive=drive-system-disk,id=system-disk,bootindex=1  -device xio3130-downstream,bus=upstream,id=downstream1,chassis=2      -device e1000,netdev=hostnet0,id=net0,bus=downstream1,mac=52:54:00:13:10:20 -netdev tap,id=hostnet0,vhost=on,script=/etc/qemu-ifup     -monitor stdio -vnc :1 -spice disable-ticketing,port=5931 -qmp tcp:0:5555,server,nowait

2. start qmp and create crash dump file
{"execute":"dump-guest-memory","arguments":{"paging": false,"protocol":"file:/tmp/guest-memory"}}

3. read the crash dump file by "crash"
crash /usr/lib/debug/lib/modules/3.10.0-0.rc7.64.el7.x86_64/vmlinux /tmp/guest-memory

Actual results:
crash can not read the crash dump file and show the following:
crash 7.0.1-1.el7
Copyright (C) 2002-2013  Red Hat, Inc.
Copyright (C) 2004, 2005, 2006, 2010  IBM Corporation
Copyright (C) 1999-2006  Hewlett-Packard Co
Copyright (C) 2005, 2006, 2011, 2012  Fujitsu Limited
Copyright (C) 2006, 2007  VA Linux Systems Japan K.K.
Copyright (C) 2005, 2011  NEC Corporation
Copyright (C) 1999, 2002, 2007  Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002  Mission Critical Linux, Inc.
This program is free software, covered by the GNU General Public License,
and you are welcome to change it and/or distribute copies of it under
certain conditions.  Enter "help copying" to see the conditions.
This program has absolutely no warranty.  Enter "help warranty" for details.
 
GNU gdb (GDB) 7.6
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu"...

crash: read error: kernel virtual address: ffff88014fc0c700  type: "current_task (per_cpu)"
crash: read error: kernel virtual address: ffff88014fc8c700  type: "current_task (per_cpu)"
crash: read error: kernel virtual address: ffff88014fd0c700  type: "current_task (per_cpu)"
crash: read error: kernel virtual address: ffff88014fd8c700  type: "current_task (per_cpu)"
crash: read error: kernel virtual address: ffff88014fc11ae4  type: "tss_struct ist array"


Expected results:
crash can read the crash dump file successfully

Additional info:
gdb can read the crash dump file when paging=true.
gdb can read not the crash dump file when paging=false.
crash can not read the crash dump file when paging=true.
crash can not read the crash dump file when paging=false.
Comment 2 Hai Huang 2013-07-10 15:30:18 EDT
The results from following comments:  
  > gdb can read the crash dump file when paging=true.
  > gdb can read not the crash dump file when paging=false.
appears to the the expected behavior.

The failure case appears to be with:
  > crash can not read the crash dump file when paging=true.

According to the comments in qmp code: "The file can be processed 
with crash or gdb."
Comment 3 Laszlo Ersek 2013-07-22 17:39:29 EDT
Where does this use case come from? Does it work with RHEL-6 guests? I doubt
it.

The guest kernel uses paging, as in, when it is running the VCPU resolves
virtual addresses to pseudo-physical (=guest-physical) addresses using page
tables that are in guest memory, and TLBs that are (I assume) in VMCSs.
Without this mapping (or with a bogus mapping) it is a theoretical
impossibility for gdb or crash to resolve pointer values in the dump. In
other words, I think the use case is wrong.

When paging=false is passed to the HMP command, a simple identity mapping
seems to be generated with qemu_get_guest_simple_memory_mapping(). This is
maybe useful when dumping a guest that runs in real mode, eg. SeaBIOS or
DOS.

From the hmp-commands.hx file (dump-guest-memory documentation):

    paging: do paging to get guest's memory mapping

The "qapi-schema.json" file is more verbose:

# @paging: if true, do paging to get guest's memory mapping. This allows
#          using gdb to process the core file.
#
# IMPORTANT: this option can make QEMU allocate several gigabytes
#            of RAM. This can happen for a large guest, or a
#            malicious guest pretending to be large.
#
# Also, paging=true has the following limitations:
#
#    1. The guest may be in a catastrophic state or can have corrupted
#       memory, which cannot be trusted
#    2. The guest can be in real-mode even if paging is enabled. For
#       example, the guest uses ACPI to sleep, and ACPI sleep state
#       goes in real-mode


Adding Dave, Paolo and Gleb for sanity checking, but for now I'm closing
this as NOTABUG. Feel free to reopen if I'm wrong.
Comment 4 juzhang 2013-07-22 23:35:06 EDT
Hi Laszlo,

According to comment0's additional infos"crash can not read the crash dump file when paging=true." In additional, I checked "hmp-commands.hx" and found " Dump guest memory to @var{protocol}. The file can be processed with crash or gdb". Based to these 2 points. QE reopened this bug. Any mistake, please fix me.
Comment 5 Sibiao Luo 2013-07-22 23:42:46 EDT
(In reply to Laszlo Ersek from comment #3)
> Where does this use case come from? Does it work with RHEL-6 guests? I doubt
> it.
Tried it in rhel6.5 host:
gdb tool can read the dump file with paging=true & paging=false.
crash tool cann't read the dump file with paging=true or false.
According to bug 832458#c10 that crash can read the dump file with paging=true, does this is regression issue ?
IIRC, set paging=true for gdb dump tools and paging=false for crash dump tools, could you help clear it and i will modify our test case according your comments.
> The guest kernel uses paging, as in, when it is running the VCPU resolves
> virtual addresses to pseudo-physical (=guest-physical) addresses using page
> tables that are in guest memory, and TLBs that are (I assume) in VMCSs.
> Without this mapping (or with a bogus mapping) it is a theoretical
> impossibility for gdb or crash to resolve pointer values in the dump. In
> other words, I think the use case is wrong.
> 
> When paging=false is passed to the HMP command, a simple identity mapping
> seems to be generated with qemu_get_guest_simple_memory_mapping(). This is
> maybe useful when dumping a guest that runs in real mode, eg. SeaBIOS or
> DOS.
> 
> From the hmp-commands.hx file (dump-guest-memory documentation):
> 
>     paging: do paging to get guest's memory mapping
> 
> The "qapi-schema.json" file is more verbose:
> 
> # @paging: if true, do paging to get guest's memory mapping. This allows
> #          using gdb to process the core file.
> #
> # IMPORTANT: this option can make QEMU allocate several gigabytes
> #            of RAM. This can happen for a large guest, or a
> #            malicious guest pretending to be large.
> #
> # Also, paging=true has the following limitations:
> #
> #    1. The guest may be in a catastrophic state or can have corrupted
> #       memory, which cannot be trusted
yes, this can be understood.
> #    2. The guest can be in real-mode even if paging is enabled. For
> #       example, the guest uses ACPI to sleep, and ACPI sleep state
> #       goes in real-mode
I cann't got its meaning, could you help explain it more details, thx.
> 
> Adding Dave, Paolo and Gleb for sanity checking, but for now I'm closing
> this as NOTABUG. Feel free to reopen if I'm wrong.
retest rhel7 host:
crash tool cann't read the dump file with paging=true & paging=false.
Comment 6 Laszlo Ersek 2013-07-23 08:15:07 EDT
Can you please fill in the bottom two rows of the following table (16 test
cases, "0" means "failure", "1" means "success"), summarizing your test
results?

  +------------------------------+---------------+---------------+
  | host (including kernel and   |               |               |
  | qemu-kvm used to produce     |     RHEL6     |     RHEL7     |
  | the dump)                    |               |               |
  +------------------------------+-------+-------+-------+-------+
  | guest (including kernel, and |       |       |       |       |
  | the crash & gdb utilities    | RHEL6 | RHEL7 | RHEL6 | RHEL7 |
  | used to open guest dump)     |       |       |       |       |
  +------------------------------+---+---+---+---+---+---+---+---+
  | "paging" argument of the     | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 |
  | "dump-guest-memory" command  |   |   |   |   |   |   |   |   |
  +------------------------------+---+---+---+---+---+---+---+---+
  | the "crash" utility          |   |   |   |   |   |   |   |   |
  | matching the guest RHEL      | ? | ? | ? | ? | ? | ? | ? | ? |
  | release can successfully     |   |   |   |   |   |   |   |   |
  | read the dump                |   |   |   |   |   |   |   |   |
  +------------------------------+---+---+---+---+---+---+---+---+
  | the "gdb" utility matching   |   |   |   |   |   |   |   |   |
  | the guest RHEL release can   | ? | ? | ? | ? | ? | ? | ? | ? |
  | successfully read the        |   |   |   |   |   |   |   |   |
  | dump                         |   |   |   |   |   |   |   |   |
  +------------------------------+---+---+---+---+---+---+---+---+

If you didn't test all 16 cases, please fill in those that you did test.
Thanks!
Comment 7 Dave Anderson 2013-07-23 10:24:17 EDT
Also, please save the vmcore files and make them available, as it's pretty 
much impossible to debug something like this without them. 

From the crash output above, it can be deduced that there have been a 
handful of successful reads from the dumpfile prior to the set of 
"current_task (per-cpu)" read failures.  However, it's not clear whether
those reads contained valid data.  If you at least were to post the output
of "crash -d8 vmlinux vmcore", then there would be some debug statements
that would be far more helpful than what you've given us.

This is the first I've heard of "paging=true/false".
Comment 8 Dave Anderson 2013-07-24 09:12:58 EDT
> gdb can read the crash dump file when paging=true.
> gdb can read not the crash dump file when paging=false.
> crash can not read the crash dump file when paging=true.
> crash can not read the crash dump file when paging=false.

I'm also curious as to how gdb was tested? 

I ask because you can pass completely unrelated vmlinux and vmcore
files to gdb, and it will still make it to the "(gdb)" prompt.  So
I'm wondering whether you actually looked at some kernel data?
Comment 9 Sibiao Luo 2013-07-24 22:20:25 EDT
(In reply to Laszlo Ersek from comment #6)
> Can you please fill in the bottom two rows of the following table (16 test
> cases, "0" means "failure", "1" means "success"), summarizing your test
> results?
  +------------------------------+---------------+---------------+
  | host (including kernel and   |               |               |
  | qemu-kvm used to produce     |     RHEL6     |     RHEL7     |
  | the dump)                    |               |               |
  +------------------------------+-------+-------+-------+-------+
  | guest (including kernel, and |       |       |       |       |
  | the crash & gdb utilities    | RHEL6 | RHEL7 | RHEL6 | RHEL7 |
  | used to open guest dump)     |       |       |       |       |
  +------------------------------+---+---+---+---+---+---+---+---+
  | "paging" argument of the     | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 |
  | "dump-guest-memory" command  |   |   |   |   |   |   |   |   |
  +------------------------------+---+---+---+---+---+---+---+---+
  | the "crash" utility          |   |   |   |   |   |   |   |   |
  | matching the guest RHEL      | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
  | release can successfully     |   |   |   |   |   |   |   |   |
  | read the dump                |   |   |   |   |   |   |   |   |
  +------------------------------+---+---+---+---+---+---+---+---+
  | the "gdb" utility matching   |   |   |   |   |   |   |   |   |
  | the guest RHEL release can   | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
  | successfully read the        |   |   |   |   |   |   |   |   |
  | dump                         |   |   |   |   |   |   |   |   |
  +------------------------------+---+---+---+---+---+---+---+---+

host & guest info:
3.10.0-0.rc7.64.el7.x86_64
qemu-kvm-1.5.1-2.el7.x86_64
2.6.32-400.el6.x86_64
qemu-kvm-0.12.1.2-2.379.el6.x86_64

Best Regards,
sluo
Comment 10 Sibiao Luo 2013-07-24 22:24:13 EDT
(In reply to Dave Anderson from comment #8)
> > gdb can read the crash dump file when paging=true.
> > gdb can read not the crash dump file when paging=false.
> > crash can not read the crash dump file when paging=true.
> > crash can not read the crash dump file when paging=false.
> 
> I'm also curious as to how gdb was tested? 
# gdb /usr/lib/debug/lib/modules/`uname -r`/vmlinux /home/$dump-guest-memory-file
> I ask because you can pass completely unrelated vmlinux and vmcore
> files to gdb, and it will still make it to the "(gdb)" prompt.  So
> I'm wondering whether you actually looked at some kernel data?
e.g1:
# gdb /usr/lib/debug/lib/modules/`uname -r`/vmlinux /home/rhel7-true
#0  native_safe_halt ()
    at /usr/src/debug/kernel-3.10.0-0.rc7.64.el7/linux-3.10.0-0.rc7.64.el7.x86_64/arch/x86/include/asm/irqflags.h:50
50	}
(gdb) bt
#0  native_safe_halt ()
    at /usr/src/debug/kernel-3.10.0-0.rc7.64.el7/linux-3.10.0-0.rc7.64.el7.x86_64/arch/x86/include/asm/irqflags.h:50
#1  0xffffffff8101955f in arch_safe_halt ()
    at /usr/src/debug/kernel-3.10.0-0.rc7.64.el7/linux-3.10.0-0.rc7.64.el7.x86_64/arch/x86/include/asm/paravirt.h:111
#2  default_idle () at arch/x86/kernel/process.c:313
#3  0xffffffff81019e36 in arch_cpu_idle () at arch/x86/kernel/process.c:302
#4  0xffffffff810b044e in cpu_idle_loop () at kernel/cpu/idle.c:99
#5  cpu_startup_entry (state=state@entry=CPUHP_ONLINE) at kernel/cpu/idle.c:134
#6  0xffffffff815e1957 in rest_init () at init/main.c:389
#7  0xffffffff81a26ee9 in start_kernel () at init/main.c:641
#8  0xffffffff81a265dc in x86_64_start_reservations (
    real_mode_data=real_mode_data@entry=0x8b000 <Address 0x8b000 out of bounds>) at arch/x86/kernel/head64.c:193
#9  0xffffffff81a266d1 in x86_64_start_kernel (real_mode_data=0x8b000 <Address 0x8b000 out of bounds>)
    at arch/x86/kernel/head64.c:182
#10 0x0000000000000000 in ?? ()
(gdb)

e.g2:
# gdb /usr/lib/debug/lib/modules/`uname -r`/vmlinux /home/rhel7-false
#0  native_safe_halt ()
    at /usr/src/debug/kernel-3.10.0-0.rc7.64.el7/linux-3.10.0-0.rc7.64.el7.x86_64/arch/x86/include/asm/irqflags.h:50
50	}
(gdb) bt
#0  native_safe_halt ()
    at /usr/src/debug/kernel-3.10.0-0.rc7.64.el7/linux-3.10.0-0.rc7.64.el7.x86_64/arch/x86/include/asm/irqflags.h:50
Cannot access memory at address 0xffffffff818edec8
(gdb) 

Best Regards,
sluo
Comment 11 Laszlo Ersek 2013-07-25 04:24:57 EDT
Hello Sibiao,

thank you for the new info.

However I'm getting more and more confused by it. From comment 9:

(In reply to Sibiao Luo from comment #9)
> (In reply to Laszlo Ersek from comment #6)
> > Can you please fill in the bottom two rows of the following table (16
> > test cases, "0" means "failure", "1" means "success"), summarizing your
> > test results?
>   +------------------------------+---------------+---------------+
>   | host (including kernel and   |               |               |
>   | qemu-kvm used to produce     |     RHEL6     |     RHEL7     |
>   | the dump)                    |               |               |
>   +------------------------------+-------+-------+-------+-------+
>   | guest (including kernel, and |       |       |       |       |
>   | the crash & gdb utilities    | RHEL6 | RHEL7 | RHEL6 | RHEL7 |
>   | used to open guest dump)     |       |       |       |       |
>   +------------------------------+---+---+---+---+---+---+---+---+
>   | "paging" argument of the     | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 |
>   | "dump-guest-memory" command  |   |   |   |   |   |   |   |   |
>   +------------------------------+---+---+---+---+---+---+---+---+
>   | the "crash" utility          |   |   |   |   |   |   |   |   |
>   | matching the guest RHEL      | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
>   | release can successfully     |   |   |   |   |   |   |   |   |
>   | read the dump                |   |   |   |   |   |   |   |   |
>   +------------------------------+---+---+---+---+---+---+---+---+
>   | the "gdb" utility matching   |   |   |   |   |   |   |   |   |
>   | the guest RHEL release can   | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
>   | successfully read the        |   |   |   |   |   |   |   |   |
>   | dump                         |   |   |   |   |   |   |   |   |
>   +------------------------------+---+---+---+---+---+---+---+---+

You've filled row #4 of the table invariably with zeros, and row #5
invariably with ones.

(a) This means that "crash" can *never* read the dump file, *independently*
of host qemu-kvm version (row #1), guest kernel version (row #2), and the
"paging" QMP command argument (row #3).

(b) It further means that "gdb" can *always* read the dump, again,
independently of the input variables (row #1 to row #3). Is that so?

Claim (b) would be actually very useful if it were true. However comment 10
makes me doubt it:

(In reply to Sibiao Luo from comment #10)
> e.g2:
> # gdb /usr/lib/debug/lib/modules/`uname -r`/vmlinux /home/rhel7-false
> #0  native_safe_halt ()
>     at
> /usr/src/debug/kernel-3.10.0-0.rc7.64.el7/linux-3.10.0-0.rc7.64.el7.x86_64/
> arch/x86/include/asm/irqflags.h:50
> 50	}
> (gdb) bt
> #0  native_safe_halt ()
>     at
> /usr/src/debug/kernel-3.10.0-0.rc7.64.el7/linux-3.10.0-0.rc7.64.el7.x86_64/
> arch/x86/include/asm/irqflags.h:50
> Cannot access memory at address 0xffffffff818edec8
> (gdb)

This implies that at least one cell in row #5 of the table should have been
zero, as gdb failed to read the dump file *in depth*. Specifically, the cell
for host=RHEL7/guest=RHEL7/paging=false:

+------------------------------+---------------+---------------+
| host (including kernel and   |               |               |
| qemu-kvm used to produce     |     RHEL6     |     RHEL7     |
| the dump)                    |               |               |
+------------------------------+-------+-------+-------+-------+
| guest (including kernel, and |       |       |       |       |
| the crash & gdb utilities    | RHEL6 | RHEL7 | RHEL6 | RHEL7 |
| used to open guest dump)     |       |       |       |       |
+------------------------------+---+---+---+---+---+---+---+---+
| "paging" argument of the     | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 |
| "dump-guest-memory" command  |   |   |   |   |   |   |   |   |
+------------------------------+---+---+---+---+---+---+---+---+
| the "crash" utility          |   |   |   |   |   |   |   |   |
| matching the guest RHEL      |   |   |   |   |   |   |   |   |
| release can successfully     |   |   |   |   |   |   |   |   |
| read the dump                |   |   |   |   |   |   |   |   |
+------------------------------+---+---+---+---+---+---+---+---+
| the "gdb" utility matching   |   |   |   |   |   |   |   |   |
| the guest RHEL release can   |   |   |   |   |   |   | 0 |   | <-
| successfully read the        |   |   |   |   |   |   |   |   |
| dump                         |   |   |   |   |   |   |   |   |
+------------------------------+---+---+---+---+---+---+---+---+
                                                         ^
                                                         |
Comment 12 Sibiao Luo 2013-07-25 04:53:00 EDT
(In reply to Laszlo Ersek from comment #11)
> Hello Sibiao,
> 
> thank you for the new info.
> 
> However I'm getting more and more confused by it. From comment 9:
> 
> (In reply to Sibiao Luo from comment #9)
> > (In reply to Laszlo Ersek from comment #6)
> > > Can you please fill in the bottom two rows of the following table (16
> > > test cases, "0" means "failure", "1" means "success"), summarizing your
> > > test results?
> >   +------------------------------+---------------+---------------+
> >   | host (including kernel and   |               |               |
> >   | qemu-kvm used to produce     |     RHEL6     |     RHEL7     |
> >   | the dump)                    |               |               |
> >   +------------------------------+-------+-------+-------+-------+
> >   | guest (including kernel, and |       |       |       |       |
> >   | the crash & gdb utilities    | RHEL6 | RHEL7 | RHEL6 | RHEL7 |
> >   | used to open guest dump)     |       |       |       |       |
> >   +------------------------------+---+---+---+---+---+---+---+---+
> >   | "paging" argument of the     | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 |
> >   | "dump-guest-memory" command  |   |   |   |   |   |   |   |   |
> >   +------------------------------+---+---+---+---+---+---+---+---+
> >   | the "crash" utility          |   |   |   |   |   |   |   |   |
> >   | matching the guest RHEL      | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
> >   | release can successfully     |   |   |   |   |   |   |   |   |
> >   | read the dump                |   |   |   |   |   |   |   |   |
> >   +------------------------------+---+---+---+---+---+---+---+---+
> >   | the "gdb" utility matching   |   |   |   |   |   |   |   |   |
> >   | the guest RHEL release can   | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
> >   | successfully read the        |   |   |   |   |   |   |   |   |
> >   | dump                         |   |   |   |   |   |   |   |   |
> >   +------------------------------+---+---+---+---+---+---+---+---+
> 
> You've filled row #4 of the table invariably with zeros, and row #5
> invariably with ones.
yes, the row #4 were zeros, the row #5 maybe make a mistake, i will modify it.
> (a) This means that "crash" can *never* read the dump file, *independently*
> of host qemu-kvm version (row #1), guest kernel version (row #2), and the
> "paging" QMP command argument (row #3).
the crash tool cann't read the dump file at all.
> (b) It further means that "gdb" can *always* read the dump, again,
> independently of the input variables (row #1 to row #3). Is that so?
the gdb can read the dump file when the paging=true.
> Claim (b) would be actually very useful if it were true. However comment 10
> makes me doubt it:
> 
> (In reply to Sibiao Luo from comment #10)
> > e.g2:
> > # gdb /usr/lib/debug/lib/modules/`uname -r`/vmlinux /home/rhel7-false
> > #0  native_safe_halt ()
> >     at
> > /usr/src/debug/kernel-3.10.0-0.rc7.64.el7/linux-3.10.0-0.rc7.64.el7.x86_64/
> > arch/x86/include/asm/irqflags.h:50
> > 50	}
> > (gdb) bt
> > #0  native_safe_halt ()
> >     at
> > /usr/src/debug/kernel-3.10.0-0.rc7.64.el7/linux-3.10.0-0.rc7.64.el7.x86_64/
> > arch/x86/include/asm/irqflags.h:50
> > Cannot access memory at address 0xffffffff818edec8
does this mean that gdb cann't read the dump file? I thought that it cannot access memory due to paging=false. Maybe i make a mistake for it, i will modify the table.
> > (gdb)
> 
> This implies that at least one cell in row #5 of the table should have been
> zero, as gdb failed to read the dump file *in depth*. Specifically, the cell
> for host=RHEL7/guest=RHEL7/paging=false:
> 
> +------------------------------+---------------+---------------+
> | host (including kernel and   |               |               |
> | qemu-kvm used to produce     |     RHEL6     |     RHEL7     |
> | the dump)                    |               |               |
> +------------------------------+-------+-------+-------+-------+
> | guest (including kernel, and |       |       |       |       |
> | the crash & gdb utilities    | RHEL6 | RHEL7 | RHEL6 | RHEL7 |
> | used to open guest dump)     |       |       |       |       |
> +------------------------------+---+---+---+---+---+---+---+---+
> | "paging" argument of the     | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 |
> | "dump-guest-memory" command  |   |   |   |   |   |   |   |   |
> +------------------------------+---+---+---+---+---+---+---+---+
> | the "crash" utility          |   |   |   |   |   |   |   |   |
> | matching the guest RHEL      |   |   |   |   |   |   |   |   |
> | release can successfully     |   |   |   |   |   |   |   |   |
> | read the dump                |   |   |   |   |   |   |   |   |
> +------------------------------+---+---+---+---+---+---+---+---+
> | the "gdb" utility matching   |   |   |   |   |   |   |   |   |
> | the guest RHEL release can   |   |   |   |   |   |   | 0 |   | <-
> | successfully read the        |   |   |   |   |   |   |   |   |
> | dump                         |   |   |   |   |   |   |   |   |
> +------------------------------+---+---+---+---+---+---+---+---+
>                                                          ^
>                                                          |
+------------------------------+---------------+---------------+
  | host (including kernel and   |               |               |
  | qemu-kvm used to produce     |     RHEL6     |     RHEL7     |
  | the dump)                    |               |               |
  +------------------------------+-------+-------+-------+-------+
  | guest (including kernel, and |       |       |       |       |
  | the crash & gdb utilities    | RHEL6 | RHEL7 | RHEL6 | RHEL7 |
  | used to open guest dump)     |       |       |       |       |
  +------------------------------+---+---+---+---+---+---+---+---+
  | "paging" argument of the     | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 |
  | "dump-guest-memory" command  |   |   |   |   |   |   |   |   |
  +------------------------------+---+---+---+---+---+---+---+---+
  | the "crash" utility          |   |   |   |   |   |   |   |   |
  | matching the guest RHEL      | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
  | release can successfully     |   |   |   |   |   |   |   |   |
  | read the dump                |   |   |   |   |   |   |   |   |
  +------------------------------+---+---+---+---+---+---+---+---+
  | the "gdb" utility matching   |   |   |   |   |   |   |   |   |
  | the guest RHEL release can   | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 |
  | successfully read the        |   |   |   |   |   |   |   |   |
  | dump                         |   |   |   |   |   |   |   |   |
  +------------------------------+---+---+---+---+---+---+---+---+

Best Regards,
sluo
Comment 13 Laszlo Ersek 2013-07-25 05:03:36 EDT
Thanks! This update is enlightening. So, gdb can read the dump in-depth *iff* paging=true; no other variable matters. This is probably also the goal we should aim at with the crash tool.

Furthermore, regarding "crash", the problem is not a regression relative to RHEL-6, it seems to be a new feature for RHEL-6 as well.

We can now select one *column* in the table where "gdb" works, and look at what "crash" does with the corresponding vmcore.
Comment 14 Laszlo Ersek 2013-07-25 07:41:08 EDT
Dave,

now that I know what to target exactly, I'll try to get us a reproducer vmcore; no need to bog down QE. I'll probably look at RHEL-6 (host & guest) first because that's what I run on my laptop.

Thanks
Laszlo
Comment 15 Dave Anderson 2013-07-25 09:16:17 EDT
> now that I know what to target exactly, I'll try to get us a reproducer
> vmcore; no need to bog down QE.

OK thanks -- when you create one or more that fail, please give me a pointer
to them.

Although I have to say -- the QE guys that do kdump/crash testing have been
*trained* to always save the dumpfiles when there are failures like this.
I mean they *have* the dumpfile on hand -- why not just save it instead of
manually removing it?

Anyway, one thing that does stand out from the original report is this:

crash: read error: kernel virtual address: ffff88014fc0c700  type: "current_task (per_cpu)"
crash: read error: kernel virtual address: ffff88014fc8c700  type: "current_task (per_cpu)"
crash: read error: kernel virtual address: ffff88014fd0c700  type: "current_task (per_cpu)"
crash: read error: kernel virtual address: ffff88014fd8c700  type: "current_task (per_cpu)"
crash: read error: kernel virtual address: ffff88014fc11ae4  type: "tss_struct ist array"

Those are all unity-mapped addresses calculated for per-cpu data symbols
by taking their offset values and adding to the appropriate starting 
per-cpu base addresses found in the __per_cpu_offset[NR_CPUS] array.
The resultant physical addresses that would be searched for in the dumpfile
would be 14fc0c700, 14fc8c700, 14fd0c700, 14fd8c700 and 14fc11ae4.  All of
those physical addresses are above 5GB (140000000), and the read errors
above would be generated if those physical addresses were not advertised
in any of the PT_LOAD segments in the ELF header.

But as I understand it, the guests are created with 4GB:

1. boot guest with QMP:
/usr/libexec/qemu-kvm -M q35 -cpu SandyBridge -enable-kvm -m 4G -smp 
...

With those arguments, would it be possible that the guest image would
be created with over 5GB of addressable physical address space?
Comment 16 Dave Anderson 2013-07-25 10:47:12 EDT
> With those arguments, would it be possible that the guest image would
> be created with over 5GB of addressable physical address space?

OK, so I created a 4GB RHEL6 guest on a RHEL7 host, and took a
dump like so:

 $ virsh dump --memory-only 2 /tmp/vmcore

When reading the dumpfile with the crash utility, it fails in a 
somewhat similar manner as reported by QE when it tries to access
high physical memory addresses.  Here it fails (but continues) after
failing to read address ffff8801190a6c04 (physical 1190a6c04), and 
subsequently fails (fatally) reading a page table address at physical
address 11bf11000:
  
  # crash vmlinux vmcore
  
  crash 7.0.1-1.el7
  Copyright (C) 2002-2013  Red Hat, Inc.
  Copyright (C) 2004, 2005, 2006, 2010  IBM Corporation
  Copyright (C) 1999-2006  Hewlett-Packard Co
  Copyright (C) 2005, 2006, 2011, 2012  Fujitsu Limited
  Copyright (C) 2006, 2007  VA Linux Systems Japan K.K.
  Copyright (C) 2005, 2011  NEC Corporation
  Copyright (C) 1999, 2002, 2007  Silicon Graphics, Inc.
  Copyright (C) 1999, 2000, 2001, 2002  Mission Critical Linux, Inc.
  This program is free software, covered by the GNU General Public License,
  and you are welcome to change it and/or distribute copies of it under
  certain conditions.  Enter "help copying" to see the conditions.
  This program has absolutely no warranty.  Enter "help warranty" for details.
   
  GNU gdb (GDB) 7.6
  Copyright (C) 2013 Free Software Foundation, Inc.
  License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
  This is free software: you are free to change and redistribute it.
  There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
  and "show warranty" for details.
  This GDB was configured as "x86_64-unknown-linux-gnu"...
  
  WARNING: failed to init kexec backup region
  please wait... (gathering kmem slab cache data)
  crash: read error: kernel virtual address: ffff8801190a6c04  type: "array cache limit"
  
  crash: unable to initialize kmem slab cache subsystem
  
  please wait... (gathering module symbol data)
  crash: read error: physical address: 11bf11000  type: "page table"
  #

Running crash live on the guest system, shows that the 4GB guest
system has a 512MB memory hole, such that the highest physical 
page is at 11ffff000:
  
  crash> kmem -p | tail
  ffffea0003effdd0 11fff6000                0        0  1 40000000000000
  ffffea0003effe08 11fff7000                0        0  1 40000000000000
  ffffea0003effe40 11fff8000                0        0  1 40000000000000
  ffffea0003effe78 11fff9000                0        0  1 40000000000000
  ffffea0003effeb0 11fffa000                0        0  1 40000000000000
  ffffea0003effee8 11fffb000                0        0  1 40000000000000
  ffffea0003efff20 11fffc000                0        0  1 40000000000000
  ffffea0003efff58 11fffd000                0        0  1 40000000000000
  ffffea0003efff90 11fffe000                0        0  1 40000000000000
  ffffea0003efffc8 11ffff000                0        0  1 40000000000000
  crash>

The addresses of the two failing reads from the dumpfile are 
legitimate and accessible on the live system.  Here, the first failure 
at virtual ffff8801190a6c04 / physical 1190a6c04:
  
  crash> rd ffff8801190a6c04
  ffff8801190a6c04:  0000001b00000036                    6.......
  crash> rd -p 1190a6c04
         1190a6c04:  0000001b00000036                    6.......
  crash>

And the page table physical address is here:

  crash> rd -p 11bf11000
         11bf11000:  000000011ac6c163                    c.......
  crash> 

But those physical address locations are not accounted for in the 
vmcore file:
  
  # readelf -a vmcore
  ELF Header:
    Magic:   7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00 
    Class:                             ELF64
    Data:                              2's complement, little endian
    Version:                           1 (current)
    OS/ABI:                            UNIX - System V
    ABI Version:                       0
    Type:                              CORE (Core file)
    Machine:                           Advanced Micro Devices X86-64
    Version:                           0x1
    Entry point address:               0x0
    Start of program headers:          64 (bytes into file)
    Start of section headers:          0 (bytes into file)
    Flags:                             0x0
    Size of this header:               64 (bytes)
    Size of program headers:           56 (bytes)
    Number of program headers:         9
    Size of section headers:           0 (bytes)
    Number of section headers:         0
    Section header string table index: 0
  
  There are no sections in this file.
  
  There are no sections to group in this file.
  
  Program Headers:
    Type           Offset             VirtAddr           PhysAddr
                   FileSiz            MemSiz              Flags  Align
    NOTE           0x0000000000000238 0x0000000000000000 0x0000000000000000
                   0x0000000000000ca0 0x0000000000000ca0         0
    LOAD           0x0000000000000ed8 0x0000000000000000 0x0000000000000000
                   0x0000000100000000 0x0000000100000000         0
    LOAD           0x0000000108040ed8 0x0000000000000000 0x0000000100000000
                   0x0000000000020000 0x0000000000020000         0
    LOAD           0x0000000108060ed8 0x0000000000000000 0x0000000100020000
                   0x0000000000020000 0x0000000000020000         0
    LOAD           0x0000000100000ed8 0x0000000000000000 0x0000000100040000
                   0x0000000004000000 0x0000000004000000         0
    LOAD           0x0000000108090ed8 0x0000000000000000 0x0000000104040000
                   0x0000000000002000 0x0000000000002000         0
    LOAD           0x0000000104000ed8 0x0000000000000000 0x0000000104042000
                   0x0000000004000000 0x0000000004000000         0
    LOAD           0x0000000108080ed8 0x0000000000000000 0x0000000108042000
                   0x0000000000010000 0x0000000000010000         0
    LOAD           0x0000000108000ed8 0x0000000000000000 0x0000000108052000
                   0x0000000000040000 0x0000000000040000         0
  
  There is no dynamic section in this file.
  
  There are no relocations in this file.
  
  The decoding of unwind sections for machine type Advanced Micro Devices X86-64 is not currently supported.
  
  No version information found in this file.
  
  Notes at offset 0x00000238 with length 0x00000ca0:
    Owner                 Data size	Description
    CORE                 0x00000150	NT_PRSTATUS (prstatus structure)
    CORE                 0x00000150	NT_PRSTATUS (prstatus structure)
    CORE                 0x00000150	NT_PRSTATUS (prstatus structure)
    CORE                 0x00000150	NT_PRSTATUS (prstatus structure)
    QEMU                 0x000001b0	Unknown note type: (0x00000000)
    QEMU                 0x000001b0	Unknown note type: (0x00000000)
    QEMU                 0x000001b0	Unknown note type: (0x00000000)
    QEMU                 0x000001b0	Unknown note type: (0x00000000)
  $  
    
Note that the highest physical memory region is a 256k region starting
at 0x0000000108052000, so the highest possible physical address that
is advertised in the ELF header would be 108092000.  So there is 
most definitely a truncation of physical memory in the dumpfile.

Dave
Comment 17 Dave Anderson 2013-07-25 11:02:47 EDT
I should also mention that looking at "crash -d8 vmlinux vmcore" debug
output shows that the successful reads return legitimate/correct data.
Problems only arise when attempting to read the truncated physical memory.
Comment 18 Dave Anderson 2013-07-25 14:07:46 EDT
And for sanity's sake, here's /proc/iomem on the guest:

$ cat /proc/iomem
00000000-00000fff : reserved
00001000-0009fbff : System RAM
0009fc00-0009ffff : reserved
000a0000-000bffff : PCI Bus 0000:00
000c0000-000c8bff : Video ROM
000c9000-000c99ff : Adapter ROM
000ca000-000cc3ff : Adapter ROM
000f0000-000fffff : reserved
  000f0000-000fffff : System ROM
00100000-dfffdfff : System RAM
  01000000-01519194 : Kernel code
  01519195-01c0d8af : Kernel data
  01d55000-0201b0a3 : Kernel bss
  03000000-0affffff : Crash kernel
dfffe000-dfffffff : reserved
e0000000-febfffff : PCI Bus 0000:00
  f4000000-f7ffffff : 0000:00:02.0
  f8000000-fbffffff : 0000:00:02.0
  fc000000-fc03ffff : 0000:00:03.0
  fc040000-fc04ffff : 0000:00:02.0
  fc050000-fc053fff : 0000:00:04.0
    fc050000-fc053fff : ICH HD audio
  fc054000-fc055fff : 0000:00:02.0
  fc056000-fc056fff : 0000:00:03.0
    fc056000-fc056fff : virtio-pci
  fc057000-fc057fff : 0000:00:05.0
    fc057000-fc057fff : virtio-pci
  fc058000-fc058fff : 0000:00:06.0
    fc058000-fc058fff : virtio-pci
fec00000-fec003ff : IOAPIC 0
fee00000-fee00fff : Local APIC
feffc000-feffffff : reserved
fffc0000-ffffffff : reserved
100000000-11fffffff : System RAM
$
Comment 19 Laszlo Ersek 2013-07-25 14:25:10 EDT
The lack of the last PT_LOAD segment(s) in the ELF header is alarming. Those are written by the following (trimmed down) call tree, RHEL-7 qemu-kvm:

qmp_dump_guest_memory() [dump.c]
  dump_init()
    memory_mapping_list_init()
      qemu_get_guest_memory_mapping() -- with paging==true
        cpu_get_memory_mapping() [target-i386/arch_memory_mapping.c]
          ... walks page tables ...
  create_vmcore()
    dump_begin()
      write_elf_loads()
    dump_iterate()
      ...

static int write_elf_loads(DumpState *s)
{
    hwaddr offset;
    MemoryMapping *memory_mapping;
    uint32_t phdr_index = 1;
    int ret;
    uint32_t max_index;

    if (s->have_section) {
        max_index = s->sh_info;
    } else {
        max_index = s->phdr_num;
    }

    QTAILQ_FOREACH(memory_mapping, &s->list.head, next) {
        offset = get_offset(memory_mapping->phys_addr, s);
        if (s->dump_info.d_class == ELFCLASS64) {
            ret = write_elf64_load(s, memory_mapping, phdr_index++, offset);
        } else {
            ret = write_elf32_load(s, memory_mapping, phdr_index++, offset);
        }

        if (ret < 0) {
            return -1;
        }

        if (phdr_index >= max_index) {
            break;
        }
    }

    return 0;
}

So, this iterates over the list of mappings that was retrieved in dump_init() / cpu_get_memory_mapping(). The list fetched there could be too short, but the loop can also stop early if (phdr_index >= max_index). I'll have to debug into this, I'm unfamiliar with this code.
Comment 20 Laszlo Ersek 2013-07-25 16:59:58 EDT
Here's a partial result on a RHEL-6 host with a RHEL-6 guest: the "virsh
dump --memory-only" command in comment 16 actually passes paging=false. I
verified that this command produces an ELF header very similar to the one
listed in comment 16.

This libvirt behavior is due to commit

  http://libvirt.org/git/?p=libvirt.git;a=commitdiff;h=d239085e

Search the commitdiff for "b:paging".

The following "direct" monitor command produces a very different vmcore:

# virsh qemu-monitor-command DOMAIN --hmp \
  dump-guest-memory -p /tmp/vmcore.monitor.p

(Note the -p parameter.)

The vmcore written like this contains several hundred PT_LOAD headers; the
total number of program headers is 783. The last PT_LOAD looks like:

  Program Headers:
    Type           Offset             VirtAddr           PhysAddr
                   FileSiz            MemSiz              Flags  Align
  ...
    LOAD           0xffffffffffffffff 0xffffc90000003000 0x000000011fee0000
                   0x0000000000000000 0x0000000000020000         0

Going up to 4607MB.

However crash (6.1.0-1.el6) rejects it the same:

  WARNING: vmcore.monitor.p: may be truncated or incomplete
           PT_LOAD p_offset: 4311920304
                   p_filesz: 536870912
             bytes required: 4848791216
              dumpfile size: 4312182448

  ...

  please wait... (gathering kmem slab cache data)
  crash: read error: kernel virtual address: ffff88011c732d80  type:
  "kmem_cache buffer"

  crash: unable to initialize kmem slab cache subsystem

  please wait... (gathering module symbol data)
  crash: read error: physical address: 1190df000  type: "page table"

The offending PT_LOAD entry:

Program Headers:
  Type           Offset             VirtAddr           PhysAddr
                 FileSiz            MemSiz              Flags  Align
  LOAD           0x000000010102aeb0 0xffff880100000000 0x0000000100000000
                 0x0000000020000000 0x0000000020000000         0

Opening the same with gdb:

  $ gdb /usr/lib/debug/lib/modules/2.6.32-358.el6.x86_64/vmlinux \
        vmcore.monitor.p

  BFD: Warning: /tmp/vmcore.monitor.p is truncated: expected core file size
  >= 4848791216, found: 4312182448.

The "bt" command does seem to work though.

... Interestingly, the huge number of small mappings / PT_LOAD entries after
the offending one fall *within* that range, and they all have:
- Offset = 0xffffffffffffffff
- FileSiz = 0x000000 

They are like some kind of sub-mapping that don't take up room in the
vmcore.

The vmcore belongs to kernel 2.6.32-358.el6.x86_64; I'll compress it and
upload it somewhere. I'll also attach the readelf -W -a output here.
Comment 21 Laszlo Ersek 2013-07-25 17:01:02 EDT
Created attachment 778539 [details]
'readelf -W -a' output for comment 20
Comment 23 Laszlo Ersek 2013-07-25 19:18:59 EDT
(ABI reference: http://refspecs.linuxbase.org/elf/gabi4+/ch5.pheader.html)

Okay I think I understand what's going on.

We're working with two lists here. The first list, a persistent list in qemu, is the list of RAMBlocks.

typedef struct RAMBlock {
    uint8_t *host;
    ram_addr_t offset;
    ram_addr_t length;
    uint32_t flags;
    char idstr[256];
    QLIST_ENTRY(RAMBlock) next;
#if defined(__linux__) && !defined(TARGET_S390X)
    int fd;
#endif
} RAMBlock;

This list describes the memory assigned to the guest. "offset" and "length" are guest-physical meaning, "host" is the virtual address inside the qemu process.

Then, at dump time, we collect another temporary list, namely the list of memory mappings. This comes from the page tables of the guest, and is untrusted.

typedef struct MemoryMapping {
    target_phys_addr_t phys_addr;
    target_ulong virt_addr;
    ram_addr_t length;
    QTAILQ_ENTRY(MemoryMapping) next;
} MemoryMapping;

The two important parts of the vmcore are the set of PT_LOAD entries, and the memory dump itself. The memory dump only contains the RAMBlocks. The RAMBlocks are distinct and there can be holes between them (in guest-physical address space). So, those gaps are not dumped.

The PT_LOAD entries are written in an "almost straightforward" way. (See write_elf64_load()):

p_type = PT_LOAD
p_offset = black magic <--- we'll return to this
p_paddr = MemoryMapping.phys_addr
p_filesz = p_memsz = MemoryMapping.length
p_vaddr = MemoryMapping.virt_addr

So, while the memory part comes from RAMBlocks, the PT_LOAD entries come from MemoryMappings. A PT_LOAD entry says (simplified):
- from the vmcore file, map the left-inclusive, right-exclusive range
  [p_offset, p_offset + p_filesz) at guest-virtual address p_vaddr,
  guest-physical address p_paddr, for a length of p_memsz.

If p_memsz > p_filesz, then fill the remaining bytes with zeros. p_memsz < p_filesz is forbidden. Anyway, in our case p_filesz = p_memsz, for each PT_LOAD entry.

qemu calculates the p_offset field with the tricky get_offset() function. It is called with "MemoryMapping.phys_addr", and looks up the one RAMBlock that contains it. It finds out the starting offset of the RAMBlock in the vmcore file (remember that gaps are not written, plus there are some headers at the beginning of the file). Then, p_offset in PT_LOAD is set to point into the containing RAMBlock in the file, at the correct RAMBlock-relative offset. All fine.

What is not enforced however is that "p_filesz" (coming straight from the guest pagetables, ie MemoryMapping.length) fits into the RAMBlock! The truncation related to the last PT_LOAD entry is not actual truncation -- the full last RAMBlock has been written by qemu, but the guest set up its page tables for *more memory* than qemu actually gave it.

So, the guest's idea of the memory size (p_filesz = p_memsz = MemoryMapping.length) overflows the RAMBlock size. This being the last RAMBlock dumped to the file, we overflow the file size too, and libbfd catches it. It could result in libbfd / crash ignoring the entire PT_LOAD entry, losing even that portion of the mapping that does fit into the last RAMBlock.
Comment 24 Laszlo Ersek 2013-07-25 20:56:10 EDT
Created attachment 778589 [details]
proposed RHEL-6 patch: dump: clamp guest-provided mapping lengths to ramblock sizes

I built this patch with "qemu-kvm-0.12.1.2-2.381.el6" (ie. RHEL-6 host),
with the following effects (same RHEL-6 guest as above):

(1) In the readelf output, the offending PT_LOAD entry has changed from

  Program Headers:
    Type           Offset             VirtAddr           PhysAddr
                   FileSiz            MemSiz              Flags  Align
  LOAD           0x000000010102b038 0xffff880100000000 0x0000000100000000
                 0x0000000020000000 0x0000000020000000         0

to

  LOAD           0x00000001010271e0 0xffff880100000000 0x0000000100000000
                 0x0000000000020000 0x0000000020000000         0

Notice how the p_filesz field sunk form 0x0000000020000000 (512MB) to
0x0000000000020000 (128KB).

(2) The libbfd complaint in gdb is gone. I can get a 'bt'.

(3) "crash" still spews the following errors when starting up:

  please wait... (gathering kmem slab cache data)
  crash: invalid kernel virtual address: 0  type: "kmem_cache buffer"

  crash: unable to initialize kmem slab cache subsystem

  please wait... (gathering module symbol data)
  WARNING: cannot access vmalloc'd module memory

However it does not bail out, it gives me the info summary:

        KERNEL: /usr/lib/debug/lib/modules/2.6.32-358.el6.x86_64/vmlinux
      DUMPFILE: vmcore.monitor.p
          CPUS: 1
          DATE: Fri Jul 26 02:41:27 2013
        UPTIME: 00:01:02
  LOAD AVERAGE: 0.03, 0.01, 0.00
         TASKS: 1
      NODENAME: seabios-rhel6
       RELEASE: 2.6.32-358.el6.x86_64
       VERSION: #1 SMP Tue Jan 29 11:47:41 EST 2013
       MACHINE: x86_64  (2659 Mhz)
        MEMORY: 4 GB
         PANIC: ""
           PID: 0
       COMMAND: "swapper"
          TASK: ffffffff81a8d020  [THREAD_INFO: ffffffff81a00000]
           CPU: 0
         STATE: TASK_RUNNING (PANIC)

It reacts to bt:

  crash> bt
  PID: 0      TASK: ffffffff81a8d020  CPU: 0   COMMAND: "swapper"
   #0 [ffffffff81a01ed0] default_idle at ffffffff8101495d
   #1 [ffffffff81a01ef0] cpu_idle at ffffffff81009fc6

And it can even display the swapper task:

  crash> task ffffffff81a8d020

  PID: 0      TASK: ffffffff81a8d020  CPU: 0   COMMAND: "swapper"
  struct task_struct {
  /* bunch of data */
  }

With regard to the leading crash warnings, the patch is probably not a full
solution, but I think it's a step in the right direction.
Comment 26 Laszlo Ersek 2013-07-25 21:09:46 EDT
Created attachment 778596 [details]
RHEL-7 / upstream version of the patch
Comment 28 Laszlo Ersek 2013-07-26 05:31:06 EDT
Dave,

before I post the patch upstream, I'd like to run its results by you.

Based on comment 16 you can generate dumps on a RHEL-7 host. Can you please grab the patched RHEL-7 brew build from comment 27 and test it (and the vmcore it writes) with the command

  # virsh qemu-monitor-command DOMAIN --hmp \
    dump-guest-memory -p /tmp/vmcore

(Sibiao, please feel free to join this test!)

Thank you both,
Laszlo
Comment 29 Paolo Bonzini 2013-07-26 08:58:46 EDT
I haven't yet digested the whole thread, but perhaps these notes help:

- gdb is only supposed to understand the paging-enabled dumps; that's their primary purpose but they're otherwise of limited utility.

- crash is supposed to understand the paging-disabled dumps, which are the most useful; I don't know about the paging-enabled dumps, but I wouldn't lose time on them.

Paging-enabled dumps are entirely useless for 32-bit guests using high memory, for example.

That said, I think Laszlo's patch is okay and comment 23 makes sense.  If paging=false doesn't work, that would be a separate (and more urgent :)) bug.
Comment 30 Dave Anderson 2013-07-26 09:35:38 EDT
> Based on comment 16 you can generate dumps on a RHEL-7 host. Can you
> please grab the patched RHEL-7 brew build from comment 27 and test it
> (and the vmcore it writes) with the command
>
>  # virsh qemu-monitor-command DOMAIN --hmp \
>    dump-guest-memory -p /tmp/vmcore

It still fails as before:
  
  # rpm -qa | grep qemu
  qemu-kvm-common-1.5.1-2.el7.bz981582_clamp.x86_64
  libvirt-daemon-driver-qemu-1.0.6-1.el7.x86_64
  qemu-img-1.5.1-2.el7.bz981582_clamp.x86_64
  ipxe-roms-qemu-20130517-1.gitc4bce43.el7.noarch
  qemu-kvm-1.5.1-2.el7.bz981582_clamp.x86_64
  #
  
Here's a RHEL6 guest:

  # crash vmlinux vmcore
  
  crash 7.0.1-1.el7
  Copyright (C) 2002-2013  Red Hat, Inc.
  Copyright (C) 2004, 2005, 2006, 2010  IBM Corporation
  Copyright (C) 1999-2006  Hewlett-Packard Co
  Copyright (C) 2005, 2006, 2011, 2012  Fujitsu Limited
  Copyright (C) 2006, 2007  VA Linux Systems Japan K.K.
  Copyright (C) 2005, 2011  NEC Corporation
  Copyright (C) 1999, 2002, 2007  Silicon Graphics, Inc.
  Copyright (C) 1999, 2000, 2001, 2002  Mission Critical Linux, Inc.
  This program is free software, covered by the GNU General Public License,
  and you are welcome to change it and/or distribute copies of it under
  certain conditions.  Enter "help copying" to see the conditions.
  This program has absolutely no warranty.  Enter "help warranty" for details.
   
  WARNING: vmcore: may be truncated or incomplete
           PT_LOAD p_offset: 4429632408
                   p_filesz: 536870912
             bytes required: 4966503320
              dumpfile size: 4429968280
  
  GNU gdb (GDB) 7.6
  Copyright (C) 2013 Free Software Foundation, Inc.
  License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
  This is free software: you are free to change and redistribute it.
  There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
  and "show warranty" for details.
  This GDB was configured as "x86_64-unknown-linux-gnu"...
  
  WARNING: failed to init kexec backup region
  please wait... (gathering kmem slab cache data)
  crash: read error: kernel virtual address: ffff880119148004  type: "array cache limit"
  
  crash: unable to initialize kmem slab cache subsystem
  
  please wait... (gathering module symbol data)
  crash: read error: physical address: 11906a000  type: "page table"
  # 

Which is bailing out because the "big" PT_LOAD segment is
advertising memory that's not there:

  LOAD           0x000000010806d398 0xffff880100000000 0x0000000100000000
                 0x0000000020000000 0x0000000020000000         0


And a RHEL7 guest results in a similar failure as the original report:

  # crash vmlinux7.gz vmcore
  
  crash 7.0.1-1.el7
  Copyright (C) 2002-2013  Red Hat, Inc.
  Copyright (C) 2004, 2005, 2006, 2010  IBM Corporation
  Copyright (C) 1999-2006  Hewlett-Packard Co
  Copyright (C) 2005, 2006, 2011, 2012  Fujitsu Limited
  Copyright (C) 2006, 2007  VA Linux Systems Japan K.K.
  Copyright (C) 2005, 2011  NEC Corporation
  Copyright (C) 1999, 2002, 2007  Silicon Graphics, Inc.
  Copyright (C) 1999, 2000, 2001, 2002  Mission Critical Linux, Inc.
  This program is free software, covered by the GNU General Public License,
  and you are welcome to change it and/or distribute copies of it under
  certain conditions.  Enter "help copying" to see the conditions.
  This program has absolutely no warranty.  Enter "help warranty" for details.
   
  WARNING: vmcore: may be truncated or incomplete
           PT_LOAD p_offset: 4430330336
                   p_filesz: 536870912
             bytes required: 4967201248
              dumpfile size: 4430666208
  
  GNU gdb (GDB) 7.6
  Copyright (C) 2013 Free Software Foundation, Inc.
  License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
  This is free software: you are free to change and redistribute it.
  There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
  and "show warranty" for details.
  This GDB was configured as "x86_64-unknown-linux-gnu"...
  
  WARNING: failed to init kexec backup region
  crash: read error: kernel virtual address: ffff88011fc0c700  type: "current_task (per_cpu)"
  crash: read error: kernel virtual address: ffff88011fc8c700  type: "current_task (per_cpu)"
  crash: read error: kernel virtual address: ffff88011fd0c700  type: "current_task (per_cpu)"
  crash: read error: kernel virtual address: ffff88011fd8c700  type: "current_task (per_cpu)"
  crash: read error: kernel virtual address: ffff88011fc11ce4  type: "tss_struct ist array"
  #

Again, failing due to the same segment:

  LOAD           0x00000001081179e0 0xffff880100000000 0x0000000100000000
                 0x0000000020000000 0x0000000020000000  


What I don't understand is -- this *used* to work correctly, right?
I mean, I have seen virsh dump --memory-only vmcores before that
were perfectly fine.
Comment 31 Laszlo Ersek 2013-07-26 09:48:23 EDT
(In reply to Paolo Bonzini from comment #29)
> I haven't yet digested the whole thread, but perhaps these notes help:
> 
> - gdb is only supposed to understand the paging-enabled dumps; that's their
> primary purpose but they're otherwise of limited utility.
> 
> - crash is supposed to understand the paging-disabled dumps, which are the
> most useful; I don't know about the paging-enabled dumps, but I wouldn't
> lose time on them.
> 
> Paging-enabled dumps are entirely useless for 32-bit guests using high
> memory, for example.
> 
> That said, I think Laszlo's patch is okay and comment 23 makes sense.  If
> paging=false doesn't work, that would be a separate (and more urgent :)) bug.

Thanks for the comment!

Maybe I should retest the paging-disabled (ie. more worthwhile) dump with my patch applied too.

I haven't tested that at all yet, because I've been under the opposite impression regarding the utility of paging-enabled vs. -disabled dumps -- in comment 13 I started focusing on the paging-enabled dumps because that's what gdb seemed to support, and I assumed we would want to reach gdb's level of support with crash.

So now that you're saying crash's killer feature is in fact understanding paging-disabled dumps, that's a new direction. In any case, my patch clamps the p_filesz fields in the PT_LOAD entries independently of how the MemoryMapping list was prepared. From comment 19:

> qmp_dump_guest_memory() [dump.c]
>   dump_init()
>     memory_mapping_list_init()
>     qemu_get_guest_memory_mapping() -- with paging==true
>       cpu_get_memory_mapping() [target-i386/arch_memory_mapping.c]
>         ... walks page tables ...
>   create_vmcore()
>     dump_begin()
>       write_elf_loads()
>     dump_iterate()
>       ...

The patch affects write_elf_loads(), while paging=true vs. paging=false affects memory_mapping_list_init(). The patch affects how the list of MemoryMappings is turned into p_filesz fields, against the list of RAMBlocks; it should be independent of how the list of MemoryMappings is produced (which depends on paging=false vs. paging=true, qemu_get_guest_memory_mapping() vs. qemu_get_guest_simple_memory_mapping()):

qmp_dump_guest_memory() [dump.c]
  dump_init()
    memory_mapping_list_init()
    qemu_get_guest_simple_memory_mapping() -- with paging==false
      ... creates one identity MemoryMapping for each RAMBlock ...
  create_vmcore()
    dump_begin()
      write_elf_loads()
    dump_iterate()
      ...

Actually, due to the simple mapping theoretically covering the RAMBlocks 1-to-1, my patch should have no visible effect on the paging-disabled dump (no clamping should be needed in that case, ever).

However in all our tests until now, the paging-disabled vmcore, as interpreted by "crash", *was* reported truncated (see comment 16 -- that dump *had* disabled paging). So testing my patch with a paging-disabled dump might be worth a shot.
Comment 32 Laszlo Ersek 2013-07-26 09:53:57 EDT
... Hm, right. I retested the patched RHEL-6 host with the RHEL-6 guest, paging=false, and it fails:

  please wait... (gathering kmem slab cache data)
  crash: read error: kernel virtual address: ffff8801185a2d80  type: "kmem_cache
  buffer"

  crash: unable to initialize kmem slab cache subsystem

  please wait... (gathering module symbol data)
  crash: read error: physical address: 1195f8000  type: "page table"

Maybe the PT_LOAD entries are correct, and we have a genuine bug in dump_iterate()... Or a bug while merging adjacent MemoryMappings...
Comment 33 Dave Anderson 2013-07-26 10:19:00 EDT
> I haven't tested that at all yet, because I've been under the opposite
> impression regarding the utility of paging-enabled vs. -disabled dumps
> -- in comment 13 I started focusing on the paging-enabled dumps because
> that's what gdb seemed to support, and I assumed we would want to reach
> gdb's level of support with crash.

The crash utility does not need, or want, to "read gdb's level of support". 

The addition of the hundred's of extra PT_LOAD segments for all of the
individual vmalloc and module addresses were apparently added because gdb
cannot access their virtual memory without them.

The crash utility translates all virtual addresses to physical first,
and then looks in the ELF header for a segment containing that physical
address range.  So while all of the paging-enabled stuff is completely 
useless for crash, it has apparently broken the original scheme
of the virsh dump --memory-only implementation, which was was to clone
the simple manner in which kdump creates ELF vmcore files, which doesn't
get involved with the creation of individual vmalloc/module
address regions.
Comment 34 Laszlo Ersek 2013-07-26 16:21:34 EDT
I tested a bunch of guest RAM sizes. As long as I stay <= 3584 (0xE00) MB,
the vmcore (paging disabled) works *perfectly* with the "crash" utility.

As soon as I go above, even just with 1 MB (--> 3585 (0xE01) MB),

(a) "crash" starts complaining about virtual addresses. If I go much higher,
it sometimes doesn't even start. If I go just a bit higher, it complains but
starts. If I then use "vtop" to check out the vaddr given in the complaint,
the gpa is invariably above 4GB.

In addition (continuing with the "just one meg over 3.5 GB" scenario):

(b) the guest dmesg starts do differ like this:

    --- mem-3584-ok 2013-07-26 21:56:12.071219676 +0200
    +++ mem-3585-fail       2013-07-26 21:56:15.927219672 +0200
    @@ -14,12 +14,13 @@
      BIOS-e820: 0000000000100000 - 00000000dfffd000 (usable)
      BIOS-e820: 00000000dfffd000 - 00000000e0000000 (reserved)
      BIOS-e820: 00000000fffbc000 - 0000000100000000 (reserved)
    + BIOS-e820: 0000000100000000 - 0000000100100000 (usable)
     DMI 2.4 present.
     SMBIOS version 2.4 @ 0xFDA20
     DMI: Red Hat KVM, BIOS 0.5.1 01/01/2007
     e820 update range: 0000000000000000 - 0000000000001000 (usable) ==>
         (reserved)
     e820 remove range: 00000000000a0000 - 0000000000100000 (usable)
    -last_pfn = 0xdfffd max_arch_pfn = 0x400000000
    +last_pfn = 0x100100 max_arch_pfn = 0x400000000
     MTRR default type: write-back
     MTRR fixed ranges enabled:
       00000-9FFFF write-back
    @@ -35,28 +36,32 @@
       6 disabled
       7 disabled
     PAT not supported by CPU.
    +last_pfn = 0xdfffd max_arch_pfn = 0x400000000
     initial memory mapped : 0 - 20000000
     init_memory_mapping: 0000000000000000-00000000dfffd000
      0000000000 - 00dfe00000 page 2M
      00dfe00000 - 00dfffd000 page 4k
     kernel direct mapping tables up to dfffd000 @ 8000-e000
    +init_memory_mapping: 0000000100000000-0000000100100000
    + 0100000000 - 0100100000 page 4k
    +kernel direct mapping tables up to 100100000 @ c000-13000

    (snip)

That is, the guest kernel remaps the memory in the 512MB hole (the PCI
hole?) over 4GB. However qemu's dump feature doesn't know anything about
this.

(c) Namely, the structure of the vmcore doesn't change at all when crossing
the 3.5 GB barrier. The exact same PT_LOAD entries are there, and qemu has
the exact same number and structure of RAMBlocks internally. When crossing
the barrier, only the size of the "main" RAMBlock, and that of the
respective PT_LOAD entry, grow by 1MB (into the PCI hole range), but that's
it.

So, this is something that the guest sets up, and what KVM is certainly
aware of (otherwise crash could not work well when run inside / on the live
guest), but what the RAMBlock list, which is the basis of the dump feature,
doesn't reflect at all. When "crash" resolves a gva to a gpa above 4GB, that
gpa exists in the live guest, but not in the vmcore (and not because of
truncation: there isn't a PT_LOAD entry that would cover the gpa >= 4GB at
all.)

Note that when this feature was backported to RHEL-6 qemu-kvm, it was almost
certainly not tested with guests > 3.5GB. For example, QE tested the feature
with a 2 GB guest (see bug 832458 comment 10).

Maybe the main PT_LOAD entry, once it grows into the PCI hole, could be
split: the RAMBlock in the file could remain contiguous, and the p_offset /
p_filesz fields in the entry intact, but the paddr base should reflect the
remapping to above 4GB. This should be somehow derived from the "dynamic"
memory view (which unfortunately underwent huge changes after RHEL-6, with
the MemoryRegion stuff).
Comment 35 Laszlo Ersek 2013-07-26 17:07:15 EDT
... I guess RHEL-7 / upstream could register a MemoryListener with memory_listener_register(), for address space "address_space_memory", and key the dumping process off that.

No idea what we could do in RHEL-6.
Comment 36 Laszlo Ersek 2013-07-27 07:01:22 EDT
In RHEL-6, pc_init1() [hw/pc.c] handles the setup with variables like
"below_4g_mem_size", "above_4g_mem_size", and the following code:

    if (ram_size >= 0xe0000000 ) {
        above_4g_mem_size = ram_size - 0xe0000000;
        below_4g_mem_size = 0xe0000000;
    } else {
        below_4g_mem_size = ram_size;
    }

...

    ram_addr = qemu_ram_alloc(NULL, "pc.ram",
                              below_4g_mem_size + above_4g_mem_size);
    cpu_register_physical_memory(0, 0xa0000, ram_addr);
    cpu_register_physical_memory(0x100000,
                 below_4g_mem_size - 0x100000,
                 ram_addr + 0x100000);
#if TARGET_PHYS_ADDR_BITS > 32
    if (above_4g_mem_size > 0) {
        cpu_register_physical_memory(0x100000000ULL, above_4g_mem_size,
                                     ram_addr + below_4g_mem_size);
    }
#endif

The qemu_ram_alloc() call establishes the RAMBlock that has the correct
size.

The cpu_register_physical_memory(start_addr, size, phys_offset) calls
register the memory under the correct physical addresses. (This is RHEL-6
specific, upstream has moved to "AddressSpace"s and "MemoryRegion"s.)

  offset
  relative
  to ram_addr RAMBlock                  visible
            0 +-------------------+.....+-------------------+ 0
              |         ^         |     |        ^          |
              |       640 KB      |     |      640 KB       |
              |         v         |     |        v          |
  0x0000a0000 +-------------------+.....+-------------------+ 0x0000a0000
              |         ^         |     |XXXXXXXXXXXXXXXXXXX|
              |       384 KB      |     |XXXXXXXXXXXXXXXXXXX|
              |         v         |     |XXXXXXXXXXXXXXXXXXX|
  0x000100000 +-------------------+.....+-------------------+ 0x000100000
              |         ^         |     |        ^          |
              |       3583 MB     |     |      3583 MB      |
              |         v         |     |        v          |
  0x0e0000000 +-------------------+.....+-------------------+ 0x0e0000000
              |         ^         |.    |XXXXXXXXXXXXXXXXXXX|
              | above_4g_mem_size | .   |XXXX PCI hole XXXXX|
              |         v         |  .  |XXXX          XXXXX|
     ram_size +-------------------+   . |XXXX  512 MB  XXXXX|
                                   .   .|XXXXXXXXXXXXXXXXXXX|
                                    .   +-------------------+ 0x100000000
                                     .  |         ^         |
                                      . | above_4g_mem_size |
                                       .|         v         |
                                        +-------------------+ ram_size
                                                              + 512 MB

The dump logic should write the PT_LOAD.p_paddr fields using the RHS, and
produce split (ie. two) entries if a RAMBlock straddles the PCI hole.

Unfortunately, the RHS is target- (ie. pc-) specific, but dump is general.
So we can't base the splitting logic on magic constants like 0x0e0000000,
nor solely on RAMBlocks. dump must be aware of the memory map.
Comment 37 Laszlo Ersek 2013-07-27 08:19:10 EDT
The RHEL-6 backport of the dump feature:

   1  dd196ef Add API to create memory mapping list
   2  b019795 exec: add cpu_physical_memory_is_io()
   3  41cb867 target-i386: cpu.h: add CPUArchState
   4  809cee8 implement cpu_get_memory_mapping()
   5  d2a2ac0 Add API to check whether paging mode is enabled
   6  4619317 Add API to get memory mapping
   7  2309994 Add API to get memory mapping without do paging
   8  864ba0c target-i386: Add API to write elf notes to core file
   9  ad05166 target-i386: Add API to write cpu status to core file
  10  11a5de5 target-i386: add API to get dump info
  11  ec9dfe5 target-i386: Add API to get note's size
  12  bb67c75 make gdb_id() generally avialable and rename it to cpu_index()
  13  d34159e hmp.h: include qdict.h
  14  d5fdc32 monitor: allow qapi and old hmp to share the same dispatch
              table
  15  b2d4fb1 introduce a new monitor command 'dump-guest-memory' to dump
              guest's memory
  16  0b6f8cf qmp: dump-guest-memory: improve schema doc
  17  3eea632 qmp: dump-guest-memory: improve schema doc (again)
  18  4d09e74 qmp: dump-guest-memory: don't spin if non-blocking fd would
              block
  19  89e133c hmp: dump-guest-memory: hardcode protocol argument to "file:"

introduced six references to "ram_list.blocks":
- get_offset(): 1
- get_start_block(): 2
- qemu_get_guest_memory_mapping(): 1
- qemu_get_guest_simple_memory_mapping(): 1
- cpu_get_dump_info(): 1

My idea is to rebase these functions to an ad-hoc list that conveys the same
information, but is based on the RHS of comment 36.

Such an ad-hoc list should be possible to produce by registering a
CPUPhysMemoryClient, and immediately removing it too. Immediately at
registration, the current memory map is given to the client through the
CPUPhysMemoryClient.set_memory() callback.

  cpu_register_phys_memory_client()
    phys_page_for_each()
      phys_page_for_each_in_l1_map()
        client->set_memory()
  cpu_unregister_phys_memory_client()

In the callback we should build this list.
Comment 38 Laszlo Ersek 2013-07-27 17:44:57 EDT
The following RHEL-6 host patches resolve the probem for me (note, this is very different from what should be written for RHEL-7 / upstream as the memory API has changed greatly). Of course the series may not be idiomatic for RHEL-6 either...

The patches apply on top of the one in comment 24.

Tested with a 4GB RHEL-6 guest, in two scenarios:

(1) paging=true, for "gdb", with

    # virsh qemu-monitor-command DOMAIN --hmp \
          dump-guest-memory -p /tmp/vmcore-p

(2) paging=false, for "crash", with

    # virsh dump DOMAIN /tmp/vmcore --memory-only

From the 2nd case, these are the PT_LOAD entries:

Program Headers:
  Type           Offset             VirtAddr           PhysAddr
                 FileSiz            MemSiz              Flags  Align
  NOTE           0x00000000000001c8 0x0000000000000000 0x0000000000000000
                 0x0000000000000328 0x0000000000000328         0
  LOAD           0x00000000200204f0 0x0000000000000000 0x0000000000000000
                 0x00000000000a0000 0x00000000000a0000         0
  LOAD           0x00000000200c04f0 0x0000000000000000 0x00000000000c0000
                 0x0000000000020000 0x0000000000020000         0
  LOAD           0x00000000000004f0 0x0000000000000000 0x00000000000e0000
                 0x0000000000020000 0x0000000000020000         0
  LOAD           0x00000000200e04f0 0x0000000000000000 0x0000000000100000
                 0x00000000dff00000 0x00000000dff00000         0
  LOAD           0x00000000fffe04f0 0x0000000000000000 0x00000000f0000000
                 0x0000000001000000 0x0000000001000000         0
  LOAD           0x00000000000204f0 0x0000000000000000 0x0000000100000000
                 0x0000000020000000 0x0000000020000000         0
Comment 39 Laszlo Ersek 2013-07-27 17:49:50 EDT
Created attachment 779222 [details]
[2/4] dump: introduce GuestPhysBlockList


The vmcore must use physical addresses that are visible to the guest, not
addresses that point into linear RAMBlocks. As first step, introduce the
list type into which we'll collect the physical mappings in effect at the
time of the dump.
---
 dump.c           |   31 +++++++++++++++++++------------
 memory_mapping.c |   18 ++++++++++++++++++
 memory_mapping.h |   22 ++++++++++++++++++++++
 3 files changed, 59 insertions(+), 12 deletions(-)
Comment 40 Laszlo Ersek 2013-07-27 17:50:08 EDT
Created attachment 779223 [details]
[3/4] dump_init(): populate guest_phys_blocks


While the machine is paused, in guest_phys_blocks_append() we register a
one-shot CPUPhysMemoryClient, solely for the initial collection of the
valid guest-physical memory ranges that happens at client registration
time.

For each range that is reported to guest_phys_blocks_set_memory(), we
attempt to merge the range with adjacent (preceding, subsequent, or both)
ranges. We use two hash tables for this purpose, both indexing the same
ranges, just by different keys (guest-phys-start vs. guest-phys-end).

Ranges can only be joined if they are contiguous in both guest-physical
address space, and contiguous in host-private RAMBlock offset space.

The "maximal" ranges that remain in the end constitute the guest-physical
memory map that the dump will be based on.
---
 dump.c           |    2 +-
 memory_mapping.c |  132 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 memory_mapping.h |    1 +
 3 files changed, 134 insertions(+), 1 deletions(-)
Comment 41 Laszlo Ersek 2013-07-27 17:50:32 EDT
Created attachment 779224 [details]
[4/4] dump: rebase from host-private RAMBlock offsets to guest-physical addresses


RAMBlock.offset                   --> GuestPhysBlock.target_start
RAMBlock.offset + RAMBlock.length --> GuestPhysBlock.target_end
RAMBlock.length                   --> GuestPhysBlock.target_end -
                                      GuestPhysBlock.target_start

"GuestPhysBlock.ram_addr" is only used to get the host virtual address,
when dumping guest memory.

This patch should enable "crash" to work with the vmcore by rebasing the
vmcore from the left side of the following diagram to the right side:

host-private
offset
relative
to ram_addr   RAMBlock                  guest-visible paddrs
            0 +-------------------+.....+-------------------+ 0
              |         ^         |     |        ^          |
              |       640 KB      |     |      640 KB       |
              |         v         |     |        v          |
  0x0000a0000 +-------------------+.....+-------------------+ 0x0000a0000
              |         ^         |     |XXXXXXXXXXXXXXXXXXX|
              |       384 KB      |     |XXXXXXXXXXXXXXXXXXX|
              |         v         |     |XXXXXXXXXXXXXXXXXXX|
  0x000100000 +-------------------+.....+-------------------+ 0x000100000
              |         ^         |     |        ^          |
              |       3583 MB     |     |      3583 MB      |
              |         v         |     |        v          |
  0x0e0000000 +-------------------+.....+-------------------+ 0x0e0000000
              |         ^         |.    |XXXXXXXXXXXXXXXXXXX|
              | above_4g_mem_size | .   |XXXX PCI hole XXXXX|
              |         v         |  .  |XXXX          XXXXX|
     ram_size +-------------------+   . |XXXX  512 MB  XXXXX|
                                   .   .|XXXXXXXXXXXXXXXXXXX|
                                    .   +-------------------+ 0x100000000
                                     .  |         ^         |
                                      . | above_4g_mem_size |
                                       .|         v         |
                                        +-------------------+ ram_size
                                                              + 512 MB
---
 cpu-all.h               |    8 +++-
 dump.c                  |  103 ++++++++++++++++++++++++++++------------------
 memory_mapping.c        |   21 +++++----
 memory_mapping.h        |   10 +++-
 target-i386/arch_dump.c |    9 ++--
 5 files changed, 93 insertions(+), 58 deletions(-)
Comment 43 Laszlo Ersek 2013-07-28 04:29:14 EDT
Restoring summary because paging=false is the main use case for "crash", I've learned (comment 29, comment 33).
Comment 44 Laszlo Ersek 2013-07-28 08:00:20 EDT
Created attachment 779324 [details]
RHEL-7 patches 2/4 to 4/4 (mbox)

I've finished forward-porting the RHEL-6 patches to RHEL-7 (esp. the new memory API). These three apply on top of the patch in comment 26.

An interesting characteristic of the new RAM handling is the number and continuity of ranges reported to the memory listener callback function. When dumping a 4GB guest, in RHEL-6 the guest_phys_blocks_set_memory() callback added in patch 3 (comment 40) is called more than 900,000 times, patch 3 merges them into <10 ranges.

Using a same-sized guest, the corresponding RHEL-7 callback is called with about 6-10 (much bigger) contiguous ranges, which can be merged down to a number of 4-5.

Another interesting difference is the RAM representation itself. write_memory() in RHEL-6 (patch 4, comment 41) must not assume contiguous ranges in host virtual address space, even though the RAMBlock offsets are contiguous for an individual block. write_memory() is much easier to update in RHEL-7, because hva-continuity is ensured (or, at least, is ensured *more obviously*) already by the new MemoryRegion internals and the matching callback interface. (I looked at "hw/virtio/dataplane/hostmem.c" for an example memory listener, and the MemoryRegionSection documentation in "include/exec/memory.h".)

Since the new memory API is more expressive, I think I got lucky with the RHEL6->RHEL7 porting direction, a backport would have been harder.
Comment 49 Laszlo Ersek 2013-07-29 10:36:26 EDT
I tested the RHEL-7 build too and ported the patches forward to upstream (which I also tested of course):

http://thread.gmane.org/gmane.comp.emulators.qemu/225360
Comment 50 Dave Anderson 2013-07-29 11:04:31 EDT
I had no luck with re-using the old images so I'm doing a re-installation
of the guests.

The RHEL6 vmcore looks good.

The RHEL7 guest is still being installed.
Comment 51 Dave Anderson 2013-07-29 11:28:31 EDT
...and RHEL7 looks good too.

Nice work!
Comment 52 Laszlo Ersek 2013-07-29 11:42:29 EDT
Thanks for your help, Dave! :)
Comment 53 Laszlo Ersek 2013-08-05 04:19:11 EDT
Refreshed the upstream series:
http://thread.gmane.org/gmane.comp.emulators.qemu/226378
Comment 54 Laszlo Ersek 2013-08-06 06:37:05 EDT
Upstream v3:
http://thread.gmane.org/gmane.comp.emulators.qemu/226715
Comment 55 Laszlo Ersek 2013-08-08 09:01:43 EDT
I thought that maybe it would be useful to cherry-pick the v3 upstream patches (comment 54) cleanly to RHEL-7. Boy was I in for a rude awakening.

RHEL-7 is forked off 1.5.2, whereas the v3 series I posted for upstream is based on 1.6-rc1. So, basically, there's an entire minor release between them.

Now, when I'm giving up this approach, this is the list of prerequisite commits:

  88f62c2 dump: Move stubs into libqemustub.a
  444d559 cpu: Turn cpu_paging_enabled() into a CPUState hook
  6d4d3ae memory_mapping: Move MemoryMappingList typedef to qemu/typedefs.h
  a23bbfd cpu: Turn cpu_get_memory_mapping() into a CPUState hook
  1b3509c dump: Abstract dump_init() with cpu_synchronize_all_states()
  11ed09c memory_mapping: Improve qemu_get_guest_memory_mapping() error
          reporting
  7581766 dump: qmp_dump_guest_memory(): use error_setg_file_open()
  dd1750d kvm: Change kvm_cpu_synchronize_state() argument to CPUState
  cb446ec kvm: Change cpu_synchronize_state() argument to CPUState
  60a3e17 cpu: Change cpu_exit() argument to CPUState
  a98ae1d cpus: Change cpu_thread_is_idle() argument to CPUState
  fd529e8 cpus: Change qemu_kvm_wait_io_event() argument to CPUState
  491d6e8 kvm: Change kvm_set_signal_mask() argument to CPUState
  13618e0 cpus: Change qemu_kvm_init_cpu_signals() argument to CPUState
  878096e cpu: Turn cpu_dump_{state,statistics}() into CPUState hooks
  1458c36 kvm: Change kvm_cpu_exec() argument to CPUState
  64f6b34 gdbstub: Set gdb_set_stop_cpu() argument to CPUState
  9132504 cpus: Change cpu_handle_guest_debug() argument to CPUState
  48a106b cpus: Change qemu_kvm_start_vcpu() argument to CPUState
  10a9021 cpus: Change qemu_dummy_start_vcpu() argument to CPUState
  c643bed cpu: Change qemu_init_vcpu() argument to CPUState
  215e79c KVM: Don't assume that mpstate exists with in-kernel PIC always
  4917cf4 cpu: Replace cpu_single_env with CPUState current_cpu
  182735e cpu: Make first_cpu and next_cpu CPUState
  369ff01 target-i386: Don't overuse CPUArchState

And this *still* doesn't apply somewhere in the middle.

I started to look for any prerequisites with:

  git log --oneline --reverse c72bf468.. -- \
      dump.c \
      include/sysemu/dump.h \
      include/sysemu/memory_mapping.h \
      memory_mapping.c \
      dump-stub.c \
      target-i386/arch_dump.c

because we have "c72bf468". The idea was to apply these prereqs, then apply the v3 dump fix series. Unfortunately, the prereqs ran into conflicts between themselves, so I had to dig deeper, find more commits that would bridge the conflicts for specific files. The list kept growing and growing, and the above (still not applying) list is where I'm giving up. So, RHEL-7 will be manual retrofit too.
Comment 56 Laszlo Ersek 2013-08-08 12:23:49 EDT
Actually, 11ed09c is reachable quite OK and eliminates almost all conflicts in the upstream v3 -> RHEL-7 backport. The remaining small conflicts would be fixed by backporting 182735e, which (together with its dependencies) would be insanely intrusive, so I've patched up those small conflicts manually.
Comment 57 Laszlo Ersek 2013-08-12 10:26:49 EDT
upstream commit hashes:

1  2cac260 dump: clamp guest-provided mapping lengths to ramblock sizes
2  5ee163e dump: introduce GuestPhysBlockList
3  c5d7f60 dump: populate guest_phys_blocks
4  56c4bfb dump: rebase from host-private RAMBlock offsets to guest-physical
           addresses
Comment 60 Miroslav Rezanina 2013-08-20 04:46:18 EDT
Fix included in qemu-kvm-1.5.2-4.el7
Comment 62 Qian Guo 2014-01-24 03:07:54 EST
Reproduced by qemu-kvm-1.5.1-2.el7.x86_64

Steps:
1.Boot guest:
# /usr/libexec/qemu-kvm -M q35 -cpu SandyBridge -enable-kvm -m 4G -smp 4,sockets=2,cores=2,threads=2 -name network-test -rtc base=utc,clock=host,driftfix=slew -k en-us -boot menu=on -device ioh3420,bus=pcie.0,id=root.0 -device x3130-upstream,bus=root.0,id=upstream -device xio3130-downstream,bus=upstream,id=downstream0,chassis=1 -drive file=/home/rhel7_64cp1.qcow2_v3,if=none,id=drive-system-disk,media=disk,format=qcow2,aio=native,werror=stop,rerror=stop -device virtio-blk-pci,bus=downstream0,drive=drive-system-disk,id=system-disk,bootindex=1 -device xio3130-downstream,bus=upstream,id=downstream1,chassis=2 -device virtio-net-pci,netdev=hostnet0,id=net0,bus=downstream1,mac=52:54:00:13:10:20 -netdev tap,id=hostnet0,vhost=on,script=/etc/qemu-ifup -monitor stdio -spice disable-ticketing,port=5931 -qmp tcp:0:5555,server,nowait -vga qxl

2.Connect qmp session
# telnet 127.0.0.1 5555
Trying 127.0.0.1...
Connected to 127.0.0.1.
Escape character is '^]'.
{"QMP": {"version": {"qemu": {"micro": 1, "minor": 5, "major": 1}, "package": " (qemu-kvm-1.5.1-2.el7)"}, "capabilities": []}}
{"execute":"qmp_capabilities"}
{"return": {}}

3.create crash dump file
{"execute":"dump-guest-memory","arguments":{"paging": false,"protocol":"file:/home/guest-memory"}}
{"timestamp": {"seconds": 1390548357, "microseconds": 664440}, "event": "STOP"}
{"timestamp": {"seconds": 1390548415, "microseconds": 820675}, "event": "RESUME"}
{"return": {}}

4.In host, try to debug the dump file:
# crash /usr/lib/debug/lib/modules/3.10.0-0.rc7.64.el7.x86_64/vmlinux guest-memory 

crash 6.1.6-1.el7
Copyright (C) 2002-2013  Red Hat, Inc.
Copyright (C) 2004, 2005, 2006, 2010  IBM Corporation
Copyright (C) 1999-2006  Hewlett-Packard Co
Copyright (C) 2005, 2006, 2011, 2012  Fujitsu Limited
Copyright (C) 2006, 2007  VA Linux Systems Japan K.K.
Copyright (C) 2005, 2011  NEC Corporation
Copyright (C) 1999, 2002, 2007  Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002  Mission Critical Linux, Inc.
This program is free software, covered by the GNU General Public License,
and you are welcome to change it and/or distribute copies of it under
certain conditions.  Enter "help copying" to see the conditions.
This program has absolutely no warranty.  Enter "help warranty" for details.
 
GNU gdb (GDB) 7.3.1
Copyright (C) 2011 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu"...

WARNING: failed to init kexec backup region
crash: read error: kernel virtual address: ffff88014fc0c700  type: "current_task (per_cpu)"
crash: read error: kernel virtual address: ffff88014fc8c700  type: "current_task (per_cpu)"
crash: read error: kernel virtual address: ffff88014fd0c700  type: "current_task (per_cpu)"
crash: read error: kernel virtual address: ffff88014fd8c700  type: "current_task (per_cpu)"
crash: read error: kernel virtual address: ffff88014fc11ae4  type: "tss_struct ist array"


So according to above, this bug is reproduced.

Verify this bug with qemu-kvm-1.5.3-40.el7.x86_64

Steps as reproducer, but the result is this:
# crash /usr/lib/debug/lib/modules/3.10.0-78.el7.x86_64/vmlinux  guest-memory 

crash 7.0.2-2.el7
Copyright (C) 2002-2013  Red Hat, Inc.
Copyright (C) 2004, 2005, 2006, 2010  IBM Corporation
Copyright (C) 1999-2006  Hewlett-Packard Co
Copyright (C) 2005, 2006, 2011, 2012  Fujitsu Limited
Copyright (C) 2006, 2007  VA Linux Systems Japan K.K.
Copyright (C) 2005, 2011  NEC Corporation
Copyright (C) 1999, 2002, 2007  Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002  Mission Critical Linux, Inc.
This program is free software, covered by the GNU General Public License,
and you are welcome to change it and/or distribute copies of it under
certain conditions.  Enter "help copying" to see the conditions.
This program has absolutely no warranty.  Enter "help warranty" for details.
 
GNU gdb (GDB) 7.6
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu"...

      KERNEL: /usr/lib/debug/lib/modules/3.10.0-78.el7.x86_64/vmlinux
    DUMPFILE: guest-memory
        CPUS: 4
        DATE: Fri Jan 24 16:03:56 2014
      UPTIME: 00:04:29
LOAD AVERAGE: 0.09, 0.41, 0.22
       TASKS: 254
    NODENAME: dhcp-66-82-210.nay.redhat.com
     RELEASE: 3.10.0-78.el7.x86_64
     VERSION: #1 SMP Tue Jan 21 17:56:28 EST 2014
     MACHINE: x86_64  (2825 Mhz)
      MEMORY: 4 GB
       PANIC: ""
         PID: 0
     COMMAND: "swapper/0"
        TASK: ffffffff818b1440  (1 of 4)  [THREAD_INFO: ffffffff8189e000]
         CPU: 0
       STATE: TASK_RUNNING (PANIC)

crash> bt
PID: 0      TASK: ffffffff818b1440  CPU: 0   COMMAND: "swapper/0"
 #0 [ffffffff8189fe70] __schedule at ffffffff815c373d
 #1 [ffffffff8189feb8] default_idle at ffffffff8101aecf
 #2 [ffffffff8189fed8] arch_cpu_idle at ffffffff8101b796
 #3 [ffffffff8189fee8] cpu_startup_entry at ffffffff810acc35
 #4 [ffffffff8189ff40] rest_init at ffffffff815a18c7
 #5 [ffffffff8189ff50] start_kernel at ffffffff819e3f3d
 #6 [ffffffff8189ff90] x86_64_start_reservations at ffffffff819e35de
 #7 [ffffffff8189ffa0] x86_64_start_kernel at ffffffff819e371e


So, the dump file is successfully debugged, so according to  above, this bug is fixed by qemu-kvm-1.5.3-40.el7.x86_64
Comment 65 Ludek Smid 2014-06-13 06:54:20 EDT
This request was resolved in Red Hat Enterprise Linux 7.0.

Contact your manager or support representative in case you have further questions about the request.

Note You need to log in before you can comment on or make changes to this bug.