1730691 – Script <dump-guest-memory.py> executes failed on RHEL8 host.

Bug 1730691 - Script <dump-guest-memory.py> executes failed on RHEL8 host.

Summary: Script <dump-guest-memory.py> executes failed on RHEL8 host.

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat Enterprise Linux Advanced Virtualization
Classification:	Red Hat
Component:	qemu-kvm
Sub Component:
Version:	8.1
Hardware:	x86_64
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Virtualization Maintenance
QA Contact:	Michael
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-07-17 11:24 UTC by Michael
Modified:	2020-01-07 07:24 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-07-19 14:36:11 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Michael 2019-07-17 11:24:59 UTC

Description of problem:

As the subject, scripts <dump-guest-memory.py> does not work on RHEL8 host. *HOWEVER*, it works on RHEL7 host. 



Version-Release number of selected component (if applicable):
kernel 4.18.0-114.el8.x86_64
kernel-debuginfo-4.18.0-114.el8.x86_64
kernel-debuginfo-common-x86_64-4.18.0-114.el8.x86_64
qemu-kvm-4.0.0-5.module+el8.1.0+3622+5812d9bf.x86_64


How reproducible:
100%

Steps to Reproduce:
1.Use gdb start a qemu process.
#gdb /use/libexec/qemu-kvm

2. execute the script. 
(gdb) source /usr/share/qemu-kvm/dump-guest-memory.py


Actual results:

Traceback (most recent call last):
  File "/usr/share/qemu-kvm/dump-guest-memory.py", line 21, in <module>
    UINTPTR_T = gdb.lookup_type("uintptr_t")
gdb.error: No type named uintptr_t.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/share/qemu-kvm/dump-guest-memory.py", line 23, in <module>
    raise gdb.GdbError("Symbols must be loaded prior to sourcing dump-guest-memory.\n"
gdb.GdbError: Symbols must be loaded prior to sourcing dump-guest-memory.
Symbols may be loaded by 'attach'ing a QEMU process id or by 'load'ing a QEMU binary.



Expected results:
The script run successful and the command dump-guest-memory can be created in the gdb. 


Additional info:
This issue is only happened on RHEL8 host. RHEL7 does not have it.

Comment 1 Michael 2019-07-17 11:48:54 UTC

Hi:

Command 'bt' in the gdb is also not correct. 

[1] Qemu dump a guest.
boot.sh: line 14: 12096 Segmentation fault      (core dumped) /usr/libexec/qemu-kvm ... ...

[2] open the gdb and execute command bt.
#gdb core.12096

GNU gdb (GDB) Red Hat Enterprise Linux 8.2-6.el8
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
... ...
(gdb)
(gdb) bt
#0  0x00007f5fc09da2d6 in ?? ()
#1  0xffffffffeeccf7b0 in ?? ()
#2  0x00005645d826ffc0 in ?? ()
#3  0x0000000000000009 in ?? ()
#4  0x0000000000000000 in ?? ()


The expected result should be:

...
(gdb) bt
#0 0x00007fac35b4a263 in select () from /lib64/libc.so.6
#1 0x00007fac381358d0 in main_loop_wait (timeout=1000) at /usr/src/debug/qemu-kvm-0.12.1.2/vl.c:3973
#2 0x00007fac3815791a in kvm_main_loop () at /usr/src/debug/qemu-kvm-0.12.1.2/qemu-kvm.c:2244
#3 0x00007fac381385ea in main_loop (argc=29, argv=<value optimized out>, envp=<value optimized out>) at /usr/src/debug/qemu-kvm-0.12.1.2/vl.c:4192
#4 main (argc=29, argv=<value optimized out>, envp=<value optimized out>) at /usr/src/debug/qemu-kvm-0.12.1.2/vl.c:6528


Is the result correct ? 


Thanks

Comment 2 Andrew Jones 2019-07-17 13:14:22 UTC

It appears the qemu binary is getting stripped. However I see '--disable-strip' in our configure line in the spec file, so I don't know why. Mirek?

Comment 3 Miroslav Rezanina 2019-07-17 13:39:32 UTC

(In reply to Andrew Jones from comment #2)
> It appears the qemu binary is getting stripped. However I see
> '--disable-strip' in our configure line in the spec file, so I don't know
> why. Mirek?

No stripping should be done for binaries. Checking the build log shows no stipping.

Comment 4 Laszlo Ersek 2019-07-17 15:04:08 UTC

Is this perhaps related to the qemu-kvm-debuginfo / qemu-kvm-debugsource split we have in RHEL8?

In the past (under RHEL7), I believe we only required installing the debuginfo package, for "bt" to work. It's possible now (in RHEL8) we also need "debugsource", for the same. IOW, perhaps the virt-QE test plan should be updated.

(I have no background info on what the debuginfo/debugsource split is all about.)

Comment 5 Laszlo Ersek 2019-07-17 19:35:45 UTC

https://fedoraproject.org/wiki/Changes/SubpackageAndSourceDebuginfo

Comment 6 Laszlo Ersek 2019-07-17 19:39:57 UTC

The verification of bug 1421595 ran into a similar problem initially, and there the issue was solved by installing the debugsource package, in addition to the debuginfo package. Please check & confirm whether that solves the present issue.

Comment 7 Michael 2019-07-18 03:30:42 UTC

(In reply to Laszlo Ersek from comment #6)
> The verification of bug 1421595 ran into a similar problem initially, and
> there the issue was solved by installing the debugsource package, in
> addition to the debuginfo package. Please check & confirm whether that
> solves the present issue.

Thanks Laszlo:

I have tried it again. The problem has been solved. Please check the result. 

(gdb) source /usr/share/qemu-kvm/dump-guest-memory.py 
(gdb) 
(gdb) set height 0
(gdb) dump-guest-memory /tmp/vmcore X86_64
guest RAM blocks:
target_start     target_end       host_addr        message count
---------------- ---------------- ---------------- ------- -----
0000000000000000 00000000000a0000 00007f5dd7e00000 added       1
00000000000c0000 00000000000ca000 00007f5dd7ec0000 added       2
00000000000ca000 00000000000cd000 00007f5dd7eca000 joined      2
00000000000cd000 00000000000e8000 00007f5dd7ecd000 joined      2
00000000000e8000 00000000000f0000 00007f5dd7ee8000 joined      2
00000000000f0000 0000000000100000 00007f5dd7ef0000 joined      2
0000000000100000 0000000080000000 00007f5dd7f00000 joined      2
00000000f4000000 00000000f8000000 00007f5dd3c00000 added       3
00000000f8000000 00000000fc000000 00007f5dcfa00000 added       4
00000000fc810000 00000000fc812000 00007f5fb1200000 added       5
00000000fffc0000 0000000100000000 00007f5fb1600000 added       6
0000000100000000 0000000240000000 00007f5e57e00000 added       7
Python Exception <class 'gdb.MemoryError'> Cannot access memory at address 0x7f5f90958000: 
Error occurred in Python command: Cannot access memory at address 0x7f5f90958000


(gdb) bt
#0  0x00007f5fc09da2d6 in __GI_ppoll (fds=0x5645d826ffc0, nfds=9, timeout=<optimized out>, 
    timeout@entry=0x7ffe588aa930, sigmask=sigmask@entry=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:39
#1  0x00005645d47b7df5 in ppoll (__ss=0x0, __timeout=0x7ffe588aa930, __nfds=<optimized out>, __fds=<optimized out>)
    at /usr/include/bits/poll2.h:77
#2  0x00005645d47b7df5 in qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, timeout=timeout@entry=2839497482)
    at util/qemu-timer.c:334
#3  0x00005645d47b8e25 in os_host_main_loop_wait (timeout=2839497482) at util/main-loop.c:231
#4  0x00005645d47b8e25 in main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:512
#5  0x00005645d45ad839 in main_loop () at vl.c:1988
#6  0x00005645d4462328 in main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at vl.c:4642
(gdb)

Comment 8 Laszlo Ersek 2019-07-18 12:39:46 UTC

Yes, at least the debug symbols work fine now.

The "Cannot access memory at address 0x7f5f90958000" error could be due to genuine corruption of the QEMU process's memory (i.e. due to the reason that caused QEMU to crash in the first place).

Comment 10 Laszlo Ersek 2019-07-18 21:36:05 UTC

The purpose of "dump-guest-memory.py" is to extract the guest-physical
RAM from the crashed QEMU process's coredump, for analysis with the
"crash" utility. In other words, the ultimate goal is to let an analyst
investigate the Linux guest kernel, post-mortem, that resides in the
guest-physical RAM.

In order to say that the "dump-guest-memory.py" works, the following
integration test should be done:

(1) install the suitable qemu-kvm debuginfo and debugsource packages

(2) enable the vmcoreinfo device on the QEMU command line (or in the
    libvirt domain XML)

(3) launch a supported RHEL guest

(4) when the guest has quiesced, kill the QEMU process for example with
    SIGQUIT (or SIGSEGV), so that the host kernel dump its core

(5) run the "dump-guest-memory.py" script in gdb on the coredump

(6) install the kernel debuginfo / debugsource packages on the host that
    match the *guest* kernel version

(7) open the vmcore extracted in step (5) with the "crash" utility

(8) get a backtrace on all VCPUs in "crash"

If the backtrace looks sensible, from step (8), then the test passes,
and we can close this BZ as NOTABUG.

Importantly, step (5) may validly fail under some circumstances; after
all, it is poking around in the memory of a crashed process (that is,
the process memory could be corrupt, which could break the python
script). The idea here is that step (4) crashes QEMU forcefully at a
point where QEMU is otherwise "all fine" (modulo a small chance for QEMU
to be in the middle of updating data structures that matter to
"dump-guest-memory.py" -- and that is supposed to be minimal, due to the
guest having quiesced). Thus, if step (5) fails occasionally "in
production", that's OK, as the script is "best effort". However, if step
(5) fails *consistently* in synthetic testing like this, then we have a
problem, and more investigation is going to be necessary.

(Disclaimer: it's been a while since I last worked with
"dump-guest-memory.py"; hopefully I didn't forget any of the necessary
steps above.)

Comment 11 Michael 2019-07-19 10:59:17 UTC

(In reply to Laszlo Ersek from comment #10)
> The purpose of "dump-guest-memory.py" is to extract the guest-physical
> RAM from the crashed QEMU process's coredump, for analysis with the
> "crash" utility. In other words, the ultimate goal is to let an analyst
> investigate the Linux guest kernel, post-mortem, that resides in the
> guest-physical RAM.
> 
> In order to say that the "dump-guest-memory.py" works, the following
> integration test should be done:
> 
> (1) install the suitable qemu-kvm debuginfo and debugsource packages
> 
> (2) enable the vmcoreinfo device on the QEMU command line (or in the
>     libvirt domain XML)
> 
> (3) launch a supported RHEL guest
> 
> (4) when the guest has quiesced, kill the QEMU process for example with
>     SIGQUIT (or SIGSEGV), so that the host kernel dump its core
> 
> (5) run the "dump-guest-memory.py" script in gdb on the coredump
> 
> (6) install the kernel debuginfo / debugsource packages on the host that
>     match the *guest* kernel version
> 
> (7) open the vmcore extracted in step (5) with the "crash" utility
> 
> (8) get a backtrace on all VCPUs in "crash"
> 
> If the backtrace looks sensible, from step (8), then the test passes,
> and we can close this BZ as NOTABUG.
> 
> Importantly, step (5) may validly fail under some circumstances; after
> all, it is poking around in the memory of a crashed process (that is,
> the process memory could be corrupt, which could break the python
> script). The idea here is that step (4) crashes QEMU forcefully at a
> point where QEMU is otherwise "all fine" (modulo a small chance for QEMU
> to be in the middle of updating data structures that matter to
> "dump-guest-memory.py" -- and that is supposed to be minimal, due to the
> guest having quiesced). Thus, if step (5) fails occasionally "in
> production", that's OK, as the script is "best effort". However, if step
> (5) fails *consistently* in synthetic testing like this, then we have a
> problem, and more investigation is going to be necessary.
> 
> (Disclaimer: it's been a while since I last worked with
> "dump-guest-memory.py"; hopefully I didn't forget any of the necessary
> steps above.)

Hi Laszlo:

Thank you for your explanation. From your comment, I tested those cases follow the steps. The result in comment 3 should be Okay. Thus, I move this Bug as NOTBUG. 


Thanks.

Comment 12 Michael 2019-07-19 11:00:51 UTC

(In reply to Michael from comment #11)
> 
> Hi Laszlo:
> 
> Thank you for your explanation. From your comment, I tested those cases
> follow the steps. The result in comment 3 should be Okay. Thus, I move this
> Bug as NOTBUG. 
> 
> 
> Thanks.

Sorry, the result in comment 7 should be Okay.

Comment 13 Laszlo Ersek 2019-07-19 14:36:11 UTC

Ah, you likely mean that you have now tested steps 1-8 from comment 10, and *therefore* the failure captured in comment 7 is the "expected" ("best effort") kind.

That sounds good. Closing this as NOTABUG.

Comment 14 Michael 2019-07-22 01:35:05 UTC

(In reply to Laszlo Ersek from comment #13)
> Ah, you likely mean that you have now tested steps 1-8 from comment 10, and
> *therefore* the failure captured in comment 7 is the "expected" ("best
> effort") kind.
> 
> That sounds good. Closing this as NOTABUG.

Thanks Laszlo:

Actually, I do not know the failure captured whether is in expected. Some times I can get this error info, but some times not. It is not 100% to catch it. 

If you think this error is not a issue, we just leave this Bug in there and I will update the poalrion case. 

If you think this error is a issue, we can do more investigation. 



Thanks

Comment 15 Laszlo Ersek 2019-07-22 14:53:32 UTC

We can't do anything about the script failing *intermittently*. The QEMU process's core is dumped without regard to the consistency of QEMU's internal data structures. I suggest marking this on the Polarion test case, as an intermittent "known issue". Thanks.

Comment 16 Michael 2019-07-23 00:33:38 UTC

(In reply to Laszlo Ersek from comment #15)
> We can't do anything about the script failing *intermittently*. The QEMU
> process's core is dumped without regard to the consistency of QEMU's
> internal data structures. I suggest marking this on the Polarion test case,
> as an intermittent "known issue". Thanks.

Thanks Laszlo, I will update the polarion case.

Note You need to log in before you can comment on or make changes to this bug.