990118 – dump-guest-memory + "crash" test case for AutoTest

Bug 990118 - dump-guest-memory + "crash" test case for AutoTest

Summary: dump-guest-memory + "crash" test case for AutoTest

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	autotest-framework
Sub Component:
Version:	20
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	Laszlo Ersek
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:	981582
Blocks:
TreeView+	depends on / blocked

Reported:	2013-07-30 12:29 UTC by Laszlo Ersek
Modified:	2014-04-21 09:32 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2014-04-21 09:32:08 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
output file from "crash vmlinux vmcore > crash.init" (1.95 KB, text/plain) 2013-07-30 13:44 UTC, Dave Anderson	no flags	Details
test output for "qmp_command.qmp_query-cpus" (16.64 KB, text/plain) 2013-08-22 22:01 UTC, Laszlo Ersek	no flags	Details
View All

Description Laszlo Ersek 2013-07-30 12:29:02 UTC

We should create an AutoTest case for the symptomps described in bug 981582:

(1) install a Fedora or RHEL guest with qemu-kvm -- the guest should have at
    least 4GB RAM,
(2) install the kernel debuginfo packages on the host,
    matching the guest kernel,
(3) install the "crash" utility on the host,
(4) start the guest,
(5) dump the guests vmcore with "virsh dump DOMAIN /tmp/vmoce --memory-only",
(6) verify that crash can work with the vmcore:
    - no error or warning messages printed,
    - a few commands at the crash> prompt return meaningful output,
    - no immediate non-zero exit status,
    - etc.

Dave, can you please recommend some specifics for (6) that I could try to automate? Thanks.

Comment 1 Cleber Rosa 2013-07-30 12:47:33 UTC

Lazlo/Dave:

A few comments on the points you listed:

(1) Trivial;
(2) This limits the guest and host to be running the same distro, right? Or is the kernel-debuginfo free of deps? Or is it safe to --force its installation? I'm asking this because we'd need to coordinate host/guest distro/version.
(3) Trivial;
(4) Trivial;
(5) We'll probably define/use a virttest API for that so that it can use either libvirt or qemu directly;
(6) Ideally these commands would either return meaningful exit status or the output would be the same across versions. If that's not the case, we'd have to  define configuration per distro/version which is not usually a good idea;

Cheers,
CR.

Comment 2 Dave Anderson 2013-07-30 13:01:03 UTC

(In reply to Laszlo Ersek from comment #0)
> We should create an AutoTest case for the symptomps described in bug 981582:
> 
> (1) install a Fedora or RHEL guest with qemu-kvm -- the guest should have at
>     least 4GB RAM,
> (2) install the kernel debuginfo packages on the host,
>     matching the guest kernel,
> (3) install the "crash" utility on the host,
> (4) start the guest,
> (5) dump the guests vmcore with "virsh dump DOMAIN /tmp/vmoce --memory-only",
> (6) verify that crash can work with the vmcore:
>     - no error or warning messages printed,
>     - a few commands at the crash> prompt return meaningful output,
>     - no immediate non-zero exit status,
>     - etc.
> 
> Dave, can you please recommend some specifics for (6) that I could try to
> automate? Thanks.

Honestly, given that there are hundreds of dumpfile reads during initialization
that must succeed to make it to the "crash> " prompt, I would be happy if it
simply comes up "clean" without any errors/warnings.

Comment 3 Cleber Rosa 2013-07-30 13:20:10 UTC

And what's the clearest indication that it has come up "clean"? The lack of output on stderr? Or the lack of any string containing "error" on stdout/stderr?

If crash has a very formal way of reporting errors, it'd be great. If not, we can certainly work around it as long as we have knowledge of how it usually does.

Comment 4 Dave Anderson 2013-07-30 13:43:49 UTC

It will display the crash and gdb "banners", the system information,
and then the prompt:

# crash vmlinux vmcore

crash 7.0.1-1.el7
Copyright (C) 2002-2013  Red Hat, Inc.
Copyright (C) 2004, 2005, 2006, 2010  IBM Corporation
Copyright (C) 1999-2006  Hewlett-Packard Co
Copyright (C) 2005, 2006, 2011, 2012  Fujitsu Limited
Copyright (C) 2006, 2007  VA Linux Systems Japan K.K.
Copyright (C) 2005, 2011  NEC Corporation
Copyright (C) 1999, 2002, 2007  Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002  Mission Critical Linux, Inc.
This program is free software, covered by the GNU General Public License,
and you are welcome to change it and/or distribute copies of it under
certain conditions.  Enter "help copying" to see the conditions.
This program has absolutely no warranty.  Enter "help warranty" for details.
 
GNU gdb (GDB) 7.6
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu"...

      KERNEL: vmlinux6                          
    DUMPFILE: vmcore6
        CPUS: 4
        DATE: Mon Jul 29 10:55:28 2013
      UPTIME: 00:05:31
LOAD AVERAGE: 10.76, 4.73, 1.80
       TASKS: 319
    NODENAME: localhost.localdomain
     RELEASE: 2.6.32-358.el6.x86_64
     VERSION: #1 SMP Tue Jan 29 11:47:41 EST 2013
     MACHINE: x86_64  (2992 Mhz)
      MEMORY: 4 GB
       PANIC: ""
         PID: 7625
     COMMAND: "usex"
        TASK: ffff8800d8656080  [THREAD_INFO: ffff8800d4102000]
         CPU: 0
       STATE: TASK_RUNNING (PANIC)

crash>

Complicating matter just a bit is that there are a few "please wait..."
messages that are shown only while a given subsystem is being
initialized, and then are overwritten.  So if you redirected the
above into a file, it would look like this:


crash 7.0.1-1.el7
Copyright (C) 2002-2013  Red Hat, Inc.
Copyright (C) 2004, 2005, 2006, 2010  IBM Corporation
Copyright (C) 1999-2006  Hewlett-Packard Co
Copyright (C) 2005, 2006, 2011, 2012  Fujitsu Limited
Copyright (C) 2006, 2007  VA Linux Systems Japan K.K.
Copyright (C) 2005, 2011  NEC Corporation
Copyright (C) 1999, 2002, 2007  Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002  Mission Critical Linux, Inc.
This program is free software, covered by the GNU General Public License,
and you are welcome to change it and/or distribute copies of it under
certain conditions.  Enter "help copying" to see the conditions.
This program has absolutely no warranty.  Enter "help warranty" for details.
 
[?1034hGNU gdb (GDB) 7.6
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu"...


please wait... (gathering kmem slab cache data)
                                                

please wait... (gathering module symbol data)
                                              

please wait... (gathering task table data)
                                           

please wait... (determining panic task)
                                        
      KERNEL: vmlinux6
    DUMPFILE: vmcore6
        CPUS: 4
        DATE: Mon Jul 29 10:55:28 2013
      UPTIME: 00:05:31
LOAD AVERAGE: 10.76, 4.73, 1.80
       TASKS: 319
    NODENAME: localhost.localdomain
     RELEASE: 2.6.32-358.el6.x86_64
     VERSION: #1 SMP Tue Jan 29 11:47:41 EST 2013
     MACHINE: x86_64  (2992 Mhz)
      MEMORY: 4 GB
       PANIC: ""
         PID: 7625
     COMMAND: "usex"
        TASK: ffff8800d8656080  [THREAD_INFO: ffff8800d4102000]
         CPU: 0
       STATE: TASK_RUNNING (PANIC)

crash> q

Each please wait line is preceded with a ^M, and has no ending linefeed.
After the initialization it advertises is complete, there is another
^M and a line of spaces to cover up the line.

So in reality, the "KERNEL" line would be preceded by all of the
please wait messages:

^Mplease wait... (gathering kmem slab cache data)^M                                                ^M^Mplease wait... (gathering module symbol data)^M                                              ^M^Mplease wait... (gathering task table data)^M                                           ^M^Mplease wait... (determining panic task)^M                                        ^M      KERNEL: vmlinux6

I'll attach the output above so you can see exactly what I mean.

Anyway, there are a myriad of warning and/or error messages that *could*
be displayed, so if anything other than what is shown above is displayed,
it's definitely worth noting.

Comment 5 Dave Anderson 2013-07-30 13:44:43 UTC

Created attachment 780670 [details]
output file from "crash vmlinux vmcore > crash.init"

Comment 6 Laszlo Ersek 2013-07-30 14:05:50 UTC

(In reply to Cleber Rodrigues from comment #1)

> (2) This limits the guest and host to be running the same distro, right? Or
> is the kernel-debuginfo free of deps? Or is it safe to --force its
> installation? I'm asking this because we'd need to coordinate host/guest
> distro/version.

I think we could exploit some "regularities" here (like, a later minor release of the same major release supports debuginfos from an earlier minor release, plus, a later major release *probably* supports debuginfos from an earlier major release, considering rpm compression method compatibility etc). But, for simplicity and robustness, I think we should stick with guest release == host release.

That would also ensure that "crash", installed on the host side, would match the guest kernel's vmcore. Normally you use "crash" with kdump-written vmcores, that is, the vmcore and crash itself belong to the same major.minor release.

Then, debuginfo dependencies are major release specific. In RHEL-6 you need debuginfo-common and debuginfo, eg. for RHEL-6.4,

  kernel-debuginfo-2.6.32-358.el6.x86_64
  kernel-debuginfo-common-x86_64-2.6.32-358.el6.x86_64

RHEL-7 seems to follow suit:

  kernel-debuginfo-3.10.0-3.el7.x86_64.rpm
  kernel-debuginfo-common-x86_64-3.10.0-3.el7.x86_64.rpm

I expect future Fedora releases to match this. In any case, installing the top package (= kernel-debuginfo-VERSION-RELEASE.x86_64) with "yum" should pull in any dependencies.

> (5) We'll probably define/use a virttest API for that so that it can use
> either libvirt or qemu directly;

Right. The virsh command is cited above; the direct monitor command is available over HMP ("dump-guest-memory /tmp/vmcore") and QMP as well:

  {
    "execute" : "dump-guest-memory",
    "arguments" : {
      "paging"   : false,
      "protocol" : "file:/tmp/vmcore"
    }
  }

> (6) Ideally these commands would either return meaningful exit status or the
> output would be the same across versions. If that's not the case, we'd have
> to  define configuration per distro/version which is not usually a good idea;

I think the following should work:

BINARY=/usr/lib/debug/lib/modules/2.6.32-358.el6.x86_64/vmlinux
VMCORE=/tmp/vmcore

cat >crash.cmd <<EOT
bt
quit
EOT

if ! crash  -- "$BINARY" "$VMCORE" <crash.cmd >crash.out_err 2>&1 \
  || egrep --silent 'crash: read error:|WARNING:' crash.out_err; then
  printf 'vmcore is invalid\n' >&2
  # ...
fi

Comment 7 Laszlo Ersek 2013-07-30 14:12:49 UTC

(... Open bug 981582 and search the page for "crash: read error:" and "WARNING:".)

Comment 8 Dave Anderson 2013-07-30 14:16:53 UTC

There is a common "error()" function in the crash utility, which will
precede each error message with "crash:" during the initialization.
(During runtime it will precede the error message with the command
name, e.g., "bt:")

So if you go the route above, remove the "read error" part, and just
look for "crash:" alone.  That will catch a whole slew of other 
error types.  And yes, the "WARNING:" string will catch things
that are not necessarily errors, but most likely indicate that
something is not right.

Comment 9 Laszlo Ersek 2013-07-30 22:07:29 UTC

Cleber,

do you have any docs pertaining to the implementation of such a test case? I think I should finally try my hand at autotest stuff.

Kudos
Laszlo

Comment 10 Laszlo Ersek 2013-08-22 13:07:25 UTC

Cleber, can you pls help me with comment 9?

Dep bug 981582 is now in MODIFIED -- the issue has been fixed in RHEL-6, RHEL-7, and upstream.

I know absolutely nothing about autotest. Can you pls provide pointers to docs, test machines, etc?

Thanks!
Laszlo

Comment 11 Lucas Meneghel Rodrigues 2013-08-22 14:21:31 UTC

First of all, in order to lower the complexity to newcomers, we split the virt test suite from the larger autotest project, so people don't really have to care about this big thing called 'autotest'. So, what you are looking for is 'virt-test'.

Main github page

https://github.com/autotest/virt-test

Main wiki URL

https://github.com/autotest/virt-test/wiki

The main README.rst file is useful on how to bootstrap your tests

https://github.com/autotest/virt-test/blob/master/README.rst

Writing tests documentation

https://github.com/autotest/virt-test/wiki/WritingTests

About test machines, we do have a test grid, but it is small (4 functional machines as of this writing) and usually the machines are running automated tests. Of course, it is fine to lock an unused machine to try out some new things, but it seems most of the development work could be done on your laptop.

Comment 12 Laszlo Ersek 2013-08-22 20:20:37 UTC

What is the recommended OS to work with virt-test? According to <https://github.com/autotest/virt-test>, RHEL-6 is supported (with EPEL-6). I run RHEL-6 Workstation + EPEL-6 on my laptop.

I did the following steps:

- installed "autotest-framework" and "p7zip",
- cloned the virt-test repo
- ran "qemu/get_started.py" successfully,
- ran "/run -t qemu --list-tests" successfully,
- found a bunch of "qmp_command.*" tests.

Since dumping the guest vmcore too is available via a QMP command, I figured I'd test one:

  ./run -t qemu --tests qmp_command.qmp_query-cpus

This failed with the following Python trace (from logs/latest/debug.log):

--------v--------
22:04:42 ERROR| Fail to create qemucommand:
22:04:42 ERROR| Original Traceback (most recent call last):
  File "/home/lacos/src/upstream/virt-test/virttest/qemu_vm.py", line 2284, in create
    self.devices = self.make_create_command()
  File "/home/lacos/src/upstream/virt-test/virttest/qemu_vm.py", line 1382, in make_create_command
    params.get('workaround_qemu_qmp_crash') == "always")
  File "/home/lacos/src/upstream/virt-test/virttest/qemu_devices.py", line 923, in __init__
    timeout=10, ignore_status=True, verbose=False)
TypeError: system_output() got an unexpected keyword argument 'verbose'
--------^--------

git-blaming virttest/qemu_devices.py:923:

--------v--------
922 self.__qemu_help = utils.system_output("%s -help" % qemu_binary,
923                         timeout=10, ignore_status=True, verbose=False)
--------^--------

the following commit is fingered:

commit e9639fffc89569965cec85448ae12d8bef72e80c
Author: Lukáš Doktor <ldoktor>
Date:   Wed Jun 19 12:57:39 2013 +0200

    virttest.qemu_devices: Use utils.system_output() instead of commands
    
    Signed-off-by: Lukáš Doktor <ldoktor>

Indeed it passes "verbose=False" to utils.system_output().

Unfortunately, the "autotest-framework-0.14.3-1.el6.noarch", installed from EPEL-6, doesn't know about the "verbose" parameter:

[/usr/lib/python2.6/site-packages/autotest/client/shared/base_utils.py]

--------v--------
   1112 def system_output(command, timeout=None, ignore_status=False,
   1113                   retain_output=False, args=()):
--------^--------


So, what host OS should I use? Is Fedora 19 recent enough? Thanks.

Comment 13 Lucas Meneghel Rodrigues 2013-08-22 20:33:40 UTC

Ok, this bug slipped in, due to the way my development environment is setup. Most of us use Fedora 19 to work on virt-test, but RHEL6 should be supported.

I'll push a fix to this bug to master right now.

Comment 14 Laszlo Ersek 2013-08-22 20:39:08 UTC

Thanks. Don't rush it though just for me, I keep an F19 installation on my separate workstation around, precisely for such cases. (I'm aware that most developers run bleeding edge Fedora.)

Comment 15 Lucas Meneghel Rodrigues 2013-08-22 21:15:52 UTC

Ok, after some thinking, I figured it'd be better to fix the problem on next, the development branch. So bear with us and use next for now./home/lmr/Code/virt-test.git/logs/run-2013-08-22-18.06.45/debug.log

Explaining what is going on: autotest 0.14 is what Fedora ships, mainly because the guy that was the original package maintainer for autotest is not working on packaging anymore.

That means that our team has to take over, which we would gladly do, but there are some requirements that we need to fulfill to become maintainers. Cleber applied and is working on all requirements, which takes time (months now). Makes me want to shoot myself on the foot.

So both utils.system() and utils.system_output() from 0.14 don't have the verbose parameter, that was introduced on the 0.15 release. Sometime people forget and we have this mess. Please have some patience with us and try out next on your current setup, it should work.

Comment 16 Lucas Meneghel Rodrigues 2013-08-22 21:16:25 UTC

Disregard the log path, it got pasted by accident...

Comment 17 Laszlo Ersek 2013-08-22 22:01:39 UTC

Created attachment 789381 [details]
test output for "qmp_command.qmp_query-cpus"

Thanks for the fix, now I can run

  ./run -t qemu --tests qmp_command.qmp_query-cpus

on the "next" branch. However the test fails -- please see the attachment.

Shouldn't I run this test as root? The GetStarted wiki page says, "For qemu and libvirt subtests, the default test set does not require root. However, other tests might fail due to lack of privileges."

However, "./run -t qemu --list-tests" doesn't say "(requires root)" for "qmp_command.qmp_query-cpus" (like it does eg. for the cgroup.* tests).

I'm asking because the debug log contains lines like "qemu will run in KVM mode", which is not possible without having root.

Thanks.

Comment 18 Lucas Meneghel Rodrigues 2013-08-22 22:06:35 UTC

I looked at the log, it seems a QEMU bug:

23:49:53 ERROR| QMP command output: '[{u'current': True, u'pc': -2130432042, u'halted': True, u'CPU': 0, u'thread_id': 17560}, {u'current': False, u'pc': -2130432042, u'halted': True, u'CPU': 1, u'thread_id': 17561}]'
23:49:53 ERROR| Human command output: '* CPU #0: pc=0xffffffff81042fd6 (halted) thread_id=17560 
23:49:53 ERROR|   CPU #1: pc=0xffffffff81042fd6 (halted) thread_id=17561 '
23:49:53 ERROR| Value in human monitor: '18446744071579119574'
23:49:53 ERROR| Value in qmp: '-2130432042'    [context: Verify that qmp command 'query-cpus' works as designed.]

Of am I completely off?

Comment 19 Laszlo Ersek 2013-08-22 22:34:49 UTC

I'm not sure yet. I got the same results on Fedora 19, running the test as root, using a qemu binary that I built from current upstream master.

Are you saying it's a qemu bug because the qmp output contains a negative PC? That's just the value of the PC taken as an int64_t.

{ 'type': 'CpuInfo',
  'data': {'CPU': 'int', 'current': 'bool', 'halted': 'bool', '*pc': 'int',
           '*nip': 'int', '*npc': 'int', '*PC': 'int', 'thread_id': 'int'} }

... Ah. So the test finds that 18446744071579119574 does not equal -2130432042.

(uint64_t)0xFFFFFFFF81042FD6 == 18446744071579119574 (dec)
 (int64_t)0xFFFFFFFF81042FD6 == -2130432042 (dec, in two's complement, 64-bit signed int)

I'll try another test.

Comment 20 Laszlo Ersek 2013-08-22 22:39:37 UTC

Alright, "./run -t qemu --tests qmp_command.qmp_query-name" succeeded. I guess my setup is OK then. Thanks!

Comment 21 Laszlo Ersek 2013-08-23 09:46:48 UTC

I'm reading the autotest documentation, and I think some changes to the original test plan would be beneficial -- we should not install any packages on the host side.

Replace, from comment #0:

> (1) install a Fedora or RHEL guest with qemu-kvm -- the guest should have at
>     least 4GB RAM,
> (2) install the kernel debuginfo packages on the host,
>     matching the guest kernel,
> (3) install the "crash" utility on the host,
> (4) start the guest,
> (5) dump the guests vmcore with "virsh dump DOMAIN /tmp/vmoce --memory-only",
> (6) verify that crash can work with the vmcore:

With:

(1) install a Fedora guest with qemu-kvm & start it (RAM size: 4 GB,
    qcow2 disk size should be 20 GB),
(2) install the kernel debuginfo packages *in the guest*,
    (debuginfo-install kernel)
(3) install the "crash" utility *in the guest*,
(4) dump the guest vmcore on the host side, with a QMP or HMP monitor command
    (paging=false) -- when this completes, the guest should resume running
(5) copy the vmcore into the guest with scp,
(6) verify the vmcore inside the guest, with crash.

Comment 22 Lucas Meneghel Rodrigues 2013-08-23 13:05:29 UTC

About the QMP test failing, OK, I did not realize what you pointed out:

(uint64_t)0xFFFFFFFF81042FD6 == 18446744071579119574 (dec)
 (int64_t)0xFFFFFFFF81042FD6 == -2130432042 (dec, in two's complement, 64-bit signed int)

Now that I see it, clearly the test has a bug, it needs to be modified to properly compare the 2 values. It is doing a simple int comparison right now.

Comment 23 Laszlo Ersek 2013-09-05 02:04:04 UTC

posted v1:
https://www.redhat.com/archives/virt-test-devel/2013-September/msg00011.html

Comment 24 Laszlo Ersek 2013-09-16 11:59:19 UTC

Lucas committed the patch with minor comment reformatting on the "next" branch:
https://github.com/autotest/virt-test/commit/81733cd3

Thanks!

I'm not sure if I should bump the status to CLOSED/RAWHIDE. I think I'll set it to MODIFIED for now. ("This bug report has been fixed, unit tested and checked into source control by the Assigned Engineer.")

Comment 25 Fedora End Of Life 2013-09-16 17:03:06 UTC

This bug appears to have been reported against 'rawhide' during the Fedora 20 development cycle.
Changing version to '20'.

More information and reason for this action is here:
https://fedoraproject.org/wiki/BugZappers/HouseKeeping/Fedora20

Note You need to log in before you can comment on or make changes to this bug.