We should create an AutoTest case for the symptomps described in bug 981582: (1) install a Fedora or RHEL guest with qemu-kvm -- the guest should have at least 4GB RAM, (2) install the kernel debuginfo packages on the host, matching the guest kernel, (3) install the "crash" utility on the host, (4) start the guest, (5) dump the guests vmcore with "virsh dump DOMAIN /tmp/vmoce --memory-only", (6) verify that crash can work with the vmcore: - no error or warning messages printed, - a few commands at the crash> prompt return meaningful output, - no immediate non-zero exit status, - etc. Dave, can you please recommend some specifics for (6) that I could try to automate? Thanks.
Lazlo/Dave: A few comments on the points you listed: (1) Trivial; (2) This limits the guest and host to be running the same distro, right? Or is the kernel-debuginfo free of deps? Or is it safe to --force its installation? I'm asking this because we'd need to coordinate host/guest distro/version. (3) Trivial; (4) Trivial; (5) We'll probably define/use a virttest API for that so that it can use either libvirt or qemu directly; (6) Ideally these commands would either return meaningful exit status or the output would be the same across versions. If that's not the case, we'd have to define configuration per distro/version which is not usually a good idea; Cheers, CR.
(In reply to Laszlo Ersek from comment #0) > We should create an AutoTest case for the symptomps described in bug 981582: > > (1) install a Fedora or RHEL guest with qemu-kvm -- the guest should have at > least 4GB RAM, > (2) install the kernel debuginfo packages on the host, > matching the guest kernel, > (3) install the "crash" utility on the host, > (4) start the guest, > (5) dump the guests vmcore with "virsh dump DOMAIN /tmp/vmoce --memory-only", > (6) verify that crash can work with the vmcore: > - no error or warning messages printed, > - a few commands at the crash> prompt return meaningful output, > - no immediate non-zero exit status, > - etc. > > Dave, can you please recommend some specifics for (6) that I could try to > automate? Thanks. Honestly, given that there are hundreds of dumpfile reads during initialization that must succeed to make it to the "crash> " prompt, I would be happy if it simply comes up "clean" without any errors/warnings.
And what's the clearest indication that it has come up "clean"? The lack of output on stderr? Or the lack of any string containing "error" on stdout/stderr? If crash has a very formal way of reporting errors, it'd be great. If not, we can certainly work around it as long as we have knowledge of how it usually does.
It will display the crash and gdb "banners", the system information, and then the prompt: # crash vmlinux vmcore crash 7.0.1-1.el7 Copyright (C) 2002-2013 Red Hat, Inc. Copyright (C) 2004, 2005, 2006, 2010 IBM Corporation Copyright (C) 1999-2006 Hewlett-Packard Co Copyright (C) 2005, 2006, 2011, 2012 Fujitsu Limited Copyright (C) 2006, 2007 VA Linux Systems Japan K.K. Copyright (C) 2005, 2011 NEC Corporation Copyright (C) 1999, 2002, 2007 Silicon Graphics, Inc. Copyright (C) 1999, 2000, 2001, 2002 Mission Critical Linux, Inc. This program is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Enter "help copying" to see the conditions. This program has absolutely no warranty. Enter "help warranty" for details. GNU gdb (GDB) 7.6 Copyright (C) 2013 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-unknown-linux-gnu"... KERNEL: vmlinux6 DUMPFILE: vmcore6 CPUS: 4 DATE: Mon Jul 29 10:55:28 2013 UPTIME: 00:05:31 LOAD AVERAGE: 10.76, 4.73, 1.80 TASKS: 319 NODENAME: localhost.localdomain RELEASE: 2.6.32-358.el6.x86_64 VERSION: #1 SMP Tue Jan 29 11:47:41 EST 2013 MACHINE: x86_64 (2992 Mhz) MEMORY: 4 GB PANIC: "" PID: 7625 COMMAND: "usex" TASK: ffff8800d8656080 [THREAD_INFO: ffff8800d4102000] CPU: 0 STATE: TASK_RUNNING (PANIC) crash> Complicating matter just a bit is that there are a few "please wait..." messages that are shown only while a given subsystem is being initialized, and then are overwritten. So if you redirected the above into a file, it would look like this: crash 7.0.1-1.el7 Copyright (C) 2002-2013 Red Hat, Inc. Copyright (C) 2004, 2005, 2006, 2010 IBM Corporation Copyright (C) 1999-2006 Hewlett-Packard Co Copyright (C) 2005, 2006, 2011, 2012 Fujitsu Limited Copyright (C) 2006, 2007 VA Linux Systems Japan K.K. Copyright (C) 2005, 2011 NEC Corporation Copyright (C) 1999, 2002, 2007 Silicon Graphics, Inc. Copyright (C) 1999, 2000, 2001, 2002 Mission Critical Linux, Inc. This program is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Enter "help copying" to see the conditions. This program has absolutely no warranty. Enter "help warranty" for details. [?1034hGNU gdb (GDB) 7.6 Copyright (C) 2013 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-unknown-linux-gnu"... please wait... (gathering kmem slab cache data) please wait... (gathering module symbol data) please wait... (gathering task table data) please wait... (determining panic task) KERNEL: vmlinux6 DUMPFILE: vmcore6 CPUS: 4 DATE: Mon Jul 29 10:55:28 2013 UPTIME: 00:05:31 LOAD AVERAGE: 10.76, 4.73, 1.80 TASKS: 319 NODENAME: localhost.localdomain RELEASE: 2.6.32-358.el6.x86_64 VERSION: #1 SMP Tue Jan 29 11:47:41 EST 2013 MACHINE: x86_64 (2992 Mhz) MEMORY: 4 GB PANIC: "" PID: 7625 COMMAND: "usex" TASK: ffff8800d8656080 [THREAD_INFO: ffff8800d4102000] CPU: 0 STATE: TASK_RUNNING (PANIC) crash> q Each please wait line is preceded with a ^M, and has no ending linefeed. After the initialization it advertises is complete, there is another ^M and a line of spaces to cover up the line. So in reality, the "KERNEL" line would be preceded by all of the please wait messages: ^Mplease wait... (gathering kmem slab cache data)^M ^M^Mplease wait... (gathering module symbol data)^M ^M^Mplease wait... (gathering task table data)^M ^M^Mplease wait... (determining panic task)^M ^M KERNEL: vmlinux6 I'll attach the output above so you can see exactly what I mean. Anyway, there are a myriad of warning and/or error messages that *could* be displayed, so if anything other than what is shown above is displayed, it's definitely worth noting.
Created attachment 780670 [details] output file from "crash vmlinux vmcore > crash.init"
(In reply to Cleber Rodrigues from comment #1) > (2) This limits the guest and host to be running the same distro, right? Or > is the kernel-debuginfo free of deps? Or is it safe to --force its > installation? I'm asking this because we'd need to coordinate host/guest > distro/version. I think we could exploit some "regularities" here (like, a later minor release of the same major release supports debuginfos from an earlier minor release, plus, a later major release *probably* supports debuginfos from an earlier major release, considering rpm compression method compatibility etc). But, for simplicity and robustness, I think we should stick with guest release == host release. That would also ensure that "crash", installed on the host side, would match the guest kernel's vmcore. Normally you use "crash" with kdump-written vmcores, that is, the vmcore and crash itself belong to the same major.minor release. Then, debuginfo dependencies are major release specific. In RHEL-6 you need debuginfo-common and debuginfo, eg. for RHEL-6.4, kernel-debuginfo-2.6.32-358.el6.x86_64 kernel-debuginfo-common-x86_64-2.6.32-358.el6.x86_64 RHEL-7 seems to follow suit: kernel-debuginfo-3.10.0-3.el7.x86_64.rpm kernel-debuginfo-common-x86_64-3.10.0-3.el7.x86_64.rpm I expect future Fedora releases to match this. In any case, installing the top package (= kernel-debuginfo-VERSION-RELEASE.x86_64) with "yum" should pull in any dependencies. > (5) We'll probably define/use a virttest API for that so that it can use > either libvirt or qemu directly; Right. The virsh command is cited above; the direct monitor command is available over HMP ("dump-guest-memory /tmp/vmcore") and QMP as well: { "execute" : "dump-guest-memory", "arguments" : { "paging" : false, "protocol" : "file:/tmp/vmcore" } } > (6) Ideally these commands would either return meaningful exit status or the > output would be the same across versions. If that's not the case, we'd have > to define configuration per distro/version which is not usually a good idea; I think the following should work: BINARY=/usr/lib/debug/lib/modules/2.6.32-358.el6.x86_64/vmlinux VMCORE=/tmp/vmcore cat >crash.cmd <<EOT bt quit EOT if ! crash -- "$BINARY" "$VMCORE" <crash.cmd >crash.out_err 2>&1 \ || egrep --silent 'crash: read error:|WARNING:' crash.out_err; then printf 'vmcore is invalid\n' >&2 # ... fi
(... Open bug 981582 and search the page for "crash: read error:" and "WARNING:".)
There is a common "error()" function in the crash utility, which will precede each error message with "crash:" during the initialization. (During runtime it will precede the error message with the command name, e.g., "bt:") So if you go the route above, remove the "read error" part, and just look for "crash:" alone. That will catch a whole slew of other error types. And yes, the "WARNING:" string will catch things that are not necessarily errors, but most likely indicate that something is not right.
Cleber, do you have any docs pertaining to the implementation of such a test case? I think I should finally try my hand at autotest stuff. Kudos Laszlo
Cleber, can you pls help me with comment 9? Dep bug 981582 is now in MODIFIED -- the issue has been fixed in RHEL-6, RHEL-7, and upstream. I know absolutely nothing about autotest. Can you pls provide pointers to docs, test machines, etc? Thanks! Laszlo
First of all, in order to lower the complexity to newcomers, we split the virt test suite from the larger autotest project, so people don't really have to care about this big thing called 'autotest'. So, what you are looking for is 'virt-test'. Main github page https://github.com/autotest/virt-test Main wiki URL https://github.com/autotest/virt-test/wiki The main README.rst file is useful on how to bootstrap your tests https://github.com/autotest/virt-test/blob/master/README.rst Writing tests documentation https://github.com/autotest/virt-test/wiki/WritingTests About test machines, we do have a test grid, but it is small (4 functional machines as of this writing) and usually the machines are running automated tests. Of course, it is fine to lock an unused machine to try out some new things, but it seems most of the development work could be done on your laptop.
What is the recommended OS to work with virt-test? According to <https://github.com/autotest/virt-test>, RHEL-6 is supported (with EPEL-6). I run RHEL-6 Workstation + EPEL-6 on my laptop. I did the following steps: - installed "autotest-framework" and "p7zip", - cloned the virt-test repo - ran "qemu/get_started.py" successfully, - ran "/run -t qemu --list-tests" successfully, - found a bunch of "qmp_command.*" tests. Since dumping the guest vmcore too is available via a QMP command, I figured I'd test one: ./run -t qemu --tests qmp_command.qmp_query-cpus This failed with the following Python trace (from logs/latest/debug.log): --------v-------- 22:04:42 ERROR| Fail to create qemucommand: 22:04:42 ERROR| Original Traceback (most recent call last): File "/home/lacos/src/upstream/virt-test/virttest/qemu_vm.py", line 2284, in create self.devices = self.make_create_command() File "/home/lacos/src/upstream/virt-test/virttest/qemu_vm.py", line 1382, in make_create_command params.get('workaround_qemu_qmp_crash') == "always") File "/home/lacos/src/upstream/virt-test/virttest/qemu_devices.py", line 923, in __init__ timeout=10, ignore_status=True, verbose=False) TypeError: system_output() got an unexpected keyword argument 'verbose' --------^-------- git-blaming virttest/qemu_devices.py:923: --------v-------- 922 self.__qemu_help = utils.system_output("%s -help" % qemu_binary, 923 timeout=10, ignore_status=True, verbose=False) --------^-------- the following commit is fingered: commit e9639fffc89569965cec85448ae12d8bef72e80c Author: Lukáš Doktor <ldoktor> Date: Wed Jun 19 12:57:39 2013 +0200 virttest.qemu_devices: Use utils.system_output() instead of commands Signed-off-by: Lukáš Doktor <ldoktor> Indeed it passes "verbose=False" to utils.system_output(). Unfortunately, the "autotest-framework-0.14.3-1.el6.noarch", installed from EPEL-6, doesn't know about the "verbose" parameter: [/usr/lib/python2.6/site-packages/autotest/client/shared/base_utils.py] --------v-------- 1112 def system_output(command, timeout=None, ignore_status=False, 1113 retain_output=False, args=()): --------^-------- So, what host OS should I use? Is Fedora 19 recent enough? Thanks.
Ok, this bug slipped in, due to the way my development environment is setup. Most of us use Fedora 19 to work on virt-test, but RHEL6 should be supported. I'll push a fix to this bug to master right now.
Thanks. Don't rush it though just for me, I keep an F19 installation on my separate workstation around, precisely for such cases. (I'm aware that most developers run bleeding edge Fedora.)
Ok, after some thinking, I figured it'd be better to fix the problem on next, the development branch. So bear with us and use next for now./home/lmr/Code/virt-test.git/logs/run-2013-08-22-18.06.45/debug.log Explaining what is going on: autotest 0.14 is what Fedora ships, mainly because the guy that was the original package maintainer for autotest is not working on packaging anymore. That means that our team has to take over, which we would gladly do, but there are some requirements that we need to fulfill to become maintainers. Cleber applied and is working on all requirements, which takes time (months now). Makes me want to shoot myself on the foot. So both utils.system() and utils.system_output() from 0.14 don't have the verbose parameter, that was introduced on the 0.15 release. Sometime people forget and we have this mess. Please have some patience with us and try out next on your current setup, it should work.
Disregard the log path, it got pasted by accident...
Created attachment 789381 [details] test output for "qmp_command.qmp_query-cpus" Thanks for the fix, now I can run ./run -t qemu --tests qmp_command.qmp_query-cpus on the "next" branch. However the test fails -- please see the attachment. Shouldn't I run this test as root? The GetStarted wiki page says, "For qemu and libvirt subtests, the default test set does not require root. However, other tests might fail due to lack of privileges." However, "./run -t qemu --list-tests" doesn't say "(requires root)" for "qmp_command.qmp_query-cpus" (like it does eg. for the cgroup.* tests). I'm asking because the debug log contains lines like "qemu will run in KVM mode", which is not possible without having root. Thanks.
I looked at the log, it seems a QEMU bug: 23:49:53 ERROR| QMP command output: '[{u'current': True, u'pc': -2130432042, u'halted': True, u'CPU': 0, u'thread_id': 17560}, {u'current': False, u'pc': -2130432042, u'halted': True, u'CPU': 1, u'thread_id': 17561}]' 23:49:53 ERROR| Human command output: '* CPU #0: pc=0xffffffff81042fd6 (halted) thread_id=17560 23:49:53 ERROR| CPU #1: pc=0xffffffff81042fd6 (halted) thread_id=17561 ' 23:49:53 ERROR| Value in human monitor: '18446744071579119574' 23:49:53 ERROR| Value in qmp: '-2130432042' [context: Verify that qmp command 'query-cpus' works as designed.] Of am I completely off?
I'm not sure yet. I got the same results on Fedora 19, running the test as root, using a qemu binary that I built from current upstream master. Are you saying it's a qemu bug because the qmp output contains a negative PC? That's just the value of the PC taken as an int64_t. { 'type': 'CpuInfo', 'data': {'CPU': 'int', 'current': 'bool', 'halted': 'bool', '*pc': 'int', '*nip': 'int', '*npc': 'int', '*PC': 'int', 'thread_id': 'int'} } ... Ah. So the test finds that 18446744071579119574 does not equal -2130432042. (uint64_t)0xFFFFFFFF81042FD6 == 18446744071579119574 (dec) (int64_t)0xFFFFFFFF81042FD6 == -2130432042 (dec, in two's complement, 64-bit signed int) I'll try another test.
Alright, "./run -t qemu --tests qmp_command.qmp_query-name" succeeded. I guess my setup is OK then. Thanks!
I'm reading the autotest documentation, and I think some changes to the original test plan would be beneficial -- we should not install any packages on the host side. Replace, from comment #0: > (1) install a Fedora or RHEL guest with qemu-kvm -- the guest should have at > least 4GB RAM, > (2) install the kernel debuginfo packages on the host, > matching the guest kernel, > (3) install the "crash" utility on the host, > (4) start the guest, > (5) dump the guests vmcore with "virsh dump DOMAIN /tmp/vmoce --memory-only", > (6) verify that crash can work with the vmcore: With: (1) install a Fedora guest with qemu-kvm & start it (RAM size: 4 GB, qcow2 disk size should be 20 GB), (2) install the kernel debuginfo packages *in the guest*, (debuginfo-install kernel) (3) install the "crash" utility *in the guest*, (4) dump the guest vmcore on the host side, with a QMP or HMP monitor command (paging=false) -- when this completes, the guest should resume running (5) copy the vmcore into the guest with scp, (6) verify the vmcore inside the guest, with crash.
About the QMP test failing, OK, I did not realize what you pointed out: (uint64_t)0xFFFFFFFF81042FD6 == 18446744071579119574 (dec) (int64_t)0xFFFFFFFF81042FD6 == -2130432042 (dec, in two's complement, 64-bit signed int) Now that I see it, clearly the test has a bug, it needs to be modified to properly compare the 2 values. It is doing a simple int comparison right now.
posted v1: https://www.redhat.com/archives/virt-test-devel/2013-September/msg00011.html
Lucas committed the patch with minor comment reformatting on the "next" branch: https://github.com/autotest/virt-test/commit/81733cd3 Thanks! I'm not sure if I should bump the status to CLOSED/RAWHIDE. I think I'll set it to MODIFIED for now. ("This bug report has been fixed, unit tested and checked into source control by the Assigned Engineer.")
This bug appears to have been reported against 'rawhide' during the Fedora 20 development cycle. Changing version to '20'. More information and reason for this action is here: https://fedoraproject.org/wiki/BugZappers/HouseKeeping/Fedora20