Bug 619846
Summary: | virsh dump gives very cryptic error messages | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Qian Cai <qcai> | ||||
Component: | libvirt | Assignee: | Osier Yang <jyang> | ||||
Status: | CLOSED ERRATA | QA Contact: | Virtualization Bugs <virt-bugs> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | low | ||||||
Version: | 6.0 | CC: | ajia, clalance, dallan, dyuan, eblake, gren, kxiong, mzhan, phan, rwu, vbian, xen-maint, yoyzhang, yupzhang | ||||
Target Milestone: | rc | ||||||
Target Release: | --- | ||||||
Hardware: | All | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | libvirt-0.9.9-1.el6 | Doc Type: | Bug Fix | ||||
Doc Text: |
Cause: qemu monitor command 'query-migrate' doesn't return the error.
Consequence: libvirt has no way to known what error happened, and has no choice but to throw error like "Migration unexpectedly failed".
Fix: Refactor the underlying codes for migration to use "fd:" protocol, and thus could get the exact error.
Result: Much more exact error if something wrong happens.
|
Story Points: | --- | ||||
Clone Of: | Environment: | ||||||
Last Closed: | 2012-06-20 06:23:58 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | 584077, 670727, 698496 | ||||||
Bug Blocks: | |||||||
Attachments: |
|
Description
Qian Cai
2010-07-30 17:10:04 UTC
Created attachment 435605 [details]
guest.xml
both i386 and x86_64 guest failed.
Hm, it works here for me, both with and without selinux in enforcing mode. Is there anything in /var/log/messages from libvirt? Can you edit /etc/libvirt/libvirt.conf, set log_filters to: log_filters="1:qemu" Then restart libvirtd and try again? Then there should be a lot more debugging output in /var/log/messages which may give more insight into the problem. Chris Lalancette I did that, and there was only one of those messages, Jul 31 08:27:50 localhost libvirtd: 08:27:50.703: error : qemuDomainWaitForMigrationComplete:5371 : operation failed: Migration unexpectedly failed It was certainly working before in rhel6. I can virsh dump guest's vmcore with the same software version of Cai, both host and guest running with 2.6.32-54.el6.x86_64. The guest is a 2 cpu smp system. The host is a AMD system. Cai, What is the current working directory when you tried your virsh dump command? Also, if you disable selinux (setenforce 0), does the dump work? Thanks Chris Lalancette (In reply to comment #7) > Cai, > What is the current working directory when you tried your virsh dump > command? Also, if you disable selinux (setenforce 0), does the dump work? setenforce 0 works! Surprisingly, when it failed, there was nothing obvious in audit.log as well. > setenforce 0 works!
It looks like just dumb luck for this attempt. I failed for the next attempts even with setenforce 0.
I did try both virsh dump from /root and /tmp. The only successful one was taken from /tmp. This sounds like just a plain old permissions problem. IIRC, the dump is going to be written as user qemu, so it's going to be able to write to /tmp but not /root. Just to make things clear. The tests were conducted by root, and also dumping to both /tmp and /root could fail with and without selinux tried with many times. There was only one time it was worked under the circumstance that selinux was disabled and under /tmp, and also the guest was in the process of booting up (probably due to timing??). However, further testing showed that trying to virsh dump during the guest booting up also failed. (In reply to comment #13) > Just to make things clear. The tests were conducted by root, and also dumping > to both /tmp and /root could fail with and without selinux tried with many > times. There was only one time it was worked under the circumstance that > selinux was disabled and under /tmp, and also the guest was in the process of > booting up (probably due to timing??). However, further testing showed that > trying to virsh dump during the guest booting up also failed. Yes, but it actually doesn't matter at all that they were conducted by root. The problem is that it's the qemu-kvm process that is responsible for generating the core-dump, and that is running as the user qemu:qemu. Hence that process does not have permission to write to /root, and the dump fails. It's unintuitive, I admit. To confirm this, please do this: # setenforce 1 # virsh dump rhel6 /tmp/rhel6.core Does that work? Chris Lalancette Yes, you are right. I also realized that when there is insufficient disk space to save the vmcore, it also give this slightly misleading message. OK. So the problem is a user problem, but we could do much better with the error reporting. I'll leave this bug open to track better error reporting for: 1) Low disk space 2) Permissions problem when trying to write the dump Chris Lalancette 1) for permission problem: [root@Osier libvirt]# df -h Filesystem Size Used Avail Use% Mounted on /dev/sda9 29G 28G 2G 100% / # nc -U /var/lib/libvirt/qemu/$guest.monitor {"execute": "qmp_capabilities"} {"return": {}} {"execute":"migrate","arguments":{"detach":true,"blk":false,"inc":false,"uri":"exec:cat | { dd bs=4096 seek=0 if=/dev/null && dd bs=1048576; } 1<>/root/f14.img"}} {"return": {}} {"timestamp": {"seconds": 1295951220, "microseconds": 249002}, "event": "STOP"} {"timestamp": {"seconds": 1295951220, "microseconds": 252766}, "event": "RESUME"} {"execute": "query-migrate"} {"return": {"status": "failed"}} 2) low disk space [root@Osier ~]# df -h Filesystem Size Used Avail Use% Mounted on /dev/sda9 29G 28G 2.6M 100% / [root@Osier ~]# nc -U /var/lib/libvirt/qemu/f14.monitor {"QMP": {"version": {"qemu": {"micro": 0, "minor": 13, "major": 0}, "package": " (qemu-kvm-0.13.0)"}, "capabilities": []}} {"execute": "qmp_capabilities"} {"return": {}} {"execute":"migrate","arguments":{"detach":true,"blk":false,"inc":false,"uri":"exec:cat | { dd bs=4096 seek=0 if=/dev/null && dd bs=1048576; } 1<>/tmp/f14.img"}} {"return": {}} {"execute": "query-migrate"} {"return": {"status": "failed"}} As we could see above, qemu returned no error by "query-migrate" command, we can't known what exactly happened. And it's not workable for upper management checking the possibility which will cause the migration be failed. We need qemu return the exact error in JSON format for migration failure. Same as https://bugzilla.redhat.com/show_bug.cgi?id=615941, this bug is blocked by https://bugzilla.redhat.com/show_bug.cgi?id=670727. *** Bug 664650 has been marked as a duplicate of this bug. *** *** Bug 688807 has been marked as a duplicate of this bug. *** Since RHEL 6.1 External Beta has begun, and this bug remains unresolved, it has been rejected as it is not proposed as an exception or blocker. Red Hat invites you to ask your support representative to propose this request, if appropriate and relevant, in the next release of Red Hat Enterprise Linux. *** Bug 688804 has been marked as a duplicate of this bug. *** This should be fixed, as we switched save/dump/restore to using qemu "fd:" protocol several releases ago, e.g: commit 9497506fa0b2c8f34cf661f952aee2f5c0417974 Author: Eric Blake <eblake> Date: Tue Mar 1 21:59:25 2011 -0700 qemu: allow simple domain save to use fd: protocol This allows direct saves (no compression, no root-squash NFS) to use the more efficient fd: migration, which in turn avoids a race where qemu exec: migration can sometimes fail because qemu does a generic waitpid() that conflicts with the pclose() used by exec:. Further patches will solve compression and root-squash NFS. * src/qemu/qemu_driver.c (qemudDomainSaveFlag): Use new function when there is no compression. commit 6034ddd55954251f454ca0a0632d5bb6ef4a5db4 Author: Eric Blake <eblake> Date: Wed Mar 9 17:35:13 2011 -0700 qemu: consolidate migration to file code This points out that core dumps (still) don't work for root-squash NFS, since the fd is not opened correctly. This patch should not introduce any functionality change, it is just a refactoring to avoid duplicated code. * src/qemu/qemu_migration.h (qemuMigrationToFile): New prototype. * src/qemu/qemu_migration.c (qemuMigrationToFile): New function. * src/qemu/qemu_driver.c (qemudDomainSaveFlag, doCoreDump): Use it. etc. etc. With these patches, we don't use qemu "exec" anymore if qemu version greater than 0.12, this means we won't suffer from the qemu's popen (qemu's "exec" uses popen)anymore. I think this BZ is saying that it's incorrect for the error message to say migration when the user requested a dump. Of course, internally, it's the same code, but the user has no way of knowing that. Unless that error message has been changed to make it clear that the dump operation failed, I don't think this has been fixed. (In reply to comment #27) > I think this BZ is saying that it's incorrect for the error message to say > migration when the user requested a dump. This was fixed long time ago. commit 5faf88fe9809045c15de5ddc7970aca46b516fe7 Author: Osier Yang <jyang> Date: Mon Dec 13 16:30:30 2010 +0800 qemu: Introduce two new job types Currently, all of domain "save/dump/managed save/migration" use the same function "qemudDomainWaitForMigrationComplete" to wait the job finished, but the error messages are all about "migration", e.g. when a domain saving job is canceled by user, "migration was cancled by client" will be throwed as an error message, which will be confused for user. As a solution, intoduce two new job types(QEMU_JOB_SAVE, QEMU_JOB_DUMP), and set "priv->jobActive" to "QEMU_JOB_SAVE" before saving, to "QEMU_JOB_DUMP" before dumping, so that we could get the real job type in "qemudDomainWaitForMigrationComplete", and give more clear message further. And as It's not important to figure out what's the exact job is in the DEBUG and WARN log, also we don't need translated string in logs, simply repace "migration" with "job" in some statements. * src/qemu/qemu_driver.c Of course, internally, it's the same > code, but the user has no way of knowing that. Unless that error message has > been changed to make it clear that the dump operation failed, I don't think > this has been fixed. And this bug is mainly complaining about the "Unexpectly failed" IMO, as nobody will known what it means. (In reply to comment #28) > (In reply to comment #27) > > I think this BZ is saying that it's incorrect for the error message to say > > migration when the user requested a dump. > > This was fixed long time ago. > > commit 5faf88fe9809045c15de5ddc7970aca46b516fe7 > Author: Osier Yang <jyang> > Date: Mon Dec 13 16:30:30 2010 +0800 > > qemu: Introduce two new job types > > Currently, all of domain "save/dump/managed save/migration" > use the same function "qemudDomainWaitForMigrationComplete" > to wait the job finished, but the error messages are all > about "migration", e.g. when a domain saving job is canceled > by user, "migration was cancled by client" will be throwed as > an error message, which will be confused for user. > > As a solution, intoduce two new job types(QEMU_JOB_SAVE, > QEMU_JOB_DUMP), and set "priv->jobActive" to "QEMU_JOB_SAVE" > before saving, to "QEMU_JOB_DUMP" before dumping, so that we > could get the real job type in > "qemudDomainWaitForMigrationComplete", and give more clear > message further. > > And as It's not important to figure out what's the exact job > is in the DEBUG and WARN log, also we don't need translated > string in logs, simply repace "migration" with "job" in some > statements. > > * src/qemu/qemu_driver.c The related fixed BZ: https://bugzilla.redhat.com/show_bug.cgi?id=639595 (In reply to comment #28) > (In reply to comment #27) > > I think this BZ is saying that it's incorrect for the error message to say > > migration when the user requested a dump. > > This was fixed long time ago. Ok, putting in POST then. Tested this issue with libvirt-0.9.9-1.el6.x86_64 # df -h Filesystem Size Used Avail Use% Mounted on /dev/sda1 16G 15G 0 100% / tmpfs 1.8G 376K 1.8G 1% /dev/shm # virsh start test Domain test started # virsh dump test test.core error: Failed to core dump domain test to test.core error: operation failed: domain core dump job: unexpectedly failed # virsh save test test.save error: Failed to save domain test to test.save error: operation failed: domain save job: unexpectedly failed So change the status to VERIFIED. Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Cause: qemu monitor command 'query-migrate' doesn't return the error. Consequence: libvirt has no way to known what error happened, and has no choice but to throw error like "Migration unexpectedly failed". Fix: Refactor the underlying codes for migration to use "fd:" protocol, and thus could get the exact error. Result: Much more exact error if something wrong happens. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2012-0748.html |