Bug 619846

Summary: virsh dump gives very cryptic error messages
Product: Red Hat Enterprise Linux 6 Reporter: Qian Cai <qcai>
Component: libvirtAssignee: Osier Yang <jyang>
Status: CLOSED ERRATA QA Contact: Virtualization Bugs <virt-bugs>
Severity: high Docs Contact:
Priority: low    
Version: 6.0CC: ajia, clalance, dallan, dyuan, eblake, gren, kxiong, mzhan, phan, rwu, vbian, xen-maint, yoyzhang, yupzhang
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: libvirt-0.9.9-1.el6 Doc Type: Bug Fix
Doc Text:
Cause: qemu monitor command 'query-migrate' doesn't return the error. Consequence: libvirt has no way to known what error happened, and has no choice but to throw error like "Migration unexpectedly failed". Fix: Refactor the underlying codes for migration to use "fd:" protocol, and thus could get the exact error. Result: Much more exact error if something wrong happens.
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-06-20 06:23:58 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 584077, 670727, 698496    
Bug Blocks:    
Attachments:
Description Flags
guest.xml none

Description Qian Cai 2010-07-30 17:10:04 UTC
Description of problem:
# virsh dump rhel6 rhel6.core
error: Failed to core dump domain rhel6 to rhel6.core
error: operation failed: Migration unexpectedly failed

Version-Release number of selected component (if applicable):
host & guest:
RHEL6.0-20100729.1
libvirt-0.8.1-20.el6
qemu-kvm-0.12.1.2-2.104.el6
kernel-2.6.32-54.el6

How reproducible:
always

Comment 1 Qian Cai 2010-07-30 17:11:40 UTC
Created attachment 435605 [details]
guest.xml

both i386 and x86_64 guest failed.

Comment 2 Chris Lalancette 2010-07-30 18:09:21 UTC
Hm, it works here for me, both with and without selinux in enforcing mode.  Is there anything in /var/log/messages from libvirt?  Can you edit /etc/libvirt/libvirt.conf, set log_filters to:

log_filters="1:qemu"

Then restart libvirtd and try again?  Then there should be a lot more debugging output in /var/log/messages which may give more insight into the problem.

Chris Lalancette

Comment 3 Qian Cai 2010-07-31 00:29:34 UTC
I did that, and there was only one of those messages,
Jul 31 08:27:50 localhost libvirtd: 08:27:50.703: error : qemuDomainWaitForMigrationComplete:5371 : operation failed: Migration unexpectedly failed

It was certainly working before in rhel6.

Comment 6 Han Pingtian 2010-08-05 04:31:00 UTC
I can virsh dump guest's vmcore with the same software version of Cai, both host and guest running with 2.6.32-54.el6.x86_64. The guest is a 2 cpu smp system. The host is a AMD system.

Comment 7 Chris Lalancette 2010-08-05 15:37:31 UTC
Cai,
     What is the current working directory when you tried your virsh dump command?  Also, if you disable selinux (setenforce 0), does the dump work?

Thanks
Chris Lalancette

Comment 8 Qian Cai 2010-08-05 15:46:33 UTC
(In reply to comment #7)
> Cai,
>      What is the current working directory when you tried your virsh dump
> command?  Also, if you disable selinux (setenforce 0), does the dump work?
setenforce 0 works!

Comment 9 Qian Cai 2010-08-05 15:48:29 UTC
Surprisingly, when it failed, there was nothing obvious in audit.log as well.

Comment 10 Qian Cai 2010-08-05 15:57:22 UTC
> setenforce 0 works!    
It looks like just dumb luck for this attempt. I failed for the next attempts even with setenforce 0.

Comment 11 Qian Cai 2010-08-05 15:59:05 UTC
I did try both virsh dump from /root and /tmp. The only successful one was taken from /tmp.

Comment 12 Dave Allan 2010-08-05 20:26:30 UTC
This sounds like just a plain old permissions problem.  IIRC, the dump is going to be written as user qemu, so it's going to be able to write to /tmp but not /root.

Comment 13 Qian Cai 2010-08-06 07:07:00 UTC
Just to make things clear. The tests were conducted by root, and also dumping to both /tmp and /root could fail with and without selinux tried with many times. There was only one time it was worked under the circumstance that selinux was disabled and under /tmp, and also the guest was in the process of booting up (probably due to timing??). However, further testing showed that trying to virsh dump during the guest booting up also failed.

Comment 14 Chris Lalancette 2010-08-06 13:58:20 UTC
(In reply to comment #13)
> Just to make things clear. The tests were conducted by root, and also dumping
> to both /tmp and /root could fail with and without selinux tried with many
> times. There was only one time it was worked under the circumstance that
> selinux was disabled and under /tmp, and also the guest was in the process of
> booting up (probably due to timing??). However, further testing showed that
> trying to virsh dump during the guest booting up also failed.    

Yes, but it actually doesn't matter at all that they were conducted by root.  The problem is that it's the qemu-kvm process that is responsible for generating the core-dump, and that is running as the user qemu:qemu.  Hence that process does not have permission to write to /root, and the dump fails.  It's unintuitive, I admit.

To confirm this, please do this:

# setenforce 1
# virsh dump rhel6 /tmp/rhel6.core

Does that work?

Chris Lalancette

Comment 15 Qian Cai 2010-08-06 14:08:48 UTC
Yes, you are right. I also realized that when there is insufficient disk space to save the vmcore, it also give this slightly misleading message.

Comment 16 Chris Lalancette 2010-08-06 14:35:49 UTC
OK.  So the problem is a user problem, but we could do much better with the error reporting.  I'll leave this bug open to track better error reporting for:

1)  Low disk space
2)  Permissions problem when trying to write the dump

Chris Lalancette

Comment 18 Osier Yang 2011-01-25 11:48:58 UTC
1) for permission problem:

[root@Osier libvirt]# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/sda9              29G   28G   2G 100% /

# nc -U /var/lib/libvirt/qemu/$guest.monitor
{"execute": "qmp_capabilities"}
{"return": {}}
{"execute":"migrate","arguments":{"detach":true,"blk":false,"inc":false,"uri":"exec:cat | { dd bs=4096 seek=0 if=/dev/null && dd bs=1048576; } 1<>/root/f14.img"}}
{"return": {}}
{"timestamp": {"seconds": 1295951220, "microseconds": 249002}, "event": "STOP"}
{"timestamp": {"seconds": 1295951220, "microseconds": 252766}, "event": "RESUME"}
{"execute": "query-migrate"}
{"return": {"status": "failed"}}

2) low disk space

[root@Osier ~]# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/sda9              29G   28G  2.6M 100% /
[root@Osier ~]# nc -U /var/lib/libvirt/qemu/f14.monitor
{"QMP": {"version": {"qemu": {"micro": 0, "minor": 13, "major": 0}, "package": " (qemu-kvm-0.13.0)"}, "capabilities": []}}
{"execute": "qmp_capabilities"}
{"return": {}}
{"execute":"migrate","arguments":{"detach":true,"blk":false,"inc":false,"uri":"exec:cat | { dd bs=4096 seek=0 if=/dev/null && dd bs=1048576; } 1<>/tmp/f14.img"}} 
{"return": {}}
{"execute": "query-migrate"}
{"return": {"status": "failed"}}

As we could see above, qemu returned no error by "query-migrate" command, we can't known what exactly happened. And it's not workable for upper management checking the possibility which will cause the migration be failed. We need qemu return the exact error in JSON format for migration failure.

Same as https://bugzilla.redhat.com/show_bug.cgi?id=615941, this bug is blocked
by https://bugzilla.redhat.com/show_bug.cgi?id=670727.

Comment 19 Osier Yang 2011-01-25 12:32:58 UTC
*** Bug 664650 has been marked as a duplicate of this bug. ***

Comment 20 Osier Yang 2011-03-18 07:19:24 UTC
*** Bug 688807 has been marked as a duplicate of this bug. ***

Comment 21 Suzanne Logcher 2011-03-28 21:11:24 UTC
Since RHEL 6.1 External Beta has begun, and this bug remains 
unresolved, it has been rejected as it is not proposed as an 
exception or blocker.

Red Hat invites you to ask your support representative to 
propose this request, if appropriate and relevant, in the 
next release of Red Hat Enterprise Linux.

Comment 23 Dave Allan 2011-06-08 17:59:32 UTC
*** Bug 688804 has been marked as a duplicate of this bug. ***

Comment 25 Osier Yang 2011-07-29 04:14:18 UTC
This should be fixed, as we switched save/dump/restore to using qemu "fd:" protocol several releases ago, e.g:

commit 9497506fa0b2c8f34cf661f952aee2f5c0417974
Author: Eric Blake <eblake>
Date:   Tue Mar 1 21:59:25 2011 -0700

    qemu: allow simple domain save to use fd: protocol
    
    This allows direct saves (no compression, no root-squash NFS) to use
    the more efficient fd: migration, which in turn avoids a race where
    qemu exec: migration can sometimes fail because qemu does a generic
    waitpid() that conflicts with the pclose() used by exec:.  Further
    patches will solve compression and root-squash NFS.
    
    * src/qemu/qemu_driver.c (qemudDomainSaveFlag): Use new function
    when there is no compression.

commit 6034ddd55954251f454ca0a0632d5bb6ef4a5db4
Author: Eric Blake <eblake>
Date:   Wed Mar 9 17:35:13 2011 -0700

    qemu: consolidate migration to file code
    
    This points out that core dumps (still) don't work for root-squash
    NFS, since the fd is not opened correctly.  This patch should not
    introduce any functionality change, it is just a refactoring to
    avoid duplicated code.
    
    * src/qemu/qemu_migration.h (qemuMigrationToFile): New prototype.
    * src/qemu/qemu_migration.c (qemuMigrationToFile): New function.
    * src/qemu/qemu_driver.c (qemudDomainSaveFlag, doCoreDump): Use
    it.

etc. etc.

With these patches, we don't use qemu "exec" anymore if qemu version greater than 0.12, this means we won't suffer from the qemu's popen (qemu's "exec" uses popen)anymore.

Comment 27 Dave Allan 2011-07-29 13:28:24 UTC
I think this BZ is saying that it's incorrect for the error message to say migration when the user requested a dump.  Of course, internally, it's the same code, but the user has no way of knowing that.  Unless that error message has been changed to make it clear that the dump operation failed, I don't think this has been fixed.

Comment 28 Osier Yang 2011-07-29 14:15:55 UTC
(In reply to comment #27)
> I think this BZ is saying that it's incorrect for the error message to say
> migration when the user requested a dump. 

This was fixed long time ago.

commit 5faf88fe9809045c15de5ddc7970aca46b516fe7
Author: Osier Yang <jyang>
Date:   Mon Dec 13 16:30:30 2010 +0800

    qemu: Introduce two new job types
    
    Currently, all of domain "save/dump/managed save/migration"
    use the same function "qemudDomainWaitForMigrationComplete"
    to wait the job finished, but the error messages are all
    about "migration", e.g. when a domain saving job is canceled
    by user, "migration was cancled by client" will be throwed as
    an error message, which will be confused for user.
    
    As a solution, intoduce two new job types(QEMU_JOB_SAVE,
    QEMU_JOB_DUMP), and set "priv->jobActive" to "QEMU_JOB_SAVE"
    before saving, to "QEMU_JOB_DUMP" before dumping, so that we
    could get the real job type in
    "qemudDomainWaitForMigrationComplete", and give more clear
    message further.
    
    And as It's not important to figure out what's the exact job
    is in the DEBUG and WARN log, also we don't need translated
    string in logs, simply repace "migration" with "job" in some
    statements.
    
    * src/qemu/qemu_driver.c

 Of course, internally, it's the same
> code, but the user has no way of knowing that.  Unless that error message has
> been changed to make it clear that the dump operation failed, I don't think
> this has been fixed.

And this bug is mainly complaining about the "Unexpectly failed" IMO, as nobody will known what it means.

Comment 29 Osier Yang 2011-07-29 14:21:18 UTC
(In reply to comment #28)
> (In reply to comment #27)
> > I think this BZ is saying that it's incorrect for the error message to say
> > migration when the user requested a dump. 
> 
> This was fixed long time ago.
> 
> commit 5faf88fe9809045c15de5ddc7970aca46b516fe7
> Author: Osier Yang <jyang>
> Date:   Mon Dec 13 16:30:30 2010 +0800
> 
>     qemu: Introduce two new job types
> 
>     Currently, all of domain "save/dump/managed save/migration"
>     use the same function "qemudDomainWaitForMigrationComplete"
>     to wait the job finished, but the error messages are all
>     about "migration", e.g. when a domain saving job is canceled
>     by user, "migration was cancled by client" will be throwed as
>     an error message, which will be confused for user.
> 
>     As a solution, intoduce two new job types(QEMU_JOB_SAVE,
>     QEMU_JOB_DUMP), and set "priv->jobActive" to "QEMU_JOB_SAVE"
>     before saving, to "QEMU_JOB_DUMP" before dumping, so that we
>     could get the real job type in
>     "qemudDomainWaitForMigrationComplete", and give more clear
>     message further.
> 
>     And as It's not important to figure out what's the exact job
>     is in the DEBUG and WARN log, also we don't need translated
>     string in logs, simply repace "migration" with "job" in some
>     statements.
> 
>     * src/qemu/qemu_driver.c

The related fixed BZ: https://bugzilla.redhat.com/show_bug.cgi?id=639595

Comment 30 Dave Allan 2011-07-29 20:01:35 UTC
(In reply to comment #28)
> (In reply to comment #27)
> > I think this BZ is saying that it's incorrect for the error message to say
> > migration when the user requested a dump. 
> 
> This was fixed long time ago.

Ok, putting in POST then.

Comment 34 yuping zhang 2012-01-10 07:32:18 UTC
Tested this issue with libvirt-0.9.9-1.el6.x86_64

# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/sda1              16G   15G     0 100% /
tmpfs                 1.8G  376K  1.8G   1% /dev/shm

# virsh start test
Domain test started

# virsh dump test test.core
error: Failed to core dump domain test to test.core
error: operation failed: domain core dump job: unexpectedly failed

# virsh save test test.save
error: Failed to save domain test to test.save
error: operation failed: domain save job: unexpectedly failed

So change the status to VERIFIED.

Comment 35 Osier Yang 2012-05-04 09:39:17 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Cause: qemu monitor command 'query-migrate' doesn't return the error.
Consequence: libvirt has no way to known what error happened, and has no choice but to throw error like "Migration unexpectedly failed".
Fix: Refactor the underlying codes for migration to use "fd:" protocol, and thus could get the exact error.
Result: Much more exact error if something wrong happens.

Comment 37 errata-xmlrpc 2012-06-20 06:23:58 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2012-0748.html