Bug 1481595 - [7.4-Alt] Unable to execute QEMU command 'dump-guest-memory': dump: failed to save memory
Summary: [7.4-Alt] Unable to execute QEMU command 'dump-guest-memory': dump: failed to...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: qemu-kvm-rhev
Version: 7.4-Alt
Hardware: All
OS: Linux
high
high
Target Milestone: rc
: 7.6
Assignee: Laurent Vivier
QA Contact: Minjia Cai
URL:
Whiteboard:
Depends On:
Blocks: 1513404 1528344 1572554 1578741
TreeView+ depends on / blocked
 
Reported: 2017-08-15 07:09 UTC by yilzhang
Modified: 2018-11-01 11:01 UTC (History)
13 users (show)

Fixed In Version: qemu-kvm-rhev-2.12.0-1.el7
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1572554 (view as bug list)
Environment:
Last Closed: 2018-11-01 11:01:10 UTC
Target Upstream Version:


Attachments (Terms of Use)
libvirt log. Time is not correct on Host (1.07 MB, text/plain)
2017-08-17 07:43 UTC, yilzhang
no flags Details


Links
System ID Priority Status Summary Last Updated
IBM Linux Technology Center 157818 None None None 2019-07-26 01:16:17 UTC
Red Hat Knowledge Base (Solution) 3437191 None None None 2018-05-09 05:33:16 UTC

Description yilzhang 2017-08-15 07:09:21 UTC
Description of problem:
Start one guest and then use pvpanic device to trigger guest crash, after that, the coredump file is not created on host.


Version-Release number of selected component (if applicable):
Host: 
    kernel: 4.11.0-22.el7a.ppc64le
    qemu-kvm-2.9.0-20.el7a.ppc64le
    SLOF-20170303-4.git66d250e.el7.noarch
Guest kernel: 4.11.0-22.el7a.ppc64le

How reproducible: 100%



Steps to Reproduce:
1. Define one vm and boot up it, for example:
virsh define test.xml
virsh start test
2. Inside guest, issue command to make guest crash
[Guest] # systemctl stop kdump
[Guest] # echo c > /proc/sysrq-trigger

3. Check the crash coredump file is automatically created on Host
[Host]# ls -lh /var/lib/libvirt/qemu/dump/
total 4.3G
-rw-------. 1 root root 4.3G May 14 14:52 7-guest-2017-05-14-14:51:33

[Host]# ls -lh /var/lib/libvirt/qemu/dump/
total 0



Actual results:
In step3, core file was created, but after a while(do nothing but just wait), this core file disappears.
In /var/log/messages:
May 14 14:20:02 virt8 libvirtd: 2017-05-14 18:20:02.265+0000: 14365: error : qemuMonitorJSONCheckError:389 : internal error: unable to execute QEMU command 'dump-guest-memory': dump: failed to save memory
May 14 14:20:02 virt8 libvirtd: 2017-05-14 18:20:02.268+0000: 14365: error : qemuMonitorJSONCheckError:389 : internal error: unable to execute QEMU command 'closefd': File descriptor named 'dump' not found
May 14 14:20:02 virt8 libvirtd: 2017-05-14 18:20:02.268+0000: 14365: warning : qemuMonitorDumpToFd:2733 : failed to close dumping handle
May 14 14:20:03 virt8 libvirtd: 2017-05-14 18:20:03.200+0000: 14365: error : doCoreDumpToAutoDumpPath:4117 : operation failed: Dump failed



Expected results:
The coredump file should be created successfully under /var/lib/libvirt/qemu/dump/, and "crash" tool can analyse it.

Additional info:
# cat test.xml
  <domain type='kvm' id='1'>
    <name>guest</name>
    <memory unit='KiB'>8388608</memory>
    <currentMemory unit='KiB'>8388608</currentMemory>
    <vcpu placement='static'>16</vcpu>
    <resource>
      <partition>/machine</partition>
    </resource>
    <os>
      <type arch='ppc64' machine='pseries'>hvm</type>
      <boot dev='hd'/>
      <boot dev='network'/>
      <bootmenu enable='yes'/>
    </os>
    <cpu>
      <topology sockets='4' cores='4' threads='1'/>
    </cpu>
    <clock offset='utc'/>
    <on_poweroff>destroy</on_poweroff>
    <on_reboot>restart</on_reboot>
    <on_crash>coredump-restart</on_crash>
    <devices>
      <emulator>/usr/libexec/qemu-kvm</emulator>
      <disk type='file' device='disk'>
        <driver name='qemu' type='qcow2' cache='none'/>
        <source file='/home/yilzhang/dump/rhel7.4-alt-20170726.0__.qcow2'/>
        <backingStore/>
        <target dev='sda' bus='scsi'/>
        <alias name='scsi0-0-0-0'/>
        <address type='drive' controller='0' bus='0' target='0' unit='0'/>
      </disk>
      <disk type='file' device='cdrom'>
        <driver name='qemu' type="aio"  io='native' cache="none"/>
        <target dev='sdc' bus='scsi'/>
        <readonly/>
      </disk>
      <interface type='bridge'>
       <mac address='52:54:00:c3:e7:8e'/>
       <source bridge='switch'/>
       <model type='virtio'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x0'/>
      </interface>
    </devices>
  </domain>

Comment 2 yilzhang 2017-08-15 08:32:59 UTC
Power8 + qemu-kvm-rhev doesn't have this issue.


Host kernel: 3.10.0-693.el7.ppc64le
     qemu-kvm-rhev-2.9.0-16.el7_4.3.ppc64le
Guest kernel: 3.10.0-675.el7.ppc64le

Comment 3 Laurent Vivier 2017-08-16 14:24:09 UTC
Could you provide logs from libvirt?

Comment 4 yilzhang 2017-08-17 07:43:32 UTC
Created attachment 1314585 [details]
libvirt log. Time is not correct on Host

Comment 5 Laurent Vivier 2017-08-17 10:02:12 UTC
(In reply to yilzhang from comment #4)
> Created attachment 1314585 [details]
> libvirt log. Time is not correct on Host

Thank you.

Could you check you have enough space on the disk (with "df -h /var/lib/libvirt/qemu/dump/")?

As your VM is defined with 8GB of memory, you need at least 8GB of free space on the disk.

Comment 6 yilzhang 2017-08-18 05:58:43 UTC
Yes, there is not enough space left:
[root@virt8 ~]# df -h /var/lib/libvirt/qemu/dump/
Filesystem                     Size  Used Avail Use% Mounted on
/dev/mapper/rhel7--pegas-root   16G   11G  5.3G  68% /

I decreased VM's memory to 4G, and successfully got the dump file just now.


I don't know how upper layer(e.g. virt-manager) handles this kind of failure.

I just want to know is the error message printed by libvirtd expected? Probably it should print some ENOSPC message in this case, I think.

As well, the incomplete dump file disappears automatically, which is a bit confusing to me. Please help to clarify it, thank you very much.

Comment 7 Laurent Vivier 2017-08-18 07:39:00 UTC
(In reply to yilzhang from comment #6)
> Yes, there is not enough space left:
> [root@virt8 ~]# df -h /var/lib/libvirt/qemu/dump/
> Filesystem                     Size  Used Avail Use% Mounted on
> /dev/mapper/rhel7--pegas-root   16G   11G  5.3G  68% /
> 
> I decreased VM's memory to 4G, and successfully got the dump file just now.
> 
> 
> I don't know how upper layer(e.g. virt-manager) handles this kind of failure.
> 
> I just want to know is the error message printed by libvirtd expected?
> Probably it should print some ENOSPC message in this case, I think.
> 
> As well, the incomplete dump file disappears automatically, which is a bit
> confusing to me. Please help to clarify it, thank you very much.

So this bug is not specific to ppc64le.

We can modify QEMU to report a more accurate error message, but will it be used by libvirt?

Andrea, any comment?

Comment 8 David Gibson 2017-08-21 03:54:20 UTC
In any case this isn't especially urgent, since it's just about making an error message nicer.

Deferring.

Comment 9 Andrea Bolognani 2017-08-28 13:56:46 UTC
(In reply to Laurent Vivier from comment #7)
> We can modify QEMU to report a more accurate error message, but will it be
> used by libvirt?

It depends on your expectations.

If you initiate the dump manually from the host using virsh,
the QEMU error will be displayed:

  # sudo virsh dump guest /var/lib/libvirt/qemu/dump/guest --format elf --memory-only
  error: Failed to core dump domain guest to /var/lib/libvirt/qemu/dump/guest
  error: internal error: unable to execute QEMU command 'dump-guest-memory': dump: failed to save memory

However, in the situation described above there is no client
connected, so the only way libvirt can report the error is
through the log.

So the error message won't be any more visible to the user
than it is now, but at least it will be more helpful.

Comment 10 Karen Noel 2017-09-22 11:34:20 UTC
Move to qemu-kvm-rhev. This fix will apply to both RHEL KVM and qemu-kvm-rhev for RHV and RHOSP. Both packages are using the same code base.

Comment 11 IBM Bug Proxy 2018-01-25 13:51:22 UTC
------- Comment From yasmins@br.ibm.com 2018-01-25 08:48 EDT-------
I am working on it.

Comment 12 IBM Bug Proxy 2018-02-09 19:40:56 UTC
------- Comment From yasmins@br.ibm.com 2018-02-09 14:40 EDT-------
Sent the patch 'dump: Show custom message for ENOSPC' to qemu-devel for review.

Comment 13 IBM Bug Proxy 2018-03-05 19:11:02 UTC
------- Comment From yasmins@br.ibm.com 2018-03-05 14:03 EDT-------
The patch has been reviewed and approved. I'll update the bug status as soon as it gets merged to master.

Comment 14 Laurent Vivier 2018-03-21 12:39:26 UTC
Yasmin,

As your patch has not been merged I've sent a new patch addressing comments given by Eric Blake:

dump: display cause of write failure
http://patchwork.ozlabs.org/patch/888783/

Comment 15 IBM Bug Proxy 2018-03-22 14:11:25 UTC
------- Comment From hannsj_uhl@de.ibm.com 2018-03-22 10:04 EDT-------
(In reply to comment #14)
> Yasmin,
> As your patch has not been merged I've sent a new patch addressing comments
> given by Eric Blake:
> dump: display cause of write failure
> http://patchwork.ozlabs.org/patch/888783/
.
... which I think is now finally upstream accepted as git commit
https://git.qemu.org/gitweb.cgi?p=qemu.git;a=commit;h=0c33659d09f4a8ab926846295538d6a67e8c2c63
("dump.c: allow fd_write_vmcore to return errno on failure")
... please correct me if I am wrong ...

Comment 16 Laurent Vivier 2018-03-22 14:15:48 UTC
(In reply to IBM Bug Proxy from comment #15)
> ------- Comment From hannsj_uhl@de.ibm.com 2018-03-22 10:04 EDT-------
> (In reply to comment #14)
> > Yasmin,
> > As your patch has not been merged I've sent a new patch addressing comments
> > given by Eric Blake:
> > dump: display cause of write failure
> > http://patchwork.ozlabs.org/patch/888783/
> .
> ... which I think is now finally upstream accepted as git commit
> https://git.qemu.org/gitweb.cgi?p=qemu.git;a=commit;
> h=0c33659d09f4a8ab926846295538d6a67e8c2c63
> ("dump.c: allow fd_write_vmcore to return errno on failure")
> ... please correct me if I am wrong ...

In fact, the one merged is the v3 of the patch from Yasmin, but it does the same thing, I'm going to backport it.

Comment 17 Laurent Vivier 2018-04-25 11:29:06 UTC
Move state to POST as the fix will come with the rebase on qemu v2.12.0

Comment 18 Minjia Cai 2018-04-26 08:00:57 UTC
Reproduce:

Version-Release number of selected component (if applicable):
Host: 
    kernel: 3.10.0-862.el7.ppc64le
    qemu-kvm-ma-2.10.0-21.el7.ppc64le
    SLOF-20170724-2.git89f519f.el7.noarch


Guest kernel: kernel: 3.10.0-862.el7.ppc64le

How reproducible: 100%



Steps to Reproduce:
1. Define one vm and boot up it, for example:
virsh define guest.xml
[root@ibm-p8-07 micai]# cat guest.xml
<domain type='kvm'>
  <name>rhel75</name>
  <memory unit='GB'>30</memory>
  <currentMemory unit='GB'>30</currentMemory>
  <vcpu placement='static'>24</vcpu>
  <os>
    <type arch='ppc64le'>hvm</type>
    <boot dev='hd'/>
  </os>
  <features>
    <acpi/>
    <apic/>
    <pae/>
  </features>
  <clock offset='utc'/>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>coredump-restart</on_crash>
   <devices>
   <emulator>/usr/libexec/qemu-kvm</emulator>
    <disk type='file' device='disk'>
      <driver name='qemu' type='qcow2'/>
      <source file='/home/micai/rhel75.qcow2'/>
      <target dev='vda' bus='virtio'/>
    </disk>
    <graphics type='vnc' port='1' autoport='yes' listen='0.0.0.0'>
      <listen type='address' address='0.0.0.0'/>
    </graphics>
  </devices>
</domain>

virsh start rhel75
2. Inside guest, issue command to make guest crash
[Guest] # systemctl stop kdump
[Guest] # echo c > /proc/sysrq-trigger

3. Check the crash coredump file is automatically created on Host
[root@ibm-p8-07 micai]# ls -lh /var/lib/libvirt/qemu/dump/
total 21G
-rw------- 1 root root 21G Apr 26 03:50 5-rhel75-2018-04-26-03:50:40


[root@ibm-p8-07 micai]# ls -lh /var/lib/libvirt/qemu/dump/
total 0

This is the same result as comment 0.

Comment 19 Laurent Vivier 2018-04-26 08:37:45 UTC
(In reply to Minjia Cai from comment #18)
> Reproduce:
> 
> Version-Release number of selected component (if applicable):
> Host: 
>     kernel: 3.10.0-862.el7.ppc64le
>     qemu-kvm-ma-2.10.0-21.el7.ppc64le
>     SLOF-20170724-2.git89f519f.el7.noarch
> 
> 
> Guest kernel: kernel: 3.10.0-862.el7.ppc64le
> 
> How reproducible: 100%
...
> This is the same result as comment 0.

The fix will be in qemu-kvm-rhev-2.12.0 (coming with the rebase).

For qemu-kvm-ma-2.10.0, the rhel-7.5.z must be set to + and the BZ cloned.

And it will not change the behavior, the error message is only more explicit but I don't know if libvirt (virsh) will report it to you.

Comment 20 Minjia Cai 2018-04-26 09:30:57 UTC
(In reply to Laurent Vivier from comment #19)
> (In reply to Minjia Cai from comment #18)
> > Reproduce:
> > 
> > Version-Release number of selected component (if applicable):
> > Host: 
> >     kernel: 3.10.0-862.el7.ppc64le
> >     qemu-kvm-ma-2.10.0-21.el7.ppc64le
> >     SLOF-20170724-2.git89f519f.el7.noarch
> > 
> > 
> > Guest kernel: kernel: 3.10.0-862.el7.ppc64le
> > 
> > How reproducible: 100%
> ...
> > This is the same result as comment 0.
> 
> The fix will be in qemu-kvm-rhev-2.12.0 (coming with the rebase).
> 
> For qemu-kvm-ma-2.10.0, the rhel-7.5.z must be set to + and the BZ cloned.
> 
> And it will not change the behavior, the error message is only more explicit
> but I don't know if libvirt (virsh) will report it to you.

Sorry, For comment 18, you are misunderstood. I just took over the feature, and I plan to reproduce it myself, and then I will be in fix's qemu-2.12 verify.

Comment 24 Minjia Cai 2018-05-10 00:53:29 UTC



Version-Release number of selected component (if applicable):
Host: 
    kernel: 3.10.0-883.el7.ppc64le 
    qemu-kvm-rhev-2.12.0-1.el7.ppc64le
    SLOF-20170724-2.git89f519f.el7.noarch


Guest kernel: kernel: 3.10.0-883.el7.ppc64le


Steps to verify:
1. Define one vm and boot up it, for example:
virsh define guest.xml
[root@ibm-p8-rhevm-13  micai]# cat guest.xml
<domain type='kvm'>
  <name>rhel75</name>
  <memory unit='GB'>45</memory>
  <currentMemory unit='GB'>30</currentMemory>
  <vcpu placement='static'>24</vcpu>
  <os>
    <type arch='ppc64le'>hvm</type>
    <boot dev='hd'/>
  </os>
  <features>
    <acpi/>
    <apic/>
    <pae/>
  </features>
  <clock offset='utc'/>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>coredump-restart</on_crash>
   <devices>
   <emulator>/usr/libexec/qemu-kvm</emulator>
    <disk type='file' device='disk'>
      <driver name='qemu' type='qcow2'/>
      <source file='/home/micai/rhel75.qcow2'/>
      <target dev='vda' bus='virtio'/>
    </disk>
    <graphics type='vnc' port='1' autoport='yes' listen='0.0.0.0'>
      <listen type='address' address='0.0.0.0'/>
    </graphics>
  </devices>
</domain>

virsh start rhel75
2. Inside guest, issue command to make guest crash
[Guest] # systemctl stop kdump
[Guest] # echo c > /proc/sysrq-trigger

3. Check the crash coredump file is automatically created on Host
[root@ibm-p8-rhevm-13 dump]# df -h /var/lib/libvirt/qemu/dump/
Filesystem                                Size  Used Avail Use% Mounted on
/dev/mapper/rhel_ibm--p8--rhevm--13-root   50G   50G  745M  99% /
[root@ibm-p8-rhevm-13 dump]# ls -lh /var/lib/libvirt/qemu/dump/
total 44G
-rw------- 1 root root 43G May  9 05:57 8-rhel75-2018-05-09-05:53:26

Wait ten minutes.
[root@ibm-p8-rhevm-13 dump]# ls -lh /var/lib/libvirt/qemu/dump/
total 43G
-rw------- 1 root root 43G May  9 05:57 8-rhel75-2018-05-09-05:53:26
[root@ibm-p8-rhevm-13 dump]# df -h /var/lib/libvirt/qemu/dump/
Filesystem                                Size  Used Avail Use% Mounted on
/dev/mapper/rhel_ibm--p8--rhevm--13-root   50G   48G  2.8G  95% /

 
The coredump file is created on host.It doesn't go away.This bug has been proven successful.

Comment 26 Minjia Cai 2018-05-11 01:40:58 UTC
I use the qemu command to start the guest.
(qemu) dump-guest-memory  /var/lib/libvirt/qemu/dump/test
dump: failed to save memory: No space left on device
(qemu) 

This is a clear reminder. According to comment25. When using libvirt, where should I view the error message?

Comment 27 Laurent Vivier 2018-05-14 09:16:33 UTC
(In reply to Minjia Cai from comment #26)
> I use the qemu command to start the guest.
> (qemu) dump-guest-memory  /var/lib/libvirt/qemu/dump/test
> dump: failed to save memory: No space left on device
> (qemu) 
> 
> This is a clear reminder. According to comment25. When using libvirt, where
> should I view the error message?

I think the answer is in comment 9: libvirt logs. But perhaps Andrea can give more details?

Comment 28 Andrea Bolognani 2018-05-24 14:13:37 UTC
(In reply to Laurent Vivier from comment #27)
> (In reply to Minjia Cai from comment #26)
> > I use the qemu command to start the guest.
> > (qemu) dump-guest-memory  /var/lib/libvirt/qemu/dump/test
> > dump: failed to save memory: No space left on device
> > (qemu) 
> > 
> > This is a clear reminder. According to comment25. When using libvirt, where
> > should I view the error message?
> 
> I think the answer is in comment 9: libvirt logs. But perhaps Andrea can
> give more details?

I too expected the error message to be in the guest log, but it's
not there.

It looks like libvirt is not able to retrieve the return value for
the dump job (which is started asyncronously) correctly, despite
QEMU reporting it:

  # In shell #1, run

  $ sudo virsh qemu-monitor-event guest --loop

  # In shell #2, run

  $ sudo virsh qemu-monitor-command guest '{"execute": "dump-guest-memory", "arguments": {"protocol": "file:/small/guest.dump", "paging": "false", "detach": true}}'
  {"return":{},"id":"libvirt-25"}

  # Back to shell #1, we now see

  event DUMP_COMPLETED at 1527170776.767879 for domain guest: {"result":{"total":4294967296,"status":"failed","completed":950140928},"error":"dump: failed to save memory: No space left on device"}

I'll look into it, but QEMU is clearly reporting all the expected
information at this point and libvirt not exposing it to the user
is the remaining issue; see Bug 1578741 for the latter.

Comment 29 errata-xmlrpc 2018-11-01 11:01:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:3443


Note You need to log in before you can comment on or make changes to this bug.