Bug 1892681 - [CBT] VM will be corrupted during full backup if the user performs reboot inside guest OS of the VM
Summary: [CBT] VM will be corrupted during full backup if the user performs reboot ins...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: vdsm
Classification: oVirt
Component: General
Version: 4.40.60.7
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ovirt-4.4.8
: 4.40.80.5
Assignee: Nir Soffer
QA Contact: Amit Sharir
URL:
Whiteboard:
Depends On: 1900326 1918966 1960137 1973829
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-10-29 12:49 UTC by Yuriy Khokhlov (Veeam)
Modified: 2021-11-04 19:28 UTC (History)
15 users (show)

Fixed In Version: vdsm-4.40.80.5
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-09-03 10:09:54 UTC
oVirt Team: Storage
Embargoed:
pm-rhel: ovirt-4.4+


Attachments (Terms of Use)
Backup and restore logs (2.21 MB, application/zip)
2020-11-05 15:03 UTC, Yury.Panchenko
no flags Details
Backup and restore logs 4.4.2 (3.89 MB, application/zip)
2020-11-18 15:04 UTC, Yury.Panchenko
no flags Details
Failed to verify logs (950.46 KB, application/zip)
2021-05-03 10:08 UTC, Ilan Zuckerman
no flags Details
Failed to verify 2 with VM_1_raw_disk_virtio (9.90 MB, application/zip)
2021-05-12 06:19 UTC, Ilan Zuckerman
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 114307 0 None MERGED spec: update qemu-kvm requirement 2021-07-26 10:02:48 UTC
oVirt gerrit 115865 0 master MERGED spec: update qemu-kvm requirement 2021-08-09 18:37:50 UTC

Description Yuriy Khokhlov (Veeam) 2020-10-29 12:49:34 UTC
Description of problem:
VM will be corrupted during full backup if the user performs a reboot inside the guest OS of the VM


Version-Release number of selected component (if applicable):
Two RHEV 4.4 nodes with layer rhvh-4.4.1.1-0.20200722.0+1 
Ovirt-manager 4.4.1
NFS 4.1 storage domain
ImageIO version:
# rpm -qa | grep image
ovirt-imageio-client-2.0.9-1.el8ev.x86_64
genisoimage-1.1.11-39.el8.x86_64
ovirt-imageio-common-2.0.9-1.el8ev.x86_64
ovirt-imageio-daemon-2.0.9-1.el8ev.x86_64
redhat-virtualization-host-image-update-placeholder-4.4.1-1.el8ev.noarch


How reproducible:


Steps to Reproduce:
1) Start VM with guest CentOS 8 system inside.
2) Start full backup.
3) When backup started processing of the VM's disk, issue a reboot from inside the guest OS.
4) Make a restore to a new location (create a new VM from OVF).


Actual results:
VM won't boot (because of FS metadata corrupt in our case).

Expected results:
Restored VM's disk should not be corrupted and VM should boot normally.

Additional info:
We will provide logs soon.

Comment 1 Eyal Shenitzky 2020-11-03 11:53:39 UTC
Peter,

Is there something that libvirt/qemu can do in order to prevent this?

From RHV POV, we cannot tell which operations are done inside the guest during a backup and we cannot block it from happening.

Comment 2 Peter Krempa 2020-11-03 12:59:35 UTC
This bug is very sparse on information, but I'll try to guess.

By default a guest OS reboot should not cause any problems as qemu still tracks which blocks were changed.

In certain cases the VM configuration actually causes the VM to be terminated and re-started on guest-initiated reboot (missing VM xml so I can't tell). In such cases obviously, as the backup job doesn't survive QEMU termination, will fail and the backup will be incomplete. Trying to restore an incomplete backup obviously won't work well.

Comment 3 Yuriy Khokhlov (Veeam) 2020-11-04 08:40:13 UTC
Let me comment on Peter's words.

>In certain cases the VM configuration actually causes the VM to be terminated and re-started on guest-initiated reboot (missing VM xml so I can't tell).
Yury will provide you with XML of VM configuration which causes **silent** data corruption on backup. So, you will be able to reproduce it on your own.

>In such cases obviously, as the backup job doesn't survive QEMU termination, will fail and the backup will be incomplete.
Yes, I understand this. As soon as RHEV blocks VM from powering off while it is in the backup state, the hypervisor should ignore any VM settings (in XML) which can cause "the VM to be terminated and re-started on guest-initiated reboot".
Now it looks so that a user could provoke data corruption on any reboot by setting some parameters in VM configuration. Users should be protected from doing such "mistakes".

>Trying to restore an incomplete backup obviously won't work well.
The current implementation doesn't warn users about incomplete backup. Users get **silent** data corruption.
They think that they have protected a VM, but in reality, they do not!

Veeam thinks about this issue as very important because in the case of unsuccessful restores the first point of contact will be Veeam Customer Support. It's a matter of the company's image.

Comment 4 Yury.Panchenko 2020-11-04 15:34:09 UTC
This case was confirmed on different vms. I don't use any specific configuration.
Just create new vm with default parameters.

Comment 5 Eyal Shenitzky 2020-11-05 11:19:02 UTC
The VM XML can be fetched from the engine log.
Can you please attach the needed logs to the bug (engine/vdsm/libvirt)?

Comment 6 Yury.Panchenko 2020-11-05 15:03:43 UTC
Created attachment 1726903 [details]
Backup and restore logs

Hello, i uploaded needed logs.
Backup started at 11/5/2020 15:42
Restore started at 11/5/2020 15:47

Comment 7 Eyal Shenitzky 2020-11-11 10:57:56 UTC
Yury, can you please update the version to the latest engine version (4.4.2).
Also, can you please provide also the version of Libvirt and QEMU?

Comment 8 Eyal Shenitzky 2020-11-11 11:00:09 UTC
Ilan, can you try to see if this issue reproduces on your environments?

Comment 9 Ilan Zuckerman 2020-11-11 12:30:42 UTC
(In reply to Eyal Shenitzky from comment #8)
> Ilan, can you try to see if this issue reproduces on your environments?

In order to try to reproduce this, I am missing the flow in details.

The full backup flow as i see it, is as following [1].
Where in this process the reboot occurs? If it occurs in between steps 1-3, then how were they able to download the disks?


[1]:

1. Issue full backup to the vm disk
2. wait for status of backup to be 'ok'
3. download the backed up disk
4. finalize the backup (to release the locked disk of the VM)
5. Upload the backed up disk + attach it to a new blank VM + mount it.

Comment 10 Yury.Panchenko 2020-11-11 15:49:28 UTC
> Where in this process the reboot occurs? If it occurs in between steps 1-3, then how were they able to download the disks?
When step 3 is started, you need to reboot guest.
I don't understand how any activity inside guest can broke external disk download.

Comment 11 Yury.Panchenko 2020-11-11 16:20:54 UTC
Hello, Eyal.
> Yury, can you please update the version to the latest engine version (4.4.2).
I confirm that issue still have in 4.4.2 env.

> Also, can you please provide also the version of Libvirt and QEMU?
I will provide correct numbers today later. I use release version of RHV 4.4.2

Comment 12 Tal Nisan 2020-11-16 15:08:08 UTC
Yuri, still waiting for the version numbers, also please be sure to needinfo the right Eyal, you needinfos another Eyal

Comment 13 Yury.Panchenko 2020-11-18 15:01:22 UTC
Hello Tal.
> Yuri, still waiting for the version numbers
[root@pan-rhv44rel1 ~]# rpm -qa | grep engine
ovirt-hosted-engine-setup-2.4.6-2.el8ev.noarch
ovirt-ansible-engine-setup-1.2.4-1.el8ev.noarch
ovirt-hosted-engine-ha-2.4.4-1.el8ev.noarch
ovirt-ansible-hosted-engine-setup-1.1.8-1.el8ev.noarch
python3-ovirt-engine-sdk4-4.4.4-1.el8ev.x86_64

[root@pan-rhv44rel1 ~]# rpm -qa | grep qemu
qemu-kvm-common-4.2.0-29.module+el8.2.1+7990+27f1e480.4.x86_64
qemu-kvm-4.2.0-29.module+el8.2.1+7990+27f1e480.4.x86_64
qemu-img-4.2.0-29.module+el8.2.1+7990+27f1e480.4.x86_64
qemu-kvm-block-iscsi-4.2.0-29.module+el8.2.1+7990+27f1e480.4.x86_64
qemu-kvm-block-gluster-4.2.0-29.module+el8.2.1+7990+27f1e480.4.x86_64
ipxe-roms-qemu-20181214-5.git133f4c47.el8.noarch
qemu-kvm-block-curl-4.2.0-29.module+el8.2.1+7990+27f1e480.4.x86_64
qemu-kvm-block-ssh-4.2.0-29.module+el8.2.1+7990+27f1e480.4.x86_64
libvirt-daemon-driver-qemu-6.0.0-25.2.module+el8.2.1+7722+a9e38cf3.x86_64
qemu-kvm-block-rbd-4.2.0-29.module+el8.2.1+7990+27f1e480.4.x86_64
qemu-kvm-core-4.2.0-29.module+el8.2.1+7990+27f1e480.4.x86_64

Also for more stable confirm, please generate 4GB of data inside guest. On default centos installation hard to catch time when disk download started and you have enough time to reboot.

Comment 14 Yury.Panchenko 2020-11-18 15:04:41 UTC
Created attachment 1730584 [details]
Backup and restore logs 4.4.2

Backup started at 11/18/20 15:32

Comment 15 Nir Soffer 2020-11-18 15:32:05 UTC
(In reply to Yury.Panchenko from comment #13)
> Hello Tal.
> > Yuri, still waiting for the version numbers

We need version for these packages:

On engine host:

    rpm -q ovirt-engine ovirt-imageio-daemon

On hosts:

    rpm -q vdsm ovirt-imageio-daemon qemu-kvm libvirt-daemon

> [root@pan-rhv44rel1 ~]# rpm -qa | grep qemu
> qemu-kvm-common-4.2.0-29.module+el8.2.1+7990+27f1e480.4.x86_64

This is RHEL 8.2 (are you testing on CentOS?). In particular, qemu-4.2.0
is too old to test incremental backup. You must use qemu >= 5.1.

You should test using RHEL 8.3 snapshot available to partners, or
if you want most recent version, the same version we build for QE.

This is the version we test we used for development now:

$ rpm -q vdsm ovirt-imageio-daemon qemu-kvm libvirt-daemon
vdsm-4.40.35.1-202010291125.gita43f00858.el8.x86_64
ovirt-imageio-daemon-2.2.0-0.202011031609.gitd364421.el8.x86_64
qemu-kvm-5.1.0-10.module+el8.3.0+8254+568ca30d.x86_64
libvirt-daemon-6.6.0-6.module+el8.3.0+8125+aefcf088.x86_64

> Also for more stable confirm, please generate 4GB of data inside guest. On
> default centos installation hard to catch time when disk download started
> and you have enough time to reboot.

So, the flow should be:

1. Start vm with qcow2 disk
2. Write at least 4G of data to disk
3. Start full backup
4. Start downloading disk
5. Reboot inside the vm
6. Finalize backup
7. Restore vm from downloaded disk

Other info we need to reproduce this:
- Where do you download the disk, on the host or from another host?
- If from another host, which network is used (1Gbit? 10Gbit?)
- How much time the download takes when you do not reboot?

Comment 16 Yury.Panchenko 2020-11-18 17:21:05 UTC
> We need version for these packages:
> On engine host:
[root@pan-rhv44man ~]#  rpm -q ovirt-engine ovirt-imageio-daemon
ovirt-engine-4.4.1.10-0.1.el8ev.noarch
ovirt-imageio-daemon-2.0.10-1.el8.x86_64

> On hosts:
[root@pan-rhv44rel1 ~]# rpm -q vdsm ovirt-imageio-daemon qemu-kvm libvirt-daemon
vdsm-4.40.26.3-1.el8ev.x86_64
ovirt-imageio-daemon-2.0.10-1.el8ev.x86_64
qemu-kvm-4.2.0-29.module+el8.2.1+7990+27f1e480.4.x86_64
libvirt-daemon-6.0.0-25.2.module+el8.2.1+7722+a9e38cf3.x86_64

> This is RHEL 8.2 (are you testing on CentOS?). In particular, qemu-4.2.0
> is too old to test incremental backup. You must use qemu >= 5.1.
I don't use RHEL or Centos. Only released builds of RHV.
Now its 4.4.2 on hosts and 4.4.1 for manager.

> You should test using RHEL 8.3 snapshot available to partners, or
> if you want most recent version, the same version we build for QE.
I cant use unstable builds, because i need to be sure that customers will have same behavior on their environment when we release product. And i cant track when this changes move from dev branch to release.

> So, the flow should be:
Yes, steps is correct.

> - Where do you download the disk, on the host or from another host?
another host
> - If from another host, which network is used (1Gbit? 10Gbit?)
10G
> - How much time the download takes when you do not reboot?
i test on vm with 10GB disk image, usually it takes about 2-5 mins (depends on stream counts).

Comment 17 Pavan Chavva 2020-11-19 15:20:39 UTC
(In reply to Nir Soffer from comment #15)
> (In reply to Yury.Panchenko from comment #13)
> > Hello Tal.
> > > Yuri, still waiting for the version numbers
> 
> We need version for these packages:
> 
> On engine host:
> 
>     rpm -q ovirt-engine ovirt-imageio-daemon
> 
> On hosts:
> 
>     rpm -q vdsm ovirt-imageio-daemon qemu-kvm libvirt-daemon
> 
> > [root@pan-rhv44rel1 ~]# rpm -qa | grep qemu
> > qemu-kvm-common-4.2.0-29.module+el8.2.1+7990+27f1e480.4.x86_64
> 
> This is RHEL 8.2 (are you testing on CentOS?). In particular, qemu-4.2.0
> is too old to test incremental backup. You must use qemu >= 5.1.
> 
> You should test using RHEL 8.3 snapshot available to partners, or
> if you want most recent version, the same version we build for QE.

-> RHEL 8.3 is now GA,Do you mean RHEL AV 8.3?

> This is the version we test we used for development now:
> 
> $ rpm -q vdsm ovirt-imageio-daemon qemu-kvm libvirt-daemon
> vdsm-4.40.35.1-202010291125.gita43f00858.el8.x86_64
> ovirt-imageio-daemon-2.2.0-0.202011031609.gitd364421.el8.x86_64
> qemu-kvm-5.1.0-10.module+el8.3.0+8254+568ca30d.x86_64
> libvirt-daemon-6.6.0-6.module+el8.3.0+8125+aefcf088.x86_64
> 
> > Also for more stable confirm, please generate 4GB of data inside guest. On
> > default centos installation hard to catch time when disk download started
> > and you have enough time to reboot.
> 
> So, the flow should be:
> 
> 1. Start vm with qcow2 disk
> 2. Write at least 4G of data to disk
> 3. Start full backup
> 4. Start downloading disk
> 5. Reboot inside the vm
> 6. Finalize backup
> 7. Restore vm from downloaded disk
> 
> Other info we need to reproduce this:
> - Where do you download the disk, on the host or from another host?
> - If from another host, which network is used (1Gbit? 10Gbit?)
> - How much time the download takes when you do not reboot?

Comment 18 Nir Soffer 2020-11-19 15:55:01 UTC
(In reply to Pavan Chavva from comment #17)
> > You should test using RHEL 8.3 snapshot available to partners, or
> > if you want most recent version, the same version we build for QE.
> 
> -> RHEL 8.3 is now GA,Do you mean RHEL AV 8.3?

Sure, we support only RHEL AV.

Comment 19 Nir Soffer 2020-11-19 15:59:24 UTC
(In reply to Yuriy Khokhlov (Veeam) from comment #3)
...
> >In such cases obviously, as the backup job doesn't survive QEMU termination, will fail and the backup will be incomplete.
> Yes, I understand this. As soon as RHEV blocks VM from powering off while it
> is in the backup state, 

RHV cannot block the VM from powering off. For this we need support
from libvirt and qemu, to keep qemu in "paused" mode after guest
powered off.

When we have such feature, we will be able to modify RHV to support
this, which is lot of work. This is a feature we can consider for 
distant future.

Comment 20 Nir Soffer 2020-11-19 16:27:51 UTC
(In reply to Yury.Panchenko from comment #16)
> > We need version for these packages:
> > On engine host:
> [root@pan-rhv44man ~]#  rpm -q ovirt-engine ovirt-imageio-daemon
> ovirt-engine-4.4.1.10-0.1.el8ev.noarch

You should test with engine 4.4.2, your version may be missing
lot of fixes and features.

> ovirt-imageio-daemon-2.0.10-1.el8.x86_64
> 
> > On hosts:
> [root@pan-rhv44rel1 ~]# rpm -q vdsm ovirt-imageio-daemon qemu-kvm
> libvirt-daemon
> vdsm-4.40.26.3-1.el8ev.x86_64
> ovirt-imageio-daemon-2.0.10-1.el8ev.x86_64
> qemu-kvm-4.2.0-29.module+el8.2.1+7990+27f1e480.4.x86_64
> libvirt-daemon-6.0.0-25.2.module+el8.2.1+7722+a9e38cf3.x86_64

So you are testing RHV 4.4.2 and RHEL AV 8.2.1. In this version
incremental backup is not supported in libvirt or RHV. The only
way to use this feature is enable a flag in RHV, which enables
incremental backup in RHV and libvirt.

This version is good enough for developing your application but it 
is not a supported version that your customers may use.

> > This is RHEL 8.2 (are you testing on CentOS?). In particular, qemu-4.2.0
> > is too old to test incremental backup. You must use qemu >= 5.1.
> I don't use RHEL or Centos. Only released builds of RHV.
> Now its 4.4.2 on hosts and 4.4.1 for manager.
>
> > You should test using RHEL 8.3 snapshot available to partners, or
> > if you want most recent version, the same version we build for QE.
> I cant use unstable builds, because i need to be sure that customers will
> have same behavior on their environment when we release product. And i cant
> track when this changes move from dev branch to release.

Customers will never run RHV 4.4.2, libvirt 6.0 and qemu-kvm 4.2. They
may run RHEL AV 8.3.1 with RHV 4.4.4, or more likey RHEL AV 8.4 and RHV
4.4.6/7.

If you want to test only the version that customers may use, you will
have to wait (lot of time) for this version. If you find bugs in this
version, we can fix them only in later versions. So I don't think
testing on release versions is useful. It means your product will be
delayed, and bug fixes will be delayed.

RHV 4.4.3 should be released very soon. Please try to reproduce this
again with 4.4.3.

I'm trying to reproduce this now with RHV 4.4.4 development version on
RHEL AV 8.3.1 nightly build.

Comment 21 Yury.Panchenko 2020-11-19 19:19:39 UTC
> If you want to test only the version that customers may use, you will
> have to wait (lot of time) for this version. If you find bugs in this
> version, we can fix them only in later versions. So I don't think
> testing on release versions is useful. It means your product will be
> delayed, and bug fixes will be delayed.
Agree. I will create additional env with latest version

Comment 22 Yury.Panchenko 2020-11-20 17:26:51 UTC
> RHV cannot block the VM from powering off. For this we need support
> from libvirt and qemu, to keep qemu in "paused" mode after guest
> powered off.

> When we have such feature, we will be able to modify RHV to support
> this, which is lot of work. This is a feature we can consider for 
> distant future.

I understood that normal fix of that problem can take more time.
So, we can focus on handle this situation more correct.
For example, if vm rebooted, backup should be failed with proper message.
Main goal is not get customer silent data corruption.
Can you implement this check ?

Comment 23 Nir Soffer 2020-11-20 17:35:56 UTC
(In reply to Yury.Panchenko from comment #22)
> > RHV cannot block the VM from powering off. For this we need support
> > from libvirt and qemu, to keep qemu in "paused" mode after guest
> > powered off.
> 
> > When we have such feature, we will be able to modify RHV to support
> > this, which is lot of work. This is a feature we can consider for 
> > distant future.
> 
> I understood that normal fix of that problem can take more time.
> So, we can focus on handle this situation more correct.
> For example, if vm rebooted, backup should be failed with proper message.
> Main goal is not get customer silent data corruption.
> Can you implement this check ?

I started to tests, and did not reproduce yet any corruption, so I don't
know if we have such issue in current code.

Assuming we have, we cannot fail the transfer in imageio, since it does
not know anything about the guest. If we want to fail the transfer it 
must be done in qemu since it should know the guest state. Another option
is to fail the backup when you finalize it if the system detected that
a guest was rebooted during a backup. But all these ideas are premature
at this point. We will try first to reproduce this issue.

If you have a system that reliably reproduce this, arranging access to 
this system will help to resolve this issue quickly.

Comment 24 Nir Soffer 2020-11-20 18:16:26 UTC
(In reply to Yuriy Khokhlov (Veeam) from comment #0)
> Steps to Reproduce:
> 1) Start VM with guest CentOS 8 system inside.
> 2) Start full backup.
> 3) When backup started processing of the VM's disk, issue a reboot from
> inside the guest OS.
> 4) Make a restore to a new location (create a new VM from OVF).

There are few issues here:

1. We don't have any information how you download the backup.

Theoretically the issue can be that your download code is failing
silently with short read.

When downloading using GET request, the server returns

    200 OK HTTP/1.1
    Content-Length: N

And then send the data to the client. If the server had an error
while sending the body, it has no way to return an error to the
client, and it will close the connection.

The client is responsible to check that it read Content-Length bytes
from the server, and fail the operation if the server closed the 
connection without sending all the data.

imageio client handle this correct, and will fail with this error
when server fail to return the promised amount of data:

Traceback (most recent call last):
  File "./backup_vm.py", line 177, in cmd_full
    download_backup(connection, backup.id, args)
  File "./backup_vm.py", line 355, in download_backup
    download_disk(connection, backup_uuid, disk, disk_path, args, incremental=incremental)
  File "./backup_vm.py", line 390, in download_disk
    client.download(
  File "/usr/lib64/python3.8/site-packages/ovirt_imageio/client/_api.py", line 172, in download
    io.copy(
  File "/usr/lib64/python3.8/site-packages/ovirt_imageio/_internal/io.py", line 69, in copy
    log.debug("Executor failed")
  File "/usr/lib64/python3.8/site-packages/ovirt_imageio/_internal/io.py", line 189, in __exit__
    self.stop()
  File "/usr/lib64/python3.8/site-packages/ovirt_imageio/_internal/io.py", line 166, in stop
    raise self._errors[0]
  File "/usr/lib64/python3.8/site-packages/ovirt_imageio/_internal/io.py", line 238, in _run
    handler.copy(req)
  File "/usr/lib64/python3.8/site-packages/ovirt_imageio/_internal/io.py", line 282, in copy
    self._src.write_to(self._dst, req.length, self._buf)
  File "/usr/lib64/python3.8/site-packages/ovirt_imageio/_internal/backends/http.py", line 213, in write_to
    raise RuntimeError(
RuntimeError: Expected 103219200 bytes, got 33554432 bytes

(This error happens when qemu crashes during backup)

If your client is not implemented correctly, e.g. using HTTP/1.0,
it may ignore the Content-Length and interrupt closing the connection
as normal completion.

Please provide exact command line that reproduce this issue, that we
can run in our lab.

2. Is this reproducible with backup_vm.py example?

Please try:

    ./backup_vm.py ... full vm-id

And reboot the vm once the download was started.

I know that you use your own client code, but if the issue happens
only with your client code, the issue is likely in your code, and
not in the server.

3. To verify the backup, it is best to run a local vm.

If we prove that the backup is correct, failure to restore means
that something is wrong with the restore, not the backup. This is
unlikely the issue since it would affect restore in other flows,
but it will be faster to test backup issues this way.

To test with local vm the best way is:

Create an overlay to keep the backup read only:

    qemu-img create -f qcow2 -b /path/to/backup.qcow2 -F qcow2 preview.qcow2

(you will have to change the format if you download to raw file)

And start a vm with the preview disk:

    qemu-kvm -m 2048 preview.qcow2

Comment 25 Nir Soffer 2020-11-22 09:57:36 UTC
I tested reboot flow, and could not reproduce silent data corruption.

I found that rebooting a guest during backup crashes qemu when using
virtio disk, but works when using virtio-scsi (the default).

When qemu crashes, imageio server fails when the connection to qemu
is closed. Correct HTTP clients such as imageio client detect the 
failure and fail the download as expected.

I filed qemu bug 1900326 for tracking the crash issue.

Lowering severity and priority to high since I cannot reproduce silent
data corruption, only a crash, which should be fixed in other component.


Here details how I tested this:

## Tested versions

# rpm -q vdsm ovirt-imageio-daemon libvirt-daemon qemu-kvm
vdsm-4.40.35.1-202010291125.gita43f00858.el8.x86_64
ovirt-imageio-daemon-2.2.0-0.202011031609.gitd364421.el8.x86_64
libvirt-daemon-6.6.0-8.module+el8.3.1+8648+130818f2.x86_64
qemu-kvm-5.1.0-13.module+el8.3.0+8424+e82f331d.x86_64

## Guest file systems

# df -h
Filesystem      Size  Used Avail Use% Mounted on
devtmpfs        440M     0  440M   0% /dev
tmpfs           453M     0  453M   0% /dev/shm
tmpfs           453M  720K  453M   1% /run
/dev/sda4       4.4G  3.7G  733M  84% /
tmpfs           453M  4.0K  453M   1% /tmp
/dev/sda2      1014M  265M  750M  27% /boot
tmpfs            91M     0   91M   0% /run/user/0


## Backup script

Using modified backup script:
https://github.com/oVirt/ovirt-engine-sdk/blob/master/sdk/examples/backup_vm.py

that download the same multiple times:

diff --git a/sdk/examples/backup_vm.py b/sdk/examples/backup_vm.py
index f00f4e6..494716b 100755
--- a/sdk/examples/backup_vm.py
+++ b/sdk/examples/backup_vm.py
@@ -349,9 +349,10 @@ def download_backup(connection, backup_uuid, args, incremental=False):
     backup_type = "incremental" if incremental else "full"
     timestamp = time.strftime("%Y%m%d%H%M")
     for disk in backup_disks:
-        file_name = "{}.{}.{}.qcow2".format(disk.id, timestamp, backup_type)
-        disk_path = os.path.join(args.backup_dir, file_name)
-        download_disk(connection, backup_uuid, disk, disk_path, args, incremental=incremental)
+        for i in range(5):
+            file_name = "{}.{:02d}.qcow2".format(disk.id, i)
+            disk_path = os.path.join(args.backup_dir, file_name)
+            download_disk(connection, backup_uuid, disk, disk_path, args, incremental=incremental)


 def get_backup_service(connection, vm_uuid, backup_uuid):


This allows testing image transfer when reboot takes lot of time, even with
smaller vms.

The script generates multiple backup files, and we can compare them
later to make sure we get the same content.


## Test flow

For testing full backup:

1. Start full backup
2. Once first download finish, reboot the vm
3. When vm is up, reboot it again, repeat until backups are done.

For testing incremental backup:

1. Create full backup using normal backup_vm.py script
2. Start incremental backup with modified script that downloads multiple
   times
3. Once the first incremental backup is finished, reboot the vm.


See next comments for test results with 6 different vm confgurations.

Comment 26 Nir Soffer 2020-11-22 09:59:17 UTC
VM 1 - raw disk, virtio-scsi

# virsh -r dumpxml backup-raw
<domain type='kvm' id='2' xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'>
  <name>backup-raw</name>
  <uuid>62af983c-53d1-4e1e-b78b-0de57027c192</uuid>
  <metadata xmlns:ns1="http://ovirt.org/vm/tune/1.0" xmlns:ovirt-vm="http://ovirt.org/vm/1.0">
    <ns1:qos/>
    <ovirt-vm:vm xmlns:ovirt-vm="http://ovirt.org/vm/1.0">
    <ovirt-vm:balloonTarget type="int">1048576</ovirt-vm:balloonTarget>
    <ovirt-vm:clusterVersion>4.5</ovirt-vm:clusterVersion>
    <ovirt-vm:destroy_on_reboot type="bool">False</ovirt-vm:destroy_on_reboot>
    <ovirt-vm:launchPaused>false</ovirt-vm:launchPaused>
    <ovirt-vm:memGuaranteedSize type="int">1024</ovirt-vm:memGuaranteedSize>
    <ovirt-vm:minGuaranteedMemoryMb type="int">1024</ovirt-vm:minGuaranteedMemoryMb>
    <ovirt-vm:resumeBehavior>auto_resume</ovirt-vm:resumeBehavior>
    <ovirt-vm:startTime type="float">1605914483.0351462</ovirt-vm:startTime>
    <ovirt-vm:device mac_address="56:6f:7a:5a:00:00">
        <ovirt-vm:network>ovirtmgmt</ovirt-vm:network>
    </ovirt-vm:device>
    <ovirt-vm:device devtype="disk" name="sda">
        <ovirt-vm:domainID>839db77f-fde3-4e13-bfb6-56be604631ed</ovirt-vm:domainID>
        <ovirt-vm:guestName>/dev/sda</ovirt-vm:guestName>
        <ovirt-vm:imageID>5eb97d51-9da4-420c-8c88-6a3691bc4c67</ovirt-vm:imageID>
        <ovirt-vm:poolID>fb7797bd-3cca-4baa-bcb3-35e8851015f7</ovirt-vm:poolID>
        <ovirt-vm:volumeID>b14b04c3-5936-4963-8ee0-c5dbda70afa3</ovirt-vm:volumeID>
        <ovirt-vm:volumeChain>
            <ovirt-vm:volumeChainNode>
                <ovirt-vm:domainID>839db77f-fde3-4e13-bfb6-56be604631ed</ovirt-vm:domainID>
                <ovirt-vm:imageID>5eb97d51-9da4-420c-8c88-6a3691bc4c67</ovirt-vm:imageID>
                <ovirt-vm:leaseOffset type="int">0</ovirt-vm:leaseOffset>
                <ovirt-vm:leasePath>/rhev/data-center/mnt/storage:_export_1/839db77f-fde3-4e13-bfb6-56be604631ed/images/5eb97d51-9da4-420c-8c88-6a3691bc4c67/b14b04c3-5936-4963-8ee0-c5dbda70afa3.lease</ovirt-vm:leasePath>
                <ovirt-vm:path>/rhev/data-center/mnt/storage:_export_1/839db77f-fde3-4e13-bfb6-56be604631ed/images/5eb97d51-9da4-420c-8c88-6a3691bc4c67/b14b04c3-5936-4963-8ee0-c5dbda70afa3</ovirt-vm:path>
                <ovirt-vm:volumeID>b14b04c3-5936-4963-8ee0-c5dbda70afa3</ovirt-vm:volumeID>
            </ovirt-vm:volumeChainNode>
        </ovirt-vm:volumeChain>
    </ovirt-vm:device>
    <ovirt-vm:device devtype="disk" name="sdc"/>
</ovirt-vm:vm>
  </metadata>
  <maxMemory slots='16' unit='KiB'>4194304</maxMemory>
  <memory unit='KiB'>1048576</memory>
  <currentMemory unit='KiB'>1048576</currentMemory>
  <vcpu placement='static' current='1'>16</vcpu>
  <iothreads>1</iothreads>
  <resource>
    <partition>/machine</partition>
  </resource>
  <sysinfo type='smbios'>
    <system>
      <entry name='manufacturer'>oVirt</entry>
      <entry name='product'>RHEL</entry>
      <entry name='version'>8.3-1.0.el8</entry>
      <entry name='serial'>095bce26-9d5b-4ae2-b3bf-dddfc9cdf541</entry>
      <entry name='uuid'>62af983c-53d1-4e1e-b78b-0de57027c192</entry>
      <entry name='family'>oVirt</entry>
    </system>
  </sysinfo>
  <os>
    <type arch='x86_64' machine='pc-q35-rhel8.1.0'>hvm</type>
    <smbios mode='sysinfo'/>
  </os>
  <features>
    <acpi/>
  </features>
  <cpu mode='custom' match='exact' check='full'>
    <model fallback='forbid'>Skylake-Client</model>
    <topology sockets='16' dies='1' cores='1' threads='1'/>
    <feature policy='disable' name='hle'/>
    <feature policy='disable' name='rtm'/>
    <feature policy='disable' name='mpx'/>
    <feature policy='require' name='hypervisor'/>
    <numa>
      <cell id='0' cpus='0-15' memory='1048576' unit='KiB'/>
    </numa>
  </cpu>
  <clock offset='variable' adjustment='0' basis='utc'>
    <timer name='rtc' tickpolicy='catchup'/>
    <timer name='pit' tickpolicy='delay'/>
    <timer name='hpet' present='no'/>
  </clock>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>destroy</on_crash>
  <pm>
    <suspend-to-mem enabled='no'/>
    <suspend-to-disk enabled='no'/>
  </pm>
  <devices>
    <emulator>/usr/libexec/qemu-kvm</emulator>
    <disk type='file' device='cdrom'>
      <driver name='qemu' error_policy='report'/>
      <source startupPolicy='optional'/>
      <target dev='sdc' bus='sata'/>
      <readonly/>
      <alias name='ua-f99274cc-c8aa-441f-82eb-a7bcf1d58789'/>
      <address type='drive' controller='0' bus='0' target='0' unit='2'/>
    </disk>
    <disk type='file' device='disk' snapshot='no'>
      <driver name='qemu' type='raw' cache='none' error_policy='stop' io='threads' discard='unmap'/>
      <source file='/rhev/data-center/mnt/storage:_export_1/839db77f-fde3-4e13-bfb6-56be604631ed/images/5eb97d51-9da4-420c-8c88-6a3691bc4c67/b14b04c3-5936-4963-8ee0-c5dbda70afa3' index='1'>
        <seclabel model='dac' relabel='no'/>
      </source>
      <backingStore/>
      <target dev='sda' bus='scsi'/>
      <serial>5eb97d51-9da4-420c-8c88-6a3691bc4c67</serial>
      <boot order='1'/>
      <alias name='ua-5eb97d51-9da4-420c-8c88-6a3691bc4c67'/>
      <address type='drive' controller='0' bus='0' target='0' unit='0'/>
    </disk>
    <controller type='pci' index='1' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='1' port='0x10'/>
      <alias name='pci.1'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0' multifunction='on'/>
    </controller>
    <controller type='pci' index='2' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='2' port='0x11'/>
      <alias name='pci.2'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x1'/>
    </controller>
    <controller type='pci' index='3' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='3' port='0x12'/>
      <alias name='pci.3'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x2'/>
    </controller>
    <controller type='pci' index='4' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='4' port='0x13'/>
      <alias name='pci.4'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x3'/>
    </controller>
    <controller type='pci' index='5' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='5' port='0x14'/>
      <alias name='pci.5'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x4'/>
    </controller>
    <controller type='pci' index='6' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='6' port='0x15'/>
      <alias name='pci.6'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x5'/>
    </controller>
    <controller type='pci' index='7' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='7' port='0x16'/>
      <alias name='pci.7'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x6'/>
    </controller>
    <controller type='pci' index='8' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='8' port='0x17'/>
      <alias name='pci.8'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x7'/>
    </controller>
    <controller type='pci' index='9' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='9' port='0x18'/>
      <alias name='pci.9'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0' multifunction='on'/>
    </controller>
    <controller type='pci' index='10' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='10' port='0x19'/>
      <alias name='pci.10'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x1'/>
    </controller>
    <controller type='pci' index='11' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='11' port='0x1a'/>
      <alias name='pci.11'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x2'/>
    </controller>
    <controller type='pci' index='12' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='12' port='0x1b'/>
      <alias name='pci.12'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x3'/>
    </controller>
    <controller type='pci' index='13' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='13' port='0x1c'/>
      <alias name='pci.13'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x4'/>
    </controller>
    <controller type='pci' index='14' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='14' port='0x1d'/>
      <alias name='pci.14'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x5'/>
    </controller>
    <controller type='pci' index='15' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='15' port='0x1e'/>
      <alias name='pci.15'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x6'/>
    </controller>
    <controller type='pci' index='16' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='16' port='0x1f'/>
      <alias name='pci.16'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x7'/>
    </controller>
    <controller type='pci' index='0' model='pcie-root'>
      <alias name='pcie.0'/>
    </controller>
    <controller type='virtio-serial' index='0' ports='16'>
      <alias name='ua-20458a3e-cc4e-4eaf-bb21-93a61bd3bbe8'/>
      <address type='pci' domain='0x0000' bus='0x02' slot='0x00' function='0x0'/>
    </controller>
    <controller type='usb' index='0' model='qemu-xhci' ports='8'>
      <alias name='ua-33d62616-9d86-4da5-a352-e2bf7a79d667'/>
      <address type='pci' domain='0x0000' bus='0x03' slot='0x00' function='0x0'/>
    </controller>
    <controller type='scsi' index='0' model='virtio-scsi'>
      <driver iothread='1'/>
      <alias name='ua-67f16eeb-cf51-4e45-8a9c-24240c59910c'/>
      <address type='pci' domain='0x0000' bus='0x04' slot='0x00' function='0x0'/>
    </controller>
    <controller type='sata' index='0'>
      <alias name='ide'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x1f' function='0x2'/>
    </controller>
    <interface type='bridge'>
      <mac address='56:6f:7a:5a:00:00'/>
      <source bridge='ovirtmgmt'/>
      <target dev='vnet1'/>
      <model type='virtio'/>
      <filterref filter='vdsm-no-mac-spoofing'/>
      <link state='up'/>
      <mtu size='1500'/>
      <alias name='ua-a181f7b1-2838-4980-af77-cbd508172fec'/>
      <address type='pci' domain='0x0000' bus='0x01' slot='0x00' function='0x0'/>
    </interface>
    <channel type='unix'>
      <source mode='bind' path='/var/lib/libvirt/qemu/channels/62af983c-53d1-4e1e-b78b-0de57027c192.ovirt-guest-agent.0'/>
      <target type='virtio' name='ovirt-guest-agent.0' state='disconnected'/>
      <alias name='channel0'/>
      <address type='virtio-serial' controller='0' bus='0' port='1'/>
    </channel>
    <channel type='unix'>
      <source mode='bind' path='/var/lib/libvirt/qemu/channels/62af983c-53d1-4e1e-b78b-0de57027c192.org.qemu.guest_agent.0'/>
      <target type='virtio' name='org.qemu.guest_agent.0' state='connected'/>
      <alias name='channel1'/>
      <address type='virtio-serial' controller='0' bus='0' port='2'/>
    </channel>
    <channel type='spicevmc'>
      <target type='virtio' name='com.redhat.spice.0' state='disconnected'/>
      <alias name='channel2'/>
      <address type='virtio-serial' controller='0' bus='0' port='3'/>
    </channel>
    <input type='tablet' bus='usb'>
      <alias name='input0'/>
      <address type='usb' bus='0' port='1'/>
    </input>
    <input type='mouse' bus='ps2'>
      <alias name='input1'/>
    </input>
    <input type='keyboard' bus='ps2'>
      <alias name='input2'/>
    </input>
    <graphics type='spice' port='5900' tlsPort='5901' autoport='yes' listen='192.168.122.22' passwdValidTo='1970-01-01T00:00:01'>
      <listen type='network' address='192.168.122.22' network='vdsm-ovirtmgmt'/>
      <channel name='main' mode='secure'/>
      <channel name='display' mode='secure'/>
      <channel name='inputs' mode='secure'/>
      <channel name='cursor' mode='secure'/>
      <channel name='playback' mode='secure'/>
      <channel name='record' mode='secure'/>
      <channel name='smartcard' mode='secure'/>
      <channel name='usbredir' mode='secure'/>
    </graphics>
    <graphics type='vnc' port='5902' autoport='yes' listen='192.168.122.22' keymap='en-us' passwdValidTo='1970-01-01T00:00:01'>
      <listen type='network' address='192.168.122.22' network='vdsm-ovirtmgmt'/>
    </graphics>
    <video>
      <model type='qxl' ram='65536' vram='8192' vgamem='16384' heads='1' primary='yes'/>
      <alias name='ua-b851f4bd-392b-448c-a08b-ab0643fb4def'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x0'/>
    </video>
    <memballoon model='virtio'>
      <stats period='5'/>
      <alias name='ua-c3938948-2dcb-45c0-86c5-cb7677bbe138'/>
      <address type='pci' domain='0x0000' bus='0x06' slot='0x00' function='0x0'/>
    </memballoon>
    <rng model='virtio'>
      <backend model='random'>/dev/urandom</backend>
      <alias name='ua-0a608db4-1dea-47c2-9128-8fade85c3928'/>
      <address type='pci' domain='0x0000' bus='0x05' slot='0x00' function='0x0'/>
    </rng>
  </devices>
  <seclabel type='dynamic' model='selinux' relabel='yes'>
    <label>system_u:system_r:svirt_t:s0:c69,c502</label>
    <imagelabel>system_u:object_r:svirt_image_t:s0:c69,c502</imagelabel>
  </seclabel>
  <seclabel type='dynamic' model='dac' relabel='yes'>
    <label>+107:+107</label>
    <imagelabel>+107:+107</imagelabel>
  </seclabel>
  <qemu:capabilities>
    <qemu:add capability='blockdev'/>
    <qemu:add capability='incremental-backup'/>
  </qemu:capabilities>
</domain>


Backup log:

$ ./backup_vm.py -c engine-dev full --backup-dir /var/tmp/backups/raw/ 62af983c-53d1-4e1e-b78b-0de57027c192
[   0.0 ] Starting full backup for VM 62af983c-53d1-4e1e-b78b-0de57027c192
[   0.2 ] Waiting until backup 0fd5d2a0-788f-41ed-9c03-00a7ebb174c7 is ready
[   1.2 ] Creating image transfer for disk 5eb97d51-9da4-420c-8c88-6a3691bc4c67
[   2.3 ] Image transfer 5025e8ef-0fe5-4919-a1c2-393afbc125a1 is ready
[ 100.00% ] 6.00 GiB, 23.54 seconds, 260.99 MiB/s                            
[  25.9 ] Finalizing image transfer
[  26.9 ] Creating image transfer for disk 5eb97d51-9da4-420c-8c88-6a3691bc4c67
[  28.0 ] Image transfer c9eaefdb-3ff8-4409-afbb-28c57dae1812 is ready
[ 100.00% ] 6.00 GiB, 37.62 seconds, 163.30 MiB/s                            
[  65.6 ] Finalizing image transfer
[  72.7 ] Creating image transfer for disk 5eb97d51-9da4-420c-8c88-6a3691bc4c67
[  73.9 ] Image transfer 2a59069d-a48b-44fb-8632-27e54e6ca2b6 is ready
[ 100.00% ] 6.00 GiB, 39.42 seconds, 155.84 MiB/s                            
[ 113.3 ] Finalizing image transfer
[ 117.3 ] Creating image transfer for disk 5eb97d51-9da4-420c-8c88-6a3691bc4c67
[ 118.5 ] Image transfer d63710ed-cc8a-4415-84a9-636197fa82f6 is ready
[ 100.00% ] 6.00 GiB, 48.71 seconds, 126.13 MiB/s                            
[ 167.2 ] Finalizing image transfer
[ 173.3 ] Creating image transfer for disk 5eb97d51-9da4-420c-8c88-6a3691bc4c67
[ 174.5 ] Image transfer c668076a-fd0e-42a0-951d-dd41aeecb9b1 is ready
[ 100.00% ] 6.00 GiB, 51.26 seconds, 119.85 MiB/s                            
[ 225.8 ] Finalizing image transfer
[ 228.9 ] Full backup completed successfully

All downloads succeeded.

Comparing backups:

$ ls -lhs
total 20G
3.9G -rw-r--r--. 1 nsoffer nsoffer 3.9G Nov 21 01:19 5eb97d51-9da4-420c-8c88-6a3691bc4c67.00.qcow2
3.9G -rw-r--r--. 1 nsoffer nsoffer 3.9G Nov 21 01:20 5eb97d51-9da4-420c-8c88-6a3691bc4c67.01.qcow2
3.9G -rw-r--r--. 1 nsoffer nsoffer 3.9G Nov 21 01:20 5eb97d51-9da4-420c-8c88-6a3691bc4c67.02.qcow2
3.9G -rw-r--r--. 1 nsoffer nsoffer 3.9G Nov 21 01:21 5eb97d51-9da4-420c-8c88-6a3691bc4c67.03.qcow2
3.9G -rw-r--r--. 1 nsoffer nsoffer 3.9G Nov 21 01:22 5eb97d51-9da4-420c-8c88-6a3691bc4c67.04.qcow2

$ for i in $(seq 4); do echo "compare backup 00 to 0$i"; qemu-img compare 5eb97d51-9da4-420c-8c88-6a3691bc4c67.00.qcow2 5eb97d51-9da4-420c-8c88-6a3691bc4c67.0$i.qcow2; done
compare backup 00 to 01
Images are identical.
compare backup 00 to 02
Images are identical.
compare backup 00 to 03
Images are identical.
compare backup 00 to 04
Images are identical.

Testing backup using local vm:

$ qemu-img create -f qcow2 -b 5eb97d51-9da4-420c-8c88-6a3691bc4c67.00.qcow2 -F qcow2 preview.qcow2
Formatting 'preview.qcow2', fmt=qcow2 cluster_size=65536 compression_type=zlib size=6442450944 backing_file=5eb97d51-9da4-420c-8c88-6a3691bc4c67.00.qcow2 backing_fmt=qcow2 lazy_refcounts=off refcount_bits=16

$ qemu-kvm -m 2048 preview.qcow2

VM boots successfully.

Comment 27 Nir Soffer 2020-11-22 10:00:07 UTC
VM 2 - raw disk, virtio - crash during reboot

virsh -r dumpxml backup-raw
<domain type='kvm' id='3' xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'>
  <name>backup-raw</name>
  <uuid>62af983c-53d1-4e1e-b78b-0de57027c192</uuid>
  <metadata xmlns:ns1="http://ovirt.org/vm/tune/1.0" xmlns:ovirt-vm="http://ovirt.org/vm/1.0">
    <ns1:qos/>
    <ovirt-vm:vm xmlns:ovirt-vm="http://ovirt.org/vm/1.0">
    <ovirt-vm:balloonTarget type="int">1048576</ovirt-vm:balloonTarget>
    <ovirt-vm:clusterVersion>4.5</ovirt-vm:clusterVersion>
    <ovirt-vm:destroy_on_reboot type="bool">False</ovirt-vm:destroy_on_reboot>
    <ovirt-vm:launchPaused>false</ovirt-vm:launchPaused>
    <ovirt-vm:memGuaranteedSize type="int">1024</ovirt-vm:memGuaranteedSize>
    <ovirt-vm:minGuaranteedMemoryMb type="int">1024</ovirt-vm:minGuaranteedMemoryMb>
    <ovirt-vm:resumeBehavior>auto_resume</ovirt-vm:resumeBehavior>
    <ovirt-vm:startTime type="float">1605915811.6488793</ovirt-vm:startTime>
    <ovirt-vm:device mac_address="56:6f:7a:5a:00:00">
        <ovirt-vm:network>ovirtmgmt</ovirt-vm:network>
    </ovirt-vm:device>
    <ovirt-vm:device devtype="disk" name="vda">
        <ovirt-vm:domainID>839db77f-fde3-4e13-bfb6-56be604631ed</ovirt-vm:domainID>
        <ovirt-vm:guestName>/dev/vda</ovirt-vm:guestName>
        <ovirt-vm:imageID>5eb97d51-9da4-420c-8c88-6a3691bc4c67</ovirt-vm:imageID>
        <ovirt-vm:poolID>fb7797bd-3cca-4baa-bcb3-35e8851015f7</ovirt-vm:poolID>
        <ovirt-vm:volumeID>b14b04c3-5936-4963-8ee0-c5dbda70afa3</ovirt-vm:volumeID>
        <ovirt-vm:specParams>
            <ovirt-vm:pinToIoThread>1</ovirt-vm:pinToIoThread>
        </ovirt-vm:specParams>
        <ovirt-vm:volumeChain>
            <ovirt-vm:volumeChainNode>
                <ovirt-vm:domainID>839db77f-fde3-4e13-bfb6-56be604631ed</ovirt-vm:domainID>
                <ovirt-vm:imageID>5eb97d51-9da4-420c-8c88-6a3691bc4c67</ovirt-vm:imageID>
                <ovirt-vm:leaseOffset type="int">0</ovirt-vm:leaseOffset>
                <ovirt-vm:leasePath>/rhev/data-center/mnt/storage:_export_1/839db77f-fde3-4e13-bfb6-56be604631ed/images/5eb97d51-9da4-420c-8c88-6a3691bc4c67/b14b04c3-5936-4963-8ee0-c5dbda70afa3.lease</ovirt-vm:leasePath>
                <ovirt-vm:path>/rhev/data-center/mnt/storage:_export_1/839db77f-fde3-4e13-bfb6-56be604631ed/images/5eb97d51-9da4-420c-8c88-6a3691bc4c67/b14b04c3-5936-4963-8ee0-c5dbda70afa3</ovirt-vm:path>
                <ovirt-vm:volumeID>b14b04c3-5936-4963-8ee0-c5dbda70afa3</ovirt-vm:volumeID>
            </ovirt-vm:volumeChainNode>
        </ovirt-vm:volumeChain>
    </ovirt-vm:device>
    <ovirt-vm:device devtype="disk" name="sdc"/>
</ovirt-vm:vm>
  </metadata>
  <maxMemory slots='16' unit='KiB'>4194304</maxMemory>
  <memory unit='KiB'>1048576</memory>
  <currentMemory unit='KiB'>1048576</currentMemory>
  <vcpu placement='static' current='1'>16</vcpu>
  <iothreads>1</iothreads>
  <resource>
    <partition>/machine</partition>
  </resource>
  <sysinfo type='smbios'>
    <system>
      <entry name='manufacturer'>oVirt</entry>
      <entry name='product'>RHEL</entry>
      <entry name='version'>8.3-1.0.el8</entry>
      <entry name='serial'>095bce26-9d5b-4ae2-b3bf-dddfc9cdf541</entry>
      <entry name='uuid'>62af983c-53d1-4e1e-b78b-0de57027c192</entry>
      <entry name='family'>oVirt</entry>
    </system>
  </sysinfo>
  <os>
    <type arch='x86_64' machine='pc-q35-rhel8.1.0'>hvm</type>
    <smbios mode='sysinfo'/>
  </os>
  <features>
    <acpi/>
  </features>
  <cpu mode='custom' match='exact' check='full'>
    <model fallback='forbid'>Skylake-Client</model>
    <topology sockets='16' dies='1' cores='1' threads='1'/>
    <feature policy='disable' name='hle'/>
    <feature policy='disable' name='rtm'/>
    <feature policy='disable' name='mpx'/>
    <feature policy='require' name='hypervisor'/>
    <numa>
      <cell id='0' cpus='0-15' memory='1048576' unit='KiB'/>
    </numa>
  </cpu>
  <clock offset='variable' adjustment='0' basis='utc'>
    <timer name='rtc' tickpolicy='catchup'/>
    <timer name='pit' tickpolicy='delay'/>
    <timer name='hpet' present='no'/>
  </clock>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>destroy</on_crash>
  <pm>
    <suspend-to-mem enabled='no'/>
    <suspend-to-disk enabled='no'/>
  </pm>
  <devices>
    <emulator>/usr/libexec/qemu-kvm</emulator>
    <disk type='file' device='cdrom'>
      <driver name='qemu' error_policy='report'/>
      <source startupPolicy='optional'/>
      <target dev='sdc' bus='sata'/>
      <readonly/>
      <alias name='ua-f99274cc-c8aa-441f-82eb-a7bcf1d58789'/>
      <address type='drive' controller='0' bus='0' target='0' unit='2'/>
    </disk>
    <disk type='file' device='disk' snapshot='no'>
      <driver name='qemu' type='raw' cache='none' error_policy='stop' io='threads' iothread='1'/>
      <source file='/rhev/data-center/mnt/storage:_export_1/839db77f-fde3-4e13-bfb6-56be604631ed/images/5eb97d51-9da4-420c-8c88-6a3691bc4c67/b14b04c3-5936-4963-8ee0-c5dbda70afa3' index='1'>
        <seclabel model='dac' relabel='no'/>
      </source>
      <backingStore/>
      <target dev='vda' bus='virtio'/>
      <serial>5eb97d51-9da4-420c-8c88-6a3691bc4c67</serial>
      <boot order='1'/>
      <alias name='ua-5eb97d51-9da4-420c-8c88-6a3691bc4c67'/>
      <address type='pci' domain='0x0000' bus='0x05' slot='0x00' function='0x0'/>
    </disk>
    <controller type='pci' index='1' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='1' port='0x10'/>
      <alias name='pci.1'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0' multifunction='on'/>
    </controller>
    <controller type='pci' index='2' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='2' port='0x11'/>
      <alias name='pci.2'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x1'/>
    </controller>
    <controller type='pci' index='3' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='3' port='0x12'/>
      <alias name='pci.3'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x2'/>
    </controller>
    <controller type='pci' index='4' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='4' port='0x13'/>
      <alias name='pci.4'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x3'/>
    </controller>
    <controller type='pci' index='5' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='5' port='0x14'/>
      <alias name='pci.5'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x4'/>
    </controller>
    <controller type='pci' index='6' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='6' port='0x15'/>
      <alias name='pci.6'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x5'/>
    </controller>
    <controller type='pci' index='7' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='7' port='0x16'/>
      <alias name='pci.7'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x6'/>
    </controller>
    <controller type='pci' index='8' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='8' port='0x17'/>
      <alias name='pci.8'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x7'/>
    </controller>
    <controller type='pci' index='9' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='9' port='0x18'/>
      <alias name='pci.9'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0' multifunction='on'/>
    </controller>
    <controller type='pci' index='10' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='10' port='0x19'/>
      <alias name='pci.10'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x1'/>
    </controller>
    <controller type='pci' index='11' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='11' port='0x1a'/>
      <alias name='pci.11'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x2'/>
    </controller>
    <controller type='pci' index='12' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='12' port='0x1b'/>
      <alias name='pci.12'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x3'/>
    </controller>
    <controller type='pci' index='13' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='13' port='0x1c'/>
      <alias name='pci.13'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x4'/>
    </controller>
    <controller type='pci' index='14' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='14' port='0x1d'/>
      <alias name='pci.14'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x5'/>
    </controller>
    <controller type='pci' index='15' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='15' port='0x1e'/>
      <alias name='pci.15'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x6'/>
    </controller>
    <controller type='pci' index='16' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='16' port='0x1f'/>
      <alias name='pci.16'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x7'/>
    </controller>
    <controller type='pci' index='0' model='pcie-root'>
      <alias name='pcie.0'/>
    </controller>
    <controller type='virtio-serial' index='0' ports='16'>
      <alias name='ua-20458a3e-cc4e-4eaf-bb21-93a61bd3bbe8'/>
      <address type='pci' domain='0x0000' bus='0x02' slot='0x00' function='0x0'/>
    </controller>
    <controller type='usb' index='0' model='qemu-xhci' ports='8'>
      <alias name='ua-33d62616-9d86-4da5-a352-e2bf7a79d667'/>
      <address type='pci' domain='0x0000' bus='0x03' slot='0x00' function='0x0'/>
    </controller>
    <controller type='sata' index='0'>
      <alias name='ide'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x1f' function='0x2'/>
    </controller>
    <controller type='scsi' index='0' model='virtio-scsi'>
      <driver iothread='1'/>
      <alias name='ua-c0757cfe-203b-43ce-8e76-508cfdf05e22'/>
      <address type='pci' domain='0x0000' bus='0x04' slot='0x00' function='0x0'/>
    </controller>
    <interface type='bridge'>
      <mac address='56:6f:7a:5a:00:00'/>
      <source bridge='ovirtmgmt'/>
      <target dev='vnet2'/>
      <model type='virtio'/>
      <filterref filter='vdsm-no-mac-spoofing'/>
      <link state='up'/>
      <mtu size='1500'/>
      <alias name='ua-a181f7b1-2838-4980-af77-cbd508172fec'/>
      <address type='pci' domain='0x0000' bus='0x01' slot='0x00' function='0x0'/>
    </interface>
    <channel type='unix'>
      <source mode='bind' path='/var/lib/libvirt/qemu/channels/62af983c-53d1-4e1e-b78b-0de57027c192.ovirt-guest-agent.0'/>
      <target type='virtio' name='ovirt-guest-agent.0' state='disconnected'/>
      <alias name='channel0'/>
      <address type='virtio-serial' controller='0' bus='0' port='1'/>
    </channel>
    <channel type='unix'>
      <source mode='bind' path='/var/lib/libvirt/qemu/channels/62af983c-53d1-4e1e-b78b-0de57027c192.org.qemu.guest_agent.0'/>
      <target type='virtio' name='org.qemu.guest_agent.0' state='connected'/>
      <alias name='channel1'/>
      <address type='virtio-serial' controller='0' bus='0' port='2'/>
    </channel>
    <channel type='spicevmc'>
      <target type='virtio' name='com.redhat.spice.0' state='disconnected'/>
      <alias name='channel2'/>
      <address type='virtio-serial' controller='0' bus='0' port='3'/>
    </channel>
    <input type='tablet' bus='usb'>
      <alias name='input0'/>
      <address type='usb' bus='0' port='1'/>
    </input>
    <input type='mouse' bus='ps2'>
      <alias name='input1'/>
    </input>
    <input type='keyboard' bus='ps2'>
      <alias name='input2'/>
    </input>
    <graphics type='spice' port='5900' tlsPort='5901' autoport='yes' listen='192.168.122.22' passwdValidTo='1970-01-01T00:00:01'>
      <listen type='network' address='192.168.122.22' network='vdsm-ovirtmgmt'/>
      <channel name='main' mode='secure'/>
      <channel name='display' mode='secure'/>
      <channel name='inputs' mode='secure'/>
      <channel name='cursor' mode='secure'/>
      <channel name='playback' mode='secure'/>
      <channel name='record' mode='secure'/>
      <channel name='smartcard' mode='secure'/>
      <channel name='usbredir' mode='secure'/>
    </graphics>
    <graphics type='vnc' port='5902' autoport='yes' listen='192.168.122.22' keymap='en-us' passwdValidTo='1970-01-01T00:00:01'>
      <listen type='network' address='192.168.122.22' network='vdsm-ovirtmgmt'/>
    </graphics>
    <video>
      <model type='qxl' ram='65536' vram='8192' vgamem='16384' heads='1' primary='yes'/>
      <alias name='ua-b851f4bd-392b-448c-a08b-ab0643fb4def'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x0'/>
    </video>
    <memballoon model='virtio'>
      <stats period='5'/>
      <alias name='ua-c3938948-2dcb-45c0-86c5-cb7677bbe138'/>
      <address type='pci' domain='0x0000' bus='0x06' slot='0x00' function='0x0'/>
    </memballoon>
    <rng model='virtio'>
      <backend model='random'>/dev/urandom</backend>
      <alias name='ua-0a608db4-1dea-47c2-9128-8fade85c3928'/>
      <address type='pci' domain='0x0000' bus='0x07' slot='0x00' function='0x0'/>
    </rng>
  </devices>
  <seclabel type='dynamic' model='selinux' relabel='yes'>
    <label>system_u:system_r:svirt_t:s0:c896,c1004</label>
    <imagelabel>system_u:object_r:svirt_image_t:s0:c896,c1004</imagelabel>
  </seclabel>
  <seclabel type='dynamic' model='dac' relabel='yes'>
    <label>+107:+107</label>
    <imagelabel>+107:+107</imagelabel>
  </seclabel>
  <qemu:capabilities>
    <qemu:add capability='blockdev'/>
    <qemu:add capability='incremental-backup'/>
  </qemu:capabilities>
</domain>

Backup log:

$ ./backup_vm.py -c engine-dev full --backup-dir /var/tmp/backups/raw/ 62af983c-53d1-4e1e-b78b-0de57027c192
[   0.0 ] Starting full backup for VM 62af983c-53d1-4e1e-b78b-0de57027c192
[   0.7 ] Waiting until backup 85a6b40b-17d3-4867-b06b-b7ac76a2db03 is ready
[   1.8 ] Creating image transfer for disk 5eb97d51-9da4-420c-8c88-6a3691bc4c67
[   3.1 ] Image transfer 69ec0ef5-b7f2-445f-9553-255ccb5a04ba is ready
[ 100.00% ] 6.00 GiB, 29.52 seconds, 208.14 MiB/s                            
[  32.7 ] Finalizing image transfer
[  37.8 ] Creating image transfer for disk 5eb97d51-9da4-420c-8c88-6a3691bc4c67
[  38.9 ] Image transfer 283935b3-31cb-442e-9731-caa22358b683 is ready
[  23.37% ] 1.40 GiB, 2.10 seconds, 684.47 MiB/s                            
[  41.0 ] Finalizing image transfer
Traceback (most recent call last):
  File "./backup_vm.py", line 177, in cmd_full
    download_backup(connection, backup.id, args)
  File "./backup_vm.py", line 355, in download_backup
    download_disk(connection, backup_uuid, disk, disk_path, args, incremental=incremental)
  File "./backup_vm.py", line 390, in download_disk
    client.download(
  File "/usr/lib64/python3.8/site-packages/ovirt_imageio/client/_api.py", line 172, in download
    io.copy(
  File "/usr/lib64/python3.8/site-packages/ovirt_imageio/_internal/io.py", line 69, in copy
    log.debug("Executor failed")
  File "/usr/lib64/python3.8/site-packages/ovirt_imageio/_internal/io.py", line 189, in __exit__
    self.stop()
  File "/usr/lib64/python3.8/site-packages/ovirt_imageio/_internal/io.py", line 166, in stop
    raise self._errors[0]
  File "/usr/lib64/python3.8/site-packages/ovirt_imageio/_internal/io.py", line 238, in _run
    handler.copy(req)
  File "/usr/lib64/python3.8/site-packages/ovirt_imageio/_internal/io.py", line 282, in copy
    self._src.write_to(self._dst, req.length, self._buf)
  File "/usr/lib64/python3.8/site-packages/ovirt_imageio/_internal/backends/http.py", line 204, in write_to
    res = self._get(length)
  File "/usr/lib64/python3.8/site-packages/ovirt_imageio/_internal/backends/http.py", line 431, in _get
    self._reraise(res.status, res.read())
  File "/usr/lib64/python3.8/site-packages/ovirt_imageio/_internal/backends/http.py", line 585, in _reraise
    raise http.Error(status, msg)
ovirt_imageio._internal.http.Error: Server failed to perform the request, check logs

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "./backup_vm.py", line 429, in <module>
    main()
  File "./backup_vm.py", line 161, in main
    args.command(args)
  File "./backup_vm.py", line 179, in cmd_full
    stop_backup(connection, backup.id, args)
  File "./backup_vm.py", line 327, in stop_backup
    backup_service.finalize()
  File "/home/nsoffer/.local/lib/python3.8/site-packages/ovirtsdk4/services.py", line 33306, in finalize
    return self._internal_action(action, 'finalize', None, headers, query, wait)
  File "/home/nsoffer/.local/lib/python3.8/site-packages/ovirtsdk4/service.py", line 299, in _internal_action
    return future.wait() if wait else future
  File "/home/nsoffer/.local/lib/python3.8/site-packages/ovirtsdk4/service.py", line 55, in wait
    return self._code(response)
  File "/home/nsoffer/.local/lib/python3.8/site-packages/ovirtsdk4/service.py", line 296, in callback
    self._check_fault(response)
  File "/home/nsoffer/.local/lib/python3.8/site-packages/ovirtsdk4/service.py", line 132, in _check_fault
    self._raise_error(response, body)
  File "/home/nsoffer/.local/lib/python3.8/site-packages/ovirtsdk4/service.py", line 118, in _raise_error
    raise error
ovirtsdk4.Error: Fault reason is "Operation Failed". Fault detail is "[Host not found.]". HTTP response code is 400.

Second downlaod failed during download. The error details are available in the server logs.

Server log:
/var/log/ovirt-imageio/daemon.log

2020-11-21 02:28:13,717 ERROR   (Thread-123) [http] Request aborted after starting response
Traceback (most recent call last):
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/http.py", line 774, in __call__
    self.dispatch(req, resp)
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/http.py", line 819, in dispatch
    return method(req, resp, *match.groups())
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/cors.py", line 84, in wrapper
    return func(self, req, resp, *args)
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/images.py", line 148, in get
    ticket.run(op)
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/auth.py", line 174, in run
    return operation.run()
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/ops.py", line 57, in run
    res = self._run()
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/ops.py", line 100, in _run
    self._read_chunk()
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/ops.py", line 112, in _read_chunk
    count = self._src.readinto(view)
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/backends/nbd.py", line 110, in readinto
    self._client.readinto(self._position, view)
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/nbd.py", line 416, in readinto
    self._recv_reply(cmd)
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/nbd.py", line 958, in _recv_reply
    if self._recv_reply_chunk(cmd):
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/nbd.py", line 1016, in _recv_reply_chunk
    self._handle_data_chunk(length, cmd)
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/nbd.py", line 1152, in _handle_data_chunk
    self._recv_into(view)
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/nbd.py", line 1213, in _recv_into
    .format(pos, length))
ovirt_imageio._internal.nbd.ProtocolError: Server closed the connection, read 7016420 bytes, expected 8388608 bytes

Connection to qemu closed during read.

Libvirt qemu log:
/var/log/libvirt/qemu/backup-raw.log

2020-11-21 00:26:03.762+0000: Domain id=6 is tainted: custom-hypervisor-feature
2020-11-21T00:26:04.033478Z qemu-kvm: -numa node,nodeid=0,cpus=0-15,mem=1024: warning: Parameter -numa node,mem is deprecated, use -numa node,memdev instead
2020-11-21 00:26:22.900+0000: Domain id=6 is tainted: custom-ga-command
aio_co_schedule: Co-routine was already scheduled in 'aio_co_schedule'
2020-11-21 00:28:13.932+0000: shutting down, reason=crashed

Looking in the coredump:

Core was generated by `/usr/libexec/qemu-kvm -name guest=backup-qcow2,debug-threads=on -S -object secr'.
Program terminated with signal SIGABRT, Aborted.
#0  0x00007f37f011a7ff in ?? ()

Comment 28 Nir Soffer 2020-11-22 10:01:25 UTC
VM 3 - qcow2 disk, virtio-scsi

# virsh -r dumpxml backup-qcow2
<domain type='kvm' id='8' xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'>
  <name>backup-qcow2</name>
  <uuid>4198a95d-c9be-413c-a01b-bc2fa3448e26</uuid>
  <metadata xmlns:ns1="http://ovirt.org/vm/tune/1.0" xmlns:ovirt-vm="http://ovirt.org/vm/1.0">
    <ns1:qos/>
    <ovirt-vm:vm xmlns:ovirt-vm="http://ovirt.org/vm/1.0">
    <ovirt-vm:balloonTarget type="int">1048576</ovirt-vm:balloonTarget>
    <ovirt-vm:clusterVersion>4.5</ovirt-vm:clusterVersion>
    <ovirt-vm:destroy_on_reboot type="bool">False</ovirt-vm:destroy_on_reboot>
    <ovirt-vm:launchPaused>false</ovirt-vm:launchPaused>
    <ovirt-vm:memGuaranteedSize type="int">1024</ovirt-vm:memGuaranteedSize>
    <ovirt-vm:minGuaranteedMemoryMb type="int">1024</ovirt-vm:minGuaranteedMemoryMb>
    <ovirt-vm:resumeBehavior>auto_resume</ovirt-vm:resumeBehavior>
    <ovirt-vm:startTime type="float">1605921023.322825</ovirt-vm:startTime>
    <ovirt-vm:device mac_address="56:6f:7a:5a:00:02">
        <ovirt-vm:network>ovirtmgmt</ovirt-vm:network>
    </ovirt-vm:device>
    <ovirt-vm:device devtype="disk" name="sda">
        <ovirt-vm:domainID>839db77f-fde3-4e13-bfb6-56be604631ed</ovirt-vm:domainID>
        <ovirt-vm:guestName>/dev/sda</ovirt-vm:guestName>
        <ovirt-vm:imageID>86569d03-189c-4893-b46e-286c2eba8857</ovirt-vm:imageID>
        <ovirt-vm:poolID>fb7797bd-3cca-4baa-bcb3-35e8851015f7</ovirt-vm:poolID>
        <ovirt-vm:volumeID>fa0ff024-932b-4515-8dab-d2a2ed5a7bf9</ovirt-vm:volumeID>
        <ovirt-vm:volumeChain>
            <ovirt-vm:volumeChainNode>
                <ovirt-vm:domainID>839db77f-fde3-4e13-bfb6-56be604631ed</ovirt-vm:domainID>
                <ovirt-vm:imageID>86569d03-189c-4893-b46e-286c2eba8857</ovirt-vm:imageID>
                <ovirt-vm:leaseOffset type="int">0</ovirt-vm:leaseOffset>
                <ovirt-vm:leasePath>/rhev/data-center/mnt/storage:_export_1/839db77f-fde3-4e13-bfb6-56be604631ed/images/86569d03-189c-4893-b46e-286c2eba8857/fa0ff024-932b-4515-8dab-d2a2ed5a7bf9.lease</ovirt-vm:leasePath>
                <ovirt-vm:path>/rhev/data-center/mnt/storage:_export_1/839db77f-fde3-4e13-bfb6-56be604631ed/images/86569d03-189c-4893-b46e-286c2eba8857/fa0ff024-932b-4515-8dab-d2a2ed5a7bf9</ovirt-vm:path>
                <ovirt-vm:volumeID>fa0ff024-932b-4515-8dab-d2a2ed5a7bf9</ovirt-vm:volumeID>
            </ovirt-vm:volumeChainNode>
        </ovirt-vm:volumeChain>
    </ovirt-vm:device>
    <ovirt-vm:device devtype="disk" name="sdc"/>
</ovirt-vm:vm>
  </metadata>
  <maxMemory slots='16' unit='KiB'>4194304</maxMemory>
  <memory unit='KiB'>1048576</memory>
  <currentMemory unit='KiB'>1048576</currentMemory>
  <vcpu placement='static' current='1'>16</vcpu>
  <iothreads>1</iothreads>
  <resource>
    <partition>/machine</partition>
  </resource>
  <sysinfo type='smbios'>
    <system>
      <entry name='manufacturer'>oVirt</entry>
      <entry name='product'>RHEL</entry>
      <entry name='version'>8.3-1.0.el8</entry>
      <entry name='serial'>095bce26-9d5b-4ae2-b3bf-dddfc9cdf541</entry>
      <entry name='uuid'>4198a95d-c9be-413c-a01b-bc2fa3448e26</entry>
      <entry name='family'>oVirt</entry>
    </system>
  </sysinfo>
  <os>
    <type arch='x86_64' machine='pc-q35-rhel8.1.0'>hvm</type>
    <smbios mode='sysinfo'/>
  </os>
  <features>
    <acpi/>
  </features>
  <cpu mode='custom' match='exact' check='full'>
    <model fallback='forbid'>Skylake-Client</model>
    <topology sockets='16' dies='1' cores='1' threads='1'/>
    <feature policy='disable' name='hle'/>
    <feature policy='disable' name='rtm'/>
    <feature policy='disable' name='mpx'/>
    <feature policy='require' name='hypervisor'/>
    <numa>
      <cell id='0' cpus='0-15' memory='1048576' unit='KiB'/>
    </numa>
  </cpu>
  <clock offset='variable' adjustment='0' basis='utc'>
    <timer name='rtc' tickpolicy='catchup'/>
    <timer name='pit' tickpolicy='delay'/>
    <timer name='hpet' present='no'/>
  </clock>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>destroy</on_crash>
  <pm>
    <suspend-to-mem enabled='no'/>
    <suspend-to-disk enabled='no'/>
  </pm>
  <devices>
    <emulator>/usr/libexec/qemu-kvm</emulator>
    <disk type='file' device='cdrom'>
      <driver name='qemu' error_policy='report'/>
      <source startupPolicy='optional'/>
      <target dev='sdc' bus='sata'/>
      <readonly/>
      <alias name='ua-87de72de-d473-4b69-b5fc-bb8f2a65eda1'/>
      <address type='drive' controller='0' bus='0' target='0' unit='2'/>
    </disk>
    <disk type='file' device='disk' snapshot='no'>
      <driver name='qemu' type='qcow2' cache='none' error_policy='stop' io='threads' discard='unmap'/>
      <source file='/rhev/data-center/mnt/storage:_export_1/839db77f-fde3-4e13-bfb6-56be604631ed/images/86569d03-189c-4893-b46e-286c2eba8857/fa0ff024-932b-4515-8dab-d2a2ed5a7bf9' index='1'>
        <seclabel model='dac' relabel='no'/>
      </source>
      <backingStore/>
      <target dev='sda' bus='scsi'/>
      <serial>86569d03-189c-4893-b46e-286c2eba8857</serial>
      <boot order='1'/>
      <alias name='ua-86569d03-189c-4893-b46e-286c2eba8857'/>
      <address type='drive' controller='0' bus='0' target='0' unit='0'/>
    </disk>
    <controller type='pci' index='1' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='1' port='0x10'/>
      <alias name='pci.1'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0' multifunction='on'/>
    </controller>
    <controller type='pci' index='2' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='2' port='0x11'/>
      <alias name='pci.2'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x1'/>
    </controller>
    <controller type='pci' index='3' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='3' port='0x12'/>
      <alias name='pci.3'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x2'/>
    </controller>
    <controller type='pci' index='4' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='4' port='0x13'/>
      <alias name='pci.4'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x3'/>
    </controller>
    <controller type='pci' index='5' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='5' port='0x14'/>
      <alias name='pci.5'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x4'/>
    </controller>
    <controller type='pci' index='6' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='6' port='0x15'/>
      <alias name='pci.6'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x5'/>
    </controller>
    <controller type='pci' index='7' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='7' port='0x16'/>
      <alias name='pci.7'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x6'/>
    </controller>
    <controller type='pci' index='8' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='8' port='0x17'/>
      <alias name='pci.8'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x7'/>
    </controller>
    <controller type='pci' index='9' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='9' port='0x18'/>
      <alias name='pci.9'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0' multifunction='on'/>
    </controller>
    <controller type='pci' index='10' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='10' port='0x19'/>
      <alias name='pci.10'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x1'/>
    </controller>
    <controller type='pci' index='11' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='11' port='0x1a'/>
      <alias name='pci.11'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x2'/>
    </controller>
    <controller type='pci' index='12' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='12' port='0x1b'/>
      <alias name='pci.12'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x3'/>
    </controller>
    <controller type='pci' index='13' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='13' port='0x1c'/>
      <alias name='pci.13'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x4'/>
    </controller>
    <controller type='pci' index='14' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='14' port='0x1d'/>
      <alias name='pci.14'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x5'/>
    </controller>
    <controller type='pci' index='15' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='15' port='0x1e'/>
      <alias name='pci.15'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x6'/>
    </controller>
    <controller type='pci' index='16' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='16' port='0x1f'/>
      <alias name='pci.16'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x7'/>
    </controller>
    <controller type='pci' index='0' model='pcie-root'>
      <alias name='pcie.0'/>
    </controller>
    <controller type='virtio-serial' index='0' ports='16'>
      <alias name='ua-6a467db1-1ffc-4f77-9b58-b746a1b048be'/>
      <address type='pci' domain='0x0000' bus='0x03' slot='0x00' function='0x0'/>
    </controller>
    <controller type='sata' index='0'>
      <alias name='ide'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x1f' function='0x2'/>
    </controller>
    <controller type='scsi' index='0' model='virtio-scsi'>
      <driver iothread='1'/>
      <alias name='ua-cc2f22e8-9625-4dc0-8d70-3513b45975bd'/>
      <address type='pci' domain='0x0000' bus='0x02' slot='0x00' function='0x0'/>
    </controller>
    <controller type='usb' index='0' model='qemu-xhci' ports='8'>
      <alias name='ua-ec25086f-5623-4d02-b596-5b5251ccce02'/>
      <address type='pci' domain='0x0000' bus='0x04' slot='0x00' function='0x0'/>
    </controller>
    <interface type='bridge'>
      <mac address='56:6f:7a:5a:00:02'/>
      <source bridge='ovirtmgmt'/>
      <target dev='vnet7'/>
      <model type='virtio'/>
      <filterref filter='vdsm-no-mac-spoofing'/>
      <link state='up'/>
      <mtu size='1500'/>
      <alias name='ua-d44c62e4-0a44-4a11-a3f1-31247c73f6a0'/>
      <address type='pci' domain='0x0000' bus='0x01' slot='0x00' function='0x0'/>
    </interface>
    <channel type='unix'>
      <source mode='bind' path='/var/lib/libvirt/qemu/channels/4198a95d-c9be-413c-a01b-bc2fa3448e26.ovirt-guest-agent.0'/>
      <target type='virtio' name='ovirt-guest-agent.0' state='disconnected'/>
      <alias name='channel0'/>
      <address type='virtio-serial' controller='0' bus='0' port='1'/>
    </channel>
    <channel type='unix'>
      <source mode='bind' path='/var/lib/libvirt/qemu/channels/4198a95d-c9be-413c-a01b-bc2fa3448e26.org.qemu.guest_agent.0'/>
      <target type='virtio' name='org.qemu.guest_agent.0' state='connected'/>
      <alias name='channel1'/>
      <address type='virtio-serial' controller='0' bus='0' port='2'/>
    </channel>
    <channel type='spicevmc'>
      <target type='virtio' name='com.redhat.spice.0' state='disconnected'/>
      <alias name='channel2'/>
      <address type='virtio-serial' controller='0' bus='0' port='3'/>
    </channel>
    <input type='tablet' bus='usb'>
      <alias name='input0'/>
      <address type='usb' bus='0' port='1'/>
    </input>
    <input type='mouse' bus='ps2'>
      <alias name='input1'/>
    </input>
    <input type='keyboard' bus='ps2'>
      <alias name='input2'/>
    </input>
    <graphics type='spice' port='5900' tlsPort='5901' autoport='yes' listen='192.168.122.22' passwdValidTo='1970-01-01T00:00:01'>
      <listen type='network' address='192.168.122.22' network='vdsm-ovirtmgmt'/>
      <channel name='main' mode='secure'/>
      <channel name='display' mode='secure'/>
      <channel name='inputs' mode='secure'/>
      <channel name='cursor' mode='secure'/>
      <channel name='playback' mode='secure'/>
      <channel name='record' mode='secure'/>
      <channel name='smartcard' mode='secure'/>
      <channel name='usbredir' mode='secure'/>
    </graphics>
    <graphics type='vnc' port='5902' autoport='yes' listen='192.168.122.22' keymap='en-us' passwdValidTo='1970-01-01T00:00:01'>
      <listen type='network' address='192.168.122.22' network='vdsm-ovirtmgmt'/>
    </graphics>
    <video>
      <model type='qxl' ram='65536' vram='8192' vgamem='16384' heads='1' primary='yes'/>
      <alias name='ua-d7761ee1-a13e-4314-979b-92d4968e2e61'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x0'/>
    </video>
    <memballoon model='virtio'>
      <stats period='5'/>
      <alias name='ua-96960a0a-f56f-4c4c-8b8a-bb51410db7b8'/>
      <address type='pci' domain='0x0000' bus='0x06' slot='0x00' function='0x0'/>
    </memballoon>
    <rng model='virtio'>
      <backend model='random'>/dev/urandom</backend>
      <alias name='ua-494fbdfa-a451-4a9d-9573-f821e4726847'/>
      <address type='pci' domain='0x0000' bus='0x05' slot='0x00' function='0x0'/>
    </rng>
  </devices>
  <seclabel type='dynamic' model='selinux' relabel='yes'>
    <label>system_u:system_r:svirt_t:s0:c313,c558</label>
    <imagelabel>system_u:object_r:svirt_image_t:s0:c313,c558</imagelabel>
  </seclabel>
  <seclabel type='dynamic' model='dac' relabel='yes'>
    <label>+107:+107</label>
    <imagelabel>+107:+107</imagelabel>
  </seclabel>
  <qemu:capabilities>
    <qemu:add capability='blockdev'/>
    <qemu:add capability='incremental-backup'/>
  </qemu:capabilities>
</domain>

Backup log:

$ ./backup_vm.py -c engine-dev full --backup-dir /var/tmp/backups/qcow2 4198a95d-c9be-413c-a01b-bc2fa3448e26
[   0.0 ] Starting full backup for VM 4198a95d-c9be-413c-a01b-bc2fa3448e26
[   0.3 ] Waiting until backup 246c73cb-738c-4a08-9462-d88404968db8 is ready
[   1.3 ] Created checkpoint '9233b5a6-f68e-4c3b-82eb-698ddaa6fbbe' (to use in --from-checkpoint-uuid for the next incremental backup)
[   1.3 ] Creating image transfer for disk 86569d03-189c-4893-b46e-286c2eba8857
[   2.4 ] Image transfer 416669c6-86ef-4b88-8774-41aa33bb438a is ready
[ 100.00% ] 6.00 GiB, 21.14 seconds, 290.63 MiB/s                            
[  23.5 ] Finalizing image transfer
[  26.5 ] Creating image transfer for disk 86569d03-189c-4893-b46e-286c2eba8857
[  27.7 ] Image transfer 1f0306b7-842f-44f0-bdb6-287673963fef is ready
[ 100.00% ] 6.00 GiB, 32.03 seconds, 191.84 MiB/s                            
[  59.7 ] Finalizing image transfer
[  61.7 ] Creating image transfer for disk 86569d03-189c-4893-b46e-286c2eba8857
[  62.9 ] Image transfer 66d60bd1-5260-4d40-a542-a0e94435c32c is ready
[ 100.00% ] 6.00 GiB, 34.89 seconds, 176.09 MiB/s                            
[  97.8 ] Finalizing image transfer
[ 106.9 ] Creating image transfer for disk 86569d03-189c-4893-b46e-286c2eba8857
[ 108.0 ] Image transfer 8b3c5465-0147-47a1-93f2-3335c365710f is ready
[ 100.00% ] 6.00 GiB, 31.45 seconds, 195.36 MiB/s                            
[ 139.4 ] Finalizing image transfer
[ 142.5 ] Creating image transfer for disk 86569d03-189c-4893-b46e-286c2eba8857
[ 143.6 ] Image transfer 38b6a790-057d-40c0-b9dc-39bb7792c687 is ready
[ 100.00% ] 6.00 GiB, 31.13 seconds, 197.37 MiB/s                            
[ 174.8 ] Finalizing image transfer
[ 177.9 ] Full backup completed successfully

Inspecting downloads:

$ ls -lhs
total 20G
3.9G -rw-r--r--. 1 nsoffer nsoffer 3.9G Nov 21 04:42 86569d03-189c-4893-b46e-286c2eba8857.00.qcow2
3.9G -rw-r--r--. 1 nsoffer nsoffer 3.9G Nov 21 04:42 86569d03-189c-4893-b46e-286c2eba8857.01.qcow2
3.9G -rw-r--r--. 1 nsoffer nsoffer 3.9G Nov 21 04:43 86569d03-189c-4893-b46e-286c2eba8857.02.qcow2
3.9G -rw-r--r--. 1 nsoffer nsoffer 3.9G Nov 21 04:44 86569d03-189c-4893-b46e-286c2eba8857.03.qcow2
3.9G -rw-r--r--. 1 nsoffer nsoffer 3.9G Nov 21 04:44 86569d03-189c-4893-b46e-286c2eba8857.04.qcow2

$ for i in $(seq 4); do echo "compare backup 00 to 0$i"; qemu-img compare 86569d03-189c-4893-b46e-286c2eba8857.00.qcow2 86569d03-189c-4893-b46e-286c2eba8857.0$i.qcow2; done
compare backup 00 to 01
Images are identical.
compare backup 00 to 02
Images are identical.
compare backup 00 to 03
Images are identical.
compare backup 00 to 04
Images are identical.

All downloads are identical.

Testing downloads using local vm:

$ qemu-img create -f qcow2 -b 86569d03-189c-4893-b46e-286c2eba8857.00.qcow2 -F qcow2 preview.qcow2
Formatting 'preview.qcow2', fmt=qcow2 cluster_size=65536 compression_type=zlib size=6442450944 backing_file=86569d03-189c-4893-b46e-286c2eba8857.00.qcow2 backing_fmt=qcow2 lazy_refcounts=off refcount_bits=16

$ qemu-kvm -m 2048 preview.qcow2

VM boots normally.

Comment 29 Nir Soffer 2020-11-22 10:02:24 UTC
VM 4 - qcow2 disk, virtio - crash during reboot

# virsh -r dumpxml backup-qcow2
<domain type='kvm' id='10' xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'>
  <name>backup-qcow2</name>
  <uuid>4198a95d-c9be-413c-a01b-bc2fa3448e26</uuid>
  <metadata xmlns:ns1="http://ovirt.org/vm/tune/1.0" xmlns:ovirt-vm="http://ovirt.org/vm/1.0">
    <ns1:qos/>
    <ovirt-vm:vm xmlns:ovirt-vm="http://ovirt.org/vm/1.0">
    <ovirt-vm:balloonTarget type="int">1048576</ovirt-vm:balloonTarget>
    <ovirt-vm:clusterVersion>4.5</ovirt-vm:clusterVersion>
    <ovirt-vm:destroy_on_reboot type="bool">False</ovirt-vm:destroy_on_reboot>
    <ovirt-vm:launchPaused>false</ovirt-vm:launchPaused>
    <ovirt-vm:memGuaranteedSize type="int">1024</ovirt-vm:memGuaranteedSize>
    <ovirt-vm:minGuaranteedMemoryMb type="int">1024</ovirt-vm:minGuaranteedMemoryMb>
    <ovirt-vm:resumeBehavior>auto_resume</ovirt-vm:resumeBehavior>
    <ovirt-vm:startTime type="float">1605927260.501533</ovirt-vm:startTime>
    <ovirt-vm:device mac_address="56:6f:7a:5a:00:02">
        <ovirt-vm:network>ovirtmgmt</ovirt-vm:network>
    </ovirt-vm:device>
    <ovirt-vm:device devtype="disk" name="vda">
        <ovirt-vm:domainID>839db77f-fde3-4e13-bfb6-56be604631ed</ovirt-vm:domainID>
        <ovirt-vm:imageID>86569d03-189c-4893-b46e-286c2eba8857</ovirt-vm:imageID>
        <ovirt-vm:poolID>fb7797bd-3cca-4baa-bcb3-35e8851015f7</ovirt-vm:poolID>
        <ovirt-vm:volumeID>fa0ff024-932b-4515-8dab-d2a2ed5a7bf9</ovirt-vm:volumeID>
        <ovirt-vm:specParams>
            <ovirt-vm:pinToIoThread>1</ovirt-vm:pinToIoThread>
        </ovirt-vm:specParams>
        <ovirt-vm:volumeChain>
            <ovirt-vm:volumeChainNode>
                <ovirt-vm:domainID>839db77f-fde3-4e13-bfb6-56be604631ed</ovirt-vm:domainID>
                <ovirt-vm:imageID>86569d03-189c-4893-b46e-286c2eba8857</ovirt-vm:imageID>
                <ovirt-vm:leaseOffset type="int">0</ovirt-vm:leaseOffset>
                <ovirt-vm:leasePath>/rhev/data-center/mnt/storage:_export_1/839db77f-fde3-4e13-bfb6-56be604631ed/images/86569d03-189c-4893-b46e-286c2eba8857/fa0ff024-932b-4515-8dab-d2a2ed5a7bf9.lease</ovirt-vm:leasePath>
                <ovirt-vm:path>/rhev/data-center/mnt/storage:_export_1/839db77f-fde3-4e13-bfb6-56be604631ed/images/86569d03-189c-4893-b46e-286c2eba8857/fa0ff024-932b-4515-8dab-d2a2ed5a7bf9</ovirt-vm:path>
                <ovirt-vm:volumeID>fa0ff024-932b-4515-8dab-d2a2ed5a7bf9</ovirt-vm:volumeID>
            </ovirt-vm:volumeChainNode>
        </ovirt-vm:volumeChain>
    </ovirt-vm:device>
    <ovirt-vm:device devtype="disk" name="sdc"/>
</ovirt-vm:vm>
  </metadata>
  <maxMemory slots='16' unit='KiB'>4194304</maxMemory>
  <memory unit='KiB'>1048576</memory>
  <currentMemory unit='KiB'>1048576</currentMemory>
  <vcpu placement='static' current='1'>16</vcpu>
  <iothreads>1</iothreads>
  <resource>
    <partition>/machine</partition>
  </resource>
  <sysinfo type='smbios'>
    <system>
      <entry name='manufacturer'>oVirt</entry>
      <entry name='product'>RHEL</entry>
      <entry name='version'>8.3-1.0.el8</entry>
      <entry name='serial'>095bce26-9d5b-4ae2-b3bf-dddfc9cdf541</entry>
      <entry name='uuid'>4198a95d-c9be-413c-a01b-bc2fa3448e26</entry>
      <entry name='family'>oVirt</entry>
    </system>
  </sysinfo>
  <os>
    <type arch='x86_64' machine='pc-q35-rhel8.1.0'>hvm</type>
    <smbios mode='sysinfo'/>
  </os>
  <features>
    <acpi/>
  </features>
  <cpu mode='custom' match='exact' check='full'>
    <model fallback='forbid'>Skylake-Client</model>
    <topology sockets='16' dies='1' cores='1' threads='1'/>
    <feature policy='disable' name='hle'/>
    <feature policy='disable' name='rtm'/>
    <feature policy='disable' name='mpx'/>
    <feature policy='require' name='hypervisor'/>
    <numa>
      <cell id='0' cpus='0-15' memory='1048576' unit='KiB'/>
    </numa>
  </cpu>
  <clock offset='variable' adjustment='0' basis='utc'>
    <timer name='rtc' tickpolicy='catchup'/>
    <timer name='pit' tickpolicy='delay'/>
    <timer name='hpet' present='no'/>
  </clock>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>destroy</on_crash>
  <pm>
    <suspend-to-mem enabled='no'/>
    <suspend-to-disk enabled='no'/>
  </pm>
  <devices>
    <emulator>/usr/libexec/qemu-kvm</emulator>
    <disk type='file' device='cdrom'>
      <driver name='qemu' error_policy='report'/>
      <source startupPolicy='optional'/>
      <target dev='sdc' bus='sata'/>
      <readonly/>
      <alias name='ua-87de72de-d473-4b69-b5fc-bb8f2a65eda1'/>
      <address type='drive' controller='0' bus='0' target='0' unit='2'/>
    </disk>
    <disk type='file' device='disk' snapshot='no'>
      <driver name='qemu' type='qcow2' cache='none' error_policy='stop' io='threads' iothread='1'/>
      <source file='/rhev/data-center/mnt/storage:_export_1/839db77f-fde3-4e13-bfb6-56be604631ed/images/86569d03-189c-4893-b46e-286c2eba8857/fa0ff024-932b-4515-8dab-d2a2ed5a7bf9' index='1'>
        <seclabel model='dac' relabel='no'/>
      </source>
      <backingStore/>
      <target dev='vda' bus='virtio'/>
      <serial>86569d03-189c-4893-b46e-286c2eba8857</serial>
      <boot order='1'/>
      <alias name='ua-86569d03-189c-4893-b46e-286c2eba8857'/>
      <address type='pci' domain='0x0000' bus='0x05' slot='0x00' function='0x0'/>
    </disk>
    <controller type='pci' index='1' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='1' port='0x10'/>
      <alias name='pci.1'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0' multifunction='on'/>
    </controller>
    <controller type='pci' index='2' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='2' port='0x11'/>
      <alias name='pci.2'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x1'/>
    </controller>
    <controller type='pci' index='3' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='3' port='0x12'/>
      <alias name='pci.3'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x2'/>
    </controller>
    <controller type='pci' index='4' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='4' port='0x13'/>
      <alias name='pci.4'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x3'/>
    </controller>
    <controller type='pci' index='5' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='5' port='0x14'/>
      <alias name='pci.5'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x4'/>
    </controller>
    <controller type='pci' index='6' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='6' port='0x15'/>
      <alias name='pci.6'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x5'/>
    </controller>
    <controller type='pci' index='7' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='7' port='0x16'/>
      <alias name='pci.7'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x6'/>
    </controller>
    <controller type='pci' index='8' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='8' port='0x17'/>
      <alias name='pci.8'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x7'/>
    </controller>
    <controller type='pci' index='9' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='9' port='0x18'/>
      <alias name='pci.9'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0' multifunction='on'/>
    </controller>
    <controller type='pci' index='10' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='10' port='0x19'/>
      <alias name='pci.10'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x1'/>
    </controller>
    <controller type='pci' index='11' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='11' port='0x1a'/>
      <alias name='pci.11'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x2'/>
    </controller>
    <controller type='pci' index='12' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='12' port='0x1b'/>
      <alias name='pci.12'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x3'/>
    </controller>
    <controller type='pci' index='13' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='13' port='0x1c'/>
      <alias name='pci.13'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x4'/>
    </controller>
    <controller type='pci' index='14' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='14' port='0x1d'/>
      <alias name='pci.14'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x5'/>
    </controller>
    <controller type='pci' index='15' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='15' port='0x1e'/>
      <alias name='pci.15'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x6'/>
    </controller>
    <controller type='pci' index='16' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='16' port='0x1f'/>
      <alias name='pci.16'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x7'/>
    </controller>
    <controller type='pci' index='0' model='pcie-root'>
      <alias name='pcie.0'/>
    </controller>
    <controller type='virtio-serial' index='0' ports='16'>
      <alias name='ua-6a467db1-1ffc-4f77-9b58-b746a1b048be'/>
      <address type='pci' domain='0x0000' bus='0x03' slot='0x00' function='0x0'/>
    </controller>
    <controller type='sata' index='0'>
      <alias name='ide'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x1f' function='0x2'/>
    </controller>
    <controller type='scsi' index='0' model='virtio-scsi'>
      <driver iothread='1'/>
      <alias name='ua-cc2f22e8-9625-4dc0-8d70-3513b45975bd'/>
      <address type='pci' domain='0x0000' bus='0x02' slot='0x00' function='0x0'/>
    </controller>
    <controller type='usb' index='0' model='qemu-xhci' ports='8'>
      <alias name='ua-ec25086f-5623-4d02-b596-5b5251ccce02'/>
      <address type='pci' domain='0x0000' bus='0x04' slot='0x00' function='0x0'/>
    </controller>
    <interface type='bridge'>
      <mac address='56:6f:7a:5a:00:02'/>
      <source bridge='ovirtmgmt'/>
      <target dev='vnet9'/>
      <model type='virtio'/>
      <filterref filter='vdsm-no-mac-spoofing'/>
      <link state='up'/>
      <mtu size='1500'/>
      <alias name='ua-d44c62e4-0a44-4a11-a3f1-31247c73f6a0'/>
      <address type='pci' domain='0x0000' bus='0x01' slot='0x00' function='0x0'/>
    </interface>
    <channel type='unix'>
      <source mode='bind' path='/var/lib/libvirt/qemu/channels/4198a95d-c9be-413c-a01b-bc2fa3448e26.ovirt-guest-agent.0'/>
      <target type='virtio' name='ovirt-guest-agent.0' state='disconnected'/>
      <alias name='channel0'/>
      <address type='virtio-serial' controller='0' bus='0' port='1'/>
    </channel>
    <channel type='unix'>
      <source mode='bind' path='/var/lib/libvirt/qemu/channels/4198a95d-c9be-413c-a01b-bc2fa3448e26.org.qemu.guest_agent.0'/>
      <target type='virtio' name='org.qemu.guest_agent.0' state='connected'/>
      <alias name='channel1'/>
      <address type='virtio-serial' controller='0' bus='0' port='2'/>
    </channel>
    <channel type='spicevmc'>
      <target type='virtio' name='com.redhat.spice.0' state='disconnected'/>
      <alias name='channel2'/>
      <address type='virtio-serial' controller='0' bus='0' port='3'/>
    </channel>
    <input type='tablet' bus='usb'>
      <alias name='input0'/>
      <address type='usb' bus='0' port='1'/>
    </input>
    <input type='mouse' bus='ps2'>
      <alias name='input1'/>
    </input>
    <input type='keyboard' bus='ps2'>
      <alias name='input2'/>
    </input>
    <graphics type='spice' port='5900' tlsPort='5901' autoport='yes' listen='192.168.122.22' passwdValidTo='1970-01-01T00:00:01'>
      <listen type='network' address='192.168.122.22' network='vdsm-ovirtmgmt'/>
      <channel name='main' mode='secure'/>
      <channel name='display' mode='secure'/>
      <channel name='inputs' mode='secure'/>
      <channel name='cursor' mode='secure'/>
      <channel name='playback' mode='secure'/>
      <channel name='record' mode='secure'/>
      <channel name='smartcard' mode='secure'/>
      <channel name='usbredir' mode='secure'/>
    </graphics>
    <graphics type='vnc' port='5902' autoport='yes' listen='192.168.122.22' keymap='en-us' passwdValidTo='1970-01-01T00:00:01'>
      <listen type='network' address='192.168.122.22' network='vdsm-ovirtmgmt'/>
    </graphics>
    <video>
      <model type='qxl' ram='65536' vram='8192' vgamem='16384' heads='1' primary='yes'/>
      <alias name='ua-d7761ee1-a13e-4314-979b-92d4968e2e61'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x0'/>
    </video>
    <memballoon model='virtio'>
      <stats period='5'/>
      <alias name='ua-96960a0a-f56f-4c4c-8b8a-bb51410db7b8'/>
      <address type='pci' domain='0x0000' bus='0x06' slot='0x00' function='0x0'/>
    </memballoon>
    <rng model='virtio'>
      <backend model='random'>/dev/urandom</backend>
      <alias name='ua-494fbdfa-a451-4a9d-9573-f821e4726847'/>
      <address type='pci' domain='0x0000' bus='0x07' slot='0x00' function='0x0'/>
    </rng>
  </devices>
  <seclabel type='dynamic' model='selinux' relabel='yes'>
    <label>system_u:system_r:svirt_t:s0:c50,c207</label>
    <imagelabel>system_u:object_r:svirt_image_t:s0:c50,c207</imagelabel>
  </seclabel>
  <seclabel type='dynamic' model='dac' relabel='yes'>
    <label>+107:+107</label>
    <imagelabel>+107:+107</imagelabel>
  </seclabel>
  <qemu:capabilities>
    <qemu:add capability='blockdev'/>
    <qemu:add capability='incremental-backup'/>
  </qemu:capabilities>
</domain>


Backup log:

$ ./backup_vm.py -c engine-dev full --backup-dir /var/tmp/backups/qcow2 4198a95d-c9be-413c-a01b-bc2fa3448e26
[   0.0 ] Starting full backup for VM 4198a95d-c9be-413c-a01b-bc2fa3448e26
[   0.6 ] Waiting until backup 3e8bae25-63da-4fc9-bddb-5dffaab67319 is ready
[   3.6 ] Created checkpoint '8d7530dd-deeb-41d3-81c4-76cafb71ec13' (to use in --from-checkpoint-uuid for the next incremental backup)
[   3.6 ] Creating image transfer for disk 86569d03-189c-4893-b46e-286c2eba8857
[   4.7 ] Image transfer e9e6b021-0410-467c-afb2-dc46445d895c is ready
[ 100.00% ] 6.00 GiB, 23.21 seconds, 264.73 MiB/s                              
[  27.9 ] Finalizing image transfer
[  29.0 ] Creating image transfer for disk 86569d03-189c-4893-b46e-286c2eba8857
[  30.1 ] Image transfer f554dd46-a58d-4193-8433-e4e1cf6a5784 is ready
[  29.78% ] 1.79 GiB, 6.37 seconds, 287.02 MiB/s                              
[  36.5 ] Finalizing image transfer
Traceback (most recent call last):
  File "./backup_vm.py", line 177, in cmd_full
    download_backup(connection, backup.id, args)
  File "./backup_vm.py", line 355, in download_backup
    download_disk(connection, backup_uuid, disk, disk_path, args, incremental=incremental)
  File "./backup_vm.py", line 390, in download_disk
    client.download(
  File "/usr/lib64/python3.8/site-packages/ovirt_imageio/client/_api.py", line 172, in download
    io.copy(
  File "/usr/lib64/python3.8/site-packages/ovirt_imageio/_internal/io.py", line 69, in copy
    log.debug("Executor failed")
  File "/usr/lib64/python3.8/site-packages/ovirt_imageio/_internal/io.py", line 189, in __exit__
    self.stop()
  File "/usr/lib64/python3.8/site-packages/ovirt_imageio/_internal/io.py", line 166, in stop
    raise self._errors[0]
  File "/usr/lib64/python3.8/site-packages/ovirt_imageio/_internal/io.py", line 238, in _run
    handler.copy(req)
  File "/usr/lib64/python3.8/site-packages/ovirt_imageio/_internal/io.py", line 282, in copy
    self._src.write_to(self._dst, req.length, self._buf)
  File "/usr/lib64/python3.8/site-packages/ovirt_imageio/_internal/backends/http.py", line 213, in write_to
    raise RuntimeError(
RuntimeError: Expected 24182784 bytes, got 16777216 bytes

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "./backup_vm.py", line 429, in <module>
    main()
  File "./backup_vm.py", line 161, in main
    args.command(args)
  File "./backup_vm.py", line 179, in cmd_full
    stop_backup(connection, backup.id, args)
  File "./backup_vm.py", line 327, in stop_backup
    backup_service.finalize()
  File "/home/nsoffer/.local/lib/python3.8/site-packages/ovirtsdk4/services.py", line 33306, in finalize
    return self._internal_action(action, 'finalize', None, headers, query, wait)
  File "/home/nsoffer/.local/lib/python3.8/site-packages/ovirtsdk4/service.py", line 299, in _internal_action
    return future.wait() if wait else future
  File "/home/nsoffer/.local/lib/python3.8/site-packages/ovirtsdk4/service.py", line 55, in wait
    return self._code(response)
  File "/home/nsoffer/.local/lib/python3.8/site-packages/ovirtsdk4/service.py", line 296, in callback
    self._check_fault(response)
  File "/home/nsoffer/.local/lib/python3.8/site-packages/ovirtsdk4/service.py", line 132, in _check_fault
    self._raise_error(response, body)
  File "/home/nsoffer/.local/lib/python3.8/site-packages/ovirtsdk4/service.py", line 118, in _raise_error
    raise error
ovirtsdk4.Error: Fault reason is "Operation Failed". Fault detail is "[Host not found.]". HTTP response code is 400.


Server log:
/var/log/ovirt-imageio/daemon.log

2020-11-21 04:53:29,238 ERROR   (Thread-263) [http] Request aborted after starting response
Traceback (most recent call last):
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/http.py", line 774, in __call__
    self.dispatch(req, resp)
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/http.py", line 819, in dispatch
    return method(req, resp, *match.groups())
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/cors.py", line 84, in wrapper
    return func(self, req, resp, *args)
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/images.py", line 148, in get
    ticket.run(op)
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/auth.py", line 174, in run
    return operation.run()
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/ops.py", line 57, in run
    res = self._run()
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/ops.py", line 100, in _run
    self._read_chunk()
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/ops.py", line 112, in _read_chunk
    count = self._src.readinto(view)
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/backends/nbd.py", line 110, in readinto
    self._client.readinto(self._position, view)
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/nbd.py", line 416, in readinto
    self._recv_reply(cmd)
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/nbd.py", line 935, in _recv_reply
    magic = self._recv_fmt("!I")[0]
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/nbd.py", line 1190, in _recv_fmt
    data = self._recv(s.size)
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/nbd.py", line 1200, in _recv
    self._recv_into(buf)
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/nbd.py", line 1213, in _recv_into
    .format(pos, length))
ovirt_imageio._internal.nbd.ProtocolError: Server closed the connection, read 0 bytes, expected 4 bytes


Libvirt qemu log:

2020-11-21 02:51:57.306+0000: Domain id=9 is tainted: custom-hypervisor-feature
2020-11-21T02:51:57.488883Z qemu-kvm: -numa node,nodeid=0,cpus=0-15,mem=1024: warning: Parameter -numa node,mem is deprecated,
 use -numa node,memdev instead
2020-11-21 02:52:20.345+0000: Domain id=9 is tainted: custom-ga-command
aio_co_schedule: Co-routine was already scheduled in 'aio_co_schedule'
2020-11-21 02:53:29.451+0000: shutting down, reason=crashed

qemu coredump:

Program terminated with signal SIGABRT, Aborted.
#0  0x00007f37f011a7ff in ?? ()

Comment 30 Nir Soffer 2020-11-22 10:03:04 UTC
VM 5 - qcow2 disk, virtio-scsi, enable incremental backup

(same xml as vm 3)

Backup log:

Using unmodifed backup script:
$ ./backup_vm.py -c engine-dev full --backup-dir /var/tmp/backups/qcow2 6ba753b5-2a04-4a6b-9874-6288131cfe64
[   0.0 ] Starting full backup for VM 6ba753b5-2a04-4a6b-9874-6288131cfe64
[   0.2 ] Waiting until backup d741e566-1073-494b-8fc0-b5c7d87cb9a3 is ready
[   1.2 ] Created checkpoint '2842ca29-f527-48d2-b2b7-725ebb425260' (to use in --from-checkpoint-uuid for the next incremental backup)
[   1.2 ] Creating image transfer for disk e494e31d-dae9-4d40-8160-d01aa6755542
[   2.3 ] Image transfer d32d3aed-7b78-4882-b5e3-008ab356d383 is ready
[ 100.00% ] 6.00 GiB, 21.08 seconds, 291.41 MiB/s                              
[  23.4 ] Finalizing image transfer
[  26.5 ] Full backup completed successfully

Using modified backup script:
$ ./backup_vm.py -c engine-dev incremental --from-checkpoint-uuid 2842ca29-f527-48d2-b2b7-725ebb425260 --backup-dir /var/tmp/backups/qcow2 6ba753b5-2a04-4a6b-9874-6288131cfe64
[   0.0 ] Starting incremental backup for VM 6ba753b5-2a04-4a6b-9874-6288131cfe64
[   0.3 ] Waiting until backup dd032904-8e77-4764-a5ce-f6a8a53de71d is ready
[   1.3 ] Created checkpoint 'f9fa62e0-237f-48b1-a3c4-242a72848733' (to use in --from-checkpoint-uuid for the next incremental backup)
[   1.3 ] Creating image transfer for disk e494e31d-dae9-4d40-8160-d01aa6755542
[   2.4 ] Image transfer a2b2c1c1-fc2c-4df3-9bcc-a1783d383d91 is ready
[ 100.00% ] 6.00 GiB, 1.75 seconds, 3.42 GiB/s                                
[   4.2 ] Finalizing image transfer
[   8.2 ] Creating image transfer for disk e494e31d-dae9-4d40-8160-d01aa6755542
[   9.3 ] Image transfer 26522267-0e90-41e4-936f-852cc72cdde8 is ready
[ 100.00% ] 6.00 GiB, 1.86 seconds, 3.22 GiB/s                                
[  11.2 ] Finalizing image transfer
[  15.2 ] Creating image transfer for disk e494e31d-dae9-4d40-8160-d01aa6755542
[  16.3 ] Image transfer 2aac5804-6997-4947-ab90-e3dd9a30e856 is ready
[ 100.00% ] 6.00 GiB, 1.58 seconds, 3.79 GiB/s                                
[  17.9 ] Finalizing image transfer
[  22.9 ] Creating image transfer for disk e494e31d-dae9-4d40-8160-d01aa6755542
[  24.1 ] Image transfer c025fb1b-a437-4fc7-a535-30ab1ab057d7 is ready
[ 100.00% ] 6.00 GiB, 1.74 seconds, 3.44 GiB/s                                
[  25.8 ] Finalizing image transfer
[  29.9 ] Creating image transfer for disk e494e31d-dae9-4d40-8160-d01aa6755542
[  31.0 ] Image transfer 799daffe-a9f8-4122-8dca-bf49db26efb5 is ready
[ 100.00% ] 6.00 GiB, 2.80 seconds, 2.14 GiB/s                                
[  33.8 ] Finalizing image transfer
[  37.0 ] Incremental backup completed successfully

Comparing backups:

$ ls -lhs
total 6.4G
506M -rw-r--r--. 1 nsoffer nsoffer 506M Nov 21 05:07 e494e31d-dae9-4d40-8160-d01aa6755542.00.qcow2
506M -rw-r--r--. 1 nsoffer nsoffer 506M Nov 21 05:07 e494e31d-dae9-4d40-8160-d01aa6755542.01.qcow2
506M -rw-r--r--. 1 nsoffer nsoffer 506M Nov 21 05:07 e494e31d-dae9-4d40-8160-d01aa6755542.02.qcow2
506M -rw-r--r--. 1 nsoffer nsoffer 506M Nov 21 05:07 e494e31d-dae9-4d40-8160-d01aa6755542.03.qcow2
506M -rw-r--r--. 1 nsoffer nsoffer 506M Nov 21 05:08 e494e31d-dae9-4d40-8160-d01aa6755542.04.qcow2
3.9G -rw-r--r--. 1 nsoffer nsoffer 3.9G Nov 21 05:04 e494e31d-dae9-4d40-8160-d01aa6755542.202011210504.full.qcow2

$ for i in $(seq 4); do echo "compare backup 00 to 0$i"; qemu-img compare e494e31d-dae9-4d40-8160-d01aa6755542.00.qcow2 e494e31d-dae9-4d40-8160-d01aa6755542.0$i.qcow2; done
compare backup 00 to 01
Images are identical.
compare backup 00 to 02
Images are identical.
compare backup 00 to 03
Images are identical.
compare backup 00 to 04
Images are identical.

Testing backup using local vm:

$ qemu-img rebase -u -b e494e31d-dae9-4d40-8160-d01aa6755542.202011210504.full.qcow2 -F qcow2 e494e31d-dae9-4d40-8160-d01aa6755542.00.qcow2

$ qemu-img create -f qcow2 -b e494e31d-dae9-4d40-8160-d01aa6755542.00.qcow2 -F qcow2 preview.qcow2
Formatting 'preview.qcow2', fmt=qcow2 cluster_size=65536 compression_type=zlib size=6442450944 backing_file=e494e31d-dae9-4d40-8160-d01aa6755542.00.qcow2 backing_fmt=qcow2 lazy_refcounts=off refcount_bits=16

$ qemu-kvm -m 2048 preview.qcow2

VM boots normally.

Comment 31 Nir Soffer 2020-11-22 10:04:02 UTC
VM 6 - qcow2 disk, virtio, enabled for incremental backup - crash during reboot

(same xml as vm 4)

Backup log:

$ ./backup_vm.py -c engine-dev incremental --from-checkpoint-uuid 2842ca29-f527-48d2-b2b7-725ebb425260 --backup-dir /var/tmp/backups/qcow2 6ba753b5-2a04-4a6b-9874-6288131cfe64
[   0.0 ] Starting incremental backup for VM 6ba753b5-2a04-4a6b-9874-6288131cfe64
[   0.3 ] Waiting until backup 310cce5b-9868-408b-a378-370f5c8c3282 is ready
[   1.3 ] Created checkpoint '3319c16d-9af8-427c-ba44-f6d567bd176f' (to use in --from-checkpoint-uuid for the next incremental backup)
[   1.3 ] Creating image transfer for disk e494e31d-dae9-4d40-8160-d01aa6755542
[   2.4 ] Image transfer b0306363-c2ef-41cc-9b14-2a8962491ebf is ready
[ 100.00% ] 6.00 GiB, 1.92 seconds, 3.13 GiB/s                                
[   4.3 ] Finalizing image transfer
[   8.4 ] Creating image transfer for disk e494e31d-dae9-4d40-8160-d01aa6755542
[   9.5 ] Image transfer 738b5421-d71a-478e-9ab3-6e97ed424c4e is ready
[ 100.00% ] 6.00 GiB, 2.12 seconds, 2.83 GiB/s                                
[  11.6 ] Finalizing image transfer
[  15.6 ] Creating image transfer for disk e494e31d-dae9-4d40-8160-d01aa6755542
[  16.7 ] Image transfer 589a6f9e-c762-4322-b2e0-cef821a8a9fb is ready
[ 100.00% ] 6.00 GiB, 1.89 seconds, 3.17 GiB/s                                
[  18.6 ] Finalizing image transfer
[  22.6 ] Creating image transfer for disk e494e31d-dae9-4d40-8160-d01aa6755542
[  23.7 ] Image transfer 3de30c0e-6fce-4d02-80bd-bd6d9de50a13 is ready
[  43.82% ] 2.63 GiB, 0.27 seconds, 9.65 GiB/s                                
[  24.0 ] Finalizing image transfer
Traceback (most recent call last):
  File "./backup_vm.py", line 194, in cmd_incremental
    download_backup(connection, backup.id, args, incremental=True)
  File "./backup_vm.py", line 355, in download_backup
    download_disk(connection, backup_uuid, disk, disk_path, args, incremental=incremental)
  File "./backup_vm.py", line 390, in download_disk
    client.download(
  File "/usr/lib64/python3.8/site-packages/ovirt_imageio/client/_api.py", line 172, in download
    io.copy(
  File "/usr/lib64/python3.8/site-packages/ovirt_imageio/_internal/io.py", line 69, in copy
    log.debug("Executor failed")
  File "/usr/lib64/python3.8/site-packages/ovirt_imageio/_internal/io.py", line 189, in __exit__
    self.stop()
  File "/usr/lib64/python3.8/site-packages/ovirt_imageio/_internal/io.py", line 166, in stop
    raise self._errors[0]
  File "/usr/lib64/python3.8/site-packages/ovirt_imageio/_internal/io.py", line 238, in _run
    handler.copy(req)
  File "/usr/lib64/python3.8/site-packages/ovirt_imageio/_internal/io.py", line 282, in copy
    self._src.write_to(self._dst, req.length, self._buf)
  File "/usr/lib64/python3.8/site-packages/ovirt_imageio/_internal/backends/http.py", line 204, in write_to
    res = self._get(length)
  File "/usr/lib64/python3.8/site-packages/ovirt_imageio/_internal/backends/http.py", line 431, in _get
    self._reraise(res.status, res.read())
  File "/usr/lib64/python3.8/site-packages/ovirt_imageio/_internal/backends/http.py", line 585, in _reraise
    raise http.Error(status, msg)
ovirt_imageio._internal.http.Error: Server failed to perform the request, check logs

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "./backup_vm.py", line 429, in <module>
    main()
  File "./backup_vm.py", line 161, in main
    args.command(args)
  File "./backup_vm.py", line 196, in cmd_incremental
    stop_backup(connection, backup.id, args)
  File "./backup_vm.py", line 327, in stop_backup
    backup_service.finalize()
  File "/home/nsoffer/.local/lib/python3.8/site-packages/ovirtsdk4/services.py", line 33306, in finalize
    return self._internal_action(action, 'finalize', None, headers, query, wait)
  File "/home/nsoffer/.local/lib/python3.8/site-packages/ovirtsdk4/service.py", line 299, in _internal_action
    return future.wait() if wait else future
  File "/home/nsoffer/.local/lib/python3.8/site-packages/ovirtsdk4/service.py", line 55, in wait
    return self._code(response)
  File "/home/nsoffer/.local/lib/python3.8/site-packages/ovirtsdk4/service.py", line 296, in callback
    self._check_fault(response)
  File "/home/nsoffer/.local/lib/python3.8/site-packages/ovirtsdk4/service.py", line 132, in _check_fault
    self._raise_error(response, body)
  File "/home/nsoffer/.local/lib/python3.8/site-packages/ovirtsdk4/service.py", line 118, in _raise_error
    raise error
ovirtsdk4.Error: Fault reason is "Operation Failed". Fault detail is "[Host not found.]". HTTP response code is 400.

Server log:
/var/log/ovirt-imageio/daemon.log

2020-11-21 05:17:42,480 INFO    (Thread-358) [backends.nbd] Open backend address='/run/vdsm/backup/310cce5b-9868-408b-a378-370f5c8c3282' export_name='vda' sparse=True max_connections=8
2020-11-21 05:17:42,637 ERROR   (Thread-355) [http] Server error
Traceback (most recent call last):
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/http.py", line 774, in __call__
    self.dispatch(req, resp)
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/http.py", line 819, in dispatch
    return method(req, resp, *match.groups())
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/cors.py", line 84, in wrapper
    return func(self, req, resp, *args)
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/images.py", line 148, in get
    ticket.run(op)
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/auth.py", line 174, in run
    return operation.run()
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/ops.py", line 57, in run
    res = self._run()
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/ops.py", line 100, in _run
    self._read_chunk()
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/ops.py", line 112, in _read_chunk
    count = self._src.readinto(view)
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/backends/nbd.py", line 110, in readinto
    self._client.readinto(self._position, view)
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/nbd.py", line 415, in readinto
    self._send_command(cmd)
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/nbd.py", line 928, in _send_command
    self._send(cmd.to_bytes())
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/nbd.py", line 1196, in _send
    self._sock.sendall(data)
BrokenPipeError: [Errno 32] Broken pipe

Libvirt qemu log:
/var/log/libvirt/qemu/backup-qcow2-incremental.log

2020-11-21 03:15:13.546+0000: Domain id=12 is tainted: custom-hypervisor-feature
2020-11-21T03:15:13.924968Z qemu-kvm: -numa node,nodeid=0,cpus=0-15,mem=1024: warning: Parameter -numa node,mem is deprecated, use -numa node,memdev instead
2020-11-21 03:15:31.751+0000: Domain id=12 is tainted: custom-ga-command
aio_co_schedule: Co-routine was already scheduled in 'aio_co_schedule'
2020-11-21 03:17:42.815+0000: shutting down, reason=crashed

qemu coredump:

Program terminated with signal SIGABRT, Aborted.
#0  0x00007f37f011a7ff in ?? ()

Comment 32 Yury.Panchenko 2020-11-24 13:19:13 UTC
Hello, Nir.
I retest original case on new build.
Looks like the problem is gone.
Package versions:
RHEL 8.3 AV
[root@KVMC-RHEL83 ~]# rpm -q vdsm ovirt-imageio-daemon libvirt-daemon qemu-kvm
vdsm-4.40.37-9.git2611c054f.el8.x86_64
ovirt-imageio-daemon-2.2.0-0.202011111341.git5931b13.el8.x86_64
libvirt-daemon-6.6.0-7.module+el8.3.0+8424+5ea525c5.x86_64
qemu-kvm-5.1.0-14.module+el8.3.0+8438+644aff69.x86_64

ovirt-manager version: 4.4-20201123175824.1.el8

Comment 33 Nir Soffer 2020-11-24 13:33:43 UTC
(In reply to Yury.Panchenko from comment #32)
> Looks like the problem is gone.
...
> qemu-kvm-5.1.0-14.module+el8.3.0+8438+644aff69.x86_64

May be this version fixed the issue, I tested with 5.1.0-13.

But according to my tests, this the crash happens only with virtio
disk. Our default is virtio-scdsi. Which interface did you test?

You can change the interface when creating a disk, or when editing
a disk when the vm is down.

Please share the vm xml you tested with. The best way is using virsh:

    virsh -r dumpxml vm-name

Comment 34 Yury.Panchenko 2020-11-24 13:59:49 UTC
> But according to my tests, this the crash happens only with virtio
> disk. Our default is virtio-scdsi. Which interface did you test?
I test only original case (vm corruption on reboot).
I didn't test virtio disk, I think this crash is still exists, but you already created another bug for this case

Comment 35 Yuriy Khokhlov (Veeam) 2020-12-07 08:43:36 UTC
I've cleared the Need Info flag cause Yuriy Panchenko has already provided information.

Comment 36 Eyal Shenitzky 2021-04-19 06:42:37 UTC
Qemu version with the fix will be used now by VDSM after patch - https://gerrit.ovirt.org/#/c/vdsm/+/114307/ merged.
Moving to QA for testing.

Comment 37 Ilan Zuckerman 2021-04-19 07:03:52 UTC
Nir, from reading all your comments after comment #25 , I understand that you could not reproduce silent data corruption.
Is there a way of verifying this BUG currently?

Comment 38 Nir Soffer 2021-04-19 16:24:52 UTC
We could not reproduce the reported issue of silent data corruption.

Testing revealed:

- qemu bug 1900326 which is already fixed in RHEL AV 8.3.1
  qemu-kvm-5.1.0-19.module+el8.3.1+9795+4ce2a535

- we don't have automated tests for this flow - if we had this
  issue would not be reported by partners.

There is change to test on our side, but we need to have automated tests
for the case when guest perform a reboot during backup. You can see 
that this worked when I tested this with virtio-scsi disk and failed with
virtio disks. With current qemu version this should work with both.

We need an automated testing both virtio and virtio-scsi disks, best
as separate tests so if there a new regression only the relevant test
will fail.

Comment 39 Ilan Zuckerman 2021-05-02 07:35:54 UTC
Hi Nir, I think that the backup_vm.py script got broken somewhere along the way [1].
Is it possible to test this flow by breaking it to 3 steps (start, download, stop)?
Testing on rhv-4.4.6-7

[1]:
[root@lynx21 ~]# python3 /usr/share/doc/python3-ovirt-engine-sdk4/examples/backup_vm.py -c engine full --backup-dir /var/tmp/ 0a2700c2-84d0-4c6a-b6b5-8f1a0412780b
[   0.0 ] Starting full backup for VM 0a2700c2-84d0-4c6a-b6b5-8f1a0412780b
[   1.5 ] Waiting until backup 78236bd4-bfcd-4428-a912-1d84893e7ff0 is ready
[  18.8 ] Created checkpoint 'f8f7c68c-d30b-460a-92ff-1ca2b364e62a' (to use in --from-checkpoint-uuid for the next incremental backup)
[  18.8 ] Creating image transfer for disk f2c917e3-050a-45ca-a6f4-01b8a8c9d210
[  18.8 ] Finalizing backup
[  18.9 ] Waiting until backup is finalized
Traceback (most recent call last):
  File "/usr/share/doc/python3-ovirt-engine-sdk4/examples/backup_vm.py", line 452, in <module>
    main()
  File "/usr/share/doc/python3-ovirt-engine-sdk4/examples/backup_vm.py", line 167, in main
    args.command(args)
  File "/usr/share/doc/python3-ovirt-engine-sdk4/examples/backup_vm.py", line 183, in cmd_full
    download_backup(connection, backup.id, args)
  File "/usr/share/doc/python3-ovirt-engine-sdk4/examples/backup_vm.py", line 371, in download_backup
    connection, backup_uuid, disk, disk_path, args, incremental=download_incremental)
  File "/usr/share/doc/python3-ovirt-engine-sdk4/examples/backup_vm.py", line 388, in download_disk
    timeout_policy=types.ImageTransferTimeoutPolicy(args.timeout_policy))
AttributeError: 'Namespace' object has no attribute 'timeout_policy'

Comment 40 Eyal Shenitzky 2021-05-02 11:05:41 UTC
(In reply to Ilan Zuckerman from comment #39)
> Hi Nir, I think that the backup_vm.py script got broken somewhere along the
> way [1].
> Is it possible to test this flow by breaking it to 3 steps (start, download,
> stop)?
> Testing on rhv-4.4.6-7
> 
> [1]:
> [root@lynx21 ~]# python3
> /usr/share/doc/python3-ovirt-engine-sdk4/examples/backup_vm.py -c engine
> full --backup-dir /var/tmp/ 0a2700c2-84d0-4c6a-b6b5-8f1a0412780b
> [   0.0 ] Starting full backup for VM 0a2700c2-84d0-4c6a-b6b5-8f1a0412780b
> [   1.5 ] Waiting until backup 78236bd4-bfcd-4428-a912-1d84893e7ff0 is ready
> [  18.8 ] Created checkpoint 'f8f7c68c-d30b-460a-92ff-1ca2b364e62a' (to use
> in --from-checkpoint-uuid for the next incremental backup)
> [  18.8 ] Creating image transfer for disk
> f2c917e3-050a-45ca-a6f4-01b8a8c9d210
> [  18.8 ] Finalizing backup
> [  18.9 ] Waiting until backup is finalized
> Traceback (most recent call last):
>   File "/usr/share/doc/python3-ovirt-engine-sdk4/examples/backup_vm.py",
> line 452, in <module>
>     main()
>   File "/usr/share/doc/python3-ovirt-engine-sdk4/examples/backup_vm.py",
> line 167, in main
>     args.command(args)
>   File "/usr/share/doc/python3-ovirt-engine-sdk4/examples/backup_vm.py",
> line 183, in cmd_full
>     download_backup(connection, backup.id, args)
>   File "/usr/share/doc/python3-ovirt-engine-sdk4/examples/backup_vm.py",
> line 371, in download_backup
>     connection, backup_uuid, disk, disk_path, args,
> incremental=download_incremental)
>   File "/usr/share/doc/python3-ovirt-engine-sdk4/examples/backup_vm.py",
> line 388, in download_disk
>     timeout_policy=types.ImageTransferTimeoutPolicy(args.timeout_policy))
> AttributeError: 'Namespace' object has no attribute 'timeout_policy'

The timeout_policy was added recently to the SDK, maybe you are working with outdated SDK version.
Please update the SDK version and try again.

Comment 41 Eyal Shenitzky 2021-05-02 11:58:52 UTC
Found an issue with the SDK script, fix was submitted - https://gerrit.ovirt.org/#/c/ovirt-engine-sdk/+/114584/.
Using - start, download, stop should work.

Comment 42 Ilan Zuckerman 2021-05-03 10:07:31 UTC
Failed to verify on:

rhv-release-4.4.6-7

[root@lynx12 ~]# rpm -q vdsm ovirt-imageio-daemon libvirt-daemon qemu-kvm
vdsm-4.40.60.6-1.el8ev.x86_64
ovirt-imageio-daemon-2.1.1-1.el8ev.x86_64
libvirt-daemon-7.0.0-13.module+el8.4.0+10604+5608c2b4.x86_64
qemu-kvm-5.2.0-15.module+el8.4.0+10650+50781ca0.x86_64


During the second VM flow (VM_2_raw_disk_virtio), The 'full' cycle of VM backup has failed.
The 'backup_vm.py' script was modified as Nir explained in his comment #25
The verification steps were done as nir explained in comments #26 and #27

=====================================================================================


VM_1_raw_disk_virtio-scsi - SUCCEEDED

[root@lynx21 ~]# python3 /usr/share/doc/python3-ovirt-engine-sdk4/examples/backup_vm.py -c engine full 13706768-250e-44bb-84fe-2a94b431e66e --backup-dir /rhev/data-center/mnt/mantis-nfs-lif2.lab.eng.tlv2.redhat.com\:_nas01_ge__2__nfs__0/backups/raw
[   0.0 ] Starting full backup for VM 13706768-250e-44bb-84fe-2a94b431e66e
[   1.4 ] Waiting until backup 4a565d86-db5d-4c77-acf3-09234c2c71e6 is ready
[  20.7 ] Creating image transfer for disk d25c61ff-e473-4c11-868a-b69fd8f36dd7
[  21.9 ] Image transfer 206cf13a-677a-49c0-8128-5ada2ee1bd93 is ready
[ 100.00% ] 10.00 GiB, 121.96 seconds, 83.96 MiB/s                             
[ 143.9 ] Finalizing image transfer
[ 147.0 ] Creating image transfer for disk d25c61ff-e473-4c11-868a-b69fd8f36dd7
[ 148.1 ] Image transfer 41d20410-825c-4c8b-818e-918d3011af5e is ready
[ 100.00% ] 10.00 GiB, 125.75 seconds, 81.43 MiB/s                             
[ 273.9 ] Finalizing image transfer
[ 283.0 ] Creating image transfer for disk d25c61ff-e473-4c11-868a-b69fd8f36dd7
[ 284.2 ] Image transfer 8d9083c8-7a14-4aec-8d62-450e9bcefea5 is ready
[ 100.00% ] 10.00 GiB, 121.84 seconds, 84.05 MiB/s                             
[ 406.0 ] Finalizing image transfer
[ 409.1 ] Creating image transfer for disk d25c61ff-e473-4c11-868a-b69fd8f36dd7
[ 410.2 ] Image transfer f4c21a76-3018-4b40-a19f-eb46ab530702 is ready
[ 100.00% ] 10.00 GiB, 117.86 seconds, 86.88 MiB/s                             
[ 528.1 ] Finalizing image transfer
[ 535.2 ] Creating image transfer for disk d25c61ff-e473-4c11-868a-b69fd8f36dd7
[ 536.3 ] Image transfer 4f98ab8a-64a8-493e-9349-321bd6003c69 is ready
[ 100.00% ] 10.00 GiB, 124.78 seconds, 82.06 MiB/s                             
[ 661.1 ] Finalizing image transfer
[ 671.3 ] Finalizing backup
[ 671.4 ] Waiting until backup is finalized
[ 671.4 ] Full backup completed successfully

Dump xml : VM_1_raw_disk_virtio-scsi_dump.xml

[root@lynx21 raw]# for i in $(seq 4); do echo "compare backup 00 to 0$i"; qemu-img compare d25c61ff-e473-4c11-868a-b69fd8f36dd7.00.qcow2 d25c61ff-e473-4c11-868a-b69fd8f36dd7.0$i.qcow2; done
compare backup 00 to 01
Images are identical.
compare backup 00 to 02
Images are identical.
compare backup 00 to 03
Images are identical.
compare backup 00 to 04
Images are identical.

=======================================================================

VM_2_raw_disk_virtio - FAILED

Dump xml: VM_2_raw_disk_virtio_dump.xml

VM became 'not responding' during the first reboot
logs attached: engine + vdsm + daemon + VM_2_raw_disk_virtio.log (libvirt)

[root@lynx12 ~]# python3 /usr/share/doc/python3-ovirt-engine-sdk4/examples/backup_vm.py -c engine full 12e0c686-ab8d-4c41-b072-744205f2ef3e --backup-dir /rhev/data-center/mnt/mantis-nfs-lif2.lab.eng.tlv2.redhat.com\:_nas01_ge__2__nfs__0/backups/raw2/
[   0.0 ] Starting full backup for VM 12e0c686-ab8d-4c41-b072-744205f2ef3e
[   1.4 ] Waiting until backup 2b332ecf-08a1-48c8-8c58-1cfc1f8e9f32 is ready
[  17.6 ] Creating image transfer for disk b38a4912-c8a9-45dd-b42d-9b07c8b9dd24
[  18.8 ] Image transfer 7a852a9b-dd5b-4202-91ed-73ed889fb5bf is ready
[  10.00% ] 1.00 GiB, 73.15 seconds, 14.00 MiB/s                               
[  91.9 ] Finalizing image transfer
[ 152.6 ] Finalizing backup
Traceback (most recent call last):
  File "/usr/share/doc/python3-ovirt-engine-sdk4/examples/backup_vm.py", line 420, in download_disk
    **extra_args)
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/client/_api.py", line 186, in download
    name="download")
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/io.py", line 69, in copy
    log.debug("Executor failed")
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/io.py", line 189, in __exit__
    self.stop()
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/io.py", line 166, in stop
    raise self._errors[0]
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/io.py", line 238, in _run
    handler.copy(req)
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/io.py", line 282, in copy
    self._src.write_to(self._dst, req.length, self._buf)
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/backends/http.py", line 204, in write_to
    res = self._get(length)
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/backends/http.py", line 428, in _get
    res = self._con.getresponse()
  File "/usr/lib64/python3.6/http/client.py", line 1361, in getresponse
    response.begin()
  File "/usr/lib64/python3.6/http/client.py", line 311, in begin
    version, status, reason = self._read_status()
  File "/usr/lib64/python3.6/http/client.py", line 272, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/usr/lib64/python3.6/socket.py", line 586, in readinto
    return self._sock.recv_into(b)
socket.timeout: timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/share/doc/python3-ovirt-engine-sdk4/examples/backup_vm.py", line 177, in cmd_full
    download_backup(connection, backup.id, args)
  File "/usr/share/doc/python3-ovirt-engine-sdk4/examples/backup_vm.py", line 376, in download_backup
    download_disk(connection, backup_uuid, disk, disk_path, args, incremental=incremental)
  File "/usr/share/doc/python3-ovirt-engine-sdk4/examples/backup_vm.py", line 423, in download_disk
    imagetransfer.finalize_transfer(connection, transfer, disk)
  File "/usr/share/doc/python3-ovirt-engine-sdk4/examples/helpers/imagetransfer.py", line 332, in finalize_transfer
    .format(transfer.id))
RuntimeError: Timed out waiting for transfer 7a852a9b-dd5b-4202-91ed-73ed889fb5bf to finalize

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/share/doc/python3-ovirt-engine-sdk4/examples/backup_vm.py", line 457, in <module>
    main()
  File "/usr/share/doc/python3-ovirt-engine-sdk4/examples/backup_vm.py", line 161, in main
    args.command(args)
  File "/usr/share/doc/python3-ovirt-engine-sdk4/examples/backup_vm.py", line 180, in cmd_full
    stop_backup(connection, backup.id, args)
  File "/usr/share/doc/python3-ovirt-engine-sdk4/examples/backup_vm.py", line 335, in stop_backup
    backup_service.finalize()
  File "/usr/lib64/python3.6/site-packages/ovirtsdk4/services.py", line 33793, in finalize
    return self._internal_action(action, 'finalize', None, headers, query, wait)
  File "/usr/lib64/python3.6/site-packages/ovirtsdk4/service.py", line 299, in _internal_action
    return future.wait() if wait else future
  File "/usr/lib64/python3.6/site-packages/ovirtsdk4/service.py", line 55, in wait
    return self._code(response)
  File "/usr/lib64/python3.6/site-packages/ovirtsdk4/service.py", line 296, in callback
    self._check_fault(response)
  File "/usr/lib64/python3.6/site-packages/ovirtsdk4/service.py", line 132, in _check_fault
    self._raise_error(response, body)
  File "/usr/lib64/python3.6/site-packages/ovirtsdk4/service.py", line 118, in _raise_error
    raise error
ovirtsdk4.Error: Fault reason is "Operation Failed". Fault detail is "[]". HTTP response code is 400.

Comment 43 Ilan Zuckerman 2021-05-03 10:08:28 UTC
Created attachment 1778896 [details]
Failed to verify logs

Comment 44 Nir Soffer 2021-05-10 13:02:07 UTC
Based on comment 42 and the attached logs, there may be a new issue in qemu when rebooting
a vm during backup.

What we see in the logs

[root@lynx12 ~]# python3 /usr/share/doc/python3-ovirt-engine-sdk4/examples/backup_vm.py -c engine full 12e0c686-ab8d-4c41-b072-744205f2ef3e --backup-dir /rhev/data-center/mnt/mantis-nfs-lif2.lab.eng.tlv2.redhat.com\:_nas01_ge__2__nfs__0/backups/raw2/

You are testing full backup instead of incremental backup. While we want to test that vm
can be rebooted during both full and incremental backup, testing that we can download the
same backup several times during same backup with full backup is less interesting.

When testing full backup, you can download the backup once. When testing incremetnal
backup you can test multiple downloads of the backup.

[   0.0 ] Starting full backup for VM 12e0c686-ab8d-4c41-b072-744205f2ef3e
[   1.4 ] Waiting until backup 2b332ecf-08a1-48c8-8c58-1cfc1f8e9f32 is ready
[  17.6 ] Creating image transfer for disk b38a4912-c8a9-45dd-b42d-9b07c8b9dd24
[  18.8 ] Image transfer 7a852a9b-dd5b-4202-91ed-73ed889fb5bf is ready
[  10.00% ] 1.00 GiB, 73.15 seconds, 14.00 MiB/s    

Backup started and we downloaded 10% of the data...
                           
[  91.9 ] Finalizing image transfer
[ 152.6 ] Finalizing backup


But the download timed out, so we aborted the tranfter.

Traceback (most recent call last):
...
    res = self._con.getresponse()
  File "/usr/lib64/python3.6/http/client.py", line 1361, in getresponse
    response.begin()
  File "/usr/lib64/python3.6/http/client.py", line 311, in begin
    version, status, reason = self._read_status()
  File "/usr/lib64/python3.6/http/client.py", line 272, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/usr/lib64/python3.6/socket.py", line 586, in readinto
    return self._sock.recv_into(b)
socket.timeout: timed out


This means that we had a timeout reading from the server. This means the daemon is blocked
for long time waiting for response from qemu nbd server.

In the daemon log we see:

1. Ticket added for full backup

2021-05-03 12:23:04,045 INFO    (Thread-126) [tickets] [local] ADD ticket={'dirty': False, 'ops': ['read'], 'size': 10737418240, 'sparse': False, 'transfer_id': '7a852a9b-dd5b-4202-91ed-73ed889fb5bf', 'uuid': '08fa6023-80d2-4158-9247-bfbfed1a5584', 'timeout': 300, 'url': 'nbd:unix:/run/vdsm/backup/2b332ecf-08a1-48c8-8c58-1cfc1f8e9f32:exportname=vda'}

2. OPTIONS request

2021-05-03 12:23:05,114 INFO    (Thread-128) [images] [::ffff:10.46.16.27] OPTIONS ticket=08fa6023-80d2-4158-9247-bfbfed1a5584

3. Opening nbd connection to qemu nbd server

2021-05-03 12:23:05,119 INFO    (Thread-129) [backends.nbd] Open backend address='/run/vdsm/backup/2b332ecf-08a1-48c8-8c58-1cfc1f8e9f32' export_name='vda' sparse=False max_connections=8

4. EXTENTS request

2021-05-03 12:23:05,119 INFO    (Thread-129) [extents] [local] EXTENTS ticket=08fa6023-80d2-4158-9247-bfbfed1a5584 context=zero

5. Opening more connections to qemu nbd server

2021-05-03 12:23:05,543 INFO    (Thread-130) [backends.nbd] Open backend address='/run/vdsm/backup/2b332ecf-08a1-48c8-8c58-1cfc1f8e9f32' export_name='vda' sparse=False max_connections=8
2021-05-03 12:23:05,549 INFO    (Thread-132) [backends.nbd] Open backend address='/run/vdsm/backup/2b332ecf-08a1-48c8-8c58-1cfc1f8e9f32' export_name='vda' sparse=False max_connections=8
2021-05-03 12:23:05,553 INFO    (Thread-133) [backends.nbd] Open backend address='/run/vdsm/backup/2b332ecf-08a1-48c8-8c58-1cfc1f8e9f32' export_name='vda' sparse=False max_connections=8
2021-05-03 12:23:05,556 INFO    (Thread-131) [backends.nbd] Open backend address='/run/vdsm/backup/2b332ecf-08a1-48c8-8c58-1cfc1f8e9f32' export_name='vda' sparse=False max_connections=8

6. EXTENTS request timed out and connection was closed

2021-05-03 12:24:05,155 WARNING (Thread-129) [http] Timeout reading or writing to socket: timed out
2021-05-03 12:24:05,156 INFO    (Thread-129) [backends.nbd] Close backend address='/run/vdsm/backup/2b332ecf-08a1-48c8-8c58-1cfc1f8e9f32'
2021-05-03 12:24:05,156 INFO    (Thread-129) [http] CLOSE connection=129 client=local [connection 1 ops, 60.037904 s] [dispatch 1 ops, 0.002877 s] [extents 1 ops, 0.001768 s]

7. REMOVE ticket

2021-05-03 12:24:18,739 INFO    (Thread-143) [tickets] [local] REMOVE ticket=08fa6023-80d2-4158-9247-bfbfed1a5584
2021-05-03 12:24:18,739 INFO    (Thread-143) [auth] Waiting until ticket 08fa6023-80d2-4158-9247-bfbfed1a5584 is unused
2021-05-03 12:25:18,740 ERROR   (Thread-143) [http] Client error: Timeout cancelling ticket 08fa6023-80d2-4158-9247-bfbfed1a5584

The operation timed out since ticket has active connections, likely block on qemu nbd server.

We don't have any logs from other connections blocked on qemu nbd server (Thread-130, Thread-132, Thread-133, Thread-131).

Based on this log I think we have deadlock in qemu, and qemu nbd server is not responsive.


Looking in vdsm log:

1. Adding ticket

2021-05-03 12:23:04,044+0300 INFO  (jsonrpc/7) [vdsm.api] START add_image_ticket(ticket={'dirty': False, 'ops': ['read'], 'size': 10737418240, 'sparse': False, 'transfer_id': '7a852a9b-dd5b-4202-91ed-73ed889fb5bf', 'uuid': '08fa6023-80d2-4158-9247-bfbfed1a5584', 'timeout': 300, 'url': 'nbd:unix:/run/vdsm/backup/2b332ecf-08a1-48c8-8c58-1cfc1f8e9f32:exportname=vda'}) from=::ffff:10.46.16.248,41948, flow_id=d125f9fe-d70d-4136-b085-2bbd1485190e, task_id=5a623d85-9cd8-4bf7-9611-3a6b0b252f9a (api:48)

2. Monitoring ticket progress

2021-05-03 12:23:04,252+0300 INFO  (jsonrpc/1) [vdsm.api] FINISH get_image_ticket return={'result': {'active': False, 'canceled': False, 'connections': 0, 'expires': 4519002, 'idle_time': 0, 'ops': ['read'], 'size': 10737418240, 'sparse': False, 'dirty': False, 'timeout': 300, 'url': 'nbd:unix:/run/vdsm/backup/2b332ecf-08a1-48c8-8c58-1cfc1f8e9f32:exportname=vda', 'uuid': '08fa6023-80d2-4158-9247-bfbfed1a5584', 'transfer_id': '7a852a9b-dd5b-4202-91ed-73ed889fb5bf', 'transferred': 0}} from=::ffff:10.46.16.248,41948, flow_id=d125f9fe-d70d-4136-b085-2bbd1485190e, task_id=ef1f2e59-2474-4656-a257-b00ef6d653e0 (api:54)

2021-05-03 12:23:06,283+0300 INFO  (jsonrpc/0) [vdsm.api] FINISH get_image_ticket return={'result': {'active': True, 'canceled': False, 'connections': 5, 'expires': 4519003, 'idle_time': 0, 'ops': ['read'], 'size': 10737418240, 'sparse': False, 'dirty': False, 'timeout': 300, 'url': 'nbd:unix:/run/vdsm/backup/2b332ecf-08a1-48c8-8c58-1cfc1f8e9f32:exportname=vda', 'uuid': '08fa6023-80d2-4158-9247-bfbfed1a5584', 'transfer_id': '7a852a9b-dd5b-4202-91ed-73ed889fb5bf', 'transferred': 41943040}} from=::ffff:10.46.16.248,41948, flow_id=d125f9fe-d70d-4136-b085-2bbd1485190e, task_id=9bda64d8-945b-4f30-bb53-71d7ca2505fb (api:54)

2021-05-03 12:23:06,283+0300 DEBUG (jsonrpc/0) [storage.TaskManager.Task] (Task='9bda64d8-945b-4f30-bb53-71d7ca2505fb') finished: {'result': {'active': True, 'canceled': False, 'connections': 5, 'expires': 4519003, 'idle_time': 0, 'ops': ['read'], 'size': 10737418240, 'sparse': False, 'dirty': False, 'timeout': 300, 'url': 'nbd:unix:/run/vdsm/backup/2b332ecf-08a1-48c8-8c58-1cfc1f8e9f32:exportname=vda', 'uuid': '08fa6023-80d2-4158-9247-bfbfed1a5584', 'transfer_id': '7a852a9b-dd5b-4202-91ed-73ed889fb5bf', 'transferred': 41943040}} (task:1210)

2021-05-03 12:23:18,425+0300 INFO  (jsonrpc/4) [vdsm.api] FINISH get_image_ticket return={'result': {'active': True, 'canceled': False, 'connections': 5, 'expires': 4519016, 'idle_time': 0, 'ops': ['read'], 'size': 10737418240, 'sparse': False, 'dirty': False, 'timeout': 300, 'url': 'nbd:unix:/run/vdsm/backup/2b332ecf-08a1-48c8-8c58-1cfc1f8e9f32:exportname=vda', 'uuid': '08fa6023-80d2-4158-9247-bfbfed1a5584', 'transfer_id': '7a852a9b-dd5b-4202-91ed-73ed889fb5bf', 'transferred': 1124073472}} from=::ffff:10.46.16.248,41948, flow_id=d125f9fe-d70d-4136-b085-2bbd1485190e, task_id=46dcf1b2-840e-44f2-8c10-feb6db0e6da6 (api:54)

2021-05-03 12:23:28,470+0300 INFO  (jsonrpc/2) [vdsm.api] FINISH get_image_ticket return={'result': {'active': True, 'canceled': False, 'connections': 5, 'expires': 4519016, 'idle_time': 0, 'ops': ['read'], 'size': 10737418240, 'sparse': False, 'dirty': False, 'timeout': 300, 'url': 'nbd:unix:/run/vdsm/backup/2b332ecf-08a1-48c8-8c58-1cfc1f8e9f32:exportname=vda', 'uuid': '08fa6023-80d2-4158-9247-bfbfed1a5584', 'transfer_id': '7a852a9b-dd5b-4202-91ed-73ed889fb5bf', 'transferred': 1124073472}} from=::ffff:10.46.16.248,41948, flow_id=d125f9fe-d70d-4136-b085-2bbd1485190e, task_id=72fbb0ba-1261-415a-b5b6-9dcc063bdb8f (api:54)

2021-05-03 12:24:08,689+0300 INFO  (jsonrpc/4) [vdsm.api] FINISH get_image_ticket return={'result': {'active': True, 'canceled': False, 'connections': 4, 'expires': 4519066, 'idle_time': 0, 'ops': ['read'], 'size': 10737418240, 'sparse': False, 'dirty': False, 'timeout': 300, 'url': 'nbd:unix:/run/vdsm/backup/2b332ecf-08a1-48c8-8c58-1cfc1f8e9f32:exportname=vda', 'uuid': '08fa6023-80d2-4158-9247-bfbfed1a5584', 'transfer_id': '7a852a9b-dd5b-4202-91ed-73ed889fb5bf', 'transferred': 1124073472}} from=::ffff:10.46.16.248,41948, flow_id=d125f9fe-d70d-4136-b085-2bbd1485190e, task_id=39fcce04-b410-45fc-8869-ca855c64c37c (api:54)

We transferred 1124073472 bytes so far...

4. Ticket removed

2021-05-03 12:24:18,737+0300 INFO  (jsonrpc/7) [vdsm.api] START remove_image_ticket(uuid='08fa6023-80d2-4158-9247-bfbfed1a5584') from=::ffff:10.46.16.248,41948, flow_id=d125f9fe-d70d-4136-b085-2bbd1485190e, task_id=082a33df-b2c2-4299-92e5-7fc2d76fdc4b (api:48)

2021-05-03 12:25:18,742+0300 INFO  (jsonrpc/7) [vdsm.api] FINISH remove_image_ticket error=Image daemon request failed: 'status=409, reason=Conflict, error=Timeout cancelling ticket 08fa6023-80d2-4158-9247-bfbfed1a5584\n' from=::ffff:10.46.16.248,41948, flow_id=d125f9fe-d70d-4136-b085-2bbd1485190e, task_id=082a33df-b2c2-4299-92e5-7fc2d76fdc4b (api:52)
2021-05-03 12:25:18,742+0300 ERROR (jsonrpc/7) [storage.TaskManager.Task] (Task='082a33df-b2c2-4299-92e5-7fc2d76fdc4b') Unexpected error (task:880)
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line 887, in _run
    return fn(*args, **kargs)
  File "<decorator-gen-151>", line 2, in remove_image_ticket
  File "/usr/lib/python3.6/site-packages/vdsm/common/api.py", line 50, in method
    ret = func(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/hsm.py", line 3093, in remove_image_ticket
    imagetickets.remove_ticket(uuid)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/imagetickets.py", line 66, in wrapper
    return func(*args, **kw)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/imagetickets.py", line 95, in remove_ticket
    _request("DELETE", uuid)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/imagetickets.py", line 119, in _request
    raise se.ImageDaemonError(res.status, res.reason, error)
vdsm.storage.exception.ImageDaemonError: Image daemon request failed: 'status=409, reason=Conflict, error=Timeout cancelling ticket 08fa6023-80d2-4158-9247-bfbfed1a5584\n'

The failure is expected when ticket has active connections and it cannot be removed.
Engine is handling this case by retrying.

We had several retries and all of them failed. This means the daemon has active connections
but they are blocked in requests to qemu nbd server.

In qemu log we don't have any info:

2021-05-03 08:59:31.213+0000: Domain id=3 is tainted: custom-hypervisor-feature
2021-05-03 08:59:31.222+0000: 1817200: info : virObjectUnref:381 : OBJECT_UNREF: obj=0x7f5ddc15a120
2021-05-03 09:00:03.308+0000: Domain id=3 is tainted: custom-ga-command

But we know that qemu did not crash - so this did not reproduce the original issue, but
is a new issue in qemu.

We need to file a new qemu bug for this, but we need to extract code dump of qemu in this 
state.

Summary:
1. Test incremental backup instead of full backup
2. When the operation fails, get a core dump of qemu

   gcore {qemu-kvm-pid}

Compress the core dump and put it somewhere qemu folks can access.

Comment 45 Ilan Zuckerman 2021-05-12 06:15:14 UTC
Reproduced the issue with VM_1_raw_disk_virtio
VM became 'not responding' after the SECOND reboot

Adding the following:
VM dump xml: VM_1_raw_disk_virtio_dump.xml
Core dump of qemu-kvm pid (gcore {qemu-kvm-pid}): core.380155
Engine log + vdsm log + daemon.log from both, engine and vdsm


[root@storage-ge13-vdsm1 examples]# python3 /usr/share/doc/python3-ovirt-engine-sdk4/examples/backup_vm.py -c engine full 2fe28915-fe33-4f92-9612-7684bbb0ea98 --backup-dir /rhev/data-center/mnt/mantis-nfs-lif2.lab.eng.tlv2.redhat.com\:_nas01_ge__storage13__nfs__1/backup/raw/
[   0.0 ] Starting full backup for VM 2fe28915-fe33-4f92-9612-7684bbb0ea98
[   6.7 ] Waiting until backup 2eb2ea4d-bf57-4eec-a6fe-42a04e02b4fc is ready
[  28.0 ] Creating image transfer for disk 82b0903c-4076-4e80-9fd4-d7b80783913f
[  30.1 ] Image transfer 18b8cc2e-3c66-4663-8d6f-27d4eef2ac4f is ready
[ 100.00% ] 6.00 GiB, 70.75 seconds, 86.85 MiB/s                               
[ 100.9 ] Finalizing image transfer
[ 104.0 ] Creating image transfer for disk 82b0903c-4076-4e80-9fd4-d7b80783913f
[ 105.1 ] Image transfer a80f0c3f-95ad-4ed1-8a60-9394ce5a770e is ready
[ 100.00% ] 6.00 GiB, 73.14 seconds, 84.00 MiB/s                               
[ 178.2 ] Finalizing image transfer
[ 181.4 ] Creating image transfer for disk 82b0903c-4076-4e80-9fd4-d7b80783913f
[ 183.1 ] Image transfer a43573b0-802b-47cb-9b75-523085ac9716 is ready
[ 100.00% ] 6.00 GiB, 71.93 seconds, 85.42 MiB/s                               
[ 255.0 ] Finalizing image transfer
[ 257.6 ] Creating image transfer for disk 82b0903c-4076-4e80-9fd4-d7b80783913f
[ 259.7 ] Image transfer ee69b649-6747-4a93-816c-275aed3ed301 is ready
[ 100.00% ] 6.00 GiB, 73.66 seconds, 83.41 MiB/s                               
[ 333.4 ] Finalizing image transfer
[ 343.6 ] Creating image transfer for disk 82b0903c-4076-4e80-9fd4-d7b80783913f
[ 344.7 ] Image transfer 9c46a2be-d5cd-4200-b120-747d47c2061a is ready
[   8.33% ] 512.00 MiB, 70.47 seconds, 7.27 MiB/s                              
[ 415.2 ] Finalizing image transfer
[ 476.0 ] Finalizing backup
Traceback (most recent call last):
  File "/usr/share/doc/python3-ovirt-engine-sdk4/examples/backup_vm.py", line 420, in download_disk
    **extra_args)
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/client/_api.py", line 186, in download
    name="download")
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/io.py", line 69, in copy
    log.debug("Executor failed")
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/io.py", line 189, in __exit__
    self.stop()
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/io.py", line 166, in stop
    raise self._errors[0]
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/io.py", line 238, in _run
    handler.copy(req)
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/io.py", line 282, in copy
    self._src.write_to(self._dst, req.length, self._buf)
  File "/usr/lib64/python3.6/site-packages/ovirt_imageio/_internal/backends/http.py", line 211, in write_to
    n = res.readinto(view[:step])
  File "/usr/lib64/python3.6/http/client.py", line 507, in readinto
    n = self.fp.readinto(b)
  File "/usr/lib64/python3.6/socket.py", line 586, in readinto
    return self._sock.recv_into(b)
socket.timeout: timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/share/doc/python3-ovirt-engine-sdk4/examples/backup_vm.py", line 177, in cmd_full
    download_backup(connection, backup.id, args)
  File "/usr/share/doc/python3-ovirt-engine-sdk4/examples/backup_vm.py", line 376, in download_backup
    download_disk(connection, backup_uuid, disk, disk_path, args, incremental=incremental)
  File "/usr/share/doc/python3-ovirt-engine-sdk4/examples/backup_vm.py", line 423, in download_disk
    imagetransfer.finalize_transfer(connection, transfer, disk)
  File "/usr/share/doc/python3-ovirt-engine-sdk4/examples/helpers/imagetransfer.py", line 332, in finalize_transfer
    .format(transfer.id))
RuntimeError: Timed out waiting for transfer 9c46a2be-d5cd-4200-b120-747d47c2061a to finalize

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/share/doc/python3-ovirt-engine-sdk4/examples/backup_vm.py", line 457, in <module>
    main()
  File "/usr/share/doc/python3-ovirt-engine-sdk4/examples/backup_vm.py", line 161, in main
    args.command(args)
  File "/usr/share/doc/python3-ovirt-engine-sdk4/examples/backup_vm.py", line 180, in cmd_full
    stop_backup(connection, backup.id, args)
  File "/usr/share/doc/python3-ovirt-engine-sdk4/examples/backup_vm.py", line 335, in stop_backup
    backup_service.finalize()
  File "/usr/lib64/python3.6/site-packages/ovirtsdk4/services.py", line 33793, in finalize
    return self._internal_action(action, 'finalize', None, headers, query, wait)
  File "/usr/lib64/python3.6/site-packages/ovirtsdk4/service.py", line 299, in _internal_action
    return future.wait() if wait else future
  File "/usr/lib64/python3.6/site-packages/ovirtsdk4/service.py", line 55, in wait
    return self._code(response)
  File "/usr/lib64/python3.6/site-packages/ovirtsdk4/service.py", line 296, in callback
    self._check_fault(response)
  File "/usr/lib64/python3.6/site-packages/ovirtsdk4/service.py", line 132, in _check_fault
    self._raise_error(response, body)
  File "/usr/lib64/python3.6/site-packages/ovirtsdk4/service.py", line 118, in _raise_error
    raise error
ovirtsdk4.Error: Fault reason is "Operation Failed". Fault detail is "[]". HTTP response code is 400.

Comment 46 Ilan Zuckerman 2021-05-12 06:19:23 UTC
Created attachment 1782240 [details]
Failed to verify 2 with VM_1_raw_disk_virtio

Comment 47 Nir Soffer 2021-05-12 13:43:58 UTC
Looking at attachment 1782240 [details], the non responsive qemu has this backtrace:

[New LWP 380155]
[New LWP 380160]
[New LWP 380161]
[New LWP 380165]
[New LWP 380167]
[New LWP 380171]
[New LWP 380172]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/libexec/qemu-kvm'.
#0  0x00007f9daa00c65d in __lll_lock_wait () from /lib64/libpthread.so.0
[Current thread is 1 (Thread 0x7f9dad9def00 (LWP 380155))]

Thread 7 (Thread 0x7f9d43dff700 (LWP 380172)):
#0  0x00007f9daa0092fc in pthread_cond_wait@@GLIBC_2.3.2 () at /lib64/libpthread.so.0
#1  0x00005641db21e04d in qemu_cond_wait_impl (cond=<optimized out>, mutex=0x5641de00be88, file=0x5641db2540d3 "../ui/vnc-jobs.c", line=215) at ../util/qemu-thread-posix.c:174
#2  0x00005641daec1d71 in vnc_worker_thread_loop (queue=queue@entry=0x5641de00be50) at ../ui/vnc-jobs.c:215
#3  0x00005641daec26a0 in vnc_worker_thread (arg=0x5641de00be50) at ../ui/vnc-jobs.c:325
#4  0x00005641db21dc94 in qemu_thread_start (args=0x5641de00bee0) at ../util/qemu-thread-posix.c:521
#5  0x00007f9daa00314a in start_thread () at /lib64/libpthread.so.0
#6  0x00007f9da9d32dc3 in clone () at /lib64/libc.so.6

Thread 6 (Thread 0x7f9d54fff700 (LWP 380171)):
#0  0x00007f9da9d27a41 in poll () at /lib64/libc.so.6
#1  0x00007f9dac839ab6 in g_main_context_iterate.isra () at /lib64/libglib-2.0.so.0
#2  0x00007f9dac839e72 in g_main_loop_run () at /lib64/libglib-2.0.so.0
#3  0x00007f9d9f6c58db in red_worker_main () at /lib64/libspice-server.so.1
#4  0x00007f9daa00314a in start_thread () at /lib64/libpthread.so.0
#5  0x00007f9da9d32dc3 in clone () at /lib64/libc.so.6

Thread 5 (Thread 0x7f9d575fe700 (LWP 380167)):
#0  0x00007f9da9d27b36 in ppoll () at /lib64/libc.so.6
#1  0x00005641db21efa9 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>, __fds=<optimized out>) at /usr/include/bits/poll2.h:77
#2  0x00005641db21efa9 in qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, timeout=timeout@entry=-1) at ../util/qemu-timer.c:337
#3  0x00005641db21e909 in fdmon_poll_wait (ctx=0x5641dc16ba60, ready_list=0x7f9d575fd210, timeout=-1) at ../util/fdmon-poll.c:80
#4  0x00005641db21b511 in aio_poll (ctx=0x5641dc16ba60, blocking=blocking@entry=true) at ../util/aio-posix.c:607
#5  0x00005641db214d3a in aio_wait_bh_oneshot (ctx=0x5641dc1768e0, cb=cb@entry=0x5641db135170 <nbd_aio_detach_bh>, opaque=opaque@entry=0x5641de42b400) at ../util/aio-wait.c:71
#6  0x00005641db13414c in blk_aio_detach (opaque=0x5641de42b400) at ../nbd/server.c:1573
#7  0x00005641db18945b in bdrv_detach_aio_context (bs=0x5641dc1f0990) at ../block.c:6394
#8  0x00005641db18945b in bdrv_set_aio_context_ignore (bs=0x5641dc1f0990, new_context=new_context@entry=0x5641dc16ba60, ignore=ignore@entry=0x7f9d575fd350) at ../block.c:6502
#9  0x00005641db18938b in bdrv_set_aio_context_ignore (bs=bs@entry=0x5641de01a680, new_context=new_context@entry=0x5641dc16ba60, ignore=ignore@entry=0x7f9d575fd350) at ../block.c:6489
#10 0x00005641db189973 in bdrv_child_try_set_aio_context (bs=bs@entry=0x5641de01a680, ctx=0x5641dc16ba60, ignore_child=<optimized out>, errp=<optimized out>) at ../block.c:6606
#11 0x00005641db141f3b in blk_do_set_aio_context (blk=0x5641ddbe08f0, new_context=0x5641dc16ba60, update_root_node=update_root_node@entry=true, errp=errp@entry=0x0) at ../block/block-backend.c:2026
#12 0x00005641db144641 in blk_set_aio_context (blk=<optimized out>, new_context=<optimized out>, errp=errp@entry=0x0) at ../block/block-backend.c:2047
#13 0x00005641db087f34 in virtio_blk_data_plane_stop (vdev=<optimized out>) at ../hw/block/dataplane/virtio-blk.c:304
#14 0x00005641daf5b320 in virtio_bus_stop_ioeventfd (bus=bus@entry=0x5641ddbdcb48) at ../hw/virtio/virtio-bus.c:245
#15 0x00005641daf5ba6f in virtio_bus_stop_ioeventfd (bus=bus@entry=0x5641ddbdcb48) at ../hw/virtio/virtio-bus.c:237
#16 0x00005641daf40958 in virtio_pci_stop_ioeventfd (proxy=0x5641ddbd4a40) at ../hw/virtio/virtio-pci.c:1255
#17 0x00005641daf40958 in virtio_pci_common_write (opaque=0x5641ddbd4a40, addr=<optimized out>, val=<optimized out>, size=<optimized out>) at ../hw/virtio/virtio-pci.c:1255
#18 0x00005641db0cf508 in memory_region_write_accessor (mr=<optimized out>, addr=<optimized out>, value=<optimized out>, size=<optimized out>, shift=<optimized out>, mask=<optimized out>, attrs=...) at ../softmmu/memory.c:491
#19 0x00005641db0ce0fe in access_with_adjusted_size (addr=addr@entry=20, value=value@entry=0x7f9d575fd518, size=size@entry=1, access_size_min=<optimized out>, access_size_max=<optimized out>, access_fn=0x5641db0cf4b0 <memory_region_write_accessor>, mr=0x5641ddbd5420, attrs=...) at ../softmmu/memory.c:552
#20 0x00005641db0d14a3 in memory_region_dispatch_write (mr=mr@entry=0x5641ddbd5420, addr=20, data=<optimized out>, op=<optimized out>, attrs=attrs@entry=...) at ../softmmu/memory.c:1501
#21 0x00005641db07c957 in flatview_write_continue (fv=fv@entry=0x7f9d482a3ef0, addr=addr@entry=4263510036, attrs=..., ptr=ptr@entry=0x7f9dad9fd028, len=len@entry=1, addr1=<optimized out>, l=<optimized out>, mr=0x5641ddbd5420) at /usr/src/debug/qemu-kvm-5.2.0-16.module+el8.4.0+10806+b7d97207.x86_64/include/qemu/host-utils.h:164
#22 0x00005641db07cb76 in flatview_write (fv=0x7f9d482a3ef0, addr=4263510036, attrs=..., buf=0x7f9dad9fd028, len=1) at ../softmmu/physmem.c:2799
#23 0x00005641db08070f in address_space_write (as=<optimized out>, addr=<optimized out>, attrs=..., buf=<optimized out>, len=<optimized out>) at ../softmmu/physmem.c:2891
#24 0x00005641db0a41da in kvm_cpu_exec (cpu=cpu@entry=0x5641dc1f54c0) at ../accel/kvm/kvm-all.c:2553
#25 0x00005641db0dc8e5 in kvm_vcpu_thread_fn (arg=0x5641dc1f54c0) at ../accel/kvm/kvm-cpus.c:49
#26 0x00005641db21dc94 in qemu_thread_start (args=0x5641dc21cdd0) at ../util/qemu-thread-posix.c:521
#27 0x00007f9daa00314a in start_thread () at /lib64/libpthread.so.0
#28 0x00007f9da9d32dc3 in clone () at /lib64/libc.so.6

Thread 4 (Thread 0x7f9d57dff700 (LWP 380165)):
#0  0x00007f9da9d27a41 in poll () at /lib64/libc.so.6
#1  0x00007f9dac839ab6 in g_main_context_iterate.isra () at /lib64/libglib-2.0.so.0
#2  0x00007f9dac839e72 in g_main_loop_run () at /lib64/libglib-2.0.so.0
#3  0x00005641db12aa69 in iothread_run (opaque=0x5641dbff9b20) at ../iothread.c:80
#4  0x00005641db21dc94 in qemu_thread_start (args=0x5641dc1eac40) at ../util/qemu-thread-posix.c:521
#5  0x00007f9daa00314a in start_thread () at /lib64/libpthread.so.0
#6  0x00007f9da9d32dc3 in clone () at /lib64/libc.so.6

Thread 3 (Thread 0x7f9d9dbf0700 (LWP 380161)):
#0  0x00007f9da9d27b36 in ppoll () at /lib64/libc.so.6
#1  0x00005641db21ef65 in ppoll (__ss=0x0, __timeout=0x7f9d9dbef520, __nfds=<optimized out>, __fds=<optimized out>) at /usr/include/bits/poll2.h:77
#2  0x00005641db21ef65 in qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, timeout=timeout@entry=599999036818) at ../util/qemu-timer.c:349
#3  0x00005641db21e909 in fdmon_poll_wait (ctx=0x5641dc1768e0, ready_list=0x7f9d9dbef5a0, timeout=599999036818) at ../util/fdmon-poll.c:80
#4  0x00005641db21b511 in aio_poll (ctx=ctx@entry=0x5641dc1768e0, blocking=blocking@entry=true) at ../util/aio-posix.c:607
#5  0x00005641db13528d in nbd_aio_detach_bh (opaque=0x5641de42b400) at ../nbd/server.c:1562
#6  0x00005641db214c6f in aio_wait_bh (opaque=0x7f9d575fd260) at ../util/aio-wait.c:55
#7  0x00005641db212c5d in aio_bh_call (bh=0x7f9d98003620) at ../util/async.c:164
#8  0x00005641db212c5d in aio_bh_poll (ctx=ctx@entry=0x5641dc1768e0) at ../util/async.c:164
#9  0x00005641db21b122 in aio_poll (ctx=0x5641dc1768e0, blocking=blocking@entry=true) at ../util/aio-posix.c:659
#10 0x00005641db12aa3a in iothread_run (opaque=0x5641dc112f20) at ../iothread.c:73
#11 0x00005641db21dc94 in qemu_thread_start (args=0x5641dc176df0) at ../util/qemu-thread-posix.c:521
#12 0x00007f9daa00314a in start_thread () at /lib64/libpthread.so.0
#13 0x00007f9da9d32dc3 in clone () at /lib64/libc.so.6

Thread 2 (Thread 0x7f9da52e4700 (LWP 380160)):
#0  0x00007f9da9d2d52d in syscall () at /lib64/libc.so.6
#1  0x00005641db21e55f in qemu_futex_wait (val=<optimized out>, f=<optimized out>) at /usr/src/debug/qemu-kvm-5.2.0-16.module+el8.4.0+10806+b7d97207.x86_64/include/qemu/futex.h:29
#2  0x00005641db21e55f in qemu_event_wait (ev=ev@entry=0x5641dbaa4a40 <rcu_gp_event>) at ../util/qemu-thread-posix.c:460
#3  0x00005641db23db9f in wait_for_readers () at ../util/rcu.c:135
#4  0x00005641db23db9f in synchronize_rcu () at ../util/rcu.c:171
#5  0x00005641db23df45 in call_rcu_thread (opaque=<optimized out>) at ../util/rcu.c:265
#6  0x00005641db21dc94 in qemu_thread_start (args=0x5641dbf863a0) at ../util/qemu-thread-posix.c:521
#7  0x00007f9daa00314a in start_thread () at /lib64/libpthread.so.0
#8  0x00007f9da9d32dc3 in clone () at /lib64/libc.so.6

Thread 1 (Thread 0x7f9dad9def00 (LWP 380155)):
#0  0x00007f9daa00c65d in __lll_lock_wait () at /lib64/libpthread.so.0
#1  0x00007f9daa005979 in pthread_mutex_lock () at /lib64/libpthread.so.0
#2  0x00005641db21dd7d in qemu_mutex_lock_impl (mutex=0x5641dbaa0740 <qemu_global_mutex>, file=0x5641db57ecc1 "../util/main-loop.c", line=242) at ../util/qemu-thread-posix.c:79
#3  0x00005641db08670e in qemu_mutex_lock_iothread_impl (file=file@entry=0x5641db57ecc1 "../util/main-loop.c", line=line@entry=242) at ../softmmu/cpus.c:485
#4  0x00005641db236a1d in os_host_main_loop_wait (timeout=<optimized out>) at ../util/main-loop.c:242
#5  0x00005641db236a1d in main_loop_wait (nonblocking=nonblocking@entry=0) at ../util/main-loop.c:520
#6  0x00005641db099eb1 in qemu_main_loop () at ../softmmu/vl.c:1679
#7  0x00005641dae77a02 in main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at ../softmmu/main.c:50

Comment 48 Stefan Hajnoczi 2021-05-12 13:58:29 UTC
Thread 3 vs Thread 5 looks suspicious:

Thread 4 (vcpu thread):
#5  0x00005641db214d3a in aio_wait_bh_oneshot (ctx=0x5641dc1768e0,

Thread 3 (IOThread)
#4  0x00005641db21b511 in aio_poll (ctx=ctx@entry=0x5641dc1768e0,

The vcpu thread is waiting for the IOThread to do something and then notify the vcpu thread (QEMU main loop).

I haven't tried following the code myself, but this looks like a clue regarding the deadlock.

Comment 49 Stefan Hajnoczi 2021-05-12 13:59:56 UTC
s/Thread 4 (vcpu thread)/Thread 5 (vcpu thread)/

Comment 50 Nir Soffer 2021-05-12 15:19:22 UTC
We need more info about the non-responsive vm to tell how severe is this bug:

- How reproducible is this issue? how many time you tested? how many times the
  vm became non-responsive?

- When the vm becomes non-responsive, can we shutdown the vm?

- If shutting down the vms fail, can we terminate the vm cleanly?
  (kill {qemu-kvm-pid} or kill -9 required?)

Please file rhel av qemu bug with the info collected here and the answers to
these questions.

We can name this bug depend on the qemu bug. We will have to require qemu
version fixing this issue.

Comment 51 Ilan Zuckerman 2021-05-13 06:47:32 UTC
(In reply to Nir Soffer from comment #50)
> - How reproducible is this issue? how many time you tested? how many times
> the vm became non-responsive?

I repeated the same flow with full backup and reboot with virtio raw disk VM as described in comment #25 four times.
Out of 4 reboots, the issue was seen in 4 of them.
Also tried the same flow with raw_disk_virtio-scsi VM, but at this configuration, it all worked as expected.

> - When the vm becomes non-responsive, can we shutdown the vm?

No. Cannot shut it down, due to "Cannot stop VM. The VM is during a backup operation."

> - If shutting down the vms fail, can we terminate the vm cleanly?
>   (kill {qemu-kvm-pid} or kill -9 required?)

kill {pid}    doesnt have any effect.
kill -9 {pid} terminates the process.

> Please file rhel av qemu bug with the info collected here and the answers to
> these questions.

Bug opened:
https://bugzilla.redhat.com/show_bug.cgi?id=1960137

Comment 52 Nir Soffer 2021-05-13 10:01:54 UTC
We need to wait for a fix for bug 1960137.

Comment 53 John Ferlan 2021-06-18 21:27:52 UTC
bug 1960137 for RHEL-AV 8.5.0 has been cloned to bug 1973829 for RHEL-AV 8.4.z

Comment 55 Amit Sharir 2021-07-19 08:31:25 UTC
Just want to validate that I understand the correct verification for the complex bug.
Should I verify it using only the steps mentioned in comment #25? 
Or are there additional aspects I need to verify in a different flow?

Comment 57 Nir Soffer 2021-07-25 11:44:17 UTC
(In reply to Amit Sharir from comment #55)
Yes, comment 25 seems to describe the way this should be verified.

Comment 58 Avihai 2021-08-08 11:05:02 UTC
(In reply to Nir Soffer from comment #57)
> (In reply to Amit Sharir from comment #55)
> Yes, comment 25 seems to describe the way this should be verified.

All dependant bugs are verified.

As of rhv-4.4.8-3 we have:
qemu-kvm-5.2.0-16.module+el8.4.0+11923+e8b883e4.4.x86_64.

Looking at RHEL8.4 dependant bug 1973829 fix in version:
qemu-kvm-5.2.0-16.module+el8.4.0+11721+c8bbc1be.3

So rhv-4.4.8-3 looks like it's already consuming fixed qemu-kvm build.

However one of the patchs[1] related to this bug is still in new , can you please check it out?


[1] https://gerrit.ovirt.org/#/c/vdsm/+/115865/

Comment 59 Eyal Shenitzky 2021-08-09 06:34:49 UTC
(In reply to Avihai from comment #58)
> (In reply to Nir Soffer from comment #57)
> > (In reply to Amit Sharir from comment #55)
> > Yes, comment 25 seems to describe the way this should be verified.
> 
> All dependant bugs are verified.
> 
> As of rhv-4.4.8-3 we have:
> qemu-kvm-5.2.0-16.module+el8.4.0+11923+e8b883e4.4.x86_64.
> 
> Looking at RHEL8.4 dependant bug 1973829 fix in version:
> qemu-kvm-5.2.0-16.module+el8.4.0+11721+c8bbc1be.3
> 
> So rhv-4.4.8-3 looks like it's already consuming fixed qemu-kvm build.
> 
> However one of the patchs[1] related to this bug is still in new , can you
> please check it out?
> 
> 
> [1] https://gerrit.ovirt.org/#/c/vdsm/+/115865/

You already have the right qemu-kvm version so you can test this bug without the attached patch.

But in order to make sure that all the hosts will contain this fix the patch must be merged.

Comment 60 Avihai 2021-08-09 11:25:58 UTC
(In reply to Eyal Shenitzky from comment #59)
> (In reply to Avihai from comment #58)
> > (In reply to Nir Soffer from comment #57)
> > > (In reply to Amit Sharir from comment #55)
> > > Yes, comment 25 seems to describe the way this should be verified.
> > 
> > All dependant bugs are verified.
> > 
> > As of rhv-4.4.8-3 we have:
> > qemu-kvm-5.2.0-16.module+el8.4.0+11923+e8b883e4.4.x86_64.
> > 
> > Looking at RHEL8.4 dependant bug 1973829 fix in version:
> > qemu-kvm-5.2.0-16.module+el8.4.0+11721+c8bbc1be.3
> > 
> > So rhv-4.4.8-3 looks like it's already consuming fixed qemu-kvm build.
> > 
> > However one of the patchs[1] related to this bug is still in new , can you
> > please check it out?
> > 
> > 
> > [1] https://gerrit.ovirt.org/#/c/vdsm/+/115865/
> 
> You already have the right qemu-kvm version so you can test this bug without
> the attached patch.
> 
> But in order to make sure that all the hosts will contain this fix the patch
> must be merged.

Got it.
Waiting for the patch to get merge and bug to go to ON_QA as usual.

Comment 62 Amit Sharir 2021-08-19 12:42:30 UTC
Version: 
vdsm-4.40.80.5-1.el8ev.x86_64
ovirt-engine-4.4.8.4-0.7.el8ev.noarch

Verification steps:
I used the steps Nir mentioned in comment#25 with the modified backup script on a full backup scenario.
I ran the test 6 times with the following disk conditions on the VM's:

1.raw disk, virtio-scsi
2.raw disk, virtio
3.qcow2 disk, virtio-scsi
4.qcow2 disk, virtio
5.qcow2 disk, virtio-scsi, enable incremental backup
6.qcow2 disk, virtio, enable incremental backup

After each run I checked that all the downloaded backups are the same using:
for i in $(seq 4); do echo "compare backup 00 to 0$i"; qemu-img compare 5eb97d51-9da4-420c-8c88-6a3691bc4c67.00.qcow2 5eb97d51-9da4-420c-8c88-6a3691bc4c67.0$i.qcow2; done


Verification conclusions:
The expected output matched the actual output.
The total flow mentioned was done with no errors, I was able to reboot inside the guest OS of the VM (via the console connection) without affecting the backup procedure.
In addition, all the downloaded backups were identical.

Bug verified.


Note You need to log in before you can comment on or make changes to this bug.