Bug 1554028 - "No space left on device" error when copying a disk based on template to a block domain in DC <= 4.0 when the disk was extended
Summary: "No space left on device" error when copying a disk based on template to a bl...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: BLL.Storage
Version: 4.2.0
Hardware: x86_64
OS: Unspecified
unspecified
medium
Target Milestone: ovirt-4.2.2
: ---
Assignee: Benny Zlotnik
QA Contact: Kevin Alon Goldblatt
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-03-10 20:30 UTC by Elad
Modified: 2021-04-25 15:02 UTC (History)
6 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2018-04-05 09:38:28 UTC
oVirt Team: Storage
Embargoed:
rule-engine: ovirt-4.2+
rule-engine: exception+


Attachments (Terms of Use)
logs (1022.49 KB, application/x-gzip)
2018-03-10 20:30 UTC, Elad
no flags Details
rhel-7.4 (1.44 MB, application/x-gzip)
2018-03-11 11:17 UTC, Elad
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1523614 0 unspecified CLOSED Copy image to a block storage destination does not work after disk extension in a snapshot in DC pre-4.0 2021-04-25 15:02:17 UTC
oVirt gerrit 88920 0 'None' MERGED core: block moving of an extended template disk in compat 0.10 2021-01-24 22:14:09 UTC
oVirt gerrit 89032 0 'None' MERGED core: block moving of an extended template disk in compat 0.10 2021-01-24 22:14:09 UTC
oVirt gerrit 112827 0 master MERGED core: remove childDiskWasExtended validation 2021-04-27 12:04:21 UTC

Internal Links: 1523614

Description Elad 2018-03-10 20:30:18 UTC
Created attachment 1406703 [details]
logs

Description of problem:
With qcow2 image on RHEL7.5, live storage migration of a VM disk based on a template while there is another running VM that is based on that template fails on CopyImageError

This seems to be caused by the new qemu image locking that also impacts snapshot merge operations as described in BZ 1552059

How reproducible:
Always

Steps to Reproduce:
- Created a 4.0 DC and cluster with RHEL7.5 host
- Created a NFS domain (created as v3)
- Created a template 
- Created 2 VMs from the template as thin copy (VM image is based on the template) - created as qcow2
- Created a second storage domain (iSCSI)
- Copied template disk to the second storage domain
- Started both VMs 
- Tried to move one of the VMs disk to the second domain


Tested this scenario also on 4.2 DC (v4 domain) also with RHEL7.5 and the same qemu, vdsm and libvirt packages. LSM works fine.



Actual results:

Disk move fails:


2018-03-10 21:33:11,567+0200 DEBUG (jsonrpc/0) [storage.TaskManager.Task] (Task='a4b1c616-164e-43b9-af6c-77e8f44d5809') moving from state init -> state preparing (task:602)
2018-03-10 21:33:11,569+0200 INFO  (jsonrpc/0) [vdsm.api] START syncImageData(spUUID='79d2a575-75aa-40e2-8caf-1f768de486e3', sdUUID='34c01407-3634-41e7-96be-bd5cff15e9b9', imgUUID='5f618e7c-e724-475d-b8d5-c60439d04b68', dstSdUUID='8820dfa3-0dff-4874-b431-c70266d10cdb', syncType='INTERNAL') from=::ffff:10.35.161.118,49528, flow_id=13000ef2-ea5e-4bb6-a36e-820a193bbd89, task_id=a4b1c616-164e-43b9-af6c-77e8f44d5809 (api:46)




2018-03-10 21:33:39,986+0200 DEBUG (tasks/6) [storage.operation] FAILED: <err> = bytearray(b'qemu-img: error while writing sector 11534336: No space left on device\n'); <rc> = 1 (operation:169)
2018-03-10 21:33:39,988+0200 ERROR (tasks/6) [storage.Image] Copy image error: image=5f618e7c-e724-475d-b8d5-c60439d04b68, src domain=34c01407-3634-41e7-96be-bd5cff15e9b9, dst domain=8820dfa3-0dff-4874-b431-c702
66d10cdb (image:494)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/image.py", line 485, in _interImagesCopy
    self._run_qemuimg_operation(operation)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/image.py", line 113, in _run_qemuimg_operation
    operation.run()
  File "/usr/lib/python2.7/site-packages/vdsm/storage/qemuimg.py", line 276, in run
    for data in self._operation.watch():
  File "/usr/lib/python2.7/site-packages/vdsm/storage/operation.py", line 104, in watch
    self._finalize(b"", err)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/operation.py", line 178, in _finalize
    raise cmdutils.Error(self._cmd, rc, out, err)
Error: Command ['/usr/bin/taskset', '--cpu-list', '0-0', '/usr/bin/nice', '-n', '19', '/usr/bin/ionice', '-c', '3', '/usr/bin/qemu-img', 'convert', '-p', '-t', 'none', '-T', 'none', '-f', 'qcow2', u'/rhev/data-c
enter/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Storage__NFS_storage__local__ge4__nfs__3/34c01407-3634-41e7-96be-bd5cff15e9b9/images/5f618e7c-e724-475d-b8d5-c60439d04b68/9ed4dd18-42e9-461a-a18f-38c913ffac9b', '-O',
 'qcow2', '-o', 'compat=0.10,backing_file=c6e70cde-f1b3-46e5-8867-da0427cb5c19,backing_fmt=qcow2', '/rhev/data-center/mnt/blockSD/8820dfa3-0dff-4874-b431-c70266d10cdb/images/5f618e7c-e724-475d-b8d5-c60439d04b68/
9ed4dd18-42e9-461a-a18f-38c913ffac9b'] failed with rc=1 out='' err=bytearray(b'qemu-img: error while writing sector 11534336: No space left on device\n')



2018-03-10 21:33:41,830+0200 ERROR (tasks/6) [storage.TaskManager.Task] (Task='a4b1c616-164e-43b9-af6c-77e8f44d5809') Unexpected error (task:875)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in _run
    return fn(*args, **kargs)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 336, in run
    return self.cmd(*self.argslist, **self.argsdict)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/securable.py", line 79, in wrapper
    return method(self, *args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sp.py", line 1766, in syncImageData
    img.syncData(sdUUID, imgUUID, dstSdUUID, syncType)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/image.py", line 673, in syncData
    {'srcChain': srcChain, 'dstChain': dstChain})
  File "/usr/lib/python2.7/site-packages/vdsm/storage/image.py", line 495, in _interImagesCopy
    raise se.CopyImageError()
CopyImageError: low level Image copy failed: ()



Expected results:
Live storage migration should succeed




Version-Release number of selected component (if applicable):

vdsm-hook-openstacknet-4.20.20-1.el7ev.noarch
libvirt-daemon-driver-nwfilter-3.9.0-13.el7.x86_64
ovirt-hosted-engine-ha-2.2.6-1.el7ev.noarch
sanlock-python-3.6.0-1.el7.x86_64
libvirt-daemon-driver-storage-logical-3.9.0-13.el7.x86_64
libselinux-utils-2.5-12.el7.x86_64
vdsm-yajsonrpc-4.20.20-1.el7ev.noarch
qemu-kvm-rhev-2.10.0-21.el7.x86_64
vdsm-jsonrpc-4.20.20-1.el7ev.noarch
libvirt-daemon-config-network-3.9.0-13.el7.x86_64
vdsm-hook-vmfex-dev-4.20.20-1.el7ev.noarch
libvirt-lock-sanlock-3.9.0-13.el7.x86_64
ovirt-hosted-engine-setup-2.2.12-1.el7ev.noarch
libvirt-daemon-driver-storage-mpath-3.9.0-13.el7.x86_64
ovirt-imageio-common-1.2.1-0.el7ev.noarch
qemu-img-rhev-2.10.0-21.el7.x86_64
vdsm-python-4.20.20-1.el7ev.noarch
selinux-policy-3.13.1-192.el7.noarch
sanlock-3.6.0-1.el7.x86_64
vdsm-4.20.20-1.el7ev.x86_64
vdsm-hook-fcoe-4.20.20-1.el7ev.noarch
ovirt-host-4.2.2-1.el7ev.x86_64
libnfsidmap-0.25-19.el7.x86_64
ipxe-roms-qemu-20170123-1.git4e85b27.el7_4.1.noarch
ovirt-vmconsole-1.0.4-1.el7ev.noarch
libselinux-python-2.5-12.el7.x86_64
vdsm-common-4.20.20-1.el7ev.noarch
libvirt-daemon-driver-network-3.9.0-13.el7.x86_64
libvirt-daemon-config-nwfilter-3.9.0-13.el7.x86_64
libvirt-daemon-driver-interface-3.9.0-13.el7.x86_64
libvirt-daemon-driver-lxc-3.9.0-13.el7.x86_64
libvirt-daemon-driver-storage-iscsi-3.9.0-13.el7.x86_64
libvirt-daemon-driver-storage-scsi-3.9.0-13.el7.x86_64
libvirt-daemon-kvm-3.9.0-13.el7.x86_64
ovirt-vmconsole-host-1.0.4-1.el7ev.noarch
python-ovirt-engine-sdk4-4.2.4-1.el7ev.x86_64
vdsm-client-4.20.20-1.el7ev.noarch
selinux-policy-targeted-3.13.1-192.el7.noarch
vdsm-hook-vhostmd-4.20.20-1.el7ev.noarch
sanlock-lib-3.6.0-1.el7.x86_64
ovirt-provider-ovn-driver-1.2.8-1.el7ev.noarch
vdsm-hook-ethtool-options-4.20.20-1.el7ev.noarch
libvirt-python-3.9.0-1.el7.x86_64
qemu-guest-agent-2.8.0-2.el7.x86_64
ovirt-imageio-daemon-1.2.1-0.el7ev.noarch
libvirt-daemon-3.9.0-13.el7.x86_64
libvirt-daemon-driver-nodedev-3.9.0-13.el7.x86_64
libvirt-daemon-driver-qemu-3.9.0-13.el7.x86_64
libvirt-daemon-driver-storage-rbd-3.9.0-13.el7.x86_64
ovirt-engine-sdk-python-3.6.9.1-1.el7ev.noarch
libvirt-daemon-driver-storage-3.9.0-13.el7.x86_64
qemu-kvm-common-rhev-2.10.0-21.el7.x86_64
vdsm-http-4.20.20-1.el7ev.noarch
libvirt-libs-3.9.0-13.el7.x86_64
vdsm-hook-vfio-mdev-4.20.20-1.el7ev.noarch
libvirt-daemon-driver-secret-3.9.0-13.el7.x86_64
libselinux-2.5-12.el7.x86_64
cockpit-ovirt-dashboard-0.11.14-0.1.el7ev.noarch
ovirt-setup-lib-1.1.4-1.el7ev.noarch
libvirt-daemon-driver-storage-core-3.9.0-13.el7.x86_64
libvirt-daemon-driver-storage-gluster-3.9.0-13.el7.x86_64
libvirt-3.9.0-13.el7.x86_64
ovirt-host-deploy-1.7.2-1.el7ev.noarch
nfs-utils-1.3.0-0.54.el7.x86_64
vdsm-network-4.20.20-1.el7ev.x86_64
libvirt-client-3.9.0-13.el7.x86_64
ovirt-host-dependencies-4.2.2-1.el7ev.x86_64
libvirt-daemon-driver-storage-disk-3.9.0-13.el7.x86_64
vdsm-api-4.20.20-1.el7ev.noarch
kernel 3.10.0-851.el7.x86_64 
Red Hat Enterprise Linux Server 7.5 (Maipo)



Additional info:


# qemu-img info /rhev/data-center/79d2a575-75aa-40e2-8caf-1f768de486e3/34c01407-3634-41e7-96be-bd5cff15e9b9/images/5f618e7c-e724-475d-b8d5-c60439d04b68/096d936c-11f4-4d74-bcf4-73fc97422ced
image: /rhev/data-center/79d2a575-75aa-40e2-8caf-1f768de486e3/34c01407-3634-41e7-96be-bd5cff15e9b9/images/5f618e7c-e724-475d-b8d5-c60439d04b68/096d936c-11f4-4d74-bcf4-73fc97422ced
file format: qcow2
virtual size: 7.0G (7516192768 bytes)
disk size: 196K
cluster_size: 65536
backing file: 9ed4dd18-42e9-461a-a18f-38c913ffac9b (actual path: /rhev/data-center/79d2a575-75aa-40e2-8caf-1f768de486e3/34c01407-3634-41e7-96be-bd5cff15e9b9/images/5f618e7c-e724-475d-b8d5-c60439d04b68/9ed4dd18-42e9-461a-a18f-38c913ffac9b)
backing file format: qcow2
Format specific information:
    compat: 0.10
    refcount bits: 16




engine.log:

2018-03-10 21:33:56,496+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-22) [1f845844] EVENT_ID: USER_MOVED_DISK_FINISHED_FAILURE(2
,011), User admin@internal-authz have failed to move disk test_Disk1 to domain iscsi_3.

Comment 1 Allon Mureinik 2018-03-11 07:51:25 UTC
This is definitely a bug, but offhand, it does not look related to qemu's new locking mechanism. Actually, it looks similar to bug 1523614, and probably has to do with some subtle between qcow2's compat levels.

Elad, let's try to isolate the problem.
Can you run the same scenario on RHEL 7.4.z (with qemu 2.9.something) and double check whether it reproduces?

I'm tentatively targetting this to 4.2.2 under the assumption it really is a regression.
If the aforementioned requested analysis proves otherwise, we can rethink the targeting.

Comment 2 Red Hat Bugzilla Rules Engine 2018-03-11 07:51:31 UTC
This bug report has Keywords: Regression or TestBlocker.
Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.

Comment 3 Elad 2018-03-11 11:17:26 UTC
Created attachment 1406852 [details]
rhel-7.4

The bug doesn't reproduce on RHEL7.4.


2018-03-11 13:09:57,665+0200 INFO  (jsonrpc/1) [vdsm.api] FINISH syncImageData return=None from=::ffff:10.35.161.181,36978, flow_id=78d95342-cee0-4114-af41-d2b429270026, task_id=1b966efd-bb1d-4b2f-a784-817a19ad3
160 (api:52)



[root@storage-ge1-vdsm1 ~]# rpm -qa |egrep 'vdsm|libvirt|qemu' 
qemu-kvm-tools-rhev-2.10.0-21.el7.x86_64
qemu-guest-agent-2.8.0-2.el7.x86_64
qemu-kvm-rhev-2.10.0-21.el7.x86_64
libvirt-daemon-driver-interface-3.2.0-14.el7_4.9.x86_64
libvirt-daemon-driver-storage-iscsi-3.2.0-14.el7_4.9.x86_64
vdsm-yajsonrpc-4.19.48-1.el7ev.noarch
vdsm-hook-vmfex-dev-4.19.48-1.el7ev.noarch
libvirt-libs-3.2.0-14.el7_4.9.x86_64
vdsm-xmlrpc-4.19.48-1.el7ev.noarch
libvirt-daemon-driver-nwfilter-3.2.0-14.el7_4.9.x86_64
libvirt-daemon-driver-storage-disk-3.2.0-14.el7_4.9.x86_64
libvirt-daemon-kvm-3.2.0-14.el7_4.9.x86_64
vdsm-cli-4.19.48-1.el7ev.noarch
libvirt-daemon-3.2.0-14.el7_4.9.x86_64
libvirt-daemon-driver-nodedev-3.2.0-14.el7_4.9.x86_64
libvirt-daemon-driver-storage-logical-3.2.0-14.el7_4.9.x86_64
vdsm-hook-localdisk-4.19.48-1.el7ev.noarch
qemu-img-rhev-2.10.0-21.el7.x86_64
vdsm-api-4.19.48-1.el7ev.noarch
qemu-kvm-common-rhev-2.10.0-21.el7.x86_64
libvirt-daemon-driver-storage-core-3.2.0-14.el7_4.9.x86_64
libvirt-daemon-driver-qemu-3.2.0-14.el7_4.9.x86_64
libvirt-daemon-driver-lxc-3.2.0-14.el7_4.9.x86_64
libvirt-daemon-driver-storage-rbd-3.2.0-14.el7_4.9.x86_64
libvirt-daemon-driver-storage-scsi-3.2.0-14.el7_4.9.x86_64
vdsm-hook-ethtool-options-4.19.48-1.el7ev.noarch
libvirt-3.2.0-14.el7_4.9.x86_64
vdsm-python-4.19.48-1.el7ev.noarch
libvirt-daemon-driver-network-3.2.0-14.el7_4.9.x86_64
libvirt-daemon-config-network-3.2.0-14.el7_4.9.x86_64
libvirt-daemon-driver-storage-3.2.0-14.el7_4.9.x86_64
libvirt-python-3.2.0-3.el7_4.1.x86_64
libvirt-client-3.2.0-14.el7_4.9.x86_64
libvirt-daemon-driver-secret-3.2.0-14.el7_4.9.x86_64
libvirt-daemon-driver-storage-gluster-3.2.0-14.el7_4.9.x86_64
vdsm-jsonrpc-4.19.48-1.el7ev.noarch
vdsm-4.19.48-1.el7ev.x86_64
ipxe-roms-qemu-20170123-1.git4e85b27.el7_4.1.noarch
libvirt-daemon-config-nwfilter-3.2.0-14.el7_4.9.x86_64
libvirt-daemon-driver-storage-mpath-3.2.0-14.el7_4.9.x86_64
libvirt-lock-sanlock-3.2.0-14.el7_4.9.x86_64
[root@storage-ge1-vdsm1 ~]# cat /etc/os-release 
PRETTY_NAME="Red Hat Enterprise Linux Server 7.4 (Maipo)"
[root@storage-ge1-vdsm1 ~]# uname -a
Linux storage-ge1-vdsm1.scl.lab.tlv.redhat.com 3.10.0-693.21.1.el7.x86_64 #1 SMP Fri Feb 23 18:54:16 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Comment 4 Benny Zlotnik 2018-03-13 09:36:04 UTC
It appears to be a similar bug to https://bugzilla.redhat.com/show_bug.cgi?id=1523614

The case there is that we have a snapshot and then we extend the disk, which results in having a child snapshot bigger then its parent.

The case here is that we extend the VM disk, which results in a child image bigger than its parent (the template's image)

Comment 5 Red Hat Bugzilla Rules Engine 2018-03-15 10:54:50 UTC
Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.

Comment 6 Allon Mureinik 2018-03-15 13:08:01 UTC
Benny,can you please add some doctext explaining the situation and what a user can be done to overcome it?

Comment 7 Kevin Alon Goldblatt 2018-04-02 13:13:26 UTC
Verified with the following code:
----------------------------------------
ovirt-engine-4.2.2.6-0.1.el7.noarch
vdsm-4.20.23-1.el7ev.x86_64



Verified with the following scenario:
-------------------------------------------
Steps to Reproduce:
- Created a 4.0 DC and cluster with RHEL7.5 host
- Created a NFS domain (created as v3)
- Created a template 
- Created 2 VMs from the template as thin copy (VM image is based on the template) - created as qcow2
- Created a second storage domain (iSCSI)
- Copied template disk to the second storage domain
- Started both VMs 
- Tried to move one of the VMs disk to the second domain >>>>> disk move operation was successfull


Moving to VERIFIED

Comment 8 Kevin Alon Goldblatt 2018-04-02 14:04:56 UTC
Verified with the following code:
----------------------------------------
ovirt-engine-4.2.2.6-0.1.el7.noarch
vdsm-4.20.23-1.el7ev.x86_64


CORRECTION TO SCENARIO

Verified with the following scenario:
-------------------------------------------
Steps to Reproduce:
- Created a 4.0 DC and cluster with RHEL7.5 host
- Created a NFS domain (created as v3)
- Created a template 
- Created 2 VMs from the template as thin copy (VM image is based on the template) - created as qcow2
- Created a second storage domain (iSCSI)
- Copied template disk to the second storage domain
- Started both VMs 
- Extend the disk of one of the VMs (Added this step)
- Tried to move the extended disk to the second domain >>>>> This operation fails as expected


Moving to VERIFIED

Comment 9 Sandro Bonazzola 2018-04-05 09:38:28 UTC
This bugzilla is included in oVirt 4.2.2 release, published on March 28th 2018.

Since the problem described in this bug report should be
resolved in oVirt 4.2.2 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.