Bug 1554028

Summary: "No space left on device" error when copying a disk based on template to a block domain in DC <= 4.0 when the disk was extended
Product: [oVirt] ovirt-engine Reporter: Elad <ebenahar>
Component: BLL.StorageAssignee: Benny Zlotnik <bzlotnik>
Status: CLOSED CURRENTRELEASE QA Contact: Kevin Alon Goldblatt <kgoldbla>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.2.0CC: amureini, apinnick, bugs, bzlotnik, ebenahar, tnisan
Target Milestone: ovirt-4.2.2Flags: rule-engine: ovirt-4.2+
rule-engine: exception+
Target Release: ---   
Hardware: x86_64   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Known Issue
Doc Text:
Previously, with storage domains of data centers with compatibility version 4.0 or earlier, if a virtual disk based on a template disk was extended and moved during live storage migration, the move operation failed because the child image was larger than the parent image (template disk). In the current release, the move operation of such disks displays an error message instead of failing. The resolution for this scenario is to upgrade the data center to compatibility version 4.1 or later.
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-04-05 09:38:28 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
logs
none
rhel-7.4 none

Description Elad 2018-03-10 20:30:18 UTC
Created attachment 1406703 [details]
logs

Description of problem:
With qcow2 image on RHEL7.5, live storage migration of a VM disk based on a template while there is another running VM that is based on that template fails on CopyImageError

This seems to be caused by the new qemu image locking that also impacts snapshot merge operations as described in BZ 1552059

How reproducible:
Always

Steps to Reproduce:
- Created a 4.0 DC and cluster with RHEL7.5 host
- Created a NFS domain (created as v3)
- Created a template 
- Created 2 VMs from the template as thin copy (VM image is based on the template) - created as qcow2
- Created a second storage domain (iSCSI)
- Copied template disk to the second storage domain
- Started both VMs 
- Tried to move one of the VMs disk to the second domain


Tested this scenario also on 4.2 DC (v4 domain) also with RHEL7.5 and the same qemu, vdsm and libvirt packages. LSM works fine.



Actual results:

Disk move fails:


2018-03-10 21:33:11,567+0200 DEBUG (jsonrpc/0) [storage.TaskManager.Task] (Task='a4b1c616-164e-43b9-af6c-77e8f44d5809') moving from state init -> state preparing (task:602)
2018-03-10 21:33:11,569+0200 INFO  (jsonrpc/0) [vdsm.api] START syncImageData(spUUID='79d2a575-75aa-40e2-8caf-1f768de486e3', sdUUID='34c01407-3634-41e7-96be-bd5cff15e9b9', imgUUID='5f618e7c-e724-475d-b8d5-c60439d04b68', dstSdUUID='8820dfa3-0dff-4874-b431-c70266d10cdb', syncType='INTERNAL') from=::ffff:10.35.161.118,49528, flow_id=13000ef2-ea5e-4bb6-a36e-820a193bbd89, task_id=a4b1c616-164e-43b9-af6c-77e8f44d5809 (api:46)




2018-03-10 21:33:39,986+0200 DEBUG (tasks/6) [storage.operation] FAILED: <err> = bytearray(b'qemu-img: error while writing sector 11534336: No space left on device\n'); <rc> = 1 (operation:169)
2018-03-10 21:33:39,988+0200 ERROR (tasks/6) [storage.Image] Copy image error: image=5f618e7c-e724-475d-b8d5-c60439d04b68, src domain=34c01407-3634-41e7-96be-bd5cff15e9b9, dst domain=8820dfa3-0dff-4874-b431-c702
66d10cdb (image:494)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/image.py", line 485, in _interImagesCopy
    self._run_qemuimg_operation(operation)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/image.py", line 113, in _run_qemuimg_operation
    operation.run()
  File "/usr/lib/python2.7/site-packages/vdsm/storage/qemuimg.py", line 276, in run
    for data in self._operation.watch():
  File "/usr/lib/python2.7/site-packages/vdsm/storage/operation.py", line 104, in watch
    self._finalize(b"", err)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/operation.py", line 178, in _finalize
    raise cmdutils.Error(self._cmd, rc, out, err)
Error: Command ['/usr/bin/taskset', '--cpu-list', '0-0', '/usr/bin/nice', '-n', '19', '/usr/bin/ionice', '-c', '3', '/usr/bin/qemu-img', 'convert', '-p', '-t', 'none', '-T', 'none', '-f', 'qcow2', u'/rhev/data-c
enter/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Storage__NFS_storage__local__ge4__nfs__3/34c01407-3634-41e7-96be-bd5cff15e9b9/images/5f618e7c-e724-475d-b8d5-c60439d04b68/9ed4dd18-42e9-461a-a18f-38c913ffac9b', '-O',
 'qcow2', '-o', 'compat=0.10,backing_file=c6e70cde-f1b3-46e5-8867-da0427cb5c19,backing_fmt=qcow2', '/rhev/data-center/mnt/blockSD/8820dfa3-0dff-4874-b431-c70266d10cdb/images/5f618e7c-e724-475d-b8d5-c60439d04b68/
9ed4dd18-42e9-461a-a18f-38c913ffac9b'] failed with rc=1 out='' err=bytearray(b'qemu-img: error while writing sector 11534336: No space left on device\n')



2018-03-10 21:33:41,830+0200 ERROR (tasks/6) [storage.TaskManager.Task] (Task='a4b1c616-164e-43b9-af6c-77e8f44d5809') Unexpected error (task:875)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in _run
    return fn(*args, **kargs)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 336, in run
    return self.cmd(*self.argslist, **self.argsdict)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/securable.py", line 79, in wrapper
    return method(self, *args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sp.py", line 1766, in syncImageData
    img.syncData(sdUUID, imgUUID, dstSdUUID, syncType)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/image.py", line 673, in syncData
    {'srcChain': srcChain, 'dstChain': dstChain})
  File "/usr/lib/python2.7/site-packages/vdsm/storage/image.py", line 495, in _interImagesCopy
    raise se.CopyImageError()
CopyImageError: low level Image copy failed: ()



Expected results:
Live storage migration should succeed




Version-Release number of selected component (if applicable):

vdsm-hook-openstacknet-4.20.20-1.el7ev.noarch
libvirt-daemon-driver-nwfilter-3.9.0-13.el7.x86_64
ovirt-hosted-engine-ha-2.2.6-1.el7ev.noarch
sanlock-python-3.6.0-1.el7.x86_64
libvirt-daemon-driver-storage-logical-3.9.0-13.el7.x86_64
libselinux-utils-2.5-12.el7.x86_64
vdsm-yajsonrpc-4.20.20-1.el7ev.noarch
qemu-kvm-rhev-2.10.0-21.el7.x86_64
vdsm-jsonrpc-4.20.20-1.el7ev.noarch
libvirt-daemon-config-network-3.9.0-13.el7.x86_64
vdsm-hook-vmfex-dev-4.20.20-1.el7ev.noarch
libvirt-lock-sanlock-3.9.0-13.el7.x86_64
ovirt-hosted-engine-setup-2.2.12-1.el7ev.noarch
libvirt-daemon-driver-storage-mpath-3.9.0-13.el7.x86_64
ovirt-imageio-common-1.2.1-0.el7ev.noarch
qemu-img-rhev-2.10.0-21.el7.x86_64
vdsm-python-4.20.20-1.el7ev.noarch
selinux-policy-3.13.1-192.el7.noarch
sanlock-3.6.0-1.el7.x86_64
vdsm-4.20.20-1.el7ev.x86_64
vdsm-hook-fcoe-4.20.20-1.el7ev.noarch
ovirt-host-4.2.2-1.el7ev.x86_64
libnfsidmap-0.25-19.el7.x86_64
ipxe-roms-qemu-20170123-1.git4e85b27.el7_4.1.noarch
ovirt-vmconsole-1.0.4-1.el7ev.noarch
libselinux-python-2.5-12.el7.x86_64
vdsm-common-4.20.20-1.el7ev.noarch
libvirt-daemon-driver-network-3.9.0-13.el7.x86_64
libvirt-daemon-config-nwfilter-3.9.0-13.el7.x86_64
libvirt-daemon-driver-interface-3.9.0-13.el7.x86_64
libvirt-daemon-driver-lxc-3.9.0-13.el7.x86_64
libvirt-daemon-driver-storage-iscsi-3.9.0-13.el7.x86_64
libvirt-daemon-driver-storage-scsi-3.9.0-13.el7.x86_64
libvirt-daemon-kvm-3.9.0-13.el7.x86_64
ovirt-vmconsole-host-1.0.4-1.el7ev.noarch
python-ovirt-engine-sdk4-4.2.4-1.el7ev.x86_64
vdsm-client-4.20.20-1.el7ev.noarch
selinux-policy-targeted-3.13.1-192.el7.noarch
vdsm-hook-vhostmd-4.20.20-1.el7ev.noarch
sanlock-lib-3.6.0-1.el7.x86_64
ovirt-provider-ovn-driver-1.2.8-1.el7ev.noarch
vdsm-hook-ethtool-options-4.20.20-1.el7ev.noarch
libvirt-python-3.9.0-1.el7.x86_64
qemu-guest-agent-2.8.0-2.el7.x86_64
ovirt-imageio-daemon-1.2.1-0.el7ev.noarch
libvirt-daemon-3.9.0-13.el7.x86_64
libvirt-daemon-driver-nodedev-3.9.0-13.el7.x86_64
libvirt-daemon-driver-qemu-3.9.0-13.el7.x86_64
libvirt-daemon-driver-storage-rbd-3.9.0-13.el7.x86_64
ovirt-engine-sdk-python-3.6.9.1-1.el7ev.noarch
libvirt-daemon-driver-storage-3.9.0-13.el7.x86_64
qemu-kvm-common-rhev-2.10.0-21.el7.x86_64
vdsm-http-4.20.20-1.el7ev.noarch
libvirt-libs-3.9.0-13.el7.x86_64
vdsm-hook-vfio-mdev-4.20.20-1.el7ev.noarch
libvirt-daemon-driver-secret-3.9.0-13.el7.x86_64
libselinux-2.5-12.el7.x86_64
cockpit-ovirt-dashboard-0.11.14-0.1.el7ev.noarch
ovirt-setup-lib-1.1.4-1.el7ev.noarch
libvirt-daemon-driver-storage-core-3.9.0-13.el7.x86_64
libvirt-daemon-driver-storage-gluster-3.9.0-13.el7.x86_64
libvirt-3.9.0-13.el7.x86_64
ovirt-host-deploy-1.7.2-1.el7ev.noarch
nfs-utils-1.3.0-0.54.el7.x86_64
vdsm-network-4.20.20-1.el7ev.x86_64
libvirt-client-3.9.0-13.el7.x86_64
ovirt-host-dependencies-4.2.2-1.el7ev.x86_64
libvirt-daemon-driver-storage-disk-3.9.0-13.el7.x86_64
vdsm-api-4.20.20-1.el7ev.noarch
kernel 3.10.0-851.el7.x86_64 
Red Hat Enterprise Linux Server 7.5 (Maipo)



Additional info:


# qemu-img info /rhev/data-center/79d2a575-75aa-40e2-8caf-1f768de486e3/34c01407-3634-41e7-96be-bd5cff15e9b9/images/5f618e7c-e724-475d-b8d5-c60439d04b68/096d936c-11f4-4d74-bcf4-73fc97422ced
image: /rhev/data-center/79d2a575-75aa-40e2-8caf-1f768de486e3/34c01407-3634-41e7-96be-bd5cff15e9b9/images/5f618e7c-e724-475d-b8d5-c60439d04b68/096d936c-11f4-4d74-bcf4-73fc97422ced
file format: qcow2
virtual size: 7.0G (7516192768 bytes)
disk size: 196K
cluster_size: 65536
backing file: 9ed4dd18-42e9-461a-a18f-38c913ffac9b (actual path: /rhev/data-center/79d2a575-75aa-40e2-8caf-1f768de486e3/34c01407-3634-41e7-96be-bd5cff15e9b9/images/5f618e7c-e724-475d-b8d5-c60439d04b68/9ed4dd18-42e9-461a-a18f-38c913ffac9b)
backing file format: qcow2
Format specific information:
    compat: 0.10
    refcount bits: 16




engine.log:

2018-03-10 21:33:56,496+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-22) [1f845844] EVENT_ID: USER_MOVED_DISK_FINISHED_FAILURE(2
,011), User admin@internal-authz have failed to move disk test_Disk1 to domain iscsi_3.

Comment 1 Allon Mureinik 2018-03-11 07:51:25 UTC
This is definitely a bug, but offhand, it does not look related to qemu's new locking mechanism. Actually, it looks similar to bug 1523614, and probably has to do with some subtle between qcow2's compat levels.

Elad, let's try to isolate the problem.
Can you run the same scenario on RHEL 7.4.z (with qemu 2.9.something) and double check whether it reproduces?

I'm tentatively targetting this to 4.2.2 under the assumption it really is a regression.
If the aforementioned requested analysis proves otherwise, we can rethink the targeting.

Comment 2 Red Hat Bugzilla Rules Engine 2018-03-11 07:51:31 UTC
This bug report has Keywords: Regression or TestBlocker.
Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.

Comment 3 Elad 2018-03-11 11:17:26 UTC
Created attachment 1406852 [details]
rhel-7.4

The bug doesn't reproduce on RHEL7.4.


2018-03-11 13:09:57,665+0200 INFO  (jsonrpc/1) [vdsm.api] FINISH syncImageData return=None from=::ffff:10.35.161.181,36978, flow_id=78d95342-cee0-4114-af41-d2b429270026, task_id=1b966efd-bb1d-4b2f-a784-817a19ad3
160 (api:52)



[root@storage-ge1-vdsm1 ~]# rpm -qa |egrep 'vdsm|libvirt|qemu' 
qemu-kvm-tools-rhev-2.10.0-21.el7.x86_64
qemu-guest-agent-2.8.0-2.el7.x86_64
qemu-kvm-rhev-2.10.0-21.el7.x86_64
libvirt-daemon-driver-interface-3.2.0-14.el7_4.9.x86_64
libvirt-daemon-driver-storage-iscsi-3.2.0-14.el7_4.9.x86_64
vdsm-yajsonrpc-4.19.48-1.el7ev.noarch
vdsm-hook-vmfex-dev-4.19.48-1.el7ev.noarch
libvirt-libs-3.2.0-14.el7_4.9.x86_64
vdsm-xmlrpc-4.19.48-1.el7ev.noarch
libvirt-daemon-driver-nwfilter-3.2.0-14.el7_4.9.x86_64
libvirt-daemon-driver-storage-disk-3.2.0-14.el7_4.9.x86_64
libvirt-daemon-kvm-3.2.0-14.el7_4.9.x86_64
vdsm-cli-4.19.48-1.el7ev.noarch
libvirt-daemon-3.2.0-14.el7_4.9.x86_64
libvirt-daemon-driver-nodedev-3.2.0-14.el7_4.9.x86_64
libvirt-daemon-driver-storage-logical-3.2.0-14.el7_4.9.x86_64
vdsm-hook-localdisk-4.19.48-1.el7ev.noarch
qemu-img-rhev-2.10.0-21.el7.x86_64
vdsm-api-4.19.48-1.el7ev.noarch
qemu-kvm-common-rhev-2.10.0-21.el7.x86_64
libvirt-daemon-driver-storage-core-3.2.0-14.el7_4.9.x86_64
libvirt-daemon-driver-qemu-3.2.0-14.el7_4.9.x86_64
libvirt-daemon-driver-lxc-3.2.0-14.el7_4.9.x86_64
libvirt-daemon-driver-storage-rbd-3.2.0-14.el7_4.9.x86_64
libvirt-daemon-driver-storage-scsi-3.2.0-14.el7_4.9.x86_64
vdsm-hook-ethtool-options-4.19.48-1.el7ev.noarch
libvirt-3.2.0-14.el7_4.9.x86_64
vdsm-python-4.19.48-1.el7ev.noarch
libvirt-daemon-driver-network-3.2.0-14.el7_4.9.x86_64
libvirt-daemon-config-network-3.2.0-14.el7_4.9.x86_64
libvirt-daemon-driver-storage-3.2.0-14.el7_4.9.x86_64
libvirt-python-3.2.0-3.el7_4.1.x86_64
libvirt-client-3.2.0-14.el7_4.9.x86_64
libvirt-daemon-driver-secret-3.2.0-14.el7_4.9.x86_64
libvirt-daemon-driver-storage-gluster-3.2.0-14.el7_4.9.x86_64
vdsm-jsonrpc-4.19.48-1.el7ev.noarch
vdsm-4.19.48-1.el7ev.x86_64
ipxe-roms-qemu-20170123-1.git4e85b27.el7_4.1.noarch
libvirt-daemon-config-nwfilter-3.2.0-14.el7_4.9.x86_64
libvirt-daemon-driver-storage-mpath-3.2.0-14.el7_4.9.x86_64
libvirt-lock-sanlock-3.2.0-14.el7_4.9.x86_64
[root@storage-ge1-vdsm1 ~]# cat /etc/os-release 
PRETTY_NAME="Red Hat Enterprise Linux Server 7.4 (Maipo)"
[root@storage-ge1-vdsm1 ~]# uname -a
Linux storage-ge1-vdsm1.scl.lab.tlv.redhat.com 3.10.0-693.21.1.el7.x86_64 #1 SMP Fri Feb 23 18:54:16 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Comment 4 Benny Zlotnik 2018-03-13 09:36:04 UTC
It appears to be a similar bug to https://bugzilla.redhat.com/show_bug.cgi?id=1523614

The case there is that we have a snapshot and then we extend the disk, which results in having a child snapshot bigger then its parent.

The case here is that we extend the VM disk, which results in a child image bigger than its parent (the template's image)

Comment 5 Red Hat Bugzilla Rules Engine 2018-03-15 10:54:50 UTC
Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.

Comment 6 Allon Mureinik 2018-03-15 13:08:01 UTC
Benny,can you please add some doctext explaining the situation and what a user can be done to overcome it?

Comment 7 Kevin Alon Goldblatt 2018-04-02 13:13:26 UTC
Verified with the following code:
----------------------------------------
ovirt-engine-4.2.2.6-0.1.el7.noarch
vdsm-4.20.23-1.el7ev.x86_64



Verified with the following scenario:
-------------------------------------------
Steps to Reproduce:
- Created a 4.0 DC and cluster with RHEL7.5 host
- Created a NFS domain (created as v3)
- Created a template 
- Created 2 VMs from the template as thin copy (VM image is based on the template) - created as qcow2
- Created a second storage domain (iSCSI)
- Copied template disk to the second storage domain
- Started both VMs 
- Tried to move one of the VMs disk to the second domain >>>>> disk move operation was successfull


Moving to VERIFIED

Comment 8 Kevin Alon Goldblatt 2018-04-02 14:04:56 UTC
Verified with the following code:
----------------------------------------
ovirt-engine-4.2.2.6-0.1.el7.noarch
vdsm-4.20.23-1.el7ev.x86_64


CORRECTION TO SCENARIO

Verified with the following scenario:
-------------------------------------------
Steps to Reproduce:
- Created a 4.0 DC and cluster with RHEL7.5 host
- Created a NFS domain (created as v3)
- Created a template 
- Created 2 VMs from the template as thin copy (VM image is based on the template) - created as qcow2
- Created a second storage domain (iSCSI)
- Copied template disk to the second storage domain
- Started both VMs 
- Extend the disk of one of the VMs (Added this step)
- Tried to move the extended disk to the second domain >>>>> This operation fails as expected


Moving to VERIFIED

Comment 9 Sandro Bonazzola 2018-04-05 09:38:28 UTC
This bugzilla is included in oVirt 4.2.2 release, published on March 28th 2018.

Since the problem described in this bug report should be
resolved in oVirt 4.2.2 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.