Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1639244

Summary: [downstream clone - 4.2.7] Extending VM disk fails with message "disk was successfully updated to 0 GB"
Product: Red Hat Enterprise Virtualization Manager Reporter: RHV bug bot <rhv-bugzilla-bot>
Component: ovirt-engineAssignee: Eyal Shenitzky <eshenitz>
Status: CLOSED ERRATA QA Contact: Shir Fishbain <sfishbai>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.2.3CC: apinnick, eedri, eshenitz, gveitmic, kwolf, lsurette, lveyde, nashok, nsoffer, ratamir, Rhev-m-bugs, sfishbai, srevivo, tnisan, ycui
Target Milestone: ovirt-4.2.7Keywords: ZStream
Target Release: ---Flags: lsvaty: testing_plan_complete-
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: vdsm v4.20.43 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1612978 Environment:
Last Closed: 2018-11-05 15:03:18 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1612978    
Bug Blocks:    
Attachments:
Description Flags
Attachments none

Description RHV bug bot 2018-10-15 11:31:51 UTC
+++ This bug is a downstream clone. The original bug is: +++
+++   bug 1612978 +++
======================================================================

Description of problem:

If a VM with qcow2 disk went into paused mode because of lack of space in the storage domain and if the user tries to extend the disk, it will fail while doing the "qemu-img" resize. The engine will report it as "disk was successfully updated to 0 GB".

When the disk "allocation" reaches near to the physical storage of the VM, the vdsm tried to extend the disk which got failed with the error below since the storage domain was not having enough space.

===
2018-08-06 20:00:30,256+0530 ERROR (mailbox-spm/0) [storage.SPM.Messages.Extend] processRequest: Exception caught while trying to extend volume: 5083c58f-2f08-478e-9874-8761ac2d51c3 in domain: 0c975612-ce0a-49b7-9927-41d4c46538e7 (mailbox:172)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/mailbox.py", line 166, in processRequest
    pool.extendVolume(volume['domainID'], volume['volumeID'], size)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/securable.py", line 79, in wrapper
    return method(self, *args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sp.py", line 1420, in extendVolume
    sdCache.produce(sdUUID).extendVolume(volumeUUID, size, isShuttingDown)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/blockSD.py", line 1564, in extendVolume
    lvm.extendLV(self.sdUUID, volumeUUID, size)  # , isShuttingDown)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/lvm.py", line 1232, in extendLV
    % (vgName, lvName, free_extents, needed_extents))
VolumeGroupSizeError: Volume Group not big enough: ('Not enough free extents for extending LV 0c975612-ce0a-49b7-9927-41d4c46538e7/5083c58f-2f08-478e-9874-8761ac2d51c3 (free=4, needed=8)',)
===

The VM went into paused mode after this with ENOSPC.

===
2018-08-06 20:00:13,825+0530 INFO  (libvirt/events) [virt.vm] (vmId='0f7ccc40-35b3-4535-8845-d9a2641e51ba') abnormal vm stop device ua-96c03262-118e-4ea9-a277-054b07965f9d error enospc (vm:5075)
===

The VM was made down and some disks were cleaned up to get some free space in the storage domain. After that, the disk was extended by 20 GB which got failed with the error below.

===
2018-08-06 20:15:39,806+0530 ERROR (jsonrpc/4) [storage.TaskManager.Task] (Task='f6c6b7b7-a627-4c26-81b7-97ae6b5fd9b7') Unexpected error (task:875)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in _run
    return fn(*args, **kargs)
  File "<string>", line 2, in updateVolumeSize
  File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 48, in method
    ret = func(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 758, in updateVolumeSize
    qemuimg.resize(volPath, newSizeBytes, qemuImgFormat)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/qemuimg.py", line 333, in resize
    _run_cmd(cmd)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/qemuimg.py", line 409, in _run_cmd
    raise cmdutils.Error(cmd, rc, out, err)
Error: Command ['/usr/bin/qemu-img', 'resize', '-f', 'qcow2', u'/rhev/data-center/mnt/blockSD/0c975612-ce0a-49b7-9927-41d4c46538e7/images/96c03262-118e-4ea9-a277-054b07965f9d/5083c58f-2f08-478e-9874-8761ac2d51c3', '21474836480'] failed with rc=1 out='' err='qemu-img: Failed to grow the L1 table: No space left on device\n'
===


Since it's a qcow2 disk, the lv will not be extended. However, the image may need more space to allocate new L1/refcount table.

Both the database and engine metadata is updated with 0 GB size for the disk.

===
engine=# select  log_time,message from audit_log where message like '%GlanceDisk%' order by log_time desc;
          log_time          |                                             message                                              
----------------------------+--------------------------------------------------------------------------------------------------
 2018-08-06 10:56:19.048-04 | VM rhel_vm GlanceDisk-9a3800a disk was updated by admin@internal-authz.
 2018-08-06 10:56:18.999-04 | Size of the disk 'GlanceDisk-9a3800a' was successfully updated to 0 GB by admin@internal-authz.



engine=# select image_guid,size from images where image_guid = '5083c58f-2f08-478e-9874-8761ac2d51c3';
              image_guid              | size 
--------------------------------------+------
 5083c58f-2f08-478e-9874-8761ac2d51c3 |    0
(1 row)


dd if=/dev/0c975612-ce0a-49b7-9927-41d4c46538e7/metadata bs=512 skip=5 count=1 
DOMAIN=0c975612-ce0a-49b7-9927-41d4c46538e7
CTIME=1533565674
FORMAT=COW
DISKTYPE=DATA
LEGALITY=LEGAL
SIZE=0
VOLTYPE=LEAF
DESCRIPTION=
IMAGE=96c03262-118e-4ea9-a277-054b07965f9d
PUUID=99d38ffe-806a-4631-a28f-61196d6b5147
MTIME=0
POOL_UUID=
TYPE=SPARSE
GEN=0
EOF
===

This is because we are setting the size to 0 before "qemuimg.resize" and the size is reset back to the original image size (obtained from qemuimg.info) only if the "qemuimg.resize" is a success.

===
lib/vdsm/storage/hsm.py

 756             # Uncommit the current size
 757             volToExtend.setSize(0)
 758             qemuimg.resize(volPath, newSizeBytes, qemuImgFormat)
 759             roundedSizeBytes = qemuimg.info(volPath,
 760                                             qemuImgFormat)['virtualsize']
 761         finally:
 762             volToExtend.teardown(sdUUID, volUUID)
 763 
 764         volToExtend.setSize(
 765             (roundedSizeBytes + SECTOR_SIZE - 1) / SECTOR_SIZE)
 766 
 767         return dict(size=str(roundedSizeBytes))
====


The portal will also shows the disk size as 0 GB. 


Version-Release number of selected component (if applicable):

rhvm-4.2.3.5-0.1.el7.noarch
vdsm-4.20.27.1-1.el7ev.x86_64

How reproducible:

Observed in the customer environment running 4.1. I was able to reproduce this in my lab environment running 4.2 multiple times although it's not always reproducible. I think it depends upon how much qcow2 disk is allocated vs the size of the lv. So if we extend the image, it looks like it need to create new L1/refcount tables and fails if it LV doesn't have enough space.

Steps to Reproduce:

1. Start a VM with qcow2 disk.

2. Decrease the amount of free space in storage domain to 1 GB by adding some other pre-allocated disks.

3. Start writing the data from the VM using dd command. The VM will go into paused mode.

4. Shutdown the VM, get more space on storage and try to extend the disk. It would probably fail with the above error. 


Actual results:

Extending the disk size got failed and the size was set to 0 GB in the database and metadata.

Expected results:

I think we should detect that the image needs more backend LV size and extend accordingly. Or we should capture the error correctly and reflect that in manager portal and the size shouldn't be set to 0. 

Additional info:

(Originally by Nijin Ashok)

Comment 3 RHV bug bot 2018-10-15 11:32:04 UTC
Nijin, 

Can you please supply also the engine log?

(Originally by Eyal Shenitzky)

Comment 4 RHV bug bot 2018-10-15 11:32:10 UTC
(In reply to Eyal Shenitzky from comment #2)
> Nijin, 
> 
> Can you please supply also the engine log?

I was unable to reproduce the bug with the following steps:

1) Create a VM with 2 disks (OS, empty thin-provisioned block-based disk)
2) Create a large preallocated block-based disk that filled the storage domain (only 1 GB remained)
3) Run the VM
4) Start dd operation (4 GB) to the empty thin-provisioned block-based disk
5) When the VM entered to pause mode, shut down the VM
6) Remove the large preallocated disk
7) Select to extend the empty thin-provisioned block-based disk

Actual results:
Managed to extend the disk, VM started and works well.

Am I missing something?

(Originally by Eyal Shenitzky)

Comment 5 RHV bug bot 2018-10-15 11:32:15 UTC
(In reply to Eyal Shenitzky from comment #3)
> (In reply to Eyal Shenitzky from comment #2)
> > Nijin, 
> > 
> > Can you please supply also the engine log?
> 
> I was unable to reproduce the bug with the following steps:
> 
> 1) Create a VM with 2 disks (OS, empty thin-provisioned block-based disk)
> 2) Create a large preallocated block-based disk that filled the storage
> domain (only 1 GB remained)
> 3) Run the VM
> 4) Start dd operation (4 GB) to the empty thin-provisioned block-based disk
> 5) When the VM entered to pause mode, shut down the VM
> 6) Remove the large preallocated disk
> 7) Select to extend the empty thin-provisioned block-based disk
> 
> Actual results:
> Managed to extend the disk, VM started and works well.
> 
> Am I missing something?

I tried again and I was able to reproduce it. Followed the same steps as yours.

Created a 3 GB thin provisioned disk and attached to the VM. The actual size is 1 GB and the virtual size is 3 GB.
Add a large pre-allocated disk in the storage domain. Storage domain now only has < 1 GB space.
Started writing data to the disk using command "dd if=/dev/urandom bs=1M of=/dev/sdb" from the VM.
VM went to the paused mode and I powered it down after that.
Reclaimed the space in storage domain by deleting the large pre-allocated disk.
Extended the thin provisioned disk by 10 GB. Failed with the error below.

===
2018-09-05 11:54:37,908+0530 ERROR (jsonrpc/4) [storage.Dispatcher] FINISH updateVolumeSize error=Command ['/usr/bin/qemu-img', 'resize', '-f', 'qcow2', u'/rhev/data-center/mnt/blockSD/c21260fa-bc27-4bd3-9841-99da4e2d0402/images/81209358-1712-4fd9-a2ff-d41c7f19b7c8/2849de7f-fc48-40bc-af2f-a62ed008e9ba', '13958643712'] failed with rc=1 out='' err='qemu-img: Failed to grow the L1 table: No space left on device\n' (dispatcher:86)

2018-09-05 04:52:41,629-04 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-3355) [] EVENT_ID: USER_EXTEND_DISK_SIZE_SUCCESS(371), Size of the disk 'nijin_bz1612978_Disk1' was successfully updated to 0 GB by admin@internal-authz.
===

Tried 4 times with the steps above and was able to reproduce the issue every time.

This particular VM is available in my environment and an extension to this disk will always fail with the mentioned error. Please let me know if you need the credential to my test environment if you want to have a look.

Attaching both engine and vdsm log.

(Originally by Nijin Ashok)

Comment 7 RHV bug bot 2018-10-15 11:32:26 UTC
Thanks, I managed to reproduce it also.

When you try to extend the disk for the second time, does it works for you?

(Originally by Eyal Shenitzky)

Comment 8 RHV bug bot 2018-10-15 11:32:30 UTC
We need to understand why resize behave in this way. Looks like the author of this
code wanted to keep the disk in invalid state if qemu-img resize failed. I think
this behavior exists in several places in vdsm.

This behavior looks useless when resize failed because we don't have enough space,
but I'm not sure that we get useful error code from qemu-img about the failure, so
we may not be able to detect this case.

Kevin, can we use an image safely after qemu-img resize fail? Can we detect errors
meaning that the disk is corrupted and should not be used, or errors like ENOSPC
which should be safe?

(Originally by Nir Soffer)

Comment 9 RHV bug bot 2018-10-15 11:32:35 UTC
To be honest, I don't understand the whole scenario to begin with. If you get ENOSPC on the host and the VM is stopped, then you need to make space for the image file to grow, not extend the virtual disk size. Obviously, if you already got ENOSPC, there is no space left to store new data in the image, so increasing the virtual disk size will fail as well. What you really need to do is extend the LV the image is stored on.

Second, maybe even more importantly, the 'qemu-img resize' operation in this context isn't only misguided, but actively dangerous: The image is already opened read-write by the VM, and doing a 'qemu-img resize' on it sounds like a great way to corrupt it. For local files, image locking should prevent this in current QEMU versions, but for everything else we have to rely on the user (or RHV) not to do stupid things.

Finally, the size being reset to 0 seems to be a RHV-internal thing. I am pretty sure that 'qemu-img info' will still show the real size. As long as you didn't resume the VM afterwards (which has old metadata cached and would continue to use it), there are even chances that it's not completely corrupted. However, let me repeat that the possible corruption is not from the failed 'qemu-img resize' (this is always safe), but from opening the image read-write from two processes.

When QEMU detects image corruption in a qcow2 image, it flags it as such and refuses to open it again until you repair it. This will be visible in the log files with messages like "Marking image as corrupt: ..." (when first detection corruption) or "qcow2: Image is corrupt; cannot be opened read/write" (when trying to open an image that is marked as corrupt). So QEMU already takes care that an image that shouldn't be used really isn't used.

(Originally by Kevin Wolf)

Comment 10 RHV bug bot 2018-10-15 11:32:40 UTC
Thanks for the quick reply Kevin!

We do not use qemu-img resize on a live image. When resizing live image we use
the proper libvirt api.

The flow is:

- set image size in RHV metadata to 0
- call qemu-img resize
- update image size in RHV metadata to actual size after resize

Based comment 8, qemu-img resize is always safe, and qemu is already handling bad
images, so I think we should change the resize flow to:

- call qemu-img resize
- update image size in RHV metadata to actual size

Same flow exists when resizing live image, and we also set image size to 0 before
the operation. We should fix that flow as well.

(Originally by Nir Soffer)

Comment 11 RHV bug bot 2018-10-15 11:32:45 UTC
(In reply to Kevin Wolf from comment #8)
> To be honest, I don't understand the whole scenario to begin with. If you
> get ENOSPC on the host and the VM is stopped, then you need to make space
> for the image file to grow, not extend the virtual disk size. Obviously, if
> you already got ENOSPC, there is no space left to store new data in the
> image, so increasing the virtual disk size will fail as well. What you
> really need to do is extend the LV the image is stored on.
> 

The scenario is that 'qemu-img resize' failed after performing the following steps:

1) Create a VM with thin-provisioned block-based disk and run it
2) Create a new preallocated disk (that not related to the VM) that fill the block-based storage domain
4) Start dd operation to the empty thin-provisioned block-based disk -> it will cause to the VM to enter to 'pause' mode
5) When the VM entered to pause mode, shut down the VM
6) Remove the large preallocated disk -> make space for the disk resize
7) Select to extend the thin-provisioned block-based disk

The question here is why the qemu-img resize failing?

> Second, maybe even more importantly, the 'qemu-img resize' operation in this
> context isn't only misguided, but actively dangerous: The image is already
> opened read-write by the VM, and doing a 'qemu-img resize' on it sounds like
> a great way to corrupt it. For local files, image locking should prevent
> this in current QEMU versions, but for everything else we have to rely on
> the user (or RHV) not to do stupid things.
> 
> Finally, the size being reset to 0 seems to be a RHV-internal thing. I am
> pretty sure that 'qemu-img info' will still show the real size. As long as
> you didn't resume the VM afterwards (which has old metadata cached and would
> continue to use it), there are even chances that it's not completely
> corrupted. However, let me repeat that the possible corruption is not from
> the failed 'qemu-img resize' (this is always safe), but from opening the
> image read-write from two processes.
> 
> When QEMU detects image corruption in a qcow2 image, it flags it as such and
> refuses to open it again until you repair it. This will be visible in the
> log files with messages like "Marking image as corrupt: ..." (when first
> detection corruption) or "qcow2: Image is corrupt; cannot be opened
> read/write" (when trying to open an image that is marked as corrupt). So
> QEMU already takes care that an image that shouldn't be used really isn't
> used.

(Originally by Eyal Shenitzky)

Comment 12 RHV bug bot 2018-10-15 11:32:53 UTC
(In reply to Eyal Shenitzky from comment #10)
> The question here is why the qemu-img resize failing?

If I understand the scenario correctly, it is failing because the block device on which the qcow2 image is stored is full, and you didn't resize that block device.

(Originally by Kevin Wolf)

Comment 13 RHV bug bot 2018-10-15 11:33:03 UTC
> If I understand the scenario correctly, it is failing because the block
> device on which the qcow2 image is stored is full, and you didn't resize
> that block device.

But in step 6 the preallocated disk that filled the storage removed so the resize should succeed

(Originally by Eyal Shenitzky)

Comment 14 RHV bug bot 2018-10-15 11:33:08 UTC
Where did you resize the LV identified by /rhev/data-center/mnt/blockSD/0c975612-ce0a-49b7-9927-41d4c46538e7/images/96c03262-118e-4ea9-a277-054b07965f9d/5083c58f-2f08-478e-9874-8761ac2d51c3? I can't see this in any of your steps. You can remove as many other things as you want, as long as the LV is full, it doesn't help.

(Originally by Kevin Wolf)

Comment 15 RHV bug bot 2018-10-15 11:33:14 UTC
Eyal, can you create a reproduce script that will make your question more clear?

You can create a block device using losetup, then create a vg and some lvs.

Then simulate the flow using qemu-img on the lvs.

(Originally by Nir Soffer)

Comment 16 RHV bug bot 2018-10-15 11:33:19 UTC
I am not sure how it will help, maybe I am missing the meaning of re-size.
If the LV resides on a storage domain that has empty space (for e.g 10gb storage domain and 8gb LV) so the resize should succeed, and if the storage doesn't have the required space the re-size will be failed.

So in our case, if I understand correctly after we 'cleaned' the storage domain and reclaimed the space from it we should be able to resize the disk and there is no reason for 'qemu-img resize' to fail.

(Originally by Eyal Shenitzky)

Comment 17 RHV bug bot 2018-10-15 11:33:25 UTC
Please take Nir's advice and create a reproducer script that can clarify the exact steps you're making on the QEMU level.

Other than that, I can only repeat that 'qemu-img resize' resizes the virtual disk size as seen by the guest, not anything related to host storage. So if you got an ENOSPC from the host, 'qemu-img resize' is not a solution for the ENOSPC condition. Instead, you need to make sure that QEMU can write more data to the host storage.

In the typical case, this means that the LV needs to be extended to allow the image file to grow. It's also thinkable that you get ENOSPC from an (overcommited) thin provisioned LV, then you need to make space for that thin provisioned LV to allocate new blocks. What is common to both cases is that the storage is freed at the LVM layer in the host, QEMU is not involved and can only resume the VM after you made more space available and the failed request can be retried.

(Originally by Kevin Wolf)

Comment 18 RHV bug bot 2018-10-18 11:39:31 UTC
INFO: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Project 'ovirt-engine'/Component 'vdsm' mismatch]

For more info please contact: rhv-devops

Comment 19 Eyal Edri 2018-10-22 19:42:08 UTC
Another bug with mismatch info, 'Fixed in version' points at VDSM,
but attached patches are from ovirt-engine. 
I'm moving this to ON_QA and changing component to not block QE, but the fixed_in_version might need to be fixed.

Comment 20 Shir Fishbain 2018-10-23 13:37:42 UTC
The bug was reproduced by the following steps:

1) Create a VM with thin-provisioned block-based disk and run the VM.
2) Create a new preallocated disk (that not related to the VM) that fill the block-based storage domain.
4) Start dd operation to the empty thin-provisioned block-based disk.
For 1-4 steps I ran the test case (automation):TestNoSpaceLeftOnStorageDomain
5) When the VM entered to pause mode, shut down the VM.
6) Remove the large preallocated disk.
7) Extend the thin-provisioned block-based disk.
*A video of UI attached.

Engine log:
2018-10-22 19:53:55,580+03 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-60719) [] EVENT_ID: USER_EXTEND_DISK_SIZE_SUCCESS(371), Size of the disk 'disk_TestNoSpaceLeftOnStorag_2218253120' was successfully updated to 0 GB by admin@internal-authz.
2018-10-22 19:54:45,542+03 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-60738) [] EVENT_ID: USER_EXTEND_DISK_SIZE_UPDATE_VM_FAILURE(372), Failed to update VM 'vm_TestNoSpaceLeftOnStorag_2218243336' with the new volume size. VM restart is recommended.
2018-10-22 19:54:45,616+03 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-60738) [] EVENT_ID: USER_EXTEND_DISK_SIZE_SUCCESS(371), Size of the disk 'disk_TestNoSpaceLeftOnStorag_2218253120' was successfully updated to 2 GB by admin@internal-authz.
2018-10-22 19:55:25,539+03 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-60757) [] EVENT_ID: USER_EXTEND_DISK_SIZE_UPDATE_VM_FAILURE(372), Failed to update VM 'vm_TestNoSpaceLeftOnStorag_2218243336' with the new volume size. VM restart is recommended.
2018-10-22 19:55:25,609+03 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-60757) [] EVENT_ID: USER_EXTEND_DISK_SIZE_SUCCESS(371), Size of the disk 'disk_TestNoSpaceLeftOnStorag_2218253120' was successfully updated to 0 GB by admin@internal-authz.

ERRORS from engine log:
2018-10-22 19:53:55,482+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.ExtendVmDiskSizeVDSCommand] (EE-ManagedThreadFactory-engine-Thread-60719) [] Failed in 'ExtendVmDiskSizeVDS' method
2018-10-22 19:53:55,492+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-60719) [] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM host_mixed_1 command ExtendVmDiskSizeVDS failed: Command ['/usr/bin/qemu-img', 'resize', '-f', 'qcow2', u'/rhev/data-center/mnt/blockSD/91f13c4f-1f9a-4a86-bad0-951d2f4ac415/images/89673ac8-85a7-439a-9a01-6d7bc01270cc/ca8b1f4b-eda8-49db-b945-c16acd47a92e', '3221225472'] failed with rc=1 out='' err='qemu-img: Failed to grow the L1 table: No space left on device\n'
2018-10-22 19:53:55,492+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.ExtendVmDiskSizeVDSCommand] (EE-ManagedThreadFactory-engine-Thread-60719) [] Command 'ExtendVmDiskSizeVDSCommand(HostName = host_mixed_1, ExtendVmDiskSizeVDSCommandParameters:{hostId='bdfb22d5-43fc-4170-b674-b63deb8a60d6', vmId='00000000-0000-0000-0000-000000000000'})' execution failed: VDSGenericException: VDSErrorException: Failed to ExtendVmDiskSizeVDS, error = Command ['/usr/bin/qemu-img', 'resize', '-f', 'qcow2', u'/rhev/data-center/mnt/blockSD/91f13c4f-1f9a-4a86-bad0-951d2f4ac415/images/89673ac8-85a7-439a-9a01-6d7bc01270cc/ca8b1f4b-eda8-49db-b945-c16acd47a92e', '3221225472'] failed with rc=1 out='' err='qemu-img: Failed to grow the L1 table: No space left on device\n', code = 100
2018-10-22 19:53:55,648+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-60719) [] EVENT_ID: USER_FAILED_UPDATE_VM_DISK(89), Failed to update VM vm_TestNoSpaceLeftOnStorag_2218243336 disk disk_TestNoSpaceLeftOnStorag_2218253120 (User: admin@internal-authz).

vdsm version : vdsm-4.20.43-1.el7ev.x86_64
ovirt-engine : ovirt-engine-4.2.7.3-0.1.el7ev.noarch

Comment 21 Shir Fishbain 2018-10-23 13:42:02 UTC
Created attachment 1496699 [details]
Attachments

Comment 22 Eyal Shenitzky 2018-10-24 05:05:31 UTC
This is the expected result.
The disk size set to 0 because the engine / VDSM cannot determine what is the current disk size due to the QEMU failure - this is done by design.

This fix does not prevent from the extend operation to failed but it handles the failure and reflect it to the user.

2018-10-22 19:53:55,648+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-60719) [] EVENT_ID: USER_FAILED_UPDATE_VM_DISK(89), Failed to update VM vm_TestNoSpaceLeftOnStorag_2218243336 disk disk_TestNoSpaceLeftOnStorag_2218253120 (User: admin@internal-authz).

Comment 23 Shir Fishbain 2018-10-24 15:36:49 UTC
Is there an open bug on this issue?
(In reply to Eyal Shenitzky from comment #22)
> This is the expected result.
> The disk size set to 0 because the engine / VDSM cannot determine what is
> the current disk size due to the QEMU failure - this is done by design.
> 
> This fix does not prevent from the extend operation to failed but it handles
> the failure and reflect it to the user.
> 
> 2018-10-22 19:53:55,648+03 ERROR
> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
> (EE-ManagedThreadFactory-engine-Thread-60719) [] EVENT_ID:
> USER_FAILED_UPDATE_VM_DISK(89), Failed to update VM
> vm_TestNoSpaceLeftOnStorag_2218243336 disk
> disk_TestNoSpaceLeftOnStorag_2218253120 (User: admin@internal-authz).

After I have spoken with Eyal over the phone I understand that the fixing of this bug was to see if error "USER_FAILED_UPDATE_VM_DISK" appears in engine log.
The error does appear and therefore it's possible to retrieve the bug to status: "ON_QA".

Comment 24 Raz Tamir 2018-10-24 17:23:19 UTC
(In reply to Eyal Shenitzky from comment #22)
> This is the expected result.
> The disk size set to 0 because the engine / VDSM cannot determine what is
> the current disk size due to the QEMU failure - this is done by design.
> 
> This fix does not prevent from the extend operation to failed but it handles
> the failure and reflect it to the user.
> 
> 2018-10-22 19:53:55,648+03 ERROR
> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
> (EE-ManagedThreadFactory-engine-Thread-60719) [] EVENT_ID:
> USER_FAILED_UPDATE_VM_DISK(89), Failed to update VM
> vm_TestNoSpaceLeftOnStorag_2218243336 disk
> disk_TestNoSpaceLeftOnStorag_2218253120 (User: admin@internal-authz).

Just to ensure I understand the design correctly,

1) I have a VM with 2 disks (1st OS, 2nd thin prov.)
2) I'm creating a new floating preallocated big disk (almost the size of the SD)
Let's say that I have 3GB I can use before the threshold
3) dd from the OS disk to the 2nd thin disk. --> 3GB will be written to the disk and the VM will be paused --> poweroff the VM
4) Remove the big preallocated disk
5) try to extend the disk will fail - why? #1
6) the thin disk size will be reset to 0 - why? #2

#1 If I already removed the big disk from the SD, why the cleared size is not reclaimed by the VG? this is the point of LVM - what am I missing here?

#2 If the extend failed, and I already filled the disk with 3GB of data, why the disk size is resetting back to 0? the vdsm should have this information.

Comment 25 Raz Tamir 2018-10-24 17:43:37 UTC
(In reply to RHV Bugzilla Automation and Verification Bot from comment #17)
> Please take Nir's advice and create a reproducer script that can clarify the
> exact steps you're making on the QEMU level.
> 
> Other than that, I can only repeat that 'qemu-img resize' resizes the
> virtual disk size as seen by the guest, not anything related to host
> storage. So if you got an ENOSPC from the host, 'qemu-img resize' is not a
> solution for the ENOSPC condition. Instead, you need to make sure that QEMU
> can write more data to the host storage.
> 
> In the typical case, this means that the LV needs to be extended to allow
> the image file to grow. It's also thinkable that you get ENOSPC from an
> (overcommited) thin provisioned LV, then you need to make space for that
> thin provisioned LV to allocate new blocks. What is common to both cases is
> that the storage is freed at the LVM layer in the host, QEMU is not involved
> and can only resume the VM after you made more space available and the
> failed request can be retried.
> 
> (Originally by Kevin Wolf)

And to try to solve the misunderstanding here and ensure we are all synced,
The problematic situation is starting AFTER we are at step #4 in comment #24.

4) Remove the big preallocated disk

After this step, the VG should reclaim the removed disk space.

Step (5) extending the disk, should call **lvextend** and not **qemu-img resize** as the operation is to extend the disk from RHVM not as part of continuous I/O using dd, so basically, we are requesting to extend the virtual size.
Now that the host is involved everything should work fine.

I'm moving this bug back to assigned cause clearly there is still a problem here

Comment 26 Eyal Shenitzky 2018-10-25 04:26:52 UTC
> #1 If I already removed the big disk from the SD, why the cleared size is
> not reclaimed by the VG? this is the point of LVM - what am I missing here?
> 
> #2 If the extend failed, and I already filled the disk with 3GB of data, why
> the disk size is resetting back to 0? the vdsm should have this information.

This fix captures the error correctly and reflect it in manager portal
Before that, the disk size set to 0 and the operation finished successfully.

We do have here another issue - why is the extend operation failed?
This is under investigation (different bug doesn't open yet).

The disk size set to 0 because this is the mechanism of VDSM to announce that the disk size is unknown after a failure.
VDSM should check the unknown size and correct by himself.

Comment 27 Shir Fishbain 2018-10-25 14:19:05 UTC
I opened a new bug on extend operation failure :

Bug 1643112 - The disk size return to 0 GB after resizing in the second time and the extend operation failed
https://bugzilla.redhat.com/show_bug.cgi?id=1643112

Comment 28 Sandro Bonazzola 2018-10-26 08:01:33 UTC
This bug is not marked as blocker for 4.2.7. Can you please check if we need to block the release on this or if it can be postponed to 4.2.8?

Comment 30 Eyal Shenitzky 2018-10-28 09:27:13 UTC
Shir,

Can you please update the bug status according to your validation results.
Is this bug verified or failed QA?

Comment 31 Shir Fishbain 2018-10-28 09:45:43 UTC
This bug was verified.
Following this bug, a new bug was opened : 
https://bugzilla.redhat.com/show_bug.cgi?id=1643112
Bug 1643112 - The disk size return to 0 GB after resizing in the second time and the extend operation failed

Comment 33 errata-xmlrpc 2018-11-05 15:03:18 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:3480

Comment 34 Daniel Gur 2019-08-28 13:11:49 UTC
sync2jira

Comment 35 Daniel Gur 2019-08-28 13:16:01 UTC
sync2jira