Bug 1958032

Summary:	Live Storage Migration fails because replication filled the destination volume before extension.
Product:	Red Hat Enterprise Virtualization Manager	Reporter:	Germano Veit Michel <gveitmic>
Component:	ovirt-engine	Assignee:	Pavel Bar <pbar>
Status:	CLOSED ERRATA	QA Contact:	sshmulev
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	4.4.6	CC:	aefrat, ahadas, apinnick, bzlotnik, dfodor, eshames, lsurette, mavital, nsoffer, pbar, sfishbai, srevivo, ycui
Target Milestone:	ovirt-4.5.1	Keywords:	ZStream
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:	ovirt-engine-4.5.1.2	Doc Type:	Bug Fix
Doc Text:	Previously, live storage migration could fail if the destination volume filled up before it was extended. In the current release, the initial size of the destination volume is larger and the extension is no longer required.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-07-14 12:54:30 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	Storage	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Germano Veit Michel 2021-05-07 04:53:51 UTC

Description of problem:

This reminds me of BZ1796415, where the merge operation would fill the parent volume with data before it was extended, failing the merge. 
But here the problem is on Live Storage Migration. 

First the code, we see VDSM starting replication to dst volume on line 4304, and then later at line 4317 asking for extension of that volume
to match the source volume.

lib/vdsm/virt/vm.py
  4267      def diskReplicateStart(self, srcDisk, dstDisk):
  ...
  4296          try:
  4297              diskType = replica.get("diskType")
  4298              replica['path'] = self.cif.prepareVolumePath(replica)
  4299              if diskType != replica["diskType"]:
  4300                  # Disk type was detected or modified when preparing the volume.
  4301                  # Persist it so migration can continue after vdsm crash.
  4302                  self._updateDiskReplica(drive)
  4303              try:
  4304                  self._startDriveReplication(drive)
  ...  
  4314          if drive.chunked or drive.replicaChunked:
  4315              try:
  4316                  capacity, alloc, physical = self.getExtendInfo(drive)
  4317                  self.extendDriveVolume(drive, drive.volumeID, physical,
  4318                                         capacity)
  4319              except Exception:
  4320                  self.log.exception("Initial extension request failed for %s",
  4321                                     drive.name)
  4322
  4323          return {'status': doneCode}

If the storage is fast and the VM is writing quickly, this can lead to failed LSM.

I can somewhat reproduce this on lower end hardware by adding an artificial delay (30s) on the extension.
Adding it on the same function delays the return of ReplicateStart to the engine but does the job of 
allowing more time between replication start and extension, so that even on slower storage it shows up.

~~~
  4314	        time.sleep(30)
  4315	        if drive.chunked or drive.replicaChunked:
  4316	            try:
  4317	                capacity, alloc, physical = self.getExtendInfo(drive)
  4318	                self.extendDriveVolume(drive, drive.volumeID, physical,
  4319	                                       capacity)
  4320	            except Exception:
  4321	                self.log.exception("Initial extension request failed for %s",
  4322	                                   drive.name)
  4323	
  4324	        return {'status': doneCode}
~~~

Now the logs:

1. Engine sends ReplicateStart

2021-05-07 14:29:38,501+1000 INFO  (jsonrpc/3) [api.virt] START diskReplicateStart(srcDisk={'imageID': '1bdd3140-acdf-4bbe-ab79-559a2af8b684', 'poolID': 'f280e0f2-adfa-11eb-97b9-5254000000ff', 'volumeID': '82422848-c417-438d-8df8-1ce059ae71db', 'device': 'disk', 'domainID': '7eb7cfce-ae07-4f41-91c4-b53e81973c90'}, dstDisk={'imageID': '1bdd3140-acdf-4bbe-ab79-559a2af8b684', 'poolID': 'f280e0f2-adfa-11eb-97b9-5254000000ff', 'volumeID': '82422848-c417-438d-8df8-1ce059ae71db', 'device': 'disk', 'domainID': '15cb0997-f933-49cf-a6b4-91b038f1715f'}) from=::ffff:192.168.100.253,43640, flow_id=6a3aac32-34be-445c-8900-9b9839c644a4, vmId=6c3b7f5c-31c1-48e2-be6d-a6614cdec12a (api:48)

2. VDSM processes that, and starts qemu disk replication through libvirt and the replication starts running

2021-05-07 04:29:39.184+0000: 56383: debug : qemuProcessHandleJobStatusChange:1011 : job 'copy-sda-libvirt-3-format'(domain: 0x7fd750010070,CentOS) state changed to 'created'(1)

3. 2 seconds later it fails:
2021-05-07 14:29:41,101+1000 ERROR (libvirt/events) [virt.vm] (vmId='6c3b7f5c-31c1-48e2-be6d-a6614cdec12a') Block job (untracked) type COPY for drive sda has failed (vm:5667)

4. Because it ran out of space on destination:

2021-05-07 04:29:40.900+0000: 56383: debug : qemuMonitorJSONIOProcessLine:222 : Line [{"timestamp": {"seconds": 1620361780, "microseconds": 899090}, "event": "BLOCK_JOB_COMPLETED", "data": {"device": "copy-sda-libvirt-3-format", "len": 842399744, "offset": 267714560, "speed": 0, "type": "mirror", "error": "No space left on device"}}]

5. Extension request goes out, too late.

2021-05-07 14:30:09,221+1000 INFO  (jsonrpc/3) [vdsm.api] START sendExtendMsg(spUUID='f280e0f2-adfa-11eb-97b9-5254000000ff', volDict={'domainID': '15cb0997-f933-49cf-a6b4-91b038f1715f', 'imageID': '1bdd3140-acdf-4bbe-ab79-559a2af8b684', 'name': 'sda', 'newSize': 4294967296, 'poolID': 'f280e0f2-adfa-11eb-97b9-5254000000ff', 'volumeID': '82422848-c417-438d-8df8-1ce059ae71db', 'clock': <Clock(total=0.00*, extend-replica=0.00*)>, 'callback': None}, newSize=4294967296, callbackFunc=<bound method Vm.__afterReplicaExtension of <vdsm.virt.vm.Vm object at 0x7f59c0025438>>) from=::ffff:192.168.100.253,43640, flow_id=6a3aac32-34be-445c-8900-9b9839c644a4, task_id=fc92ea4b-fa9c-464f-81c2-47178e9f1ab4 (api:48)

Version-Release number of selected component (if applicable):
vdsm-4.40.60.3-1.el8ev.x86_64

How reproducible:
* Only managed to do it with artificial delay, my storage (single SSD) is not fast enough.

Steps to Reproduce:
1. Add delay above or use some very fast storage
2. Make VM write to disk quickly
3. Start LSM to block storage

Actual results:
* Fails under load

Expected results:
* Succeeds under load

Comment 2 Eyal Shenitzky 2021-05-10 10:16:27 UTC

Nir, this should be fixed in 4.4.5 right?

Comment 3 Germano Veit Michel 2021-05-10 21:49:33 UTC

(In reply to Eyal Shenitzky from comment #2)
> Nir, this should be fixed in 4.4.5 right?

The "reproducer" above with the sleep(30) was done on 4.4.6, the extension request and the replication run in parallel.
I think you might be talking about BZ1796415, which is a similar problem on the merge command, here its the replicate command.

Comment 4 Nir Soffer 2021-06-02 15:16:04 UTC

(In reply to Eyal Shenitzky from comment #2)
> Nir, this should be fixed in 4.4.5 right?

No, this is not fixed yet.

I agree with Germano that this is similar to the issue we had in live merge. We are
doing initial extend after the operation was started, and the operation can fail
with ENOSPC before the volume was extended.

But in this flow we have a much simpler solution - engine should create the target
volume with initial_size=3g so we don't need the initial extend.

Vdsm should change to extend only if the volume size is one was not extended yet
(.e.g. working with older engine).

Comment 8 Nir Soffer 2022-03-16 21:01:04 UTC

(In reply to Avihai from comment #7)
> After discussion with Arik this bug might be resolved by
> https://bugzilla.redhat.com/show_bug.cgi?id=2051997.
> Can you confirm?

The changes we did for bug 2051997 is to use bigger chunk size and extend earlier.
This will help to extend the volumes during the mirror, but unfortunately this
will not help with the first extend. 

According to Germano report, the mirror failed in 2.5 seconds:

2021-05-07 14:29:38,501+1000  Disk replication started
2021-05-07 14:29:41,101+1000  Disk replication failed

Extend takes 2-6 seconds in normal conditions, or more if the host
is overloaded with I/O.

The issue seems to be that when mirror fails with ENOSPC, the job fail
instead of pausing the vm:

    "event": "BLOCK_JOB_COMPLETED",
    "data": {"device": "copy-sda-libvirt-3-format",
             "len": 842399744,
             "offset": 267714560,
             "speed": 0,
             "type": "mirror",
             "error": "No space left on device"}

Maybe it is possible to change libvirt/qemu so the vm will be paused and
the mirror job will continue when the guest is unpaused, or maybe a new
API can resume the mirror job. But getting such fixes for RHEL 8 is unlikely.

I think the best way to avoid this issue is to do the first extend on engine
side - engine already measure the top volume, and create the disk size with 
initial size:

    requiredSize += SizeConverter.BYTES_IN_GB;
    return Math.min(requiredSize, sourceImage.getSize());

When the source volume is new snapshot, required size is ~0 and the actual
initial size is 1g - like a new snapshot created by vdsm. This is of course
wrong for currnet vdsm, and should be 2.5g.

When vdsm is during live storage migration, it double the chunk size, so the
first extend will extend the volume to:

    2.5g + 2 * 2.5g -> 7.5g

To avoid the need to do the first extend in vdsm, engine will allocate this
size for the new volume.

To avoid double extend in vdsm, engine will send new argument to vdsm:

    extend: false

Old vdsm will ignore this flag and try to extend the volume before the mirror.
New vdsm will skip the initial extension since the volumes are already extended.

The new behavior will be used only for cluster version >= 4.7.

Comment 9 Nir Soffer 2022-03-17 12:37:15 UTC

(In reply to Shir Fishbain from comment #6)
> Nir, please provide a clear verification scenario 

Germano "reproduced" the by adding 30 seconds sleep in vdsm.
This is not a valid way to reproduce the real issue.

To reproduce, you need:

- fast storage (e.g. flash based iSCSI or FC stoarge)
- if using iSCSI fast network to storage (e.g. 10g)
- Run VM that writes lot of data to storage during merge
  we can use this script from bug 2051997 
  https://bugzilla-attachments.redhat.com/attachment.cgi?id=1862576

Steps to reproduce:
1. Create new 50g thin disk
2. Start writing script in vm, writing to the new disk
3. Start live storage migration to another storage domain on same server

The live merge is likely to fail when staring to mirror the disk
to the new storage (see comment 0 for example failure logs).

You can repeat the live storage migration several times.

With current system, creating the 1g destination volume, writing more than
100 MiB/s during live storage migration is likely to fail. I think you need
to repeat the test multiple times increasing the write rate, until you find
the limits of the current system.

When this is fixed the limits should be similar the the limits of writing
to thin disk without pausing.

> Is it possible to reproduce this bug?

It should be reproducible based on the code.

Comment 10 Nir Soffer 2022-03-20 13:12:04 UTC

Vdsm PR posted, this is the easy part, but it must be done first so vdsm
in oVirt 4.5 always support the option to skip the initial extend.

The engine changes may be harder, and if we cannot get them in 4.5.0,
they can be added in 4.5.1.

On engine side, we need:

1. Add new configuration for volume chunk size, duplicating vdsm
   irs:volume_utilization_chunk_mb. Duplicating the configration
   is bad but until the configuration is moved to engine, we have
   no other way to use the right size for creating new volumes.

2. When creating the snapshot during live storage migration, create
   the snapshot with initial size of 3 * irs:volume_utilization_chunk_mb.
   currently the snapshot is created with default size of
   1 * irs:volume_utilization_chunk_mb.

3. When creating the destination volume part of cloning image structure,
   create the active volume at the same initial size as the active volume
   in the source. (3 * chunk size).

4. Pass needExtend=false when calling VM.diskReplicateStart.

Comment 11 Arik 2022-03-20 13:24:42 UTC

(In reply to Nir Soffer from comment #8)
> Old vdsm will ignore this flag and try to extend the volume before the
> mirror.
> New vdsm will skip the initial extension since the volumes are already
> extended.
> 
> The new behavior will be used only for cluster version >= 4.7.

Makes sense but if that new flow only happens on cluster version >= 4.7 then we don't care about old VDSM

(In reply to Nir Soffer from comment #10)
> On engine side, we need:
> 
> 1. Add new configuration for volume chunk size, duplicating vdsm
>    irs:volume_utilization_chunk_mb. Duplicating the configration
>    is bad but until the configuration is moved to engine, we have
>    no other way to use the right size for creating new volumes.
> 
> 2. When creating the snapshot during live storage migration, create
>    the snapshot with initial size of 3 * irs:volume_utilization_chunk_mb.
>    currently the snapshot is created with default size of
>    1 * irs:volume_utilization_chunk_mb.
> 
> 3. When creating the destination volume part of cloning image structure,
>    create the active volume at the same initial size as the active volume
>    in the source. (3 * chunk size).
> 
> 4. Pass needExtend=false when calling VM.diskReplicateStart.

IIRC the initial size of snapshots is only set on block storage but that's fine because this issue can only happen on block storage, right?

Comment 12 Arik 2022-03-20 13:26:10 UTC

(In reply to Arik from comment #11)
> IIRC the initial size of snapshots is only set on block storage but that's
> fine because this issue can only happen on block storage, right?

Oh wait, can we migrate a disk from file-storage to block-storage?

Comment 13 Nir Soffer 2022-03-20 13:39:00 UTC

(In reply to Arik from comment #12)
> (In reply to Arik from comment #11)
> > IIRC the initial size of snapshots is only set on block storage but that's
> > fine because this issue can only happen on block storage, right?
> 
> Oh wait, can we migrate a disk from file-storage to block-storage?

Sure, but in this case the initial size of the snapshot does not matter, since
initial size is used only on block storage.

Same applies to migrating from block to file - in this case the source snapshot
size need to use the bigger initial size, but the destination snapshot size 
does not matter.

On file storage the snapshot can be created without initial size, but even if
engine will send the argument, it is ignored by vdsm.

Comment 17 Arik 2022-04-24 19:48:17 UTC

*** Bug 1993839 has been marked as a duplicate of this bug. ***

Comment 25 sshmulev 2022-06-21 10:31:44 UTC

Verified.

Versions:
vdsm-4.50.1.3-1.el8ev
oviert-engine-4.5.1.2-0.11.el8ev

Could reproduce it on a non fixed version (engine-4.4.10.7-0.4.el8ev)
#VDSM:
2022-06-21 09:22:41,577+0000 ERROR (jsonrpc/1) [virt.vm] (vmId='9bf54187-017a-4dc5-91cd-43c514b20e1c') Replication job unfinished (drive: 'sda', srcDisk: {'imageID': '121514a0-6e33-4a8a-9285-76c0bd78434f', 'poolID': '98214c50-85b4-4c14-bd90-83c668aa553a', 'volumeID': 'a1146f5d-269e-4ada-96f1-7f3b4580f42c', 'device': 'disk', 'domainID': '7e30eb41-8018-4ea6-b02c-37326beb89b1'}, job: {'type': 2, 'bandwidth': 0, 'cur': 1794113536, 'end': 2778726400}) (vm:4499)


#Engine:
2022-06-21 09:22:41,580Z ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-93) [2f7fecf9-fd37-44d9-829b-1525dd54e22a] Failed in 'VmReplicateDiskFinishVDS' method
2022-06-21 09:22:41,594Z ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-93) [2f7fecf9-fd37-44d9-829b-1525dd54e22a] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM host_mixed_2 command VmReplicateDiskFinishVDS failed: Resource unavailable
2022-06-21 09:22:41,601Z ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-93) [2f7fecf9-fd37-44d9-829b-1525dd54e22a] Command 'VmReplicateDiskFinishVDSCommand(HostName = host_mixed_2, VmReplicateDiskParameters:{hostId='1934590a-e3da-4c30-9fcf-e21223baa468', vmId='9bf54187-017a-4dc5-91cd-43c514b20e1c', storagePoolId='98214c50-85b4-4c14-bd90-83c668aa553a', srcStorageDomainId='7e30eb41-8018-4ea6-b02c-37326beb89b1', targetStorageDomainId='3f2141c0-b836-4922-8d35-ca2152184708', imageGroupId='121514a0-6e33-4a8a-9285-76c0bd78434f', imageId='a1146f5d-269e-4ada-96f1-7f3b4580f42c'})' execution failed: VDSGenericException: VDSErrorException: Failed to VmReplicateDiskFinishVDS, error = Resource unavailable, code = 40

#'needExtend' doesn't exist in this version on both vdsm and engine



On the fixed version there were no errors in VDSM or engine logs.
Engine:
#imageInitialSizeInBytes='8053063680'

2022-06-21 12:45:04,239+03 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.CreateVolumeVDSCommand] (default task-49) [1326c67f-9b5e-4665-9574-5a5be8860229] START, CreateVolumeVDSCommand( CreateVolumeVDSCommandParameters:{storagePoolId='02f48b00-bb6e-42af-885e-19f0675db775', ignoreFailoverLimit='false', storageDomainId='8c939c62-7d98-4e65-bbd8-d52f7ea469c1', imageGroupId='c1a1f9ca-1c15-48a3-8382-76be9e638d65', imageSizeInBytes='32212254720', volumeFormat='COW', newImageId='d1bfbb37-f05f-455d-baf2-25586313c420', imageType='Sparse', newImageDescription='', imageInitialSizeInBytes='8053063680', imageId='23aa16b7-30c2-4796-b42c-5f1de8250b9f', sourceImageGroupId='c1a1f9ca-1c15-48a3-8382-76be9e638d65', shouldAddBitmaps='false', legal='true', sequenceNumber='2', bitmap='null'}), log id: 727da77c


#needExtend='false':
2022-06-21 12:45:40,982+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskStartVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-72) [1326c67f-9b5e-4665-9574-5a5be8860229] START, VmReplicateDiskStartVDSCommand(HostName = host_mixed_2, VmReplicateDiskParameters:{hostId='06b8c39e-ad71-47c5-baae-0898cdfe2ed1', vmId='c38cd2be-6d12-46e0-9688-d7e40d6fb8e6', storagePoolId='02f48b00-bb6e-42af-885e-19f0675db775', srcStorageDomainId='8c939c62-7d98-4e65-bbd8-d52f7ea469c1', targetStorageDomainId='947e4aa9-85a6-4b28-8f10-dfbcb08e665c', imageGroupId='c1a1f9ca-1c15-48a3-8382-76be9e638d65', imageId='d1bfbb37-f05f-455d-baf2-25586313c420', diskType='null', needExtend='false'}), log id: 779863e7


#initialSize='8053063680:
2022-06-21 12:45:04,244+0300 INFO  (jsonrpc/6) [vdsm.api] START createVolume(sdUUID='8c939c62-7d98-4e65-bbd8-d52f7ea469c1', spUUID='02f48b00-bb6e-42af-885e-19f0675db775', imgUUID='c1a1f9ca-1c15-48a3-8382-76be9e638d65', size='32212254720', volFormat=4, preallocate=2, diskType='DATA', volUUID='d1bfbb37-f05f-455d-baf2-25586313c420', desc='', srcImgUUID='c1a1f9ca-1c15-48a3-8382-76be9e638d65', srcVolUUID='23aa16b7-30c2-4796-b42c-5f1de8250b9f', initialSize='8053063680', addBitmaps=False, legal=True, sequence=2, bitmap=None) from=::ffff:10.46.12.196,36674, flow_id=1326c67f-9b5e-4665-9574-5a5be8860229, task_id=b3ea4d03-64a2-4ef4-9b78-8eee6307f7cf (api:48)


#needExtend='false':
2022-06-21 12:45:40,987+0300 INFO  (jsonrpc/7) [api.virt] START diskReplicateStart(srcDisk={'imageID': 'c1a1f9ca-1c15-48a3-8382-76be9e638d65', 'poolID': '02f48b00-bb6e-42af-885e-19f0675db775', 'volumeID': 'd1bfbb37-f05f-455d-baf2-25586313c420', 'device': 'disk', 'domainID': '8c939c62-7d98-4e65-bbd8-d52f7ea469c1'}, dstDisk={'imageID': 'c1a1f9ca-1c15-48a3-8382-76be9e638d65', 'poolID': '02f48b00-bb6e-42af-885e-19f0675db775', 'volumeID': 'd1bfbb37-f05f-455d-baf2-25586313c420', 'device': 'disk', 'domainID': '947e4aa9-85a6-4b28-8f10-dfbcb08e665c'}, needExtend=False) from=::ffff:10.46.12.196,39640, flow_id=1326c67f-9b5e-4665-9574-5a5be8860229, vmId=c38cd2be-6d12-46e0-9688-d7e40d6fb8e6 (api:48)

Checked flows for:
ISCSI -> ISCSI
NFS -> ISCSI
Thin provisioned disk (30G) and also a small one (1G)

Comment 29 errata-xmlrpc 2022-07-14 12:54:30 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: RHV Manager (ovirt-engine) [ovirt-4.5.1] security, bug fix and update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5555

Comment 30 meital avital 2022-08-07 13:21:55 UTC

Due to QE capacity, we are not going to cover this issue in our automation