Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1151374

Summary:	A VM import failure is not properly rolled back.
Product:	Red Hat Enterprise Virtualization Manager	Reporter:	Roman Hodain <rhodain>
Component:	ovirt-engine	Assignee:	Moti Asayag <masayag>
Status:	CLOSED WONTFIX	QA Contact:	Pavel Stehlik <pstehlik>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	3.4.1-1	CC:	amureini, laravot, lpeer, lsurette, masayag, mgoldboi, mperina, pbrilla, ppostler, rbalakri, Rhev-m-bugs, rhodain, scohen, srevivo, tnisan, ykaul
Target Milestone:	---
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-08-31 12:36:37 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	Infra	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1082498
Bug Blocks:

Description Roman Hodain 2014-10-10 09:05:48 UTC

Description of problem:
   The VM operation fails due to some storage delays which causes the
   "CopyImageVDS" command to timeout. This leads to rollback on the engine site,
   but the the operation is successful on the vdsm site. This lead to orphaned
   disk images on the storage and prevents repeating the import.

Version-Release number of selected component (if applicable):
	vdsm-4.14.11-5.el6ev.x86_64
	rhevm-3.4.1-0.30.el6ev.noarch

How reproducible:
	Sometimes when the storage does not respond properly. This happens when the
	check on the images takes longer than 180sec

Actual results:
	The import fails and the disks remains on the storage.

Expected results:
	Disks are removed from the storage or the operation is finished sucessfully.

Additional info:

	From the engine log we can see that copyImage is called and it fails after 
	180sec due to the timeout. We can also see the SD in problems:

		2014-10-05 12:02:52,085 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.CopyImageVDSCommand] (org.ovirt.thread.pool-4-thread-45) [4c92a7de] -- executeIrsBrokerCommand: calling 'copyImage' with two new parameters: description and  26943  UUID
		2014-10-05 12:02:52,085 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.CopyImageVDSCommand] (org.ovirt.thread.pool-4-thread-45) [4c92a7de] -- copyImage parameters:
                   sdUUID=57692261-0d26-4e9d-a8fa-749f2a55789b
                   spUUID=00000002-0002-0002-0002-000000000308
                   vmGUID=578bd1cf-7bde-45d7-be3c-528545b9c94b
                   srcImageGUID=a962c650-701a-4c3b-ac69-bcf2f5483785
                   srcVolUUID=654e315c-4540-4105-92cb-5b02983c3105
                   dstImageGUID=a962c650-701a-4c3b-ac69-bcf2f5483785
                   dstVolUUID=654e315c-4540-4105-92cb-5b02983c3105
                   descr=
                   dstSdUUID=f6e24add-a529-4515-b011-6fa2293f16cb

		...

		2014-10-05 12:03:11,282 WARN  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (org.ovirt.thread.pool-4-thread-12) domain 57692261-0d26-4e9d-a8fa-749f2a55789b:SD01 in problem. vds: rhev-h01
		2014-10-05 12:03:31,207 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (org.ovirt.thread.pool-4-thread-36) Domain 57692261-0d26-4e9d-a8fa-749f2a55789b:SD01 recovered from problem. vds: rhev-h01 
		2014-10-05 12:03:31,207 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (org.ovirt.thread.pool-4-thread-36) Domain 57692261-0d26-4e9d-a8fa-749f2a55789b:SD01 has recovered from problem. No active host in the DC is reporting it as problematic, so clearing the domain recovery timer.
		
		...

		2014-10-05 12:05:52,087 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (org.ovirt.thread.pool-4-thread-45) [4c92a7de] IrsBroker::Failed::CopyImageVDS due to: TimeoutException:

		...
		2014-10-05 12:05:52,349 INFO  [org.ovirt.engine.core.bll.AsyncTaskManager] (org.ovirt.thread.pool-4-thread-45) [4c92a7de] Removed task 4eb52586-2808-4479-954e-b5991a2d5088 from DataBase
		2014-10-05 12:05:52,353 ERROR [org.ovirt.engine.core.bll.CopyImageGroupCommand] (org.ovirt.thread.pool-4-thread-45) [4c92a7de] Transaction rolled-back for command: org.ovirt.engine.core.bll.CopyImageGroupCommand.
		2014-10-05 12:05:52,353 INFO  [org.ovirt.engine.core.utils.transaction.TransactionSupport] (org.ovirt.thread.pool-4-thread-45) [4c92a7de] transaction rolled back

	This is not a networking issue. The task on the SPM was not created in 180sec and thus we got the timeout. Here is the flow on the SPM. Keep in mind that the time is -2 hours

		Thread-34645::DEBUG::2014-10-05 10:02:51,490::BindingXMLRPC::251::vds::(wrapper) client [192.168.1.2] flowID [4c92a7de]
		Thread-34645::DEBUG::2014-10-05 10:02:51,492::task::595::TaskManager.Task::(_updateState) Task=`244ead22-ec64-4e7c-85a4-9ef863ea0eff`::moving from state init -> state preparing
		Thread-34645::INFO::2014-10-05 10:02:51,493::logUtils::44::dispatcher::(wrapper) Run and protect: copyImage(sdUUID='57692261-0d26-4e9d-a8fa-749f2a55789b', spUUID='00000002-0002-0002-0002-000000000308',...

		Thread-34645::DEBUG::2014-10-05 10:03:08,420::fileVolume::529::Storage.Volume::(validateVolumePath) validate path for 654e315c-4540-4105-92cb-5b02983c3105
		...
		Thread-34645::INFO::2014-10-05 10:03:36,613::image::285::Storage.Image::(isLegal) image a962c650-701a-4c3b-ac69-bcf2f5483785 in domain 57692261-0d26-4e9d-a8fa-749f2a55789b has vollist ['654e315c-4540-4105-92cb-5b02983c3105']
		Thread-34645::DEBUG::2014-10-05 10:03:36,616::fileVolume::529::Storage.Volume::(validateVolumePath) validate path for 654e315c-4540-4105-92cb-5b02983c3105
		...
		Thread-34645::DEBUG::2014-10-05 10:06:14,880::fileVolume::529::Storage.Volume::(validateVolumePath) validate path for 654e315c-4540-4105-92cb-5b02983c3105
		...
		Thread-34645::DEBUG::2014-10-05 10:07:04,552::fileVolume::529::Storage.Volume::(validateVolumePath) validate path for 654e315c-4540-4105-92cb-5b02983c3105
		...
		Thread-34645::DEBUG::2014-10-05 10:07:39,811::fileVolume::529::Storage.Volume::(validateVolumePath) validate path for 654e315c-4540-4105-92cb-5b02983c3105
		...
		Thread-34645::DEBUG::2014-10-05 10:09:22,622::resourceManager::198::ResourceManager.Request::(__init__) ResName=`Storage.57692261-0d26-4e9d-a8fa-749f2a55789b`ReqID=`32c881e9-807f-49de-b7da-0754df36b1b1`::Request was made in '/usr/share/vdsm/storage/hsm.py' line '1792' at 'copyImage'
		...
		Thread-34645::DEBUG::2014-10-05 10:09:22,893::task::595::TaskManager.Task::(_updateState) Task=`244ead22-ec64-4e7c-85a4-9ef863ea0eff`::moving from state acquiring -> state queued

	You can see that most of the time was taken by "validateVolumePath". We do the check four times as the code is not optimal. But we can see that the task was not created on time and the result was not reported back to engine and engine rolled back the operation. This timeout triggered the SpmStopVDSCommand and reloading the list of tasks on the mast SD. Unfortunately even this happened even before the task was created and thus not reported to the engine

		2014-10-05 12:05:52,272 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (org.ovirt.thread.pool-4-thread-45) [4c92a7de] START, SpmStopVDSCommand(HostName = mh-rhev-1000-1.ad.molecularhealth.com, HostId = 5b3161  26965 7e-6ce4-4538-b92a-83a670ba157f, storagePoolId = 00000002-0002-0002-0002-000000000308), log id: 51617d80
		2014-10-05 12:05:52,328 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (org.ovirt.thread.pool-4-thread-45) [4c92a7de] FINISH, SpmStopVDSCommand, log id: 51617d80
		2014-10-05 12:05:52,328 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (org.ovirt.thread.pool-4-thread-45) [4c92a7de] IRS failover failed - cant allocate vds server
		...
		2014-10-05 12:05:52,407 INFO  [org.ovirt.engine.core.bll.storage.SetStoragePoolStatusCommand] (DefaultQuartzScheduler_Worker-14) [286fabd4] Running command: SetStoragePoolStatusCommand internal: true. Entities affected :  ID: 000  26985 00002-0002-0002-0002-000000000308 Type: StoragePool
		...
		2014-10-05 12:05:52,605 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (DefaultQuartzScheduler_Worker-14) [286fabd4] Found spm host 5b31617e-6ce4-4538-b92a-83a670ba157f, host name: mh-rhev-1000-1.ad.molecularh  26991 ealth.com, according to spmId: 2.
		2014-10-05 12:05:52,654 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (DefaultQuartzScheduler_Worker-14) [286fabd4] SpmStatus on vds 5b31617e-6ce4-4538-b92a-83a670ba157f: SPM
		2014-10-05 12:05:52,654 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (DefaultQuartzScheduler_Worker-14) [286fabd4] Using old spm server: mh-rhev-1000-1.ad.molecularhealth.com, no start needed
		...
		2014-10-05 12:05:52,749 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.SPMGetAllTasksInfoVDSCommand] (org.ovirt.thread.pool-4-thread-22) [286fabd4] START, SPMGetAllTasksInfoVDSCommand( storagePoolId = 00000002-0002-0002-0002-00  26996 0000000308, ignoreFailoverLimit = false), log id: 476b6853
		...
		2014-10-05 12:05:53,164 INFO  [org.ovirt.engine.core.bll.AsyncTaskManager] (org.ovirt.thread.pool-4-thread-22) [286fabd4] Discovered 4 tasks on Storage Pool DC1, 0 added to manager.
	
	Here we can see that the list of tasks was returned about 4 mins before the task for copyImage was created.

Comment 3 Liron Aravot 2014-10-19 12:31:01 UTC

Romain,
can you please attach complete log from which you copied to the description?

thanks

Comment 6 Liron Aravot 2014-11-05 16:10:17 UTC

Thanks Roman,
As VDSM currently generates the task id's and not the engine, once the task id doesn't return to the engine faster than 180 seconds the engine can't track the tasks (relevant to all the flows) - once the engine will pass the task id to vdsm instead, it'll be able to track it and act accordingly.

Oved, afaik we already have a bug for that on the infra team. This one should be marked as a duplicate or depends on it.

thanks,
Liron.

Comment 7 Allon Mureinik 2014-11-17 16:42:03 UTC

(In reply to Liron Aravot from comment #6)
> Oved, afaik we already have a bug for that on the infra team. This one
> should be marked as a duplicate or depends on it.
Oved, any insight?

Comment 8 Oved Ourfali 2014-11-17 17:32:41 UTC

(In reply to Allon Mureinik from comment #7)
> (In reply to Liron Aravot from comment #6)
> > Oved, afaik we already have a bug for that on the infra team. This one
> > should be marked as a duplicate or depends on it.
> Oved, any insight?


Yair, are you familiar with this one?

Comment 9 Oved Ourfali 2014-11-20 13:01:32 UTC

(In reply to Allon Mureinik from comment #7)
> (In reply to Liron Aravot from comment #6)
> > Oved, afaik we already have a bug for that on the infra team. This one
> > should be marked as a duplicate or depends on it.
> Oved, any insight?

btw, even if we do find the bug number, I don't expect it to be addressed in 3.5.

Comment 10 Allon Mureinik 2014-11-24 17:50:37 UTC

(In reply to Oved Ourfali from comment #9)
> (In reply to Allon Mureinik from comment #7)
> > (In reply to Liron Aravot from comment #6)
> > > Oved, afaik we already have a bug for that on the infra team. This one
> > > should be marked as a duplicate or depends on it.
> > Oved, any insight?

Even if there isn't such a BZ, seems like there should (i.e., this one).
Moving to infra to either handle according to their priorities or provide instructions how the current storage flow should change.

Comment 11 Oved Ourfali 2014-11-25 07:11:14 UTC

This will require infrastructural changes, that won't be in scope for 3.5. Targeting it to 3.6.

Comment 12 Alexandros Gkesos 2016-07-05 07:20:18 UTC

Hello,

Is there any update?

Comment 15 Moti Asayag 2016-11-17 09:21:58 UTC

Per discussion with Liron, in order to solve this bug - the responsibility of allocating a task-id should be moved from vdsm to the engine.

By doing so, the task could be identified by the engine whenever it is reported from vdsm and will eliminate the case of orphaned/zombie tasks.

For that purpose the API with vdsm should be extended to include also the task-id.