Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1078095

Summary: [engine-backend] snapshot creation task (createVolume) is not polled by engine, which causes live storage migration flow to keep running in case of a failure of vdsm to complete the createVolume task
Product: Red Hat Enterprise Virtualization Manager Reporter: Elad <ebenahar>
Component: ovirt-engineAssignee: Daniel Erez <derez>
Status: CLOSED CURRENTRELEASE QA Contact: lkuchlan <lkuchlan>
Severity: high Docs Contact:
Priority: urgent    
Version: 3.4.0CC: acanan, acathrow, amureini, derez, ebenahar, gklein, iheim, lpeer, Rhev-m-bugs, rnori, scohen, yeylon
Target Milestone: ---Keywords: Reopened, Triaged
Target Release: 3.5.0Flags: amureini: Triaged+
Hardware: x86_64   
OS: Unspecified   
Whiteboard: storage
Fixed In Version: ovirt-engine-3.5.0_vt9 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-02-16 19:11:06 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
engine and vdsm logs
none
logs-23.10 none

Description Elad 2014-03-19 07:24:17 UTC
Description of problem:
If SPM looses its connectivity to the target storage domain while createVolume task (for the auto-generated snapshot created for LSM) is running, the live storage migration flow is not failed by engine and the createVolume task keeps running and not stopped. Disk remains in 'LOCKED' state.

Version-Release number of selected component (if applicable):
RHEV3.4 - AV3
rhevm-3.4.0-0.5.master.el6ev.noarch
vdsm-4.14.2-0.4.el6ev.x86_64


How reproducible:
Always

Steps to Reproduce:
On a shared DC with 2 storage domains (NFS in my case) and 1 host 
1. Create a VM with disk attached and start it
2. While VM is running, move its disk to another storage domain 
3. Block connectivity from SPM to storage server where the target domain is located on while createVolume task is running on vdsm

Actual results:

Create snapshot command is sent to vdsm:

2014-03-18 23:00:46,041 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.CreateSnapshotVDSCommand] (org.ovirt.thread.pool-4-thread-43) [7862a7f6] START, CreateSnapshotVDSCommand( storagePoolId = 1398e2c7-e653-49df-bb69-ba49eb14f931, ignoreFailoverLimit = false, storageDomainId = 1d9718f4-1e04-405f-90f0-75f33d140bf2, imageGroupId = 5f187fd8-9839-4d18-9ccd-b3ba9fc9747f, imageSizeInBytes = 3221225472, volumeFormat = COW, newImageId = d3da1752-9820-4159-9f4f-fd24b2720251, newImageDescription = , imageId = 1f7839b6-3489-4630-b824-4f4667cd36f3, sourceImageGroupId = 5f187fd8-9839-4d18-9ccd-b3ba9fc9747f), log id: 2018b2e2


vdsm start the createVolume task while connection to the target domain is blocked. On vdsm the task remains running until manual intervention.

Thread-1514::INFO::2014-03-18 23:00:48,695::logUtils::47::dispatcher::(wrapper) Run and protect: getAllTasksInfo, Return response: {'allTasksInfo': {'9727dbaa-459b-4118-844f-62da2643710e': {'verb': 'createVolume', 'id': '9727dbaa-459b-4118-844f-62da2643710e'}}}

Thread-2873::INFO::2014-03-18 23:25:48,653::logUtils::47::dispatcher::(wrapper) Run and protect: getAllTasksInfo, Return response: {'allTasksInfo': {'9727dbaa-459b-4118-844f-62da2643710e': {'verb': 'createVolume', 'id': '9727dbaa-459b-4118-844f-62da2643710e'}}}


Tasks is reported as added to polling:

2014-03-18 23:00:46,148 INFO  [org.ovirt.engine.core.bll.AsyncTaskManager] (org.ovirt.thread.pool-4-thread-43) [7862a7f6] Adding task 9727dbaa-459b-4118-844f-62da2643710e (Paren
t Command LiveMigrateVmDisks, Parameters Type org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters), polling hasn't started yet..
2014-03-18 23:00:46,213 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-4-thread-43) Correlation ID: 64e196de, Job ID: 1c8088
9c-d943-4086-a3c3-a0453a3da865, Call Stack: null, Custom Event ID: -1, Message: Snapshot 'Auto-generated for Live Storage Migration' creation for VM 'elad1-1' was initiated by a
dmin.
2014-03-18 23:00:46,221 INFO  [org.ovirt.engine.core.bll.SPMAsyncTask] (org.ovirt.thread.pool-4-thread-43) BaseAsyncTask::startPollingTask: Starting to poll task 9727dbaa-459b-4
118-844f-62da2643710e.
2014-03-18 23:00:53,851 INFO  [org.ovirt.engine.core.bll.AsyncTaskManager] (DefaultQuartzScheduler_Worker-96) Polling and updating Async Tasks: 1 tasks, 1 tasks to poll now
2014-03-18 23:00:53,857 INFO  [org.ovirt.engine.core.bll.SPMAsyncTask] (DefaultQuartzScheduler_Worker-96) SPMAsyncTask::PollTask: Polling task 9727dbaa-459b-4118-844f-62da264371
0e (Parent Command LiveMigrateVmDisks, Parameters Type org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters) returned status running.
2014-03-18 23:00:53,857 INFO  [org.ovirt.engine.core.bll.AsyncTaskManager] (DefaultQuartzScheduler_Worker-96) Finished polling Tasks, will poll again in 10 seconds


After that, the task doesn't appear in engine.log anymore. 

Engine doesn't revert the LSM flow if there is a failure to create the auto-generated live snapshot. Hence, roll-back to the live storage migration is not executed, which leaves a running createVolume task running on vdsm and the disk in 'LOCKED' state.


Expected results:
There should be a timeout for craeteVolume task as part the live snapshot creation for live storage migration. If the task gets stuck on vdsm, engine should request vdsm to stop and clean it, then roll-back to the whole LSM flow should be executed.

Additional info: engine and vdsm logs

Comment 1 Elad 2014-03-19 07:24:57 UTC
Created attachment 876239 [details]
engine and vdsm logs

Comment 2 Allon Mureinik 2014-03-19 10:20:53 UTC
Daniel, this seems like a general issue in live snapshot, unrelated to LSM.
Can you confirm please?

Comment 3 Elad 2014-03-20 09:52:28 UTC
Allon,
I checked the same scenario only for live snapshot. Engine poll the createVolume tasks and order vdsm to stop them in a case of a failure due to connectivity loss with storage:

2014-03-20 11:36:04,646 INFO  [org.ovirt.engine.core.bll.SPMAsyncTask] (DefaultQuartzScheduler_Worker-65) SPMAsyncTask::PollTask: Polling task 881b1a2a-6807-4cb1-8d0d-0e8fbdd2d6c0 (Parent Command CreateAllSnapshotsFromVm, Parameters Type org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters) returned status finished, result 'success'.
2014-03-20 11:36:04,651 INFO  [org.ovirt.engine.core.bll.SPMAsyncTask] (DefaultQuartzScheduler_Worker-65) BaseAsyncTask::OnTaskEndSuccess: Task 881b1a2a-6807-4cb1-8d0d-0e8fbdd2d6c0 (Parent Command CreateAllSnapshotsFromVm, Parameters Type org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters) ended successfully.


 2014-03-20 11:37:04,903 ERROR [org.ovirt.engine.core.bll.CreateAllSnapshotsFromVmCommand] (org.ovirt.thread.pool-4-thread-24) Ending command with failure: org.ovirt.engine.core.bll.CreateAllSnapshotsFromVmCommand
2014-03-20 11:37:04,907 INFO  [org.ovirt.engine.core.bll.RemoveMemoryVolumesCommand] (org.ovirt.thread.pool-4-thread-24) [736dc00b] Running command: RemoveMemoryVolumesCommand internal: true.

Comment 4 Daniel Erez 2014-08-21 15:17:40 UTC
The described scenario can't be reproduced on latest 3.5 build (neither in QE nor development environments). Please re-open if reproduced.

Comment 5 Elad 2014-10-23 15:06:19 UTC
Created attachment 949970 [details]
logs-23.10

I was able to reproduce the issue. The bug seems to be reproduced after performing the steps twice.

Thread-2483::DEBUG::2014-10-23 18:01:01,212::task::1191::Storage.TaskManager.Task::(prepare) Task=`7382f113-3421-4512-9566-f6bf49511c2f`::finished: {'tasks': {'5ab731f6-ca3d-4179-a475-dd3ea37a48ce': {'code': 0, 'tag': u'spm', 'state': 'running', 'verb': 'createVolume', 'result': '', 'message': 'running job 1 of 1', 'id': '5ab731f6-ca3d-4179-a475-dd3ea37a48ce'}}}

Disk and snapshot remain in status LOCKED, createVolume task is not cleaned from SPM

Uploading the logs

Comment 6 Daniel Erez 2014-10-28 14:20:29 UTC
Hi Elad,

On which engine build have you reproduced it?

Comment 7 Daniel Erez 2014-10-28 14:24:34 UTC
Hi Ravi,

It seems that the task isn't being cleaned in the described scenario.
Could it be related to async tasks infrastructure changes?

If the AsyncTaskManager does not clear the task, there's nothing to trigger the storage code there, so I'm not sure how we can do anything useful here.

Comment 8 Elad 2014-10-28 14:25:04 UTC
(In reply to Daniel Erez from comment #6)
> Hi Elad,
> 
> On which engine build have you reproduced it?

vt7
rhevm-3.5.0-0.17.beta.el6ev.noarch

Comment 9 Daniel Erez 2014-10-28 14:28:08 UTC
(In reply to Daniel Erez from comment #7)
> Hi Ravi,
> 
> It seems that the task isn't being cleaned in the described scenario.
> Could it be related to async tasks infrastructure changes?
> 
> If the AsyncTaskManager does not clear the task, there's nothing to trigger
> the storage code there, so I'm not sure how we can do anything useful here.

Adding back needinfo on Ravi.

Comment 10 Ravi Nori 2014-10-28 21:40:02 UTC
I was able to reproduce this. Here is what I did

1. Start VM with preallocated disk of 10GB
2. Move disk from Storage domain Data1 to Data2 while VM is running
3. Block connectivity from SPM to storage server where the target domain is located on while createVolume task is running on vdsm

Now the VM's disk remains locked until I fix the connectivity between the SPM and target storage domain. Once the connectivity is fixed the disk is unlocked, the operation is marked as failure and the async tasks table is cleared.

IMO this is the correct behavior, I don't think having a timeout for storage operations is something we need to introduce for 3.5

While the storage domain is down there is no polling or activity in the logs because the getAllTasksStatuses fails with the following vdcBLLException which is logged as debug messages

org.ovirt.engine.core.common.errors.VdcBLLException: VdcBLLException: Cannot allocate IRS server (Failed with error IRS_REPOSITORY_NOT_FOUND and code 5009)

Comment 11 Elad 2014-10-29 07:48:37 UTC
(In reply to Ravi Nori from comment #10)
> I was able to reproduce this. Here is what I did
> 
> 1. Start VM with preallocated disk of 10GB
> 2. Move disk from Storage domain Data1 to Data2 while VM is running
> 3. Block connectivity from SPM to storage server where the target domain is
> located on while createVolume task is running on vdsm
> 
> Now the VM's disk remains locked until I fix the connectivity between the
> SPM and target storage domain. Once the connectivity is fixed the disk is
> unlocked, the operation is marked as failure and the async tasks table is
> cleared.
>  
> IMO this is the correct behavior, I don't think having a timeout for storage
> operations is something we need to introduce for 3.5
> 
> While the storage domain is down there is no polling or activity in the logs
> because the getAllTasksStatuses fails with the following vdcBLLException
> which is logged as debug messages
> 
> org.ovirt.engine.core.common.errors.VdcBLLException: VdcBLLException: Cannot
> allocate IRS server (Failed with error IRS_REPOSITORY_NOT_FOUND and code
> 5009)

Ravi, it's not the case. please try the steps written in comment #5. The connectivity between SPM and the storage was resumed. The task is not cleared from vdsm because engine doesn't poll it.

Comment 12 Daniel Erez 2014-10-29 08:58:48 UTC
(In reply to Elad from comment #11)
> (In reply to Ravi Nori from comment #10)
> > I was able to reproduce this. Here is what I did
> > 
> > 1. Start VM with preallocated disk of 10GB
> > 2. Move disk from Storage domain Data1 to Data2 while VM is running
> > 3. Block connectivity from SPM to storage server where the target domain is
> > located on while createVolume task is running on vdsm
> > 
> > Now the VM's disk remains locked until I fix the connectivity between the
> > SPM and target storage domain. Once the connectivity is fixed the disk is
> > unlocked, the operation is marked as failure and the async tasks table is
> > cleared.
> >  
> > IMO this is the correct behavior, I don't think having a timeout for storage
> > operations is something we need to introduce for 3.5
> > 
> > While the storage domain is down there is no polling or activity in the logs
> > because the getAllTasksStatuses fails with the following vdcBLLException
> > which is logged as debug messages
> > 
> > org.ovirt.engine.core.common.errors.VdcBLLException: VdcBLLException: Cannot
> > allocate IRS server (Failed with error IRS_REPOSITORY_NOT_FOUND and code
> > 5009)
> 
> Ravi, it's not the case. please try the steps written in comment #5. The
> connectivity between SPM and the storage was resumed. The task is not
> cleared from vdsm because engine doesn't poll it.

Ravi, iiuc, the task has been properly cleared once you've restored the connectivity (btw, similarly to the behavior on my tests as well though it took quite a while for the tasks to be cleared). What do you think? Any reason for the tasks not being cleared after the connectivity is resumed?

Comment 13 lkuchlan 2014-11-26 15:40:14 UTC
No tasks remained stuck on the vdsm, and the disk status was set back to "OK"
Verified on VDSM vt11

Comment 14 Allon Mureinik 2015-02-16 19:11:06 UTC
RHEV-M 3.5.0 has been released, closing this bug.

Comment 15 Allon Mureinik 2015-02-16 19:11:20 UTC
RHEV-M 3.5.0 has been released, closing this bug.