Bug 1078095
| Summary: | [engine-backend] snapshot creation task (createVolume) is not polled by engine, which causes live storage migration flow to keep running in case of a failure of vdsm to complete the createVolume task | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Elad <ebenahar> | ||||||
| Component: | ovirt-engine | Assignee: | Daniel Erez <derez> | ||||||
| Status: | CLOSED CURRENTRELEASE | QA Contact: | lkuchlan <lkuchlan> | ||||||
| Severity: | high | Docs Contact: | |||||||
| Priority: | urgent | ||||||||
| Version: | 3.4.0 | CC: | acanan, acathrow, amureini, derez, ebenahar, gklein, iheim, lpeer, Rhev-m-bugs, rnori, scohen, yeylon | ||||||
| Target Milestone: | --- | Keywords: | Reopened, Triaged | ||||||
| Target Release: | 3.5.0 | Flags: | amureini:
Triaged+
|
||||||
| Hardware: | x86_64 | ||||||||
| OS: | Unspecified | ||||||||
| Whiteboard: | storage | ||||||||
| Fixed In Version: | ovirt-engine-3.5.0_vt9 | Doc Type: | Bug Fix | ||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2015-02-16 19:11:06 UTC | Type: | Bug | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Attachments: |
|
||||||||
Created attachment 876239 [details]
engine and vdsm logs
Daniel, this seems like a general issue in live snapshot, unrelated to LSM. Can you confirm please? Allon, I checked the same scenario only for live snapshot. Engine poll the createVolume tasks and order vdsm to stop them in a case of a failure due to connectivity loss with storage: 2014-03-20 11:36:04,646 INFO [org.ovirt.engine.core.bll.SPMAsyncTask] (DefaultQuartzScheduler_Worker-65) SPMAsyncTask::PollTask: Polling task 881b1a2a-6807-4cb1-8d0d-0e8fbdd2d6c0 (Parent Command CreateAllSnapshotsFromVm, Parameters Type org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters) returned status finished, result 'success'. 2014-03-20 11:36:04,651 INFO [org.ovirt.engine.core.bll.SPMAsyncTask] (DefaultQuartzScheduler_Worker-65) BaseAsyncTask::OnTaskEndSuccess: Task 881b1a2a-6807-4cb1-8d0d-0e8fbdd2d6c0 (Parent Command CreateAllSnapshotsFromVm, Parameters Type org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters) ended successfully. 2014-03-20 11:37:04,903 ERROR [org.ovirt.engine.core.bll.CreateAllSnapshotsFromVmCommand] (org.ovirt.thread.pool-4-thread-24) Ending command with failure: org.ovirt.engine.core.bll.CreateAllSnapshotsFromVmCommand 2014-03-20 11:37:04,907 INFO [org.ovirt.engine.core.bll.RemoveMemoryVolumesCommand] (org.ovirt.thread.pool-4-thread-24) [736dc00b] Running command: RemoveMemoryVolumesCommand internal: true. The described scenario can't be reproduced on latest 3.5 build (neither in QE nor development environments). Please re-open if reproduced. Created attachment 949970 [details]
logs-23.10
I was able to reproduce the issue. The bug seems to be reproduced after performing the steps twice.
Thread-2483::DEBUG::2014-10-23 18:01:01,212::task::1191::Storage.TaskManager.Task::(prepare) Task=`7382f113-3421-4512-9566-f6bf49511c2f`::finished: {'tasks': {'5ab731f6-ca3d-4179-a475-dd3ea37a48ce': {'code': 0, 'tag': u'spm', 'state': 'running', 'verb': 'createVolume', 'result': '', 'message': 'running job 1 of 1', 'id': '5ab731f6-ca3d-4179-a475-dd3ea37a48ce'}}}
Disk and snapshot remain in status LOCKED, createVolume task is not cleaned from SPM
Uploading the logs
Hi Elad, On which engine build have you reproduced it? Hi Ravi, It seems that the task isn't being cleaned in the described scenario. Could it be related to async tasks infrastructure changes? If the AsyncTaskManager does not clear the task, there's nothing to trigger the storage code there, so I'm not sure how we can do anything useful here. (In reply to Daniel Erez from comment #6) > Hi Elad, > > On which engine build have you reproduced it? vt7 rhevm-3.5.0-0.17.beta.el6ev.noarch (In reply to Daniel Erez from comment #7) > Hi Ravi, > > It seems that the task isn't being cleaned in the described scenario. > Could it be related to async tasks infrastructure changes? > > If the AsyncTaskManager does not clear the task, there's nothing to trigger > the storage code there, so I'm not sure how we can do anything useful here. Adding back needinfo on Ravi. I was able to reproduce this. Here is what I did 1. Start VM with preallocated disk of 10GB 2. Move disk from Storage domain Data1 to Data2 while VM is running 3. Block connectivity from SPM to storage server where the target domain is located on while createVolume task is running on vdsm Now the VM's disk remains locked until I fix the connectivity between the SPM and target storage domain. Once the connectivity is fixed the disk is unlocked, the operation is marked as failure and the async tasks table is cleared. IMO this is the correct behavior, I don't think having a timeout for storage operations is something we need to introduce for 3.5 While the storage domain is down there is no polling or activity in the logs because the getAllTasksStatuses fails with the following vdcBLLException which is logged as debug messages org.ovirt.engine.core.common.errors.VdcBLLException: VdcBLLException: Cannot allocate IRS server (Failed with error IRS_REPOSITORY_NOT_FOUND and code 5009) (In reply to Ravi Nori from comment #10) > I was able to reproduce this. Here is what I did > > 1. Start VM with preallocated disk of 10GB > 2. Move disk from Storage domain Data1 to Data2 while VM is running > 3. Block connectivity from SPM to storage server where the target domain is > located on while createVolume task is running on vdsm > > Now the VM's disk remains locked until I fix the connectivity between the > SPM and target storage domain. Once the connectivity is fixed the disk is > unlocked, the operation is marked as failure and the async tasks table is > cleared. > > IMO this is the correct behavior, I don't think having a timeout for storage > operations is something we need to introduce for 3.5 > > While the storage domain is down there is no polling or activity in the logs > because the getAllTasksStatuses fails with the following vdcBLLException > which is logged as debug messages > > org.ovirt.engine.core.common.errors.VdcBLLException: VdcBLLException: Cannot > allocate IRS server (Failed with error IRS_REPOSITORY_NOT_FOUND and code > 5009) Ravi, it's not the case. please try the steps written in comment #5. The connectivity between SPM and the storage was resumed. The task is not cleared from vdsm because engine doesn't poll it. (In reply to Elad from comment #11) > (In reply to Ravi Nori from comment #10) > > I was able to reproduce this. Here is what I did > > > > 1. Start VM with preallocated disk of 10GB > > 2. Move disk from Storage domain Data1 to Data2 while VM is running > > 3. Block connectivity from SPM to storage server where the target domain is > > located on while createVolume task is running on vdsm > > > > Now the VM's disk remains locked until I fix the connectivity between the > > SPM and target storage domain. Once the connectivity is fixed the disk is > > unlocked, the operation is marked as failure and the async tasks table is > > cleared. > > > > IMO this is the correct behavior, I don't think having a timeout for storage > > operations is something we need to introduce for 3.5 > > > > While the storage domain is down there is no polling or activity in the logs > > because the getAllTasksStatuses fails with the following vdcBLLException > > which is logged as debug messages > > > > org.ovirt.engine.core.common.errors.VdcBLLException: VdcBLLException: Cannot > > allocate IRS server (Failed with error IRS_REPOSITORY_NOT_FOUND and code > > 5009) > > Ravi, it's not the case. please try the steps written in comment #5. The > connectivity between SPM and the storage was resumed. The task is not > cleared from vdsm because engine doesn't poll it. Ravi, iiuc, the task has been properly cleared once you've restored the connectivity (btw, similarly to the behavior on my tests as well though it took quite a while for the tasks to be cleared). What do you think? Any reason for the tasks not being cleared after the connectivity is resumed? No tasks remained stuck on the vdsm, and the disk status was set back to "OK" Verified on VDSM vt11 RHEV-M 3.5.0 has been released, closing this bug. RHEV-M 3.5.0 has been released, closing this bug. |
Description of problem: If SPM looses its connectivity to the target storage domain while createVolume task (for the auto-generated snapshot created for LSM) is running, the live storage migration flow is not failed by engine and the createVolume task keeps running and not stopped. Disk remains in 'LOCKED' state. Version-Release number of selected component (if applicable): RHEV3.4 - AV3 rhevm-3.4.0-0.5.master.el6ev.noarch vdsm-4.14.2-0.4.el6ev.x86_64 How reproducible: Always Steps to Reproduce: On a shared DC with 2 storage domains (NFS in my case) and 1 host 1. Create a VM with disk attached and start it 2. While VM is running, move its disk to another storage domain 3. Block connectivity from SPM to storage server where the target domain is located on while createVolume task is running on vdsm Actual results: Create snapshot command is sent to vdsm: 2014-03-18 23:00:46,041 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.CreateSnapshotVDSCommand] (org.ovirt.thread.pool-4-thread-43) [7862a7f6] START, CreateSnapshotVDSCommand( storagePoolId = 1398e2c7-e653-49df-bb69-ba49eb14f931, ignoreFailoverLimit = false, storageDomainId = 1d9718f4-1e04-405f-90f0-75f33d140bf2, imageGroupId = 5f187fd8-9839-4d18-9ccd-b3ba9fc9747f, imageSizeInBytes = 3221225472, volumeFormat = COW, newImageId = d3da1752-9820-4159-9f4f-fd24b2720251, newImageDescription = , imageId = 1f7839b6-3489-4630-b824-4f4667cd36f3, sourceImageGroupId = 5f187fd8-9839-4d18-9ccd-b3ba9fc9747f), log id: 2018b2e2 vdsm start the createVolume task while connection to the target domain is blocked. On vdsm the task remains running until manual intervention. Thread-1514::INFO::2014-03-18 23:00:48,695::logUtils::47::dispatcher::(wrapper) Run and protect: getAllTasksInfo, Return response: {'allTasksInfo': {'9727dbaa-459b-4118-844f-62da2643710e': {'verb': 'createVolume', 'id': '9727dbaa-459b-4118-844f-62da2643710e'}}} Thread-2873::INFO::2014-03-18 23:25:48,653::logUtils::47::dispatcher::(wrapper) Run and protect: getAllTasksInfo, Return response: {'allTasksInfo': {'9727dbaa-459b-4118-844f-62da2643710e': {'verb': 'createVolume', 'id': '9727dbaa-459b-4118-844f-62da2643710e'}}} Tasks is reported as added to polling: 2014-03-18 23:00:46,148 INFO [org.ovirt.engine.core.bll.AsyncTaskManager] (org.ovirt.thread.pool-4-thread-43) [7862a7f6] Adding task 9727dbaa-459b-4118-844f-62da2643710e (Paren t Command LiveMigrateVmDisks, Parameters Type org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters), polling hasn't started yet.. 2014-03-18 23:00:46,213 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-4-thread-43) Correlation ID: 64e196de, Job ID: 1c8088 9c-d943-4086-a3c3-a0453a3da865, Call Stack: null, Custom Event ID: -1, Message: Snapshot 'Auto-generated for Live Storage Migration' creation for VM 'elad1-1' was initiated by a dmin. 2014-03-18 23:00:46,221 INFO [org.ovirt.engine.core.bll.SPMAsyncTask] (org.ovirt.thread.pool-4-thread-43) BaseAsyncTask::startPollingTask: Starting to poll task 9727dbaa-459b-4 118-844f-62da2643710e. 2014-03-18 23:00:53,851 INFO [org.ovirt.engine.core.bll.AsyncTaskManager] (DefaultQuartzScheduler_Worker-96) Polling and updating Async Tasks: 1 tasks, 1 tasks to poll now 2014-03-18 23:00:53,857 INFO [org.ovirt.engine.core.bll.SPMAsyncTask] (DefaultQuartzScheduler_Worker-96) SPMAsyncTask::PollTask: Polling task 9727dbaa-459b-4118-844f-62da264371 0e (Parent Command LiveMigrateVmDisks, Parameters Type org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters) returned status running. 2014-03-18 23:00:53,857 INFO [org.ovirt.engine.core.bll.AsyncTaskManager] (DefaultQuartzScheduler_Worker-96) Finished polling Tasks, will poll again in 10 seconds After that, the task doesn't appear in engine.log anymore. Engine doesn't revert the LSM flow if there is a failure to create the auto-generated live snapshot. Hence, roll-back to the live storage migration is not executed, which leaves a running createVolume task running on vdsm and the disk in 'LOCKED' state. Expected results: There should be a timeout for craeteVolume task as part the live snapshot creation for live storage migration. If the task gets stuck on vdsm, engine should request vdsm to stop and clean it, then roll-back to the whole LSM flow should be executed. Additional info: engine and vdsm logs