Bug 1151374
| Summary: | A VM import failure is not properly rolled back. | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Roman Hodain <rhodain> |
| Component: | ovirt-engine | Assignee: | Moti Asayag <masayag> |
| Status: | CLOSED WONTFIX | QA Contact: | Pavel Stehlik <pstehlik> |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | 3.4.1-1 | CC: | amureini, laravot, lpeer, lsurette, masayag, mgoldboi, mperina, pbrilla, ppostler, rbalakri, Rhev-m-bugs, rhodain, scohen, srevivo, tnisan, ykaul |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2017-08-31 12:36:37 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | Infra | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | 1082498 | ||
| Bug Blocks: | |||
Romain, can you please attach complete log from which you copied to the description? thanks Thanks Roman, As VDSM currently generates the task id's and not the engine, once the task id doesn't return to the engine faster than 180 seconds the engine can't track the tasks (relevant to all the flows) - once the engine will pass the task id to vdsm instead, it'll be able to track it and act accordingly. Oved, afaik we already have a bug for that on the infra team. This one should be marked as a duplicate or depends on it. thanks, Liron. (In reply to Liron Aravot from comment #6) > Oved, afaik we already have a bug for that on the infra team. This one > should be marked as a duplicate or depends on it. Oved, any insight? (In reply to Allon Mureinik from comment #7) > (In reply to Liron Aravot from comment #6) > > Oved, afaik we already have a bug for that on the infra team. This one > > should be marked as a duplicate or depends on it. > Oved, any insight? Yair, are you familiar with this one? (In reply to Allon Mureinik from comment #7) > (In reply to Liron Aravot from comment #6) > > Oved, afaik we already have a bug for that on the infra team. This one > > should be marked as a duplicate or depends on it. > Oved, any insight? btw, even if we do find the bug number, I don't expect it to be addressed in 3.5. (In reply to Oved Ourfali from comment #9) > (In reply to Allon Mureinik from comment #7) > > (In reply to Liron Aravot from comment #6) > > > Oved, afaik we already have a bug for that on the infra team. This one > > > should be marked as a duplicate or depends on it. > > Oved, any insight? Even if there isn't such a BZ, seems like there should (i.e., this one). Moving to infra to either handle according to their priorities or provide instructions how the current storage flow should change. This will require infrastructural changes, that won't be in scope for 3.5. Targeting it to 3.6. Hello, Is there any update? Per discussion with Liron, in order to solve this bug - the responsibility of allocating a task-id should be moved from vdsm to the engine. By doing so, the task could be identified by the engine whenever it is reported from vdsm and will eliminate the case of orphaned/zombie tasks. For that purpose the API with vdsm should be extended to include also the task-id. |
Description of problem: The VM operation fails due to some storage delays which causes the "CopyImageVDS" command to timeout. This leads to rollback on the engine site, but the the operation is successful on the vdsm site. This lead to orphaned disk images on the storage and prevents repeating the import. Version-Release number of selected component (if applicable): vdsm-4.14.11-5.el6ev.x86_64 rhevm-3.4.1-0.30.el6ev.noarch How reproducible: Sometimes when the storage does not respond properly. This happens when the check on the images takes longer than 180sec Actual results: The import fails and the disks remains on the storage. Expected results: Disks are removed from the storage or the operation is finished sucessfully. Additional info: From the engine log we can see that copyImage is called and it fails after 180sec due to the timeout. We can also see the SD in problems: 2014-10-05 12:02:52,085 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.CopyImageVDSCommand] (org.ovirt.thread.pool-4-thread-45) [4c92a7de] -- executeIrsBrokerCommand: calling 'copyImage' with two new parameters: description and 26943 UUID 2014-10-05 12:02:52,085 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.CopyImageVDSCommand] (org.ovirt.thread.pool-4-thread-45) [4c92a7de] -- copyImage parameters: sdUUID=57692261-0d26-4e9d-a8fa-749f2a55789b spUUID=00000002-0002-0002-0002-000000000308 vmGUID=578bd1cf-7bde-45d7-be3c-528545b9c94b srcImageGUID=a962c650-701a-4c3b-ac69-bcf2f5483785 srcVolUUID=654e315c-4540-4105-92cb-5b02983c3105 dstImageGUID=a962c650-701a-4c3b-ac69-bcf2f5483785 dstVolUUID=654e315c-4540-4105-92cb-5b02983c3105 descr= dstSdUUID=f6e24add-a529-4515-b011-6fa2293f16cb ... 2014-10-05 12:03:11,282 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (org.ovirt.thread.pool-4-thread-12) domain 57692261-0d26-4e9d-a8fa-749f2a55789b:SD01 in problem. vds: rhev-h01 2014-10-05 12:03:31,207 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (org.ovirt.thread.pool-4-thread-36) Domain 57692261-0d26-4e9d-a8fa-749f2a55789b:SD01 recovered from problem. vds: rhev-h01 2014-10-05 12:03:31,207 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (org.ovirt.thread.pool-4-thread-36) Domain 57692261-0d26-4e9d-a8fa-749f2a55789b:SD01 has recovered from problem. No active host in the DC is reporting it as problematic, so clearing the domain recovery timer. ... 2014-10-05 12:05:52,087 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (org.ovirt.thread.pool-4-thread-45) [4c92a7de] IrsBroker::Failed::CopyImageVDS due to: TimeoutException: ... 2014-10-05 12:05:52,349 INFO [org.ovirt.engine.core.bll.AsyncTaskManager] (org.ovirt.thread.pool-4-thread-45) [4c92a7de] Removed task 4eb52586-2808-4479-954e-b5991a2d5088 from DataBase 2014-10-05 12:05:52,353 ERROR [org.ovirt.engine.core.bll.CopyImageGroupCommand] (org.ovirt.thread.pool-4-thread-45) [4c92a7de] Transaction rolled-back for command: org.ovirt.engine.core.bll.CopyImageGroupCommand. 2014-10-05 12:05:52,353 INFO [org.ovirt.engine.core.utils.transaction.TransactionSupport] (org.ovirt.thread.pool-4-thread-45) [4c92a7de] transaction rolled back This is not a networking issue. The task on the SPM was not created in 180sec and thus we got the timeout. Here is the flow on the SPM. Keep in mind that the time is -2 hours Thread-34645::DEBUG::2014-10-05 10:02:51,490::BindingXMLRPC::251::vds::(wrapper) client [192.168.1.2] flowID [4c92a7de] Thread-34645::DEBUG::2014-10-05 10:02:51,492::task::595::TaskManager.Task::(_updateState) Task=`244ead22-ec64-4e7c-85a4-9ef863ea0eff`::moving from state init -> state preparing Thread-34645::INFO::2014-10-05 10:02:51,493::logUtils::44::dispatcher::(wrapper) Run and protect: copyImage(sdUUID='57692261-0d26-4e9d-a8fa-749f2a55789b', spUUID='00000002-0002-0002-0002-000000000308',... Thread-34645::DEBUG::2014-10-05 10:03:08,420::fileVolume::529::Storage.Volume::(validateVolumePath) validate path for 654e315c-4540-4105-92cb-5b02983c3105 ... Thread-34645::INFO::2014-10-05 10:03:36,613::image::285::Storage.Image::(isLegal) image a962c650-701a-4c3b-ac69-bcf2f5483785 in domain 57692261-0d26-4e9d-a8fa-749f2a55789b has vollist ['654e315c-4540-4105-92cb-5b02983c3105'] Thread-34645::DEBUG::2014-10-05 10:03:36,616::fileVolume::529::Storage.Volume::(validateVolumePath) validate path for 654e315c-4540-4105-92cb-5b02983c3105 ... Thread-34645::DEBUG::2014-10-05 10:06:14,880::fileVolume::529::Storage.Volume::(validateVolumePath) validate path for 654e315c-4540-4105-92cb-5b02983c3105 ... Thread-34645::DEBUG::2014-10-05 10:07:04,552::fileVolume::529::Storage.Volume::(validateVolumePath) validate path for 654e315c-4540-4105-92cb-5b02983c3105 ... Thread-34645::DEBUG::2014-10-05 10:07:39,811::fileVolume::529::Storage.Volume::(validateVolumePath) validate path for 654e315c-4540-4105-92cb-5b02983c3105 ... Thread-34645::DEBUG::2014-10-05 10:09:22,622::resourceManager::198::ResourceManager.Request::(__init__) ResName=`Storage.57692261-0d26-4e9d-a8fa-749f2a55789b`ReqID=`32c881e9-807f-49de-b7da-0754df36b1b1`::Request was made in '/usr/share/vdsm/storage/hsm.py' line '1792' at 'copyImage' ... Thread-34645::DEBUG::2014-10-05 10:09:22,893::task::595::TaskManager.Task::(_updateState) Task=`244ead22-ec64-4e7c-85a4-9ef863ea0eff`::moving from state acquiring -> state queued You can see that most of the time was taken by "validateVolumePath". We do the check four times as the code is not optimal. But we can see that the task was not created on time and the result was not reported back to engine and engine rolled back the operation. This timeout triggered the SpmStopVDSCommand and reloading the list of tasks on the mast SD. Unfortunately even this happened even before the task was created and thus not reported to the engine 2014-10-05 12:05:52,272 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (org.ovirt.thread.pool-4-thread-45) [4c92a7de] START, SpmStopVDSCommand(HostName = mh-rhev-1000-1.ad.molecularhealth.com, HostId = 5b3161 26965 7e-6ce4-4538-b92a-83a670ba157f, storagePoolId = 00000002-0002-0002-0002-000000000308), log id: 51617d80 2014-10-05 12:05:52,328 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (org.ovirt.thread.pool-4-thread-45) [4c92a7de] FINISH, SpmStopVDSCommand, log id: 51617d80 2014-10-05 12:05:52,328 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (org.ovirt.thread.pool-4-thread-45) [4c92a7de] IRS failover failed - cant allocate vds server ... 2014-10-05 12:05:52,407 INFO [org.ovirt.engine.core.bll.storage.SetStoragePoolStatusCommand] (DefaultQuartzScheduler_Worker-14) [286fabd4] Running command: SetStoragePoolStatusCommand internal: true. Entities affected : ID: 000 26985 00002-0002-0002-0002-000000000308 Type: StoragePool ... 2014-10-05 12:05:52,605 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (DefaultQuartzScheduler_Worker-14) [286fabd4] Found spm host 5b31617e-6ce4-4538-b92a-83a670ba157f, host name: mh-rhev-1000-1.ad.molecularh 26991 ealth.com, according to spmId: 2. 2014-10-05 12:05:52,654 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (DefaultQuartzScheduler_Worker-14) [286fabd4] SpmStatus on vds 5b31617e-6ce4-4538-b92a-83a670ba157f: SPM 2014-10-05 12:05:52,654 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (DefaultQuartzScheduler_Worker-14) [286fabd4] Using old spm server: mh-rhev-1000-1.ad.molecularhealth.com, no start needed ... 2014-10-05 12:05:52,749 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.SPMGetAllTasksInfoVDSCommand] (org.ovirt.thread.pool-4-thread-22) [286fabd4] START, SPMGetAllTasksInfoVDSCommand( storagePoolId = 00000002-0002-0002-0002-00 26996 0000000308, ignoreFailoverLimit = false), log id: 476b6853 ... 2014-10-05 12:05:53,164 INFO [org.ovirt.engine.core.bll.AsyncTaskManager] (org.ovirt.thread.pool-4-thread-22) [286fabd4] Discovered 4 tasks on Storage Pool DC1, 0 added to manager. Here we can see that the list of tasks was returned about 4 mins before the task for copyImage was created.