Bug 1730766
Summary: | [v2v] All/Most of the VMs fail to migrate on "Replace Logical Volume tag error" and image daemon communication | ||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Ilanit Stein <istein> | ||||||||||||||||||||||||
Component: | vdsm | Assignee: | Dan Kenigsberg <danken> | ||||||||||||||||||||||||
Status: | CLOSED WORKSFORME | QA Contact: | Ilanit Stein <istein> | ||||||||||||||||||||||||
Severity: | high | Docs Contact: | |||||||||||||||||||||||||
Priority: | medium | ||||||||||||||||||||||||||
Version: | 4.3.5 | CC: | bthurber, dagur, frolland, lsurette, nsoffer, srevivo, tnisan, ycui | ||||||||||||||||||||||||
Target Milestone: | --- | Keywords: | Regression | ||||||||||||||||||||||||
Target Release: | --- | ||||||||||||||||||||||||||
Hardware: | Unspecified | ||||||||||||||||||||||||||
OS: | Unspecified | ||||||||||||||||||||||||||
Whiteboard: | |||||||||||||||||||||||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||||||||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||||||||||||||
Clone Of: | Environment: | ||||||||||||||||||||||||||
Last Closed: | 2019-09-09 14:35:39 UTC | Type: | Bug | ||||||||||||||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||||||||||||||
Documentation: | --- | CRM: | |||||||||||||||||||||||||
Verified Versions: | Category: | Bug | |||||||||||||||||||||||||
oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||||||||||||
Cloudforms Team: | V2V | Target Upstream Version: | |||||||||||||||||||||||||
Embargoed: | |||||||||||||||||||||||||||
Attachments: |
|
Description
Ilanit Stein
2019-07-17 14:50:30 UTC
Created attachment 1591460 [details]
evm.log
VM migrations start time: July 17 5:20, failed one VM @5:59, and ended @7:30
Created attachment 1591462 [details]
automation.log
VM migrations start time: July 17 5:20, failed one VM @5:59, and ended @7:30
Created attachment 1591463 [details]
single vm failing first virt-v2v-wrapper.log
Created attachment 1591464 [details]
single vm failing first virt-v2v.log
Created attachment 1591465 [details]
One out of 19 failing VMs virt-v2v-wrapper.log
Created attachment 1591466 [details]
One out of 19 failing VMs virt-v2v.log
I did a second run, on the same system. (20VMs/100GB, Host max concurrent VM migrations: 10, 2 Conversion hosts). An hour after migration started, migration was cancelled for 2 VMs. For the other 18 VMs - migration was "restarted" (the disk was removed from destination datastore, and disk upload started from new). Eventually the migration plan ended up successfully (though the 2 VMs migration cancellation). Migration overall time was: 3 hours (and not 2 hours, as expected), due to the repeated VMs migration round. On one host, there were 11 v2v import logs (migrations), and in the second host 7. Time details (on July 21): Migration start time: 18:22. end time: 21:20 (Israel time). This is the time, set on the RHV machine, and RHV hosts. In CFME machine, time is set 7 hours behind. That is, migration was in between: 11:22 - 14:20. The cause of the failures seem to be related to Image daemon errors, displayed in RHV events: Jul 21, 2019, 7:20:42 PM VDSM host_mixed_2 command ExtendImageTicketVDS failed: Image daemon request failed: 'status=404, explanation=Invalid JSON, reason=Not Found, detail=No JSON object could be decoded' Jul 21, 2019, 7:20:42 PM Transfer was stopped by system. Reason: missing transfer image ticket. Jul 21, 2019, 7:20:52 PM VDSM host_mixed_1 command GetImageTicketVDS failed: Cannot communicate with image daemon: 'reason=Error communicating with ovirt-imageio-daemon: [Errno 111] Connection refused' Jul 21, 2019, 7:21:14 PM Disk v2v_migration_vm_big_11-000 was successfully removed from domain v2v-fc (User admin@internal-authz). Jul 21, 2019, 7:27:05 PM The disk 'v2v_migration_vm_big_10-000' was successfully added. In CFME evm.log, around the same time, such errors were repeated: [----] E, [2019-07-21T12:20:54.393884 #54225:892f60] ERROR -- : Q-task_id([job_dispatcher]) [RuntimeError]: Disks transformation failed. Method:[block (2 levels) in <class:LogProxy>] [----] E, [2019-07-21T12:20:54.394110 #54225:892f60] ERROR -- : Q-task_id([job_dispatcher]) /var/www/miq/vmdb/app/models/service_template_transformation_plan_task.rb:225:in `get_conversion_state' [----] E, [2019-07-21T12:21:01.801183 #54225:892f60] ERROR -- : Q-task_id([r48_service_template_transformation_plan_task_168]) MIQ(MiqAeEngine.deliver) Error delivering {"request"=>"transformation_plan", :service_action=>"Provision"} for object [ServiceTemplateTransformationPlanTask.168] with state [State11] to Automate: [----] E, [2019-07-21T12:21:59.647022 #43255:892f60] ERROR -- : Q-task_id([job_dispatcher]) MIQ(InfraConversionJob#process_abort) job aborting, Conversion error: Disks transformation failed. In addition there were such errors, which, based on the previous 20 VMs run, seem to be not affecting the VMs migration: Opened this Bug 1731853 - [v2v] Harmless "start is not permitted at state running" errors in the evm.log [----] E, [2019-07-21T12:24:25.696326 #4580:892f60] ERROR -- : Q-task_id([job_dispatcher]) MIQ(MiqQueue#deliver) Message id: [118032], Error: [start is not permitted at state running] [----] E, [2019-07-21T12:24:25.696555 #4580:892f60] ERROR -- : Q-task_id([job_dispatcher]) [RuntimeError]: start is not permitted at state running Method:[block (2 levels) in <class:LogProxy>] [----] E, [2019-07-21T12:24:25.696754 #4580:892f60] ERROR -- : Q-task_id([job_dispatcher]) /var/www/miq/vmdb/app/models/job/state_machine.rb:36:in `signal' For the 2 VMs cancelled migration, it was not possible to download the v2v import log, since it was removed, when the second VMs migration "round" started: [----] E, [2019-07-21T13:29:50.891249 #5531:892f60] ERROR -- : MIQ(MiqQueue#deliver) Message id: [121548], Error: [Could not get conversion log '/var/log/vdsm/import/v2v-import-20190721T182433-12648.log' from 'h ost_mixed_1' with [RuntimeError: open /var/log/vdsm/import/v2v-import-20190721T182433-12648.log: no such file (2)] Created attachment 1592549 [details]
evm.log
Errors appear @July 21, 12:20 (hour after VM migration started).
Created attachment 1592550 [details]
automation.log
This bug was reproduced on July 24 2019 Versions: CFME-5.10.7.1/RHV-4.3.5(RHEL 7.7 hosts). Flow: I run 20 VMs/100GB disk, vddk twice. Reproduce rate: ~50% of the runs Result: First run was successful, and in the second run, ALL VM migrations failed, on this bug. This second migration started @23:55 July 24 (tlv time), and ended an hour later. (CFME machine is 7 hours behind): It seems by RHV events (bellow) the causes are HSMGetAllTasksStatusesVDS failure and later ovirt-imageio-daemon communication failure. All v2v import logs were removed automatically. RHV Events: This event showed up few minutes after the VM migrations started: Jul 25, 2019, 12:02:01 AM VDSM host_mixed_2 command HSMGetAllTasksStatusesVDS failed: Replace Logical Volume tag error: (u'lv: `3dbaae67-88f9-44f9-87dd-bde515e3720c/b2dab0a0-7d46-4821-919b-c5e7422628a1` add: `PU_00000000-0000-0000-0000-000000000000, MD_19, IU_a91bbd42-d51f-41dd-871a-2b8df278ef85` del: `OVIRT_VOL_INITIALIZING` ( Cannot process volume group 3dbaae67-88f9-44f9-87dd-bde515e3720c)',) This is appeared for 4 VMs, that their migration was cancelled, right after this event: Jul 25, 2019, 12:02:09 AM Add-Disk operation failed to complete. An hour later this event appeared: Jul 25, 2019, 12:55:57 AM VDSM host_mixed_2 command GetImageTicketVDS failed: Cannot communicate with image daemon: 'reason=Error communicating with ovirt-imageio-daemon: [Errno 111] Connection refused' And right afterwards, all the rest VM migrations (16 VMs) failed: Jul 25, 2019, 12:55:57 AM Transfer was stopped by system. Reason: missing transfer image ticket. Jul 25, 2019, 12:56:48 AM Failed to remove disk v2v_migration_vm_big_10-000 from storage domain v2v-fc (User: admin@internal-authz). Jul 25, 2019, 12:56:49 AM VDSM host_mixed_1 command GetVolumeInfoVDS failed: Storage domain does not exist: (u'3dbaae67-88f9-44f9-87dd-bde515e3720c',) Created attachment 1595075 [details]
engine.log
See from July 24 23:55
Created attachment 1595076 [details]
evm.log
Created attachment 1595079 [details]
automation.log
Further to comment #11, Regarding HSMGetAllTasksStatusesVDS , GetVolumeInfoVDS Errors - nsoffer: "It might be https://bugzilla.redhat.com/1553133" We are now in a transition to RDU2 (US) lab, that will take several days, therefore, I cannot run v2v tests right now. Once possible, I'll try to reproduce both issues, mentioned in comment #11: 1. HSMGetAllTasksStatusesVDS , GetVolumeInfoVDS Errors 2. imageio Errors. Hi Ilanit, Did you reproduced the reported issue? In reply to comment #24: So far it was not reproduced, since 2 other bugs were encountered, when moving to the RDU lab: ovirt-engine: Bug 1740021 - [v2v][Scale][RHV] 20 VMs migration fail on "timed out waiting for disk to become unlocked" virt-v2v: Bug 1740098 - [v2v][Scale][RHV] Single VM migration failed, but related virt-v2v error is not logged. Closing as QE could not reproduce, feel free to reopen if the issue is encountered again I tried to reproduce it on scale RHV environment RHV-4.3.6, with CFME-5.11.0.24. however it did not reproduce so far. if/once I'll see it again, I'll update this bug. |