Bug 1730766

Summary:

[v2v] All/Most of the VMs fail to migrate on "Replace Logical Volume tag error" and image daemon communication

Product:

Red Hat Enterprise Virtualization Manager

Reporter:

Ilanit Stein <istein>

Component:

vdsm

Assignee:

Dan Kenigsberg <danken>

Status:

CLOSED WORKSFORME

QA Contact:

Ilanit Stein <istein>

Severity:

high

Docs Contact:

Priority:

medium

Version:

4.3.5

CC:

bthurber, dagur, frolland, lsurette, nsoffer, srevivo, tnisan, ycui

Target Milestone:

---

Keywords:

Regression

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2019-09-09 14:35:39 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

Bug

oVirt Team:

Storage

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

V2V

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
evm.log	none
automation.log	none
single vm failing first virt-v2v-wrapper.log	none
single vm failing first virt-v2v.log	none
One out of 19 failing VMs virt-v2v-wrapper.log	none
One out of 19 failing VMs virt-v2v.log	none
evm.log	none
automation.log	none
engine.log	none
evm.log	none
automation.log	none

Description Ilanit Stein 2019-07-17 14:50:30 UTC

Description of problem:
On a VMware->RHV 20 VMs/100GB disk migration, VDDK, with Provider max concurrent VM migrations: 20, and max Conversion host max concurrent migrations: 10,
ISCSI->FC.


After ~40min a single VM failed on https://bugzilla.redhat.com/show_bug.cgi?id=1726439#c4
(I am mentioning this first error since I am not sure if it related to the current bug or not)

Then after ~2 hours, when it was time for all VMs to end the migration,
All the remaining 19 VMs failed to migrate.
The VMs were not created on RHV. There disks were removed form the RHV storage domain.

This is the first related error in evm.log:
[----] I, [2019-07-17T07:22:15.778059 #613:106af58]  INFO -- : Q-task_id([job_dispatcher]) MIQ(ServiceTemplateTransformationPlanTask#get_conversion_state) InfraConversionJob get_conversion_state to update_options: {:virtv2v_pid=>6718, :virtv2v_finished_on=>"2019-07-17 11:22:15", :virtv2v_status=>"failed"}
[----] E, [2019-07-17T07:22:15.852976 #613:106af58] ERROR -- : Q-task_id([job_dispatcher]) [RuntimeError]: Disks transformation failed.  Method:[block (2 levels) in <class:LogProxy>]
[----] E, [2019-07-17T07:22:15.853212 #613:106af58] ERROR -- : Q-task_id([job_dispatcher]) /var/www/miq/vmdb/app/models/service_template_transformation_plan_task.rb:225:in `get_conversion_state'
/var/www/miq/vmdb/app/models/infra_conversion_job.rb:94:in `poll_conversion'
/var/www/miq/vmdb/app/models/job/state_machine.rb:34:in `signal'
/var/www/miq/vmdb/app/models/miq_queue.rb:455:in `block in dispatch_method'
/usr/share/ruby/timeout.rb:93:in `block in timeout'
/usr/share/ruby/timeout.rb:33:in `block in catch'
/usr/share/ruby/timeout.rb:33:in `catch'
/usr/share/ruby/timeout.rb:33:in `catch'
/usr/share/ruby/timeout.rb:108:in `timeout'
/var/www/miq/vmdb/app/models/miq_queue.rb:453:in `dispatch_method'
/var/www/miq/vmdb/app/models/miq_queue.rb:430:in `block in deliver'
/var/www/miq/vmdb/app/models/user.rb:275:in `with_user_group'
/var/www/miq/vmdb/app/models/miq_queue.rb:430:in `deliver'
/var/www/miq/vmdb/app/models/miq_queue_worker_base/runner.rb:104:in `deliver_queue_message'
/var/www/miq/vmdb/app/models/miq_queue_worker_base/runner.rb:137:in `deliver_message'
/var/www/miq/vmdb/app/models/miq_queue_worker_base/runner.rb:155:in `block in do_work'
/var/www/miq/vmdb/app/models/miq_queue_worker_base/runner.rb:149:in `loop'
/var/www/miq/vmdb/app/models/miq_queue_worker_base/runner.rb:149:in `do_work'
/var/www/miq/vmdb/app/models/miq_worker/runner.rb:329:in `block in do_work_loop'
/var/www/miq/vmdb/app/models/miq_worker/runner.rb:326:in `loop'
/var/www/miq/vmdb/app/models/miq_worker/runner.rb:326:in `do_work_loop'
/var/www/miq/vmdb/app/models/miq_worker/runner.rb:153:in `run'
/var/www/miq/vmdb/app/models/miq_worker/runner.rb:127:in `start'
/var/www/miq/vmdb/app/models/miq_worker/runner.rb:22:in `start_worker'
/var/www/miq/vmdb/app/models/miq_worker.rb:408:in `block in start_runner_via_fork'
/opt/rh/cfme-gemset/gems/nakayoshi_fork-0.0.4/lib/nakayoshi_fork.rb:23:in `fork'
/opt/rh/cfme-gemset/gems/nakayoshi_fork-0.0.4/lib/nakayoshi_fork.rb:23:in `fork'
/var/www/miq/vmdb/app/models/miq_worker.rb:406:in `start_runner_via_fork'
/var/www/miq/vmdb/app/models/miq_worker.rb:396:in `start_runner'
/var/www/miq/vmdb/app/models/miq_worker.rb:447:in `start'
/var/www/miq/vmdb/app/models/miq_worker.rb:277:in `start_worker'
/var/www/miq/vmdb/app/models/miq_worker.rb:154:in `block in sync_workers'
/var/www/miq/vmdb/app/models/miq_worker.rb:154:in `times'
/var/www/miq/vmdb/app/models/miq_worker.rb:154:in `sync_workers'
/var/www/miq/vmdb/app/models/miq_server/worker_management/monitor.rb:53:in `block in sync_workers'
/var/www/miq/vmdb/app/models/miq_server/worker_management/monitor.rb:50:in `each'
/var/www/miq/vmdb/app/models/miq_server/worker_management/monitor.rb:50:in `sync_workers'
/var/www/miq/vmdb/app/models/miq_server/worker_management/monitor.rb:22:in `monitor_workers'
/var/www/miq/vmdb/app/models/miq_server.rb:339:in `block in monitor'
/opt/rh/cfme-gemset/bundler/gems/cfme-gems-pending-7300bf53e177/lib/gems/pending/util/extensions/miq-benchmark.rb:11:in `realtime_store'
/opt/rh/cfme-gemset/bundler/gems/cfme-gems-pending-7300bf53e177/lib/gems/pending/util/extensions/miq-benchmark.rb:28:in `realtime_block'
/var/www/miq/vmdb/app/models/miq_server.rb:339:in `monitor'
/var/www/miq/vmdb/app/models/miq_server.rb:380:in `block (2 levels) in monitor_loop'
/opt/rh/cfme-gemset/bundler/gems/cfme-gems-pending-7300bf53e177/lib/gems/pending/util/extensions/miq-benchmark.rb:11:in `realtime_store'
/opt/rh/cfme-gemset/bundler/gems/cfme-gems-pending-7300bf53e177/lib/gems/pending/util/extensions/miq-benchmark.rb:35:in `realtime_block'
/var/www/miq/vmdb/app/models/miq_server.rb:380:in `block in monitor_loop'
/var/www/miq/vmdb/app/models/miq_server.rb:379:in `loop'
/var/www/miq/vmdb/app/models/miq_server.rb:379:in `monitor_loop'
/var/www/miq/vmdb/app/models/miq_server.rb:241:in `start'
/var/www/miq/vmdb/lib/workers/evm_server.rb:27:in `start'
/var/www/miq/vmdb/lib/workers/evm_server.rb:48:in `start'
/var/www/miq/vmdb/lib/workers/bin/evm_server.rb:4:in `<main>'
[----] I, [2019-07-17T07:22:15.884382 #613:106af58]  INFO -- : Q-task_id([job_dispatcher]) MIQ(InfraConversionJob#process_cancel) job canceling, Job canceled on 2019-07-17 07:22:15 -0400
[----] I, [2019-07-17T07:22:15.898464 #613:106af58]  INFO -- : Q-task_id([job_dispatcher]) MIQ(InfraConversionJob#process_finished) job finished, Job canceled on 2019-07-17 07:22:15 -0400
[----] I, [2019-07-17T07:22:15.908643 #613:106af58]  INFO -- : Q-task_id([job_dispatcher]) MIQ(InfraConversionJob#dispatch_finish) Dispatch Status is 'finished'
[----] I, [2019-07-17T07:22:15.923230 #613:106af58]  INFO -- : Q-task_id([job_dispatcher]) MIQ(MiqQueue.put) Message id: [42586],  id: [], Zone: [], Role: [ems_operations], Server: [], MiqTask id: [], Ident: [generic], Target id: [], Instance id: [68], Task id: [94905f18-fe83-47b1-970f-cda031542845], Command: [InfraConversionJob.signal], Timeout: [600], Priority: [100], State: [ready], Deliver On: [], Data: [], Args: [:abort_job, "Conversion error: Disks transformation failed.", "error"]

  
Version-Release number of selected component (if applicable):
cfme-5.10.7.1
RHV-4.3.5.4-0.1.el7
RHv hosts: RHEL-7.7, virt-v2v-1.40.2-5.el7.x86_64

How reproducible:
Failed on 1 20VMs v2v migration, after 2 similar migration runs passed, on the same environment.

Comment 2 Ilanit Stein 2019-07-17 15:22:22 UTC

Created attachment 1591460 [details]
evm.log

VM migrations start time: July 17 5:20, failed one VM @5:59, and ended @7:30

Comment 3 Ilanit Stein 2019-07-17 15:23:16 UTC

Created attachment 1591462 [details]
automation.log

VM migrations start time: July 17 5:20, failed one VM @5:59, and ended @7:30

Comment 4 Ilanit Stein 2019-07-17 15:24:35 UTC

Created attachment 1591463 [details]
single vm failing first virt-v2v-wrapper.log

Comment 5 Ilanit Stein 2019-07-17 15:25:25 UTC

Created attachment 1591464 [details]
single vm failing first virt-v2v.log

Comment 6 Ilanit Stein 2019-07-17 15:27:43 UTC

Created attachment 1591465 [details]
One out of 19 failing VMs virt-v2v-wrapper.log

Comment 7 Ilanit Stein 2019-07-17 15:29:20 UTC

Created attachment 1591466 [details]
One out of 19 failing VMs virt-v2v.log

Comment 8 Ilanit Stein 2019-07-22 08:48:03 UTC

I did a second run, on the same system.
(20VMs/100GB, Host max concurrent VM migrations: 10, 2 Conversion hosts).

An hour after migration started, migration was cancelled for 2 VMs.
For the other 18 VMs - migration was "restarted" (the disk was removed from destination datastore, and disk upload started from new).
Eventually the migration plan ended up successfully (though the 2 VMs migration cancellation).
Migration overall time was: 3 hours (and not 2 hours, as expected), due to the repeated VMs migration round.
On one host, there were 11 v2v import logs (migrations), and in the second host 7. 

Time details (on July 21):
Migration start time: 18:22. end time: 21:20 (Israel time).
This is the time, set on the RHV machine, and RHV hosts.
In CFME machine, time is set 7 hours behind. That is, migration was in between: 11:22 - 14:20.

The cause of the failures seem to be related to Image daemon errors, displayed in RHV events:

Jul 21, 2019, 7:20:42 PM
VDSM host_mixed_2 command ExtendImageTicketVDS failed: Image daemon request failed: 'status=404, explanation=Invalid JSON, reason=Not Found, detail=No JSON object could be decoded'

Jul 21, 2019, 7:20:42 PM
Transfer was stopped by system. Reason: missing transfer image ticket.

Jul 21, 2019, 7:20:52 PM
VDSM host_mixed_1 command GetImageTicketVDS failed: Cannot communicate with image daemon: 'reason=Error communicating with ovirt-imageio-daemon: [Errno 111] Connection refused'

Jul 21, 2019, 7:21:14 PM
Disk v2v_migration_vm_big_11-000 was successfully removed from domain v2v-fc (User admin@internal-authz).

Jul 21, 2019, 7:27:05 PM
The disk 'v2v_migration_vm_big_10-000' was successfully added.

In CFME evm.log, around the same time, such errors were repeated: 

[----] E, [2019-07-21T12:20:54.393884 #54225:892f60] ERROR -- : Q-task_id([job_dispatcher]) [RuntimeError]: Disks transformation failed.  Method:[block (2 levels) in <class:LogProxy>]
[----] E, [2019-07-21T12:20:54.394110 #54225:892f60] ERROR -- : Q-task_id([job_dispatcher]) /var/www/miq/vmdb/app/models/service_template_transformation_plan_task.rb:225:in `get_conversion_state'
[----] E, [2019-07-21T12:21:01.801183 #54225:892f60] ERROR -- : Q-task_id([r48_service_template_transformation_plan_task_168]) MIQ(MiqAeEngine.deliver) Error delivering {"request"=>"transformation_plan", :service_action=>"Provision"} for object [ServiceTemplateTransformationPlanTask.168] with state [State11] to Automate:

[----] E, [2019-07-21T12:21:59.647022 #43255:892f60] ERROR -- : Q-task_id([job_dispatcher]) MIQ(InfraConversionJob#process_abort) job aborting, Conversion error: Disks transformation failed.

In addition there were such errors, which, based on the previous 20 VMs run, seem to be not affecting the VMs migration:
Opened this Bug 1731853 - [v2v] Harmless "start is not permitted at state running" errors in the evm.log

[----] E, [2019-07-21T12:24:25.696326 #4580:892f60] ERROR -- : Q-task_id([job_dispatcher]) MIQ(MiqQueue#deliver) Message id: [118032], Error: [start is not permitted at state running]
[----] E, [2019-07-21T12:24:25.696555 #4580:892f60] ERROR -- : Q-task_id([job_dispatcher]) [RuntimeError]: start is not permitted at state running  Method:[block (2 levels) in <class:LogProxy>]
[----] E, [2019-07-21T12:24:25.696754 #4580:892f60] ERROR -- : Q-task_id([job_dispatcher]) /var/www/miq/vmdb/app/models/job/state_machine.rb:36:in `signal'

For the 2 VMs cancelled migration, it was not possible to download the v2v import log, since it was removed, when the second VMs migration "round" started:

[----] E, [2019-07-21T13:29:50.891249 #5531:892f60] ERROR -- : MIQ(MiqQueue#deliver) Message id: [121548], Error: [Could not get conversion log '/var/log/vdsm/import/v2v-import-20190721T182433-12648.log' from 'h
ost_mixed_1' with [RuntimeError: open /var/log/vdsm/import/v2v-import-20190721T182433-12648.log: no such file (2)]

Comment 9 Ilanit Stein 2019-07-22 08:49:19 UTC

Created attachment 1592549 [details]
evm.log

Errors appear @July 21, 12:20 (hour after VM migration started).

Comment 10 Ilanit Stein 2019-07-22 08:49:56 UTC

Created attachment 1592550 [details]
automation.log

Comment 11 Ilanit Stein 2019-07-31 12:57:10 UTC

This bug was reproduced on July 24 2019

Versions:
CFME-5.10.7.1/RHV-4.3.5(RHEL 7.7 hosts).

Flow:
I run 20 VMs/100GB disk, vddk twice.

Reproduce rate:
~50% of the runs

Result:
First run was successful,
and in the second run, ALL VM migrations failed, on this bug.
This second migration started @23:55 July 24 (tlv time), and ended an hour later. (CFME machine is 7 hours behind):

It seems by RHV events (bellow) the causes are HSMGetAllTasksStatusesVDS failure and later ovirt-imageio-daemon communication failure.
All v2v import logs were removed automatically.

RHV Events:
This event showed up few minutes after the VM migrations started:
Jul 25, 2019, 12:02:01 AM
VDSM host_mixed_2 command HSMGetAllTasksStatusesVDS failed: Replace Logical Volume tag error: (u'lv: `3dbaae67-88f9-44f9-87dd-bde515e3720c/b2dab0a0-7d46-4821-919b-c5e7422628a1` add: `PU_00000000-0000-0000-0000-000000000000, MD_19, IU_a91bbd42-d51f-41dd-871a-2b8df278ef85` del: `OVIRT_VOL_INITIALIZING` ( Cannot process volume group 3dbaae67-88f9-44f9-87dd-bde515e3720c)',)

This is appeared for 4 VMs, that their migration was cancelled, right after this event:
Jul 25, 2019, 12:02:09 AM
Add-Disk operation failed to complete.

An hour later this event appeared:
Jul 25, 2019, 12:55:57 AM
VDSM host_mixed_2 command GetImageTicketVDS failed: Cannot communicate with image daemon: 'reason=Error communicating with ovirt-imageio-daemon: [Errno 111] Connection refused'

And right afterwards, all the rest VM migrations (16 VMs) failed:
Jul 25, 2019, 12:55:57 AM

Transfer was stopped by system. Reason: missing transfer image ticket.
Jul 25, 2019, 12:56:48 AM
Failed to remove disk v2v_migration_vm_big_10-000 from storage domain v2v-fc (User: admin@internal-authz).
Jul 25, 2019, 12:56:49 AM
VDSM host_mixed_1 command GetVolumeInfoVDS failed: Storage domain does not exist: (u'3dbaae67-88f9-44f9-87dd-bde515e3720c',)

Comment 12 Ilanit Stein 2019-07-31 12:59:51 UTC

Created attachment 1595075 [details]
engine.log

See from July 24 23:55

Comment 13 Ilanit Stein 2019-07-31 13:01:53 UTC

Created attachment 1595076 [details]
evm.log

Comment 14 Ilanit Stein 2019-07-31 13:04:53 UTC

Created attachment 1595079 [details]
automation.log

Comment 15 Ilanit Stein 2019-08-05 05:58:33 UTC

Further to comment #11,

Regarding HSMGetAllTasksStatusesVDS , GetVolumeInfoVDS Errors -
nsoffer: "It might be https://bugzilla.redhat.com/1553133"

We are now in a transition to RDU2 (US) lab, that will take several days,
therefore, I cannot run v2v tests right now.

Once possible, I'll try to reproduce both issues, mentioned in comment #11: 
1. HSMGetAllTasksStatusesVDS , GetVolumeInfoVDS Errors 
2. imageio Errors.

Comment 24 Fred Rolland 2019-08-19 14:05:40 UTC

Hi Ilanit,
Did you reproduced the reported issue?

Comment 27 Ilanit Stein 2019-08-19 18:32:40 UTC

In reply to comment #24:

So far it was not reproduced, since 2 other bugs were encountered, when moving to the RDU lab:
ovirt-engine: Bug 1740021 - [v2v][Scale][RHV] 20 VMs migration fail on "timed out waiting for disk to become unlocked" 
virt-v2v: Bug 1740098 - [v2v][Scale][RHV] Single VM migration failed, but related virt-v2v error is not logged.

Comment 30 Tal Nisan 2019-09-09 14:35:39 UTC

Closing as QE could not reproduce, feel free to reopen if the issue is encountered again

Comment 31 Ilanit Stein 2019-09-18 13:56:28 UTC

I tried to reproduce it on scale RHV environment RHV-4.3.6, with CFME-5.11.0.24. however it did not reproduce so far.
if/once I'll see it again, I'll update this bug.