1716283 – [v2v] [VDDK][RHV][OSP] Concurrently migrating VMs are not correctly balanced among conversion hosts

Bug 1716283 - [v2v] [VDDK][RHV][OSP] Concurrently migrating VMs are not correctly balanced among conversion hosts

Summary: [v2v] [VDDK][RHV][OSP] Concurrently migrating VMs are not correctly balanced ...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat CloudForms Management Engine
Classification:	Red Hat
Component:	V2V
Sub Component:
Version:	5.10.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	GA
Target Release:	5.11.0
Assignee:	Daniel Berger
QA Contact:	Ilanit Stein
Docs Contact:	Red Hat CloudForms Documentation
URL:
Whiteboard:
Duplicates (1):	1719700 (view as bug list)
Depends On:	1698761
Blocks:	1721117
TreeView+	depends on / blocked

Reported:	2019-06-03 07:13 UTC by Ilanit Stein
Modified:	2019-12-13 14:57 UTC (History)
CC List:	12 users (show)
Fixed In Version:	5.11.0.23
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1721117 (view as bug list)
Environment:
Last Closed:	2019-12-13 14:57:15 UTC
Category:	Bug
Cloudforms Team:	V2V
Target Upstream Version:
Embargoed:
Flags:	ytale: needinfo- mfeifer: mirror+

Attachments	(Terms of Use)
evm.log (4.52 MB, application/gzip) 2019-06-03 07:22 UTC, Ilanit Stein	no flags	Details
automation.log (3.05 MB, application/gzip) 2019-06-03 07:24 UTC, Ilanit Stein	no flags	Details
v2v import log (2.04 MB, application/octet-stream) 2019-06-03 07:33 UTC, Ilanit Stein	no flags	Details
v2v import wrapper log (29.07 KB, application/octet-stream) 2019-06-03 07:36 UTC, Ilanit Stein	no flags	Details
evm.log1.tgz (16.27 MB, application/gzip) 2019-08-12 13:14 UTC, Ilanit Stein	no flags	Details
evm.log2.tgz (3.31 MB, application/gzip) 2019-08-12 13:15 UTC, Ilanit Stein	no flags	Details
View All

Description Ilanit Stein 2019-06-03 07:13:44 UTC

Description of problem:
Migrating 20 VMs from VMware to RHV.

Environment:
on CFME, that has 2 conversion hosts, added via rails console,
and configured to max concurrent tasks=10, VDDK.
Provider max concurrent migrations was set to 20, using rest api, custom attributes.

In UI, migration settings (though I am not sure this has an actual affect):
Provider max concurrent migrations=20
Host max concurrent migrations=10

All 20 VMs were directed to migrate to a single VM,
though there are 2 Valid conversion hosts in the RHV cluster.
 
Version-Release number of selected component (if applicable):
CFME-5.10.5.1
RHV-4.3.4

Additional info:
The migration itself succeeded only for 10 out of the 20 VMs,
and failed for the rest 10 VMs.
I shall open another bug for this 10 VMs migration failure.

Comment 2 Ilanit Stein 2019-06-03 07:22:17 UTC

Created attachment 1576506 [details]
evm.log

Comment 3 Ilanit Stein 2019-06-03 07:24:36 UTC

Created attachment 1576507 [details]
automation.log

Comment 4 Ilanit Stein 2019-06-03 07:33:08 UTC

Created attachment 1576509 [details]
v2v import log

Comment 5 Ilanit Stein 2019-06-03 07:34:18 UTC

Comment on attachment 1576509 [details]
v2v import log

v2v import log for one of the 10 VM failing migration (VM "v2v_migration_vm_1").

Comment 6 Ilanit Stein 2019-06-03 07:36:21 UTC

Created attachment 1576510 [details]
v2v import wrapper log

v2v import wrapper log, for one of the 10 VM failing migration (VM "v2v_migration_vm_1").

Comment 7 Ilanit Stein 2019-06-03 07:55:42 UTC

Forgot to mentioned the migrated VMs have a 100GB disk each.
and that the migration is from VMware: ISCSI to RHV: ISCSI.

Comment 8 Ilanit Stein 2019-06-03 08:00:27 UTC

On conversion host:
[root@lynx18 import]# rpm -qa | grep v2v
v2v-conversion-host-wrapper-1.13.1-1.el7ev.noarch
virt-v2v-1.38.2-12.29.lp.el7ev.x86_64
v2v-conversion-host-ansible-1.13.1-1.el7ev.noarch

Comment 9 Ilanit Stein 2019-06-03 11:59:55 UTC

Here's a read of the Conversion hosts table in rails console. It shows that the conversion hosts are configured to max concurrent migration = 10 (max_concurrent_tasks=10):

irb(main):004:0> ConversionHost.all.each { |ch| puts "[#{ch.id}] #{ch.name}" }
PostgreSQLAdapter#log_after_checkout, connection_pool: size: 5, connections: 1, in use: 1, waiting_in_queue: 0
[1] host_mixed_2
[2] host_mixed_1
=> [#<ConversionHost id: 1, name: "host_mixed_2", address: nil, type: nil, resource_type: "Host", resource_id: 3, version: nil, max_concurrent_tasks: 10, vddk_transport_supported: true, ssh_transport_supported: nil, created_at: "2019-06-02 14:57:58", updated_at: "2019-06-02 14:58:30", concurrent_transformation_limit: nil, cpu_limit: nil, memory_limit: nil, network_limit: nil, blockio_limit: nil>, #<ConversionHost id: 2, name: "host_mixed_1", address: nil, type: nil, resource_type: "Host", resource_id: 2, version: nil, max_concurrent_tasks: 10, vddk_transport_supported: true, ssh_transport_supported: nil, created_at: "2019-06-02 18:23:25", updated_at: "2019-06-02 18:24:01", concurrent_transformation_limit: nil, cpu_limit: nil, memory_limit: nil, network_limit: nil, blockio_limit: nil>]



Here's the read of Provider max concurrent VM migration ("Transformation max runners") value in rails console, that show it is configured to 20:

root@acanan-rhevm vmdb]# rails c
Loading production environment (Rails 5.0.7.2)
irb(main):001:0> $evm = MiqAeMethodService::MiqAeService.new(MiqAeEngine::MiqAeWorkspaceRuntime.new)
=> #<MiqAeMethodService::MiqAeService:0x0000000002a27b80 @tracking_label=nil, @drb_server_references=[], @inputs={}, @workspace=#<MiqAeEngine::MiqAeWorkspaceRuntime:0x0000000002a2eea8 @readonly=false, @nodes=[], @current=[], @datastore_cache={}, @class_methods={}, @dom_search=#<MiqAeEngine::MiqAeDomainSearch:0x0000000002a2dbc0 @fqns_id_cache={}, @fqns_id_class_cache={}, @partial_ns=[], @prepend_namespace=nil>, @persist_state_hash={}, @current_state_info={}, @state_machine_objects=[], @ae_user=nil, @rbac=false, @lookup_hash={}>, @persist_state_hash={}, @logger=#<VMDBLogger:0x00000000027ac310 @level=1, @progname=nil, @default_formatter=#<Logger::Formatter:0x00000000027ac1d0 @datetime_format=nil>, @formatter=#<VMDBLogger::Formatter:0x00000000027ac018 @datetime_format=nil>, @logdev=#<Logger::LogDevice:0x00000000027ac130 @shift_period_suffix="%Y%m%d", @shift_size=1048576, @shift_age=0, @filename=#<Pathname:/var/www/miq/vmdb/log/automation.log>, @dev=#<File:/var/www/miq/vmdb/log/automation.log>, @mon_owner=nil, @mon_count=0, @mon_mutex=#<Thread::Mutex:0x00000000027ac0b8>>, @write_lock=#<Thread::Mutex:0x00000000027a7f68>, @local_levels={}, @thread_hash_level_key=:"ThreadSafeLogger#20799880@level">>

irb(main):003:0> $evm.vmdb(:ext_management_system).find_by(:name => "RHV").custom_get("Max Transformation Runners")
PostgreSQLAdapter#log_after_checkout, connection_pool: size: 5, connections: 1, in use: 1, waiting_in_queue: 0
PostgreSQLAdapter#log_after_checkin, connection_pool: size: 5, connections: 1, in use: 0, waiting_in_queue: 0
PostgreSQLAdapter#log_after_checkout, connection_pool: size: 5, connections: 1, in use: 1, waiting_in_queue: 0
PostgreSQLAdapter#log_after_checkin, connection_pool: size: 5, connections: 1, in use: 0, waiting_in_queue: 0
PostgreSQLAdapter#log_after_checkout, connection_pool: size: 5, connections: 1, in use: 1, waiting_in_queue: 0
PostgreSQLAdapter#log_after_checkin, connection_pool: size: 5, connections: 1, in use: 0, waiting_in_queue: 0
=> "20"

Comment 11 Ilanit Stein 2019-06-09 19:15:18 UTC

Here's another test, in which the conversion host max concurrent tasks value is not fulfilled:
 
VMware->RHV VM migration, of 10 VMs, 100G disk (66% usage), with 2 conversion hosts, VDDK.
Migration took 3 hours.

Each conversion host, is set to: concurrent_max_tasks = 5.

However,
7 VMs were migrated to conversion host #1
3 VMs were migrated to conversion host #2

While each conversion host should have max 5 VM migrated, conversion host #1 had 7 VMs migrated in parallel.

Comment 12 Ilanit Stein 2019-06-09 19:30:00 UTC

Adding versions to comment #11: CFME-5.10.5.1/RHV-4.3.4

Comment 14 Ilanit Stein 2019-06-10 10:30:56 UTC

Adding regression keyword, since migration of 20 VMs used to divide well between 2 available hosts,
and now all 20 VMs were directed to only one conversion host, though there are 2.

Comment 16 Ilanit Stein 2019-06-12 14:01:06 UTC

By the results of testing on CFME-5.10.5.1/5.10.6.0 of 20 and 10 VMs migration,
we see:

1. Bad balancing between the conversion hosts of the number of migrated VMs.

2. Max_concurrent_tasks per conversion host is not honored.
For example, for 10 VMs migrated, and 2 conversion hosts, with max_concurrent_tasks=5,
the Migration result was: 7 VMs migrated to one conversion host, 
and 3 VMs migrated to the second conversion host.

Comment 17 Daniel Berger 2019-06-13 14:52:04 UTC

In progress: https://github.com/ManageIQ/manageiq/pull/18860

Comment 21 Fabien Dupont 2019-06-19 15:13:49 UTC

*** Bug 1719700 has been marked as a duplicate of this bug. ***

Comment 26 Ilanit Stein 2019-08-07 09:50:12 UTC

Avital,

I reviewed the 1.2 docs.

Comments:

1. Regarding the concurrent migrations - I think it would be better to mention this part 
at the beginning of CHAPTER 3. MIGRATING THE VIRTUAL MACHINES,
because usually this is set (if desired, as it is optional, of course) before the migration plan is set & started.
(though of course u can change it on the fly too, like mentioned in the doc).

2. In the known issues section, the cancel migration bug appear twice,
BZ#1666799 - correct bug.
BZ#666799 - redundant & Incorrect bug id.

Comment 27 Ilanit Stein 2019-08-12 13:10:07 UTC

Tested on these version:

CFME-5.11.0.18.20190806180636_1dd6378
RHV-4.3.5.3-0.1.el7
RHV-hosts (2, that serve as conversion hosts):
* Special packages of: libguestfs libguestfs-tools-c virt-v2v python-libguestfs: 1.40.2-5.el7.1.bz1680361.v3.1.x86_64.
* OS Version:RHEL - 7.7 - 9.el7
* OS Description: Red Hat Enterprise Linux Server 7.7 Beta (Maipo)
* Kernel Version: 3.10.0 - 957.21.3.el7.x86_64
* KVM Version:2.12.0 - 33.el7
* LIBVIRT Version: libvirt-4.5.0-23.el7
* VDSM Version: vdsm-4.30.19-1.el7ev

Did 2 Runs of 20 VMs, once with 100GB disk, and once with 20GB disk.
Conversion host max concurrent tasks = 10.
Provider concurrent tasks = 20.

In both runs, all the 20 VMs failed to migrate, on this new ovirt-engine bug:
Bug 1740021 - [v2v][Scale][RHV] 20 VMs migration fail on "timed out waiting for disk to become unlocked"

Regarding the VMs distribution,
In the first run, the distribution was 8, and 12 VMs
In the second run, the distribution was 12, and 8 VMs.

The CFME log is set to 'debug' mode.

Attached evm.log of the 2 runs.

Fabien/Dan,
Can you please advise, why the distribution is not even (10:10) as expected, and as we sew in the past versions?

Comment 28 Ilanit Stein 2019-08-12 13:14:43 UTC

Created attachment 1602934 [details]
evm.log1.tgz

Comment 29 Ilanit Stein 2019-08-12 13:15:28 UTC

Created attachment 1602935 [details]
evm.log2.tgz

Comment 30 Ilanit Stein 2019-08-20 15:13:46 UTC

I checked it on CFME-5.11.0.19/RHV-4.3.5:
VDDK migration 20 VMs of 20 GB disk run, on the new RHV I got, and the migration passed, for all 20 VMs.
Though each of the conversion hosts are set to max_concurrent_tasks=10,
one host got 15 VMs, and the second only 5 VMs.

@Fabien,
Maybe I am missing something, on how the conversion host are evaluated,
(Maybe other considerations are taken account here, that I am not a ware of).
My understanding is that the VMs should be distributed evenly.
In ALL my runs, in RDU lab RHV systems, the distribution is not even,
as it was seen in previous versions of CFME.  (using CFME-5.11.0.18, CFME-5.11.0.19).

Comment 31 Fabien Dupont 2019-08-20 20:28:37 UTC

@ilanit, can we get access to the appliance and run the migration plans on our own ?
From the logs, we see that the number of running tasks is not updated, so the least utilized host is not always the same, until the value gets updated.

@dan, can you look into this please ?

Comment 32 Adam Grare 2019-08-30 12:50:43 UTC

https://github.com/ManageIQ/manageiq/pull/19213

Comment 33 CFME Bot 2019-08-30 23:45:57 UTC

New commit detected on ManageIQ/manageiq/ivanchuk:

https://github.com/ManageIQ/manageiq/commit/a4010f7a3817ebb25f0d770e1d60700650a4120c
commit a4010f7a3817ebb25f0d770e1d60700650a4120c
Author:     Adam Grare <agrare>
AuthorDate: Fri Aug 30 08:29:27 2019 -0400
Commit:     Adam Grare <agrare>
CommitDate: Fri Aug 30 08:29:27 2019 -0400

    Merge pull request #19213 from djberg96/conversion_host_pending_state

    [V2V] Add pending state as a valid active task.

    (cherry picked from commit c5e268341c65f8c33899a411e46b74d71dc86ffc)

    https://bugzilla.redhat.com/show_bug.cgi?id=1716283

 app/models/conversion_host.rb | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comment 34 Ilanit Stein 2019-09-04 05:58:37 UTC

Tested 
cfme-5.11.0.22 + a fix for this bug
rhv-4.3.5.4-0.1.el7 (small scale)

v2v migration ended successfully, for all the following tests. 
VMs were distributed evenly, between the 2 conversion hosts, as expected:

test1: 20VMs, 16GB disk, 2 conversion hosts, vddk, provider max concurrent migrations=20, provider max concurrent migrations=10
test2: 20VMs, 16GB disk, 2 conversion hosts, vddk, provider max concurrent migrations=20, provider max concurrent migrations=5
test3: 20VMs, 100GB disk, 2 conversion hosts, vddk, provider max concurrent migrations=20, provider max concurrent migrations=10

Logs can be found here: https://drive.google.com/drive/u/0/folders/1hO3pvxLMP4SKznVDOTJrWA70_lCudxSw  : 
evm_log1.log - last 10:10, 5:5 16GB 20 VMs migration
evm_log2.log - last 10:10, 100GB 20 VMs migration

Comment 35 Sudhir Mallamprabhakara 2019-09-12 02:15:35 UTC

Ilanit, Can this be marked as verified based on Comment 34.

Comment 36 Ilanit Stein 2019-09-12 06:20:08 UTC

in reply to comment #35

No because I still need to check it is working on CFME-5.11.0.23

Comment 37 Ilanit Stein 2019-09-12 06:20:08 UTC

in reply to comment #35

No because I still need to check it is working on CFME-5.11.0.23

Comment 38 Ilanit Stein 2019-09-16 10:23:28 UTC

Verified on CFME-5.11.0.24/RHV-4.3.5.4-0.1.el7.

Tested with 20 VMs, 20GB disk each.
2 conversion hosts.
conversion host Concurrent max tasks = 10
Provider concurrent migrations = 20

VMs were distributed evenly, between 2 conversion hosts, 10:10.

Note You need to log in before you can comment on or make changes to this bug.