1724040 – [v2v][Scale][RHV] VM migration fail on "start is not permitted at state running"

Bug 1724040 - [v2v][Scale][RHV] VM migration fail on "start is not permitted at state running"

Summary: [v2v][Scale][RHV] VM migration fail on "start is not permitted at state running"

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat CloudForms Management Engine
Classification:	Red Hat
Component:	V2V
Sub Component:
Version:	unspecified
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	GA
Target Release:	5.11.z
Assignee:	Fabien Dupont
QA Contact:	Ilanit Stein
Docs Contact:	Red Hat CloudForms Documentation
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-06-26 06:44 UTC by Ilanit Stein
Modified:	2020-04-28 17:21 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-04-28 17:21:15 UTC
Category:	Bug
Cloudforms Team:	V2V
Target Upstream Version:
Embargoed:
Flags:	mfeifer: mirror+

Attachments	(Terms of Use)
evm.log (8.59 MB, application/gzip) 2019-06-26 06:52 UTC, Ilanit Stein	no flags	Details
v2v import wrapper log (21.38 KB, text/plain) 2019-06-26 07:38 UTC, Ilanit Stein	no flags	Details
v2v import log (991.96 KB, application/gzip) 2019-06-26 07:38 UTC, Ilanit Stein	no flags	Details
evm.log.2 (4.43 MB, application/gzip) 2019-06-26 15:26 UTC, Ilanit Stein	no flags	Details
View All

Description Ilanit Stein 2019-06-26 06:44:06 UTC

Description of problem:
On VMware->RHV migration, VDDK, of 20 VMs/100GB disk, and Provider max concurrent VM migrations: 20, and Max concurrent VM migrations, 2 conversion hosts, per conversion host: 5.
Storage type: VMware- ISCSI, RHV- FC.

Migration result:
_2_ VMs failed
18 VMs succeeded.

VMs distribution between the 2 conversion hosts:
On first round the distribution was 5:5, 
then VM "v2v_migration_vm_06" failed on a known bug (See Additional Info bellow).
So 1 more VM stated to be migrated to the conversion host, where this VM failed.
Therefore, distribution was again 5:5.
On the second round (when there were 9 VMs left to migrate),
the distribution was 4:5 (the host with the 4, was the host where VM "v2v_migration_vm_06" failed on).
Then VM "v2v_migration_vm_17" failed.

VM "v2v_migration_vm_17" failed on:
 "Error: [start is not permitted at state running]", in evm.log.
The vm migration ended up after 5:44 (hh:mm).
The successful Migration of the VMs in the first round took ~ 1hour, 
and the successful migration of the VMs in the second, took 1 hour.
That is if VM "v2v_migration_vm_17" was not failing, the overall time was suppose to be 2 hours.

Version-Release number of selected component (if applicable):
CFME-5.10.6.1
RHV-4.3.4

How reproducible:
It happen once, and I did not try to reproduce it.

Additional Info:
The second VM, "v2v_migration_vm_06",  failed on this known Bug 1719689 - [v2v][RHV][vddk] 1 out of 10 VMs failed migration, on CFME failed to read the state file.

Comment 2 Ilanit Stein 2019-06-26 06:52:03 UTC

Created attachment 1584633 [details]
evm.log

Migration was on June 25 2019. From 16:45 till 22:30.

[----] E, [2019-06-25T17:42:41.532697 #49653:38af54] ERROR -- : Q-task_id([job_dispatcher]) MIQ(MiqQueue#deliver) Message id: [25009], Error: [start is not permitted at state running]
[----] E, [2019-06-25T17:42:41.533066 #49653:38af54] ERROR -- : Q-task_id([job_dispatcher]) [RuntimeError]: start is not permitted at state running  Method:[block (2 levels) in <class:LogProxy>]
[----] E, [2019-06-25T17:42:41.533380 #49653:38af54] ERROR -- : Q-task_id([job_dispatcher]) /var/www/miq/vmdb/app/models/job/state_machine.rb:36:in `signal'


[----] E, [2019-06-25T18:25:15.132812 #34735:902310] ERROR -- : MIQ(ManageIQ::Providers::Redhat::InfraManager::EventCatcher::Runner#start_event_monitor) EMS [izuckerm-hosted-engine.qa.lab.tlv.redhat.com] as [admin@internal] Event Monitor Thread aborted because [HTTP response code is 500. The response content type 'text/html; charset=iso-8859-1' isn't XML.]
[----] E, [2019-06-25T18:25:15.132900 #34735:902310] ERROR -- : [OvirtSDK4::Error]: HTTP response code is 500. The response content type 'text/html; charset=iso-8859-1' isn't XML.  Method:[block (2 levels) in <class:LogProxy>]
[----] E, [2019-06-25T18:25:15.132983 #34735:902310] ERROR -- : /usr/share/gems/gems/ovirt-engine-sdk-4.2.4/lib/ovirtsdk4/connection.rb:385:in `raise_error'


[----] E, [2019-06-25T22:28:38.139415 #13174:38af54] ERROR -- : Q-task_id([r46_service_template_transformation_plan_task_135]) MIQ(MiqAeEngine.deliver) Error delivering {"request"=>"transformation_plan", :service_action=>"Provision"} for object [ServiceTemplateTransformationPlanTask.135] with state [State11] to Automate:

Comment 3 Ilanit Stein 2019-06-26 07:38:17 UTC

Created attachment 1584641 [details]
v2v import wrapper log

Comment 4 Ilanit Stein 2019-06-26 07:38:53 UTC

Created attachment 1584642 [details]
v2v import log

Comment 5 Ilanit Stein 2019-06-26 09:01:32 UTC

Correction to the tested RHV version. It is: RHV-4.3.4.3-0.1.el7 and not RHV-4.3.4

Comment 6 Fabien Dupont 2019-06-26 09:24:21 UTC

@Daniel, it sounds like a weird transition in the job workflow. You should be able to follow the transitions in evm.log. If not, it probably deserves additional logging.

Comment 7 Fabien Dupont 2019-06-26 09:31:29 UTC

@Ilanit, how many times did you hit this error ?

Comment 8 Ilanit Stein 2019-06-26 10:52:49 UTC

I shall try to reproduce, and report here back.

Adding that in evm.log, while the migration was running, there were also such errors:
[----] E, [2019-06-25T17:45:35.955350 #12387:38af54] ERROR -- net.ssh.authentication.agent[a90031c]: Q-task_id([job_dispatcher]) could not connect to ssh-agent: Agent not configured

I think it is for the various migrated VMs.

Comment 9 Ilanit Stein 2019-06-26 15:18:33 UTC

Regarding reproduce of Bug 1724040 - [v2v][Scale][RHV] VM migration fail on "start is not permitted at state running"  (https://bugzilla.redhat.com/show_bug.cgi?id=1724040)

I run migration, of 20 vms, with max concurrent VM migrations per host 5, with 2 conversion hosts.
(same env as reported in the bug, and even exact Vms imported form VMware.

10 VMs were migrated successfully on the first round, distributed evenly between the 2 conversion hosts 5:5.

Around the time that the first round was done, and the second round, of the next 10 VMs started,
several such errors appeared in the evm.log:
[----] E, [2019-06-26T08:39:17.273782 #28421:38af54] ERROR -- : Q-task_id([job_dispatcher]) MIQ(MiqQueue#deliver) Message id: [47099], Error: [start is not permitted at state running]

and later also there was this error:
[----] E, [2019-06-26T08:54:00.183367 #25018:38af54] ERROR -- : Q-task_id([job_dispatcher]) [RuntimeError]: Could not parse conversion state data from file '/tmp/v2v-import-20190626T153959-4071.state':   Method:[block (2 levels) in <class:LogProxy>]
that was reported in Bug 1719689 - [v2v][RHV][vddk] 1 out of 10 VMs failed migration, on CFME failed to read the state file.

Few minutes later, assuming it is not related to the v2v run, the connection to the RHV engine went down. 
and eventually, all 10 VMs failed migration.

evm.log.2 attached

Comment 10 Ilanit Stein 2019-06-26 15:26:26 UTC

Created attachment 1584843 [details]
evm.log.2

VM migration from Jun2 26 2019 7:32 till ~10:00

Comment 11 Daniel Berger 2019-06-26 15:53:56 UTC

The state file parsing error ties back to https://github.com/ManageIQ/manageiq-gems-pending/pull/437

@Fabien, regarding the ssh-agent errors, you said that setting use_agent to false got rid of those didn't it?

Comment 12 Fabien Dupont 2019-06-26 15:57:19 UTC

@Dan, I confirm that it got rid of those messages. However, I don't think they are related to the current issue.
In all my tests, I've seen them while the migration was successful.
But I do agree that we should get rid of them :) Maybe another BZ/PR to track it.

Comment 13 Ilanit Stein 2019-06-27 07:50:44 UTC

@Fabien, @Dan,
Bug 1724487 - [v2v] While VMs are migrated tens of ssh-agent errors appear in evm.log

Comment 14 Ilanit Stein 2019-06-30 17:21:27 UTC

The "start is not permitted at state running" error is not appearing on each VMs migration.
For example, on same env, for which bug is reported, I managed to migrate successfully 20 VMs, with max concurrent VMs per conversion host: 10, and 2 conversion hosts.

@Fabien, @Daniel B., 
In addition to the reproduce trial, Is there anything required from QE side to assist?

Comment 15 Fabien Dupont 2019-07-01 06:28:29 UTC

@Ilanit, could you give us access to the reproducer environment ?

Comment 17 Daniel Berger 2019-07-01 12:56:35 UTC

In evm.log.2 that was attached, I can see this, don't know if it's relevant:

a Disks transformation failed.  Method:[block (2 levels) in <class:LogProxy>]
[----] E, [2019-06-26T09:55:01.972658 #23290:38af54] ERROR -- : Q-task_id([job_dispatcher]) /var/www/miq/vmdb/app/models/service_template_transformation_plan_task.rb:225:in `get_conversion_state'

Also, can you try applying this patch to the MiqSshUtil.rb file? It should give us a better idea of what's actually failing instead of those parse errors.

https://github.com/ManageIQ/manageiq-gems-pending/pull/437/files

Comment 19 Fabien Dupont 2019-07-02 09:38:45 UTC

This error doesn't fail the migration. It looks like a race condition where the job's state in already 'running', while the throttler tries to send the 'start' message.
Moving it to 5.11 and lowering priority to medium. This will give us more time to investigate the root cause.

Comment 20 Fabien Dupont 2019-07-29 15:38:26 UTC

@ilanit, have you been able to reproduce this BZ on latest 5.11 builds ?

Comment 21 Ilanit Stein 2019-07-30 12:42:12 UTC

@Fabien,
I did not run yet v2v tests on 5.11.

Comment 22 Fabien Dupont 2019-07-31 09:58:25 UTC

We're going to evolve the state machine for warm migration, so I propose to revisit this once it's done.
Moving to 5.11.z.

Comment 23 Ilanit Stein 2019-08-13 06:31:56 UTC

In reply to comment #20:

For CFME-5.11.0.18,
there are several occurrences of this error in evm.log,
during 20 VMs of 20GB disk each parallel migration.

For example:
----] E, [2019-08-12T07:48:11.775892 #60725:2ac513a565c4] ERROR -- : Q-task_id([job_dispatcher]) MIQ(MiqQueue#deliver) Message id: [89973], Error: [start is not permitted at state running]
[----] E, [2019-08-12T07:48:11.776047 #60725:2ac513a565c4] ERROR -- : Q-task_id([job_dispatcher]) [RuntimeError]: start is not permitted at state running  Method:[block (2 levels) in <class:LogProxy>]
[----] E, [2019-08-12T07:48:11.776192 #60725:2ac513a565c4] ERROR -- : Q-task_id([job_dispatcher]) /var/www/miq/vmdb/app/models/job/state_machine.rb:36:in `signal'

Comment 25 Daniel Berger 2019-09-06 12:45:38 UTC

https://github.com/ManageIQ/manageiq/pull/19255

Comment 26 Ilanit Stein 2019-09-18 14:08:44 UTC

Last patch was tested recently on CFME-5.11.0.23/RHV-4.3.6
Bug reproduced when a single VM, with 100GB disk (70% usage) was migrated.
The Job ID was displayed in the evm.log.

Comment 27 Fabien Dupont 2020-04-10 10:03:43 UTC

Was it reproduced in latest CFME 5.11?

Comment 28 Ilanit Stein 2020-04-28 10:59:55 UTC

I did not encounter this error recently on CFME-5.11.5.1 or recent ManageIQ nighties (last used: master.20200423003043_87de914b8b).
Bug can be closed, and in case we'll encounter it again, we'll update the bug.

Note You need to log in before you can comment on or make changes to this bug.