Description of problem: On VMware->RHV migration, VDDK, of 20 VMs/100GB disk, and Provider max concurrent VM migrations: 20, and Max concurrent VM migrations, 2 conversion hosts, per conversion host: 5. Storage type: VMware- ISCSI, RHV- FC. Migration result: _2_ VMs failed 18 VMs succeeded. VMs distribution between the 2 conversion hosts: On first round the distribution was 5:5, then VM "v2v_migration_vm_06" failed on a known bug (See Additional Info bellow). So 1 more VM stated to be migrated to the conversion host, where this VM failed. Therefore, distribution was again 5:5. On the second round (when there were 9 VMs left to migrate), the distribution was 4:5 (the host with the 4, was the host where VM "v2v_migration_vm_06" failed on). Then VM "v2v_migration_vm_17" failed. VM "v2v_migration_vm_17" failed on: "Error: [start is not permitted at state running]", in evm.log. The vm migration ended up after 5:44 (hh:mm). The successful Migration of the VMs in the first round took ~ 1hour, and the successful migration of the VMs in the second, took 1 hour. That is if VM "v2v_migration_vm_17" was not failing, the overall time was suppose to be 2 hours. Version-Release number of selected component (if applicable): CFME-5.10.6.1 RHV-4.3.4 How reproducible: It happen once, and I did not try to reproduce it. Additional Info: The second VM, "v2v_migration_vm_06", failed on this known Bug 1719689 - [v2v][RHV][vddk] 1 out of 10 VMs failed migration, on CFME failed to read the state file.
Created attachment 1584633 [details] evm.log Migration was on June 25 2019. From 16:45 till 22:30. [----] E, [2019-06-25T17:42:41.532697 #49653:38af54] ERROR -- : Q-task_id([job_dispatcher]) MIQ(MiqQueue#deliver) Message id: [25009], Error: [start is not permitted at state running] [----] E, [2019-06-25T17:42:41.533066 #49653:38af54] ERROR -- : Q-task_id([job_dispatcher]) [RuntimeError]: start is not permitted at state running Method:[block (2 levels) in <class:LogProxy>] [----] E, [2019-06-25T17:42:41.533380 #49653:38af54] ERROR -- : Q-task_id([job_dispatcher]) /var/www/miq/vmdb/app/models/job/state_machine.rb:36:in `signal' [----] E, [2019-06-25T18:25:15.132812 #34735:902310] ERROR -- : MIQ(ManageIQ::Providers::Redhat::InfraManager::EventCatcher::Runner#start_event_monitor) EMS [izuckerm-hosted-engine.qa.lab.tlv.redhat.com] as [admin@internal] Event Monitor Thread aborted because [HTTP response code is 500. The response content type 'text/html; charset=iso-8859-1' isn't XML.] [----] E, [2019-06-25T18:25:15.132900 #34735:902310] ERROR -- : [OvirtSDK4::Error]: HTTP response code is 500. The response content type 'text/html; charset=iso-8859-1' isn't XML. Method:[block (2 levels) in <class:LogProxy>] [----] E, [2019-06-25T18:25:15.132983 #34735:902310] ERROR -- : /usr/share/gems/gems/ovirt-engine-sdk-4.2.4/lib/ovirtsdk4/connection.rb:385:in `raise_error' [----] E, [2019-06-25T22:28:38.139415 #13174:38af54] ERROR -- : Q-task_id([r46_service_template_transformation_plan_task_135]) MIQ(MiqAeEngine.deliver) Error delivering {"request"=>"transformation_plan", :service_action=>"Provision"} for object [ServiceTemplateTransformationPlanTask.135] with state [State11] to Automate:
Created attachment 1584641 [details] v2v import wrapper log
Created attachment 1584642 [details] v2v import log
Correction to the tested RHV version. It is: RHV-4.3.4.3-0.1.el7 and not RHV-4.3.4
@Daniel, it sounds like a weird transition in the job workflow. You should be able to follow the transitions in evm.log. If not, it probably deserves additional logging.
@Ilanit, how many times did you hit this error ?
I shall try to reproduce, and report here back. Adding that in evm.log, while the migration was running, there were also such errors: [----] E, [2019-06-25T17:45:35.955350 #12387:38af54] ERROR -- net.ssh.authentication.agent[a90031c]: Q-task_id([job_dispatcher]) could not connect to ssh-agent: Agent not configured I think it is for the various migrated VMs.
Regarding reproduce of Bug 1724040 - [v2v][Scale][RHV] VM migration fail on "start is not permitted at state running" (https://bugzilla.redhat.com/show_bug.cgi?id=1724040) I run migration, of 20 vms, with max concurrent VM migrations per host 5, with 2 conversion hosts. (same env as reported in the bug, and even exact Vms imported form VMware. 10 VMs were migrated successfully on the first round, distributed evenly between the 2 conversion hosts 5:5. Around the time that the first round was done, and the second round, of the next 10 VMs started, several such errors appeared in the evm.log: [----] E, [2019-06-26T08:39:17.273782 #28421:38af54] ERROR -- : Q-task_id([job_dispatcher]) MIQ(MiqQueue#deliver) Message id: [47099], Error: [start is not permitted at state running] and later also there was this error: [----] E, [2019-06-26T08:54:00.183367 #25018:38af54] ERROR -- : Q-task_id([job_dispatcher]) [RuntimeError]: Could not parse conversion state data from file '/tmp/v2v-import-20190626T153959-4071.state': Method:[block (2 levels) in <class:LogProxy>] that was reported in Bug 1719689 - [v2v][RHV][vddk] 1 out of 10 VMs failed migration, on CFME failed to read the state file. Few minutes later, assuming it is not related to the v2v run, the connection to the RHV engine went down. and eventually, all 10 VMs failed migration. evm.log.2 attached
Created attachment 1584843 [details] evm.log.2 VM migration from Jun2 26 2019 7:32 till ~10:00
The state file parsing error ties back to https://github.com/ManageIQ/manageiq-gems-pending/pull/437 @Fabien, regarding the ssh-agent errors, you said that setting use_agent to false got rid of those didn't it?
@Dan, I confirm that it got rid of those messages. However, I don't think they are related to the current issue. In all my tests, I've seen them while the migration was successful. But I do agree that we should get rid of them :) Maybe another BZ/PR to track it.
@Fabien, @Dan, Bug 1724487 - [v2v] While VMs are migrated tens of ssh-agent errors appear in evm.log
The "start is not permitted at state running" error is not appearing on each VMs migration. For example, on same env, for which bug is reported, I managed to migrate successfully 20 VMs, with max concurrent VMs per conversion host: 10, and 2 conversion hosts. @Fabien, @Daniel B., In addition to the reproduce trial, Is there anything required from QE side to assist?
@Ilanit, could you give us access to the reproducer environment ?
In evm.log.2 that was attached, I can see this, don't know if it's relevant: a Disks transformation failed. Method:[block (2 levels) in <class:LogProxy>] [----] E, [2019-06-26T09:55:01.972658 #23290:38af54] ERROR -- : Q-task_id([job_dispatcher]) /var/www/miq/vmdb/app/models/service_template_transformation_plan_task.rb:225:in `get_conversion_state' Also, can you try applying this patch to the MiqSshUtil.rb file? It should give us a better idea of what's actually failing instead of those parse errors. https://github.com/ManageIQ/manageiq-gems-pending/pull/437/files
This error doesn't fail the migration. It looks like a race condition where the job's state in already 'running', while the throttler tries to send the 'start' message. Moving it to 5.11 and lowering priority to medium. This will give us more time to investigate the root cause.
@ilanit, have you been able to reproduce this BZ on latest 5.11 builds ?
@Fabien, I did not run yet v2v tests on 5.11.
We're going to evolve the state machine for warm migration, so I propose to revisit this once it's done. Moving to 5.11.z.
In reply to comment #20: For CFME-5.11.0.18, there are several occurrences of this error in evm.log, during 20 VMs of 20GB disk each parallel migration. For example: ----] E, [2019-08-12T07:48:11.775892 #60725:2ac513a565c4] ERROR -- : Q-task_id([job_dispatcher]) MIQ(MiqQueue#deliver) Message id: [89973], Error: [start is not permitted at state running] [----] E, [2019-08-12T07:48:11.776047 #60725:2ac513a565c4] ERROR -- : Q-task_id([job_dispatcher]) [RuntimeError]: start is not permitted at state running Method:[block (2 levels) in <class:LogProxy>] [----] E, [2019-08-12T07:48:11.776192 #60725:2ac513a565c4] ERROR -- : Q-task_id([job_dispatcher]) /var/www/miq/vmdb/app/models/job/state_machine.rb:36:in `signal'
https://github.com/ManageIQ/manageiq/pull/19255
Last patch was tested recently on CFME-5.11.0.23/RHV-4.3.6 Bug reproduced when a single VM, with 100GB disk (70% usage) was migrated. The Job ID was displayed in the evm.log.
Was it reproduced in latest CFME 5.11?
I did not encounter this error recently on CFME-5.11.5.1 or recent ManageIQ nighties (last used: master.20200423003043_87de914b8b). Bug can be closed, and in case we'll encounter it again, we'll update the bug.