Bug 1724040
Summary: | [v2v][Scale][RHV] VM migration fail on "start is not permitted at state running" | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat CloudForms Management Engine | Reporter: | Ilanit Stein <istein> | ||||||||||
Component: | V2V | Assignee: | Fabien Dupont <fdupont> | ||||||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Ilanit Stein <istein> | ||||||||||
Severity: | medium | Docs Contact: | Red Hat CloudForms Documentation <cloudforms-docs> | ||||||||||
Priority: | medium | ||||||||||||
Version: | unspecified | CC: | bthurber, fdupont | ||||||||||
Target Milestone: | GA | ||||||||||||
Target Release: | 5.11.z | ||||||||||||
Hardware: | Unspecified | ||||||||||||
OS: | Unspecified | ||||||||||||
Whiteboard: | |||||||||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||||
Doc Text: | Story Points: | --- | |||||||||||
Clone Of: | Environment: | ||||||||||||
Last Closed: | 2020-04-28 17:21:15 UTC | Type: | Bug | ||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||
Documentation: | --- | CRM: | |||||||||||
Verified Versions: | Category: | Bug | |||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
Cloudforms Team: | V2V | Target Upstream Version: | |||||||||||
Embargoed: | |||||||||||||
Attachments: |
|
Description
Ilanit Stein
2019-06-26 06:44:06 UTC
Created attachment 1584633 [details]
evm.log
Migration was on June 25 2019. From 16:45 till 22:30.
[----] E, [2019-06-25T17:42:41.532697 #49653:38af54] ERROR -- : Q-task_id([job_dispatcher]) MIQ(MiqQueue#deliver) Message id: [25009], Error: [start is not permitted at state running]
[----] E, [2019-06-25T17:42:41.533066 #49653:38af54] ERROR -- : Q-task_id([job_dispatcher]) [RuntimeError]: start is not permitted at state running Method:[block (2 levels) in <class:LogProxy>]
[----] E, [2019-06-25T17:42:41.533380 #49653:38af54] ERROR -- : Q-task_id([job_dispatcher]) /var/www/miq/vmdb/app/models/job/state_machine.rb:36:in `signal'
[----] E, [2019-06-25T18:25:15.132812 #34735:902310] ERROR -- : MIQ(ManageIQ::Providers::Redhat::InfraManager::EventCatcher::Runner#start_event_monitor) EMS [izuckerm-hosted-engine.qa.lab.tlv.redhat.com] as [admin@internal] Event Monitor Thread aborted because [HTTP response code is 500. The response content type 'text/html; charset=iso-8859-1' isn't XML.]
[----] E, [2019-06-25T18:25:15.132900 #34735:902310] ERROR -- : [OvirtSDK4::Error]: HTTP response code is 500. The response content type 'text/html; charset=iso-8859-1' isn't XML. Method:[block (2 levels) in <class:LogProxy>]
[----] E, [2019-06-25T18:25:15.132983 #34735:902310] ERROR -- : /usr/share/gems/gems/ovirt-engine-sdk-4.2.4/lib/ovirtsdk4/connection.rb:385:in `raise_error'
[----] E, [2019-06-25T22:28:38.139415 #13174:38af54] ERROR -- : Q-task_id([r46_service_template_transformation_plan_task_135]) MIQ(MiqAeEngine.deliver) Error delivering {"request"=>"transformation_plan", :service_action=>"Provision"} for object [ServiceTemplateTransformationPlanTask.135] with state [State11] to Automate:
Created attachment 1584641 [details]
v2v import wrapper log
Created attachment 1584642 [details]
v2v import log
Correction to the tested RHV version. It is: RHV-4.3.4.3-0.1.el7 and not RHV-4.3.4 @Daniel, it sounds like a weird transition in the job workflow. You should be able to follow the transitions in evm.log. If not, it probably deserves additional logging. @Ilanit, how many times did you hit this error ? I shall try to reproduce, and report here back. Adding that in evm.log, while the migration was running, there were also such errors: [----] E, [2019-06-25T17:45:35.955350 #12387:38af54] ERROR -- net.ssh.authentication.agent[a90031c]: Q-task_id([job_dispatcher]) could not connect to ssh-agent: Agent not configured I think it is for the various migrated VMs. Regarding reproduce of Bug 1724040 - [v2v][Scale][RHV] VM migration fail on "start is not permitted at state running" (https://bugzilla.redhat.com/show_bug.cgi?id=1724040) I run migration, of 20 vms, with max concurrent VM migrations per host 5, with 2 conversion hosts. (same env as reported in the bug, and even exact Vms imported form VMware. 10 VMs were migrated successfully on the first round, distributed evenly between the 2 conversion hosts 5:5. Around the time that the first round was done, and the second round, of the next 10 VMs started, several such errors appeared in the evm.log: [----] E, [2019-06-26T08:39:17.273782 #28421:38af54] ERROR -- : Q-task_id([job_dispatcher]) MIQ(MiqQueue#deliver) Message id: [47099], Error: [start is not permitted at state running] and later also there was this error: [----] E, [2019-06-26T08:54:00.183367 #25018:38af54] ERROR -- : Q-task_id([job_dispatcher]) [RuntimeError]: Could not parse conversion state data from file '/tmp/v2v-import-20190626T153959-4071.state': Method:[block (2 levels) in <class:LogProxy>] that was reported in Bug 1719689 - [v2v][RHV][vddk] 1 out of 10 VMs failed migration, on CFME failed to read the state file. Few minutes later, assuming it is not related to the v2v run, the connection to the RHV engine went down. and eventually, all 10 VMs failed migration. evm.log.2 attached Created attachment 1584843 [details]
evm.log.2
VM migration from Jun2 26 2019 7:32 till ~10:00
The state file parsing error ties back to https://github.com/ManageIQ/manageiq-gems-pending/pull/437 @Fabien, regarding the ssh-agent errors, you said that setting use_agent to false got rid of those didn't it? @Dan, I confirm that it got rid of those messages. However, I don't think they are related to the current issue. In all my tests, I've seen them while the migration was successful. But I do agree that we should get rid of them :) Maybe another BZ/PR to track it. @Fabien, @Dan, Bug 1724487 - [v2v] While VMs are migrated tens of ssh-agent errors appear in evm.log The "start is not permitted at state running" error is not appearing on each VMs migration. For example, on same env, for which bug is reported, I managed to migrate successfully 20 VMs, with max concurrent VMs per conversion host: 10, and 2 conversion hosts. @Fabien, @Daniel B., In addition to the reproduce trial, Is there anything required from QE side to assist? @Ilanit, could you give us access to the reproducer environment ? In evm.log.2 that was attached, I can see this, don't know if it's relevant: a Disks transformation failed. Method:[block (2 levels) in <class:LogProxy>] [----] E, [2019-06-26T09:55:01.972658 #23290:38af54] ERROR -- : Q-task_id([job_dispatcher]) /var/www/miq/vmdb/app/models/service_template_transformation_plan_task.rb:225:in `get_conversion_state' Also, can you try applying this patch to the MiqSshUtil.rb file? It should give us a better idea of what's actually failing instead of those parse errors. https://github.com/ManageIQ/manageiq-gems-pending/pull/437/files This error doesn't fail the migration. It looks like a race condition where the job's state in already 'running', while the throttler tries to send the 'start' message. Moving it to 5.11 and lowering priority to medium. This will give us more time to investigate the root cause. @ilanit, have you been able to reproduce this BZ on latest 5.11 builds ? @Fabien, I did not run yet v2v tests on 5.11. We're going to evolve the state machine for warm migration, so I propose to revisit this once it's done. Moving to 5.11.z. In reply to comment #20: For CFME-5.11.0.18, there are several occurrences of this error in evm.log, during 20 VMs of 20GB disk each parallel migration. For example: ----] E, [2019-08-12T07:48:11.775892 #60725:2ac513a565c4] ERROR -- : Q-task_id([job_dispatcher]) MIQ(MiqQueue#deliver) Message id: [89973], Error: [start is not permitted at state running] [----] E, [2019-08-12T07:48:11.776047 #60725:2ac513a565c4] ERROR -- : Q-task_id([job_dispatcher]) [RuntimeError]: start is not permitted at state running Method:[block (2 levels) in <class:LogProxy>] [----] E, [2019-08-12T07:48:11.776192 #60725:2ac513a565c4] ERROR -- : Q-task_id([job_dispatcher]) /var/www/miq/vmdb/app/models/job/state_machine.rb:36:in `signal' Last patch was tested recently on CFME-5.11.0.23/RHV-4.3.6 Bug reproduced when a single VM, with 100GB disk (70% usage) was migrated. The Job ID was displayed in the evm.log. Was it reproduced in latest CFME 5.11? I did not encounter this error recently on CFME-5.11.5.1 or recent ManageIQ nighties (last used: master.20200423003043_87de914b8b). Bug can be closed, and in case we'll encounter it again, we'll update the bug. |