Bug 1755632 - Retry interval is not set in VMCheckTransformed
Summary: Retry interval is not set in VMCheckTransformed
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat CloudForms Management Engine
Classification: Red Hat
Component: V2V
Version: 5.10.8
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: GA
: 5.10.12
Assignee: Fabien Dupont
QA Contact: Shveta
Red Hat CloudForms Documentation
URL:
Whiteboard:
: 1755030 1755460 1771043 (view as bug list)
Depends On:
Blocks: 1755030
TreeView+ depends on / blocked
 
Reported: 2019-09-25 21:51 UTC by Fabien Dupont
Modified: 2023-03-24 15:31 UTC (History)
11 users (show)

Fixed In Version: 5.10.12.0
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-11-06 08:58:33 UTC
Category: Bug
Cloudforms Team: V2V
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
CFME_UI_fix.png (94.18 KB, image/png)
2019-10-03 10:34 UTC, Ilanit Stein
no flags Details

Description Fabien Dupont 2019-09-25 21:51:30 UTC
Description of problem:
So far, we believed that the default retry interval in Automate was 1 minute, and experimentation tended to confirm this. Apparently, it's not the case, as we saw retries every ~4 seconds when no retry interval was specified. This leads to having a TTL for VMCheckTransformed state of about 1 hour and 40 minutes, instead of 25 hours. The consequence is that long migrations fail.


Version-Release number of selected component (if applicable): CFME 5.10.8

Comment 3 Ilanit Stein 2019-10-03 10:07:20 UTC
*** Bug 1755460 has been marked as a duplicate of this bug. ***

Comment 4 Ilanit Stein 2019-10-03 10:34:06 UTC
This bug was reproduced by v2v migration of a RHEL-7.7 VM, with 1 TB 66% usage.
(See also bug 1755460, that was marked as duplicate, on this current bug).
Transport method: VDDK.
The disk transfer failed after 94% of it was copied.
CFME-5.11.0.24
RHV-4.3.5.4-0.1.el7

Logs:  https://drive.google.com/drive/u/0/folders/1gX4wwdi79pBz356Q6xbr3pQMP6Lu4dC5

journal.log analysis by 
Richard W.M. Jones 2019-10-02 15:37:39 UTC:
"
This is the same as the cezdata bug, because in the journal we can
see that CFME is logging in and sending a kill signal to virt-v2v (or
perhaps to a related process):

Sep 25 02:04:49 leopard03.qa.lab.tlv.redhat.com sshd[36961]: Accepted publickey 
for root from 10.35.161.44 port 44844 ssh2: RSA SHA256:g5Bu+bCcBH/JcIKTqU/bwBNOk
ydRI04Stgrxk2CygOY
Sep 25 02:04:50 leopard03.qa.lab.tlv.redhat.com systemd-logind[3281]: New sessio
n 1977 of user root.
Sep 25 02:04:50 leopard03.qa.lab.tlv.redhat.com systemd[1]: Started Session 1977
 of user root.
Sep 25 02:04:50 leopard03.qa.lab.tlv.redhat.com sshd[36961]: pam_unix(sshd:session): session opened for user root by (uid=0)
Sep 25 02:04:50 leopard03.qa.lab.tlv.redhat.com sudo[36981]:     root : TTY=unknown ; PWD=/root ; USER=root ; COMMAND=/bin/kill -s TERM 14115

and we can see from the wrapper log that virt-v2v is reported dead about a second later:

2019-09-25 02:04:51,424:INFO: virt-v2v terminated with return code 15 (virt-v2v-wrapper:1914)
"

evm.log analysis by 
Nenad Peric, Oct 03 2019:
"
you have a retry time of 25 -12 seconds => 13 seconds
here's the error which finally kills it (evm.log):
[----] I, [2019-09-24T19:04:28.727377 #6882:2ae53265c5c4]  INFO -- : Q-task_id([r9_service_template_transformation_plan_task_28]) MIQ(MiqAeEngine.deliver) Requeuing {:object_type=>"ServiceTemplateTransformationPlanTask", :object_id=>28, :namespace=>"Transformation/StateMachines", :class_name=>"VMTransformation", :instance_name=>"Transformation", :automate_message=>"create", :attrs=>{"request"=>"transformation_plan", :service_action=>"Provision"}, :user_id=>1, :miq_group_id=>2, :tenant_id=>1, :miq_task_id=>nil, :state=>"State11", :ae_fsm_started=>nil, :ae_state_started=>"2019-09-24 19:45:55 UTC", :ae_state_retries=>1501, :ae_state_data=>"--- !ruby/hash:ActiveSupport::HashWithIndifferentAccess\nae_state_progress: \nvm_shutdown_in_progress: true\n", :ae_state_previous=>"---\n\"/ManageIQ/Transformation/StateMachines/VMTransformation/Transformation\":\n  ae_state: State11\n  ae_state_retries: 1501\n  ae_state_started: 2019-09-24 19:45:55 UTC\n  ae_state_max_retries: 0\n\"/ManageIQ/Transformation/StateMachines/VMTransformation/Transform\":\n  ae_state: State5\n  ae_state_retries: 1501\n  ae_state_started: 2019-09-24 19:46:00 UTC\n  ae_state_max_retries: 1500\n"} for object [ServiceTemplateTransformationPlanTask.28] with state [State11] to Automate for delivery in [0] seconds


The important part is
ae_state: State11\n  ae_state_retries: 1501\n  ae_state_started: 2019-09-24 19:45:55 UTC\n  ae_state_max_retries: 0\n\"/ManageIQ/Transformation/StateMachines/VMTransformation/Transform\":\n  ae_state: State5\n  ae_state_retries: 1501\n  ae_state_started: 2019-09-24 19:46:00 UTC\n  ae_state_max_retries: 1500\n"}
yes, evm.log
the point is these retries were much further apart in the past. In your log I can see that:

1st:   2019-09-24T19:04:18.634285
2nd:  2019-09-24T19:04:25.176280
meaning you have around 7 seconds between retries.. Which is 10 times faster (almost) than presumed 60 seconds.

It hit this part:
ae_state_retries: 1501\n  ae_state_started: 2019-09-24 19:46:00 UTC\n  ae_state_max_retries: 1500\n"
that is it hit the retry number 1501, and the max is 1500

It then initiates:
virtv2v_graceful_kill
and finally:
[----] I, [2019-09-24T19:04:49.536282 #6882:2ae541ca1b78]  INFO -- : MIQ(ServiceTemplateTransformationPlanTask#get_conversion_state) InfraConversionJob get_conversion_state to update_options: {:virtv2v_pid=>14115, :virtv2v_disks=>[{:path=
>"[HP-iscsi-1] vm_1tb/vm_1tb.vmdk", :size=>1099511627776, :percent=>94.02, :weight=>100.0}]}
[----] I, [2019-09-24T19:04:49.574309 #6882:2ae541ca1b78]  INFO -- : MIQ(ServiceTemplateTransformationPlanTask#kill_virtv2v) Killing virt-v2v (PID: 14115) with TERM signal.
so we see here that percent is 94.02
but the next command at that time is Killing virt-v2v (PID: 14115) with TERM signal.

In essence it can be fixed manually in CFME UI here: see CFME_UI_fix.png
"

Comment 5 Ilanit Stein 2019-10-03 10:34:39 UTC
Created attachment 1622259 [details]
CFME_UI_fix.png

Comment 6 Shveta 2019-10-03 18:33:24 UTC
Hi Fabien , 

Just to confirm , To verify this BZ we need to check that retries are 1 min apart right ?
(Without making any change to automate method )

Thanks,
Shveta

Comment 7 Fabien Dupont 2019-10-04 17:41:05 UTC
Exactly.

Comment 9 Shveta 2019-10-08 20:16:30 UTC
EVm Logs show retries are minimum 1 min apart : 

[----] I, [2019-10-08T14:10:39.860569 #16378:1040f50]  INFO -- : Q-task_id([r1_service_template_transformation_plan_task_1]) MIQ(MiqQueue.put) Message id: [1534],  id: [], Zone: [default], Role: [automate], Server: [], MiqTask id: [], Ident: [generic], Target id: [], Instance id: [], Task id: [], Command: [MiqAeEngine.deliver], Timeout: [3600], Priority: [100], State: [ready], Deliver On: [2019-10-08 18:11:39 UTC], Data: [], Args: [{:object_type=>"ServiceTemplateTransformationPlanTask", :object_id=>1, :namespace=>"Transformation/StateMachines", :class_name=>"VMTransformation", :instance_name=>"Transformation", :automate_message=>"create", :attrs=>{"request"=>"transformation_plan", :service_action=>"Provision"}, :user_id=>1, :miq_group_id=>2, :tenant_id=>1, :miq_task_id=>nil, :state=>"State11", :ae_fsm_started=>nil, :ae_state_started=>"2019-10-08 18:10:34 UTC", :ae_state_retries=>1, :ae_state_data=>"--- !ruby/hash:ActiveSupport::HashWithIndifferentAccess\nae_state_progress: \nvm_shutdown_in_progress: true\n", :ae_state_previous=>"---\n\"/ManageIQ/Transformation/StateMachines/VMTransformation/Transformation\":\n  ae_state: State11\n  ae_state_retries: 1\n  ae_state_started: 2019-10-08 18:10:34 UTC\n  ae_state_max_retries: 0\n\"/ManageIQ/Transformation/StateMachines/VMTransformation/Transform\":\n  ae_state: State5\n  ae_state_retries: 1\n  ae_state_started: 2019-10-08 18:10:37 UTC\n  ae_state_max_retries: 1500\n"}]


[----] I, [2019-10-08T14:11:49.202259 #16386:1040f50]  INFO -- : Q-task_id([r1_service_template_transformation_plan_task_1]) MIQ(MiqAeEngine.deliver) Requeuing {:object_type=>"ServiceTemplateTransformationPlanTask", :object_id=>1, :namespace=>"Transformation/StateMachines", :class_name=>"VMTransformation", :instance_name=>"Transformation", :automate_message=>"create", :attrs=>{"request"=>"transformation_plan", :service_action=>"Provision"}, :user_id=>1, :miq_group_id=>2, :tenant_id=>1, :miq_task_id=>nil, :state=>"State11", :ae_fsm_started=>nil, :ae_state_started=>"2019-10-08 18:10:34 UTC", :ae_state_retries=>2, :ae_state_data=>"--- !ruby/hash:ActiveSupport::HashWithIndifferentAccess\nae_state_progress: \nvm_shutdown_in_progress: true\n", :ae_state_previous=>"---\n\"/ManageIQ/Transformation/StateMachines/VMTransformation/Transformation\":\n  ae_state: State11\n  ae_state_retries: 2\n  ae_state_started: 2019-10-08 18:10:34 UTC\n  ae_state_max_retries: 0\n\"/ManageIQ/Transformation/StateMachines/VMTransformation/Transform\":\n  ae_state: State5\n  ae_state_retries: 2\n  ae_state_started: 2019-10-08 18:10:37 UTC\n  ae_state_max_retries: 1500\n"} for object [ServiceTemplateTransformationPlanTask.1] with state [State11] to Automate for delivery in [60] seconds


[----] I, [2019-10-08T14:12:57.822583 #16378:1040f50]  INFO -- : Q-task_id([r1_service_template_transformation_plan_task_1]) MIQ(MiqAeEngine.deliver) Requeuing {:object_type=>"ServiceTemplateTransformationPlanTask", :object_id=>1, :namespace=>"Transformation/StateMachines", :class_name=>"VMTransformation", :instance_name=>"Transformation", :automate_message=>"create", :attrs=>{"request"=>"transformation_plan", :service_action=>"Provision"}, :user_id=>1, :miq_group_id=>2, :tenant_id=>1, :miq_task_id=>nil, :state=>"State11", :ae_fsm_started=>nil, :ae_state_started=>"2019-10-08 18:10:34 UTC", :ae_state_retries=>3, :ae_state_data=>"--- !ruby/hash:ActiveSupport::HashWithIndifferentAccess\nae_state_progress: \nvm_shutdown_in_progress: true\n", :ae_state_previous=>"---\n\"/ManageIQ/Transformation/StateMachines/VMTransformation/Transformation\":\n  ae_state: State11\n  ae_state_retries: 3\n  ae_state_started: 2019-10-08 18:10:34 UTC\n  ae_state_max_retries: 0\n\"/ManageIQ/Transformation/StateMachines/VMTransformation/Transform\":\n  ae_state: State5\n  ae_state_retries: 3\n  ae_state_started: 2019-10-08 18:10:37 UTC\n  ae_state_max_retries: 1500\n"} for object [ServiceTemplateTransformationPlanTask.1] with state [State11] to Automate for delivery in [60] seconds


[----] I, [2019-10-08T14:14:02.625858 #16386:1040f50]  INFO -- : Q-task_id([r1_service_template_transformation_plan_task_1]) MIQ(MiqAeEngine.deliver) Requeuing {:object_type=>"ServiceTemplateTransformationPlanTask", :object_id=>1, :namespace=>"Transformation/StateMachines", :class_name=>"VMTransformation", :instance_name=>"Transformation", :automate_message=>"create", :attrs=>{"request"=>"transformation_plan", :service_action=>"Provision"}, :user_id=>1, :miq_group_id=>2, :tenant_id=>1, :miq_task_id=>nil, :state=>"State11", :ae_fsm_started=>nil, :ae_state_started=>"2019-10-08 18:10:34 UTC", :ae_state_retries=>4, :ae_state_data=>"--- !ruby/hash:ActiveSupport::HashWithIndifferentAccess\nae_state_progress: \nvm_shutdown_in_progress: true\n", :ae_state_previous=>"---\n\"/ManageIQ/Transformation/StateMachines/VMTransformation/Transformation\":\n  ae_state: State11\n  ae_state_retries: 4\n  ae_state_started: 2019-10-08 18:10:34 UTC\n  ae_state_max_retries: 0\n\"/ManageIQ/Transformation/StateMachines/VMTransformation/Transform\":\n  ae_state: State5\n  ae_state_retries: 4\n  ae_state_started: 2019-10-08 18:10:37 UTC\n  ae_state_max_retries: 1500\n"} for object [ServiceTemplateTransformationPlanTask.1] with state [State11] to Automate for delivery in [60] seconds


Please check if they need to be exactly 1 min apart or this looks good ?

Comment 10 Fabien Dupont 2019-10-08 20:20:55 UTC
There can be some slight delay for the retry to be scheduled, so it's normal that it's not exactly 1 minute.
What is important is that it says "[State11] to Automate for delivery in [60] seconds".

Comment 11 Shveta 2019-10-09 00:45:07 UTC
Verified in 5.10.12.0.20191007204014_0d55c84

Comment 12 Shveta 2019-10-23 18:21:22 UTC
Hi Fabien , 

I am seeing 'Automate for delivery in [0] seconds'
in  build  5.11.0.27.20191001204742_d9cf38b

This bug needs to be cloned for 5.11 ?

[----] I, [2019-10-23T13:55:43.169808 #13785:2b1af97445d4]  INFO -- : Q-task_id([r11_service_template_transformation_plan_task_11]) MIQ(MiqAeEngine.deliver) Requeuing {:object_type=>"ServiceTemplateTransformationPlanTask", :object_id=>11, :namespace=>"Transformation/StateMachines", :class_name=>"VMTransformation", :instance_name=>"Transformation", :automate_message=>"create", :attrs=>{"request"=>"transformation_plan", :service_action=>"Provision"}, :user_id=>1, :miq_group_id=>2, :tenant_id=>1, :miq_task_id=>nil, :state=>"State11", :ae_fsm_started=>nil, :ae_state_started=>"2019-10-23 17:55:16 UTC", :ae_state_retries=>3, :ae_state_data=>"--- !ruby/hash:ActiveSupport::HashWithIndifferentAccess\nae_state_progress: \nvm_shutdown_in_progress: true\n", :ae_state_previous=>"---\n\"/ManageIQ/Transformation/StateMachines/VMTransformation/Transformation\":\n  ae_state: State11\n  ae_state_retries: 3\n  ae_state_started: 2019-10-23 17:55:16 UTC\n  ae_state_max_retries: 0\n\"/ManageIQ/Transformation/StateMachines/VMTransformation/Transform\":\n  ae_state: State5\n  ae_state_retries: 3\n  ae_state_started: 2019-10-23 17:55:21 UTC\n  ae_state_max_retries: 1500\n"} for object [ServiceTemplateTransformationPlanTask.11] with state [State11] to Automate for delivery in [0] seconds

Comment 13 Fabien Dupont 2019-10-23 18:48:19 UTC
It's already fixed upstream, and the associated BZ are targeted for 5.11.1 because they remove Automate from the equation.

Comment 17 errata-xmlrpc 2019-11-06 08:58:33 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:3268

Comment 18 Fabien Dupont 2019-11-27 07:16:38 UTC
*** Bug 1771043 has been marked as a duplicate of this bug. ***

Comment 19 Fabien Dupont 2020-04-10 13:46:40 UTC
*** Bug 1755030 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.