Description of problem: So far, we believed that the default retry interval in Automate was 1 minute, and experimentation tended to confirm this. Apparently, it's not the case, as we saw retries every ~4 seconds when no retry interval was specified. This leads to having a TTL for VMCheckTransformed state of about 1 hour and 40 minutes, instead of 25 hours. The consequence is that long migrations fail. Version-Release number of selected component (if applicable): CFME 5.10.8
https://github.com/ManageIQ/manageiq-content/pull/585
*** Bug 1755460 has been marked as a duplicate of this bug. ***
This bug was reproduced by v2v migration of a RHEL-7.7 VM, with 1 TB 66% usage. (See also bug 1755460, that was marked as duplicate, on this current bug). Transport method: VDDK. The disk transfer failed after 94% of it was copied. CFME-5.11.0.24 RHV-4.3.5.4-0.1.el7 Logs: https://drive.google.com/drive/u/0/folders/1gX4wwdi79pBz356Q6xbr3pQMP6Lu4dC5 journal.log analysis by Richard W.M. Jones 2019-10-02 15:37:39 UTC: " This is the same as the cezdata bug, because in the journal we can see that CFME is logging in and sending a kill signal to virt-v2v (or perhaps to a related process): Sep 25 02:04:49 leopard03.qa.lab.tlv.redhat.com sshd[36961]: Accepted publickey for root from 10.35.161.44 port 44844 ssh2: RSA SHA256:g5Bu+bCcBH/JcIKTqU/bwBNOk ydRI04Stgrxk2CygOY Sep 25 02:04:50 leopard03.qa.lab.tlv.redhat.com systemd-logind[3281]: New sessio n 1977 of user root. Sep 25 02:04:50 leopard03.qa.lab.tlv.redhat.com systemd[1]: Started Session 1977 of user root. Sep 25 02:04:50 leopard03.qa.lab.tlv.redhat.com sshd[36961]: pam_unix(sshd:session): session opened for user root by (uid=0) Sep 25 02:04:50 leopard03.qa.lab.tlv.redhat.com sudo[36981]: root : TTY=unknown ; PWD=/root ; USER=root ; COMMAND=/bin/kill -s TERM 14115 and we can see from the wrapper log that virt-v2v is reported dead about a second later: 2019-09-25 02:04:51,424:INFO: virt-v2v terminated with return code 15 (virt-v2v-wrapper:1914) " evm.log analysis by Nenad Peric, Oct 03 2019: " you have a retry time of 25 -12 seconds => 13 seconds here's the error which finally kills it (evm.log): [----] I, [2019-09-24T19:04:28.727377 #6882:2ae53265c5c4] INFO -- : Q-task_id([r9_service_template_transformation_plan_task_28]) MIQ(MiqAeEngine.deliver) Requeuing {:object_type=>"ServiceTemplateTransformationPlanTask", :object_id=>28, :namespace=>"Transformation/StateMachines", :class_name=>"VMTransformation", :instance_name=>"Transformation", :automate_message=>"create", :attrs=>{"request"=>"transformation_plan", :service_action=>"Provision"}, :user_id=>1, :miq_group_id=>2, :tenant_id=>1, :miq_task_id=>nil, :state=>"State11", :ae_fsm_started=>nil, :ae_state_started=>"2019-09-24 19:45:55 UTC", :ae_state_retries=>1501, :ae_state_data=>"--- !ruby/hash:ActiveSupport::HashWithIndifferentAccess\nae_state_progress: \nvm_shutdown_in_progress: true\n", :ae_state_previous=>"---\n\"/ManageIQ/Transformation/StateMachines/VMTransformation/Transformation\":\n ae_state: State11\n ae_state_retries: 1501\n ae_state_started: 2019-09-24 19:45:55 UTC\n ae_state_max_retries: 0\n\"/ManageIQ/Transformation/StateMachines/VMTransformation/Transform\":\n ae_state: State5\n ae_state_retries: 1501\n ae_state_started: 2019-09-24 19:46:00 UTC\n ae_state_max_retries: 1500\n"} for object [ServiceTemplateTransformationPlanTask.28] with state [State11] to Automate for delivery in [0] seconds The important part is ae_state: State11\n ae_state_retries: 1501\n ae_state_started: 2019-09-24 19:45:55 UTC\n ae_state_max_retries: 0\n\"/ManageIQ/Transformation/StateMachines/VMTransformation/Transform\":\n ae_state: State5\n ae_state_retries: 1501\n ae_state_started: 2019-09-24 19:46:00 UTC\n ae_state_max_retries: 1500\n"} yes, evm.log the point is these retries were much further apart in the past. In your log I can see that: 1st: 2019-09-24T19:04:18.634285 2nd: 2019-09-24T19:04:25.176280 meaning you have around 7 seconds between retries.. Which is 10 times faster (almost) than presumed 60 seconds. It hit this part: ae_state_retries: 1501\n ae_state_started: 2019-09-24 19:46:00 UTC\n ae_state_max_retries: 1500\n" that is it hit the retry number 1501, and the max is 1500 It then initiates: virtv2v_graceful_kill and finally: [----] I, [2019-09-24T19:04:49.536282 #6882:2ae541ca1b78] INFO -- : MIQ(ServiceTemplateTransformationPlanTask#get_conversion_state) InfraConversionJob get_conversion_state to update_options: {:virtv2v_pid=>14115, :virtv2v_disks=>[{:path= >"[HP-iscsi-1] vm_1tb/vm_1tb.vmdk", :size=>1099511627776, :percent=>94.02, :weight=>100.0}]} [----] I, [2019-09-24T19:04:49.574309 #6882:2ae541ca1b78] INFO -- : MIQ(ServiceTemplateTransformationPlanTask#kill_virtv2v) Killing virt-v2v (PID: 14115) with TERM signal. so we see here that percent is 94.02 but the next command at that time is Killing virt-v2v (PID: 14115) with TERM signal. In essence it can be fixed manually in CFME UI here: see CFME_UI_fix.png "
Created attachment 1622259 [details] CFME_UI_fix.png
Hi Fabien , Just to confirm , To verify this BZ we need to check that retries are 1 min apart right ? (Without making any change to automate method ) Thanks, Shveta
Exactly.
EVm Logs show retries are minimum 1 min apart : [----] I, [2019-10-08T14:10:39.860569 #16378:1040f50] INFO -- : Q-task_id([r1_service_template_transformation_plan_task_1]) MIQ(MiqQueue.put) Message id: [1534], id: [], Zone: [default], Role: [automate], Server: [], MiqTask id: [], Ident: [generic], Target id: [], Instance id: [], Task id: [], Command: [MiqAeEngine.deliver], Timeout: [3600], Priority: [100], State: [ready], Deliver On: [2019-10-08 18:11:39 UTC], Data: [], Args: [{:object_type=>"ServiceTemplateTransformationPlanTask", :object_id=>1, :namespace=>"Transformation/StateMachines", :class_name=>"VMTransformation", :instance_name=>"Transformation", :automate_message=>"create", :attrs=>{"request"=>"transformation_plan", :service_action=>"Provision"}, :user_id=>1, :miq_group_id=>2, :tenant_id=>1, :miq_task_id=>nil, :state=>"State11", :ae_fsm_started=>nil, :ae_state_started=>"2019-10-08 18:10:34 UTC", :ae_state_retries=>1, :ae_state_data=>"--- !ruby/hash:ActiveSupport::HashWithIndifferentAccess\nae_state_progress: \nvm_shutdown_in_progress: true\n", :ae_state_previous=>"---\n\"/ManageIQ/Transformation/StateMachines/VMTransformation/Transformation\":\n ae_state: State11\n ae_state_retries: 1\n ae_state_started: 2019-10-08 18:10:34 UTC\n ae_state_max_retries: 0\n\"/ManageIQ/Transformation/StateMachines/VMTransformation/Transform\":\n ae_state: State5\n ae_state_retries: 1\n ae_state_started: 2019-10-08 18:10:37 UTC\n ae_state_max_retries: 1500\n"}] [----] I, [2019-10-08T14:11:49.202259 #16386:1040f50] INFO -- : Q-task_id([r1_service_template_transformation_plan_task_1]) MIQ(MiqAeEngine.deliver) Requeuing {:object_type=>"ServiceTemplateTransformationPlanTask", :object_id=>1, :namespace=>"Transformation/StateMachines", :class_name=>"VMTransformation", :instance_name=>"Transformation", :automate_message=>"create", :attrs=>{"request"=>"transformation_plan", :service_action=>"Provision"}, :user_id=>1, :miq_group_id=>2, :tenant_id=>1, :miq_task_id=>nil, :state=>"State11", :ae_fsm_started=>nil, :ae_state_started=>"2019-10-08 18:10:34 UTC", :ae_state_retries=>2, :ae_state_data=>"--- !ruby/hash:ActiveSupport::HashWithIndifferentAccess\nae_state_progress: \nvm_shutdown_in_progress: true\n", :ae_state_previous=>"---\n\"/ManageIQ/Transformation/StateMachines/VMTransformation/Transformation\":\n ae_state: State11\n ae_state_retries: 2\n ae_state_started: 2019-10-08 18:10:34 UTC\n ae_state_max_retries: 0\n\"/ManageIQ/Transformation/StateMachines/VMTransformation/Transform\":\n ae_state: State5\n ae_state_retries: 2\n ae_state_started: 2019-10-08 18:10:37 UTC\n ae_state_max_retries: 1500\n"} for object [ServiceTemplateTransformationPlanTask.1] with state [State11] to Automate for delivery in [60] seconds [----] I, [2019-10-08T14:12:57.822583 #16378:1040f50] INFO -- : Q-task_id([r1_service_template_transformation_plan_task_1]) MIQ(MiqAeEngine.deliver) Requeuing {:object_type=>"ServiceTemplateTransformationPlanTask", :object_id=>1, :namespace=>"Transformation/StateMachines", :class_name=>"VMTransformation", :instance_name=>"Transformation", :automate_message=>"create", :attrs=>{"request"=>"transformation_plan", :service_action=>"Provision"}, :user_id=>1, :miq_group_id=>2, :tenant_id=>1, :miq_task_id=>nil, :state=>"State11", :ae_fsm_started=>nil, :ae_state_started=>"2019-10-08 18:10:34 UTC", :ae_state_retries=>3, :ae_state_data=>"--- !ruby/hash:ActiveSupport::HashWithIndifferentAccess\nae_state_progress: \nvm_shutdown_in_progress: true\n", :ae_state_previous=>"---\n\"/ManageIQ/Transformation/StateMachines/VMTransformation/Transformation\":\n ae_state: State11\n ae_state_retries: 3\n ae_state_started: 2019-10-08 18:10:34 UTC\n ae_state_max_retries: 0\n\"/ManageIQ/Transformation/StateMachines/VMTransformation/Transform\":\n ae_state: State5\n ae_state_retries: 3\n ae_state_started: 2019-10-08 18:10:37 UTC\n ae_state_max_retries: 1500\n"} for object [ServiceTemplateTransformationPlanTask.1] with state [State11] to Automate for delivery in [60] seconds [----] I, [2019-10-08T14:14:02.625858 #16386:1040f50] INFO -- : Q-task_id([r1_service_template_transformation_plan_task_1]) MIQ(MiqAeEngine.deliver) Requeuing {:object_type=>"ServiceTemplateTransformationPlanTask", :object_id=>1, :namespace=>"Transformation/StateMachines", :class_name=>"VMTransformation", :instance_name=>"Transformation", :automate_message=>"create", :attrs=>{"request"=>"transformation_plan", :service_action=>"Provision"}, :user_id=>1, :miq_group_id=>2, :tenant_id=>1, :miq_task_id=>nil, :state=>"State11", :ae_fsm_started=>nil, :ae_state_started=>"2019-10-08 18:10:34 UTC", :ae_state_retries=>4, :ae_state_data=>"--- !ruby/hash:ActiveSupport::HashWithIndifferentAccess\nae_state_progress: \nvm_shutdown_in_progress: true\n", :ae_state_previous=>"---\n\"/ManageIQ/Transformation/StateMachines/VMTransformation/Transformation\":\n ae_state: State11\n ae_state_retries: 4\n ae_state_started: 2019-10-08 18:10:34 UTC\n ae_state_max_retries: 0\n\"/ManageIQ/Transformation/StateMachines/VMTransformation/Transform\":\n ae_state: State5\n ae_state_retries: 4\n ae_state_started: 2019-10-08 18:10:37 UTC\n ae_state_max_retries: 1500\n"} for object [ServiceTemplateTransformationPlanTask.1] with state [State11] to Automate for delivery in [60] seconds Please check if they need to be exactly 1 min apart or this looks good ?
There can be some slight delay for the retry to be scheduled, so it's normal that it's not exactly 1 minute. What is important is that it says "[State11] to Automate for delivery in [60] seconds".
Verified in 5.10.12.0.20191007204014_0d55c84
Hi Fabien , I am seeing 'Automate for delivery in [0] seconds' in build 5.11.0.27.20191001204742_d9cf38b This bug needs to be cloned for 5.11 ? [----] I, [2019-10-23T13:55:43.169808 #13785:2b1af97445d4] INFO -- : Q-task_id([r11_service_template_transformation_plan_task_11]) MIQ(MiqAeEngine.deliver) Requeuing {:object_type=>"ServiceTemplateTransformationPlanTask", :object_id=>11, :namespace=>"Transformation/StateMachines", :class_name=>"VMTransformation", :instance_name=>"Transformation", :automate_message=>"create", :attrs=>{"request"=>"transformation_plan", :service_action=>"Provision"}, :user_id=>1, :miq_group_id=>2, :tenant_id=>1, :miq_task_id=>nil, :state=>"State11", :ae_fsm_started=>nil, :ae_state_started=>"2019-10-23 17:55:16 UTC", :ae_state_retries=>3, :ae_state_data=>"--- !ruby/hash:ActiveSupport::HashWithIndifferentAccess\nae_state_progress: \nvm_shutdown_in_progress: true\n", :ae_state_previous=>"---\n\"/ManageIQ/Transformation/StateMachines/VMTransformation/Transformation\":\n ae_state: State11\n ae_state_retries: 3\n ae_state_started: 2019-10-23 17:55:16 UTC\n ae_state_max_retries: 0\n\"/ManageIQ/Transformation/StateMachines/VMTransformation/Transform\":\n ae_state: State5\n ae_state_retries: 3\n ae_state_started: 2019-10-23 17:55:21 UTC\n ae_state_max_retries: 1500\n"} for object [ServiceTemplateTransformationPlanTask.11] with state [State11] to Automate for delivery in [0] seconds
It's already fixed upstream, and the associated BZ are targeted for 5.11.1 because they remove Automate from the equation.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:3268
*** Bug 1771043 has been marked as a duplicate of this bug. ***
*** Bug 1755030 has been marked as a duplicate of this bug. ***