1510069 – Service Template Provision Task Failing When Picked Up by Appliance in Wrong Zone

Bug 1510069 - Service Template Provision Task Failing When Picked Up by Appliance in Wrong Zone

Summary: Service Template Provision Task Failing When Picked Up by Appliance in Wrong ...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat CloudForms Management Engine
Classification:	Red Hat
Component:	Automate
Sub Component:
Version:	5.8.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	urgent
Target Milestone:	GA
Target Release:	5.10.0
Assignee:	Greg McCullough
QA Contact:	Shveta
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1530674 1565248
TreeView+	depends on / blocked

Reported:	2017-11-06 16:26 UTC by myoder
Modified:	2021-06-10 13:29 UTC (History)
CC List:	10 users (show)
Fixed In Version:	5.10.0.0
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1530674 1565248 (view as bug list)
Environment:
Last Closed:	2018-06-21 20:36:31 UTC
Category:	Bug
Cloudforms Team:	CFME Core
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
provision test script (3.68 KB, text/plain) 2017-12-11 21:28 UTC, Lucy Fu	no flags	Details
new messages in miq_provision_virt_workflow.rb from 5.8.2.3 (39.95 KB, text/plain) 2017-12-20 14:43 UTC, Lucy Fu	no flags	Details
View All

Comment 12 Lucy Fu 2017-12-11 21:28:42 UTC

Created attachment 1366312 [details]
provision test script

Comment 18 Lucy Fu 2017-12-20 14:43:51 UTC

Created attachment 1370497 [details]
new messages in miq_provision_virt_workflow.rb from 5.8.2.3

Comment 19 CFME Bot 2017-12-20 15:46:47 UTC

https://github.com/ManageIQ/manageiq/pull/16702

Comment 20 CFME Bot 2017-12-21 15:56:43 UTC

New commit detected on ManageIQ/manageiq/master:
https://github.com/ManageIQ/manageiq/commit/9c94f30b232d09be49b6d0952bf25f0243573bb1

commit 9c94f30b232d09be49b6d0952bf25f0243573bb1
Author:     Lucy Fu <lufu>
AuthorDate: Wed Dec 20 09:50:58 2017 -0500
Commit:     Lucy Fu <lufu>
CommitDate: Thu Dec 21 10:29:51 2017 -0500

    Fix allowed_vlans to call preload correctly.
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1510069

 app/models/miq_provision_virt_workflow.rb       | 2 +-
 spec/models/miq_provision_virt_workflow_spec.rb | 6 ++++++
 2 files changed, 7 insertions(+), 1 deletion(-)

Comment 29 mkanoor 2018-02-20 15:02:37 UTC

Because of the network latency between the 2 zones we are seeing the service
provisioning process not getting started within the 10 minute window allocated
for a task to complete. Since CFME is highly asynchronous in task management it has 
strict constraints about how long a task can run. And that timeout is 10 minutes.
For longer running tasks we can make them asynchronous by exiting at apropos points
called states in Automate Model. The states can be restarted at different times.

So the suggestion is to force a retry in pre4/pre5 before we start the service
provisioning.


In the following instance


CBTS-Public/Service/Provisioning/StateMachines/ServiceProvision_Template/VMware_Build_VMProvisionRequest

Find an open slot before the provision state
provision        /Service/Provisioning/StateMachines/Methods/Provision

I am guessing it would be pre4 or pre5

add set the value to  
METHOD::set_retry_once

What this does is calls a method that triggers a restart of the state satisfying 
the task controllers from not terminating non responsive tasks.

Add a ruby new method called set_retry_once in the class
CBTS-Public/Service/Provisioning/StateMachines/ServiceProvision_Template

the ruby method set_retry_once looks like

#
# Description: This method sets the retry once to force a break in the 
#              processing of long running tasks.
#
$evm.log(:info, "Checking if retry needs to be set")
if $evm.state_var_exist?('retry_once')
  ae_result = 'ok'
else
  $evm.set_state_var('retry_once', '1')
  ae_result = 'retry'
  $evm.log(:info, 'setting a retry once in the beginning')
end
$evm.root['ae_result'] = ae_result
$evm.root['retry_interval'] = 1.minute

Comment 31 mkanoor 2018-02-27 14:52:54 UTC

Once the change has been made in the Automate Database you dont have to restart the servers, the Automate Model changes are picked up automatically at runtime.

Comment 33 mkanoor 2018-02-27 15:57:35 UTC

I am not sure what the recommended way of doing this is.
Can you check with Josh.

One way I can think of is to turn off the Automate Role in the CIN Zone which will end up routing all the Automate work to CAR. It might overwhelm the CAR zone so it might have to be done during off peak hours.

How did this get tested with Lucys fix, what was done to route the work to CAR zone during that testing

Comment 34 Greg McCullough 2018-03-14 16:20:23 UTC

Since the work starts as a generic service it is not tied to any zone, just the automate role.  As suggested above one possibility would be to disable the Automate Role in the other zone to force the work to the CAR zone.

Maybe this is something the customer could attempt during off-hours to avoid performance issues.

Comment 36 mkanoor 2018-05-07 15:37:14 UTC

Hi Michael,
Any updates on this one from the customer where they able to test the suggestions.
Thanks,
Madhu

Comment 38 Greg McCullough 2018-06-08 21:11:35 UTC

Moving to POST based on performance changes from Comment #19.

Note You need to log in before you can comment on or make changes to this bug.