Bug 1750370

Summary: Ordered Ansible Service is getting timeout and not restarting after 3600 seconds.
Product: Red Hat CloudForms Management Engine Reporter: Satyajit Bulage <sbulage>
Component: AutomateAssignee: Lucy Fu <lufu>
Status: CLOSED CURRENTRELEASE QA Contact: Satyajit Bulage <sbulage>
Severity: high Docs Contact: Red Hat CloudForms Documentation <cloudforms-docs>
Priority: high    
Version: 5.11.0CC: bmidwood, dmetzger, gmccullo, lavenel, mkanoor, mshriver, obarenbo, simaishi, tfitzger
Target Milestone: GAKeywords: Regression
Target Release: 5.11.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: 5.11.0.24 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-12-13 14:54:27 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: Bug
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: CFME Core Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1660803    

Description Satyajit Bulage 2019-09-09 13:20:03 UTC
Description of problem: Create or use Ansible Playbook which has 60mins or more sleep time. After ordering service, at the end it shows error in EVM.log -> "restarting worker after [3600] job aborting ..."


Version-Release number of selected component (if applicable): 5.11.0.23.20190904213640_d113674


How reproducible: 100%


Steps to Reproduce:
1. Enable Ansible Server role.
2. Create or use attached playbook
3. Create Ansible service
4. Wait for 60mins

Actual results: job aborting, ansible playbook has been running longer than timeout


Expected results: Service should should not fail. Also restart the worker if timeout goes beyond.


Additional info:
Found this BZ while verifying https://bugzilla.redhat.com/show_bug.cgi?id=1660803

Comment 7 Tina Fitzgerald 2019-09-09 15:21:24 UTC
Billy is going to look into this. We looked at the reproducer environment and max_ttl is not specified.
Max_ttl should be specified when playbooks are expected to take some time.  The max_ttl(maximum time to live) is used to calculate a retry interval for the playbook service state machine.  The retry interval defaults to 1 minute when this value is not specified.

Comment 8 William Fitzgerald 2019-09-09 19:26:21 UTC
Worked on latest 5.10 but failing on 5.11.0.23.

Job ran for about an hour and then after 58 retires this error was encountered 
"job aborting, ansible playbook has been running longer than timeout"

Error is coming from ansible_runner_workflow.rb line 61.

Lucy is looking into this now

Comment 10 CFME Bot 2019-09-11 16:26:20 UTC
New commit detected on ManageIQ/manageiq/master:

https://github.com/ManageIQ/manageiq/commit/861cc9990fac39fd0edf3841c9accde4710fcf35
commit 861cc9990fac39fd0edf3841c9accde4710fcf35
Author:     Lucy Fu <lufu>
AuthorDate: Tue Sep 10 10:00:17 2019 -0400
Commit:     Lucy Fu <lufu>
CommitDate: Tue Sep 10 10:00:17 2019 -0400

    Set default timeout to 100 minutes for playbook service and playbook method.

    https://bugzilla.redhat.com/show_bug.cgi?id=1750370

 app/models/manageiq/providers/embedded_ansible/automation_manager/configuration_script.rb | 5 +-
 app/models/manageiq/providers/embedded_ansible/automation_manager/playbook_runner.rb | 2 +-
 spec/models/manageiq/providers/embedded_ansible/automation_manager/configuration_script_spec.rb | 2 +-
 3 files changed, 5 insertions(+), 4 deletions(-)

Comment 11 CFME Bot 2019-09-11 16:41:06 UTC
New commit detected on ManageIQ/manageiq/ivanchuk:

https://github.com/ManageIQ/manageiq/commit/55bc9be7150764c2fa69037d781308bedc7c8327
commit 55bc9be7150764c2fa69037d781308bedc7c8327
Author:     Jason Frey <jfrey>
AuthorDate: Wed Sep 11 12:25:56 2019 -0400
Commit:     Jason Frey <jfrey>
CommitDate: Wed Sep 11 12:25:56 2019 -0400

    Merge pull request #19279 from lfu/ansible_runner_timeout_1750370

    Set default playbook service timeout to 100 minutes

    (cherry picked from commit e1e730fb136f2702a02b04f815eb93f94934a426)

    https://bugzilla.redhat.com/show_bug.cgi?id=1750370

 app/models/manageiq/providers/embedded_ansible/automation_manager/configuration_script.rb | 5 +-
 app/models/manageiq/providers/embedded_ansible/automation_manager/playbook_runner.rb | 2 +-
 spec/models/manageiq/providers/embedded_ansible/automation_manager/configuration_script_spec.rb | 2 +-
 3 files changed, 5 insertions(+), 4 deletions(-)

Comment 12 Satyajit Bulage 2019-09-12 18:19:11 UTC
Playbook having sleep time 60mins executed without any fail. No errors occurred.

Verified Version: 5.11.0.24.20190911182429_55bc9be