Bug 1601538 - Ansible Jobs Causing State Machine to Fail due to Inactivity Threshold Exceeding 0
Summary: Ansible Jobs Causing State Machine to Fail due to Inactivity Threshold Exceed...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat CloudForms Management Engine
Classification: Red Hat
Component: Automate
Version: 5.9.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: GA
: 5.10.0
Assignee: mkanoor
QA Contact: Satyajit Bulage
URL:
Whiteboard:
Depends On:
Blocks: 1608368
TreeView+ depends on / blocked
 
Reported: 2018-07-16 15:08 UTC by myoder
Modified: 2022-03-13 15:14 UTC (History)
6 users (show)

Fixed In Version: 5.10.0.5
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1608368 (view as bug list)
Environment:
Last Closed: 2019-02-11 14:06:26 UTC
Category: Bug
Cloudforms Team: CFME Core
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Comment 2 mkanoor 2018-07-16 20:42:32 UTC
This bug was introduced when we fixed 
https://bugzilla.redhat.com/show_bug.cgi?id=1583851
Which used to have a default timeout of 300 seconds (5 minutes)
By default when you don't set the execution_ttl it is stored as an empty string.
We convert the empty string to an integer to get the timeout, which results in the timeout being 0 seconds

When the Scheduler wakes up it tries to look for stuck jobs and calls timeout! on these jobs. if the timeout is 0 seconds the job gets terminated right away.

As a workaround you can always fill the execution_ttl to 600 when adding a new method or updating an existing method.

We are working on a fix this issue so that it treats an empty string to use the default_timeout which is 600 seconds.

Comment 4 CFME Bot 2018-07-17 13:25:56 UTC
New commit detected on ManageIQ/manageiq/master:

https://github.com/ManageIQ/manageiq/commit/40abd08a42b1d4aacf4ba027df9d2ec997708e4b
commit 40abd08a42b1d4aacf4ba027df9d2ec997708e4b
Author:     Madhu Kanoor <mkanoor>
AuthorDate: Mon Jul 16 16:42:59 2018 -0400
Commit:     Madhu Kanoor <mkanoor>
CommitDate: Mon Jul 16 16:42:59 2018 -0400

    Allow for empty strings in the execution_ttl field

    https://bugzilla.redhat.com/show_bug.cgi?id=1601538

    An empty string yields a 0 timeout value causing jobs to be
    terminated right away.

 app/models/manageiq/providers/embedded_ansible/automation_manager/playbook_runner.rb | 2 +-
 spec/models/manageiq/providers/embedded_ansible/automation_manager/playbook_runner_spec.rb | 8 +
 2 files changed, 9 insertions(+), 1 deletion(-)

Comment 7 mkanoor 2018-10-02 13:48:17 UTC
You would need to have an ansible playbook that can sleep for a set amount of time.
There is a sample playbook here
https://github.com/mkanoor/playbook/blob/master/pkg_info.yaml

It takes in 3 parameters
user
sleep
pkg

and you can set the sleep time to different time in seconds to see the timeout behaviour.

This problem manifests itself when the scheduler is looking for stuck jobs that are not responding and tries to terminate them.

Comment 8 Satyajit Bulage 2018-11-16 12:21:45 UTC
Verification Steps:

1. Added repository --> https://github.com/mkanoor/playbook/blob/master/pkg_info.yaml
2. Created Generic service for Ansible.
3. Tried different sleep time i.e. 200, 450 and 750 (values in sec.)
4. Service finished with error for value 750
5. Service is not failed for values within 600 sec.

Verified Version: 5.10.0.23.20181106165157_92dd189


Note You need to log in before you can comment on or make changes to this bug.