Bug 1581465 - Ansible Job Times out at 300 seconds causing Automate State Machine to Fail
Summary: Ansible Job Times out at 300 seconds causing Automate State Machine to Fail
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat CloudForms Management Engine
Classification: Red Hat
Component: Automate
Version: 5.9.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: GA
: 5.10.0
Assignee: Bill Wei
QA Contact: Dmitry Misharov
URL:
Whiteboard:
Depends On:
Blocks: 1583851
TreeView+ depends on / blocked
 
Reported: 2018-05-22 20:32 UTC by myoder
Modified: 2022-03-13 15:01 UTC (History)
6 users (show)

Fixed In Version: 5.10.0.0
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1583851 (view as bug list)
Environment:
Last Closed: 2019-02-11 14:01:33 UTC
Category: Bug
Cloudforms Team: Ansible
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description myoder 2018-05-22 20:32:54 UTC
Description of problem:

Have seen this issue with 2 customers.

Recreated this locally but not consistently.  I was using embedded ansible and had a playbook that had a sleep command to exceed the 300 seconds.  The job is timing out at 300 seconds, however the automate task shouldn't timeout until 600 seconds.  The ansible job is completing even with 2 600 second sleep commands, yet the automate state machine fails after 300 seconds.


State Machine is failing with this error:

~~~
[----] E, [2018-05-15T18:39:27.030107 #8075:1261118] ERROR -- : State=<execute_state> running  raised exception: <job timed out after 346.181547958 seconds of inactivity.  Inactivity threshold [300 seconds]>
~~~



I've tracked the message that is reporting the ansible job to be exceeding 300 seconds.  However the role is 'smartstate' which seems incorrect:

~~~
[----] I, [2018-05-15T18:38:59.282029 #13841:1261118]  INFO -- : MIQ(MiqQueue.put) Message id: [1000000003242],  id: [], Zone: [default], Role: [smartstate], Server: [], Ident: [generic], Target id: [], Instance id: [1000000000053], Task id: [], Command: [Job.signal_abort], Timeout: [600], Priority: [100], State: [ready], Deliver On: [], Data: [], Args: ["job timed out after 346.181547958 seconds of inactivity.  Inactivity threshold [300 seconds]", "error"]
[----] I, [2018-05-15T18:39:05.321065 #13841:1261118]  INFO -- : MIQ(MiqGenericWorker::Runner#get_message_via_drb) Message id: [1000000003242], MiqWorker id: [1000000000022], Zone: [default], Role: [smartstate], Server: [], Ident: [generic], Target id: [], Instance id: [1000000000053], Task id: [], Command: [Job.signal_abort], Timeout: [600], Priority: [100], State: [dequeue], Deliver On: [], Data: [], Args: ["job timed out after 346.181547958 seconds of inactivity.  Inactivity threshold [300 seconds]", "error"], Dequeued in: [6.045622154] seconds
[----] I, [2018-05-15T18:39:05.321329 #13841:1261118]  INFO -- : MIQ(MiqQueue#deliver) Message id: [1000000003242], Delivering...
[----] I, [2018-05-15T18:39:05.441823 #13841:1261118]  INFO -- : MIQ(MiqQueue#delivered) Message id: [1000000003242], State: [ok], Delivered in [0.120482425] seconds
~~~



This appears to be the ansible job being picked up from the queue:

~~~
[----] I, [2018-05-15T18:38:55.436187 #8075:1261118]  INFO -- : MIQ(MiqGenericWorker::Runner#get_message_via_drb) Message id: [1000000003232], MiqWorker id: [1000000000021], Zone: [], Role: [embedded_ansible], Server: [], Ident: [generic], Target id: [], Instance id: [1000000000053], Task id: [], Command: [ManageIQ::Providers::EmbeddedAnsible::AutomationManager::PlaybookRunner.signal], Timeout: [600], Priority: [100], State: [dequeue], Deliver On: [2018-05-15 22:38:52 UTC], Data: [], Args: [:poll_ansible_tower_job_status, 120], Dequeued in: [62.5220199] seconds
[----] I, [2018-05-15T18:38:55.872985 #8075:1261118]  INFO -- : MIQ(MiqQueue.put) Message id: [1000000003237],  id: [], Zone: [], Role: [embedded_ansible], Server: [], Ident: [generic], Target id: [], Instance id: [1000000000053], Task id: [], Command: [ManageIQ::Providers::EmbeddedAnsible::AutomationManager::PlaybookRunner.signal], Timeout: [600], Priority: [100], State: [ready], Deliver On: [2018-05-15 22:39:55 UTC], Data: [], Args: [:poll_ansible_tower_job_status, 120]
~~~


From what I can tell, the job is instantiated from this file /app/models/job.rb and is given a default timeout of 300 seconds:

===============================>

  DEFAULT_TIMEOUT = 300

...

  def self.current_job_timeout(_timeout_adjustment = 1)
    DEFAULT_TIMEOUT
  end

===============================>


Version-Release number of selected component (if applicable):
CFME 5.9.1

How reproducible:
Customers seem to hit this consistently with service provisioning.


Steps to Reproduce:
1. Turn on embedded ansible and create a playbook that will exceed the 300 second timeout
2. Use the ansible playbook as a method in an automate state machine
3. Run the automate state machine

Actual results:
State machine is stopping due to the job timing out, even thought the ansible playbook is finishing

Expected results:
State machine should not fail if 10 minutes have not passed

Additional info:
I have recreated this locally on appliance 10.13.145.95.  The logs evm.log-20180516 and automation.log-20180516 have the errors shown from this bug.

Comment 4 CFME Bot 2018-05-29 18:51:31 UTC
New commit detected on ManageIQ/manageiq/master:

https://github.com/ManageIQ/manageiq/commit/bfb3c7a65c1e7126a7caf7257a18e4fbd23123e3
commit bfb3c7a65c1e7126a7caf7257a18e4fbd23123e3
Author:     Bill Wei <bilwei>
AuthorDate: Thu May 24 14:06:14 2018 -0400
Commit:     Bill Wei <bilwei>
CommitDate: Thu May 24 14:06:14 2018 -0400

    Honor user provided execution_ttl option

    fixes https://bugzilla.redhat.com/show_bug.cgi?id=1581465

 app/models/manageiq/providers/embedded_ansible/automation_manager/playbook_runner.rb | 7 +
 spec/models/manageiq/providers/embedded_ansible/automation_manager/playbook_runner_spec.rb | 18 +
 2 files changed, 25 insertions(+)

Comment 7 Dmitry Misharov 2018-07-24 11:41:04 UTC
Fixed and verified in 5.10.0.4.20180712211305_e6e4542. A state machine with log running ansible playbook works doesn't fail.


Note You need to log in before you can comment on or make changes to this bug.