Description of problem: Have seen this issue with 2 customers. Recreated this locally but not consistently. I was using embedded ansible and had a playbook that had a sleep command to exceed the 300 seconds. The job is timing out at 300 seconds, however the automate task shouldn't timeout until 600 seconds. The ansible job is completing even with 2 600 second sleep commands, yet the automate state machine fails after 300 seconds. State Machine is failing with this error: ~~~ [----] E, [2018-05-15T18:39:27.030107 #8075:1261118] ERROR -- : State=<execute_state> running raised exception: <job timed out after 346.181547958 seconds of inactivity. Inactivity threshold [300 seconds]> ~~~ I've tracked the message that is reporting the ansible job to be exceeding 300 seconds. However the role is 'smartstate' which seems incorrect: ~~~ [----] I, [2018-05-15T18:38:59.282029 #13841:1261118] INFO -- : MIQ(MiqQueue.put) Message id: [1000000003242], id: [], Zone: [default], Role: [smartstate], Server: [], Ident: [generic], Target id: [], Instance id: [1000000000053], Task id: [], Command: [Job.signal_abort], Timeout: [600], Priority: [100], State: [ready], Deliver On: [], Data: [], Args: ["job timed out after 346.181547958 seconds of inactivity. Inactivity threshold [300 seconds]", "error"] [----] I, [2018-05-15T18:39:05.321065 #13841:1261118] INFO -- : MIQ(MiqGenericWorker::Runner#get_message_via_drb) Message id: [1000000003242], MiqWorker id: [1000000000022], Zone: [default], Role: [smartstate], Server: [], Ident: [generic], Target id: [], Instance id: [1000000000053], Task id: [], Command: [Job.signal_abort], Timeout: [600], Priority: [100], State: [dequeue], Deliver On: [], Data: [], Args: ["job timed out after 346.181547958 seconds of inactivity. Inactivity threshold [300 seconds]", "error"], Dequeued in: [6.045622154] seconds [----] I, [2018-05-15T18:39:05.321329 #13841:1261118] INFO -- : MIQ(MiqQueue#deliver) Message id: [1000000003242], Delivering... [----] I, [2018-05-15T18:39:05.441823 #13841:1261118] INFO -- : MIQ(MiqQueue#delivered) Message id: [1000000003242], State: [ok], Delivered in [0.120482425] seconds ~~~ This appears to be the ansible job being picked up from the queue: ~~~ [----] I, [2018-05-15T18:38:55.436187 #8075:1261118] INFO -- : MIQ(MiqGenericWorker::Runner#get_message_via_drb) Message id: [1000000003232], MiqWorker id: [1000000000021], Zone: [], Role: [embedded_ansible], Server: [], Ident: [generic], Target id: [], Instance id: [1000000000053], Task id: [], Command: [ManageIQ::Providers::EmbeddedAnsible::AutomationManager::PlaybookRunner.signal], Timeout: [600], Priority: [100], State: [dequeue], Deliver On: [2018-05-15 22:38:52 UTC], Data: [], Args: [:poll_ansible_tower_job_status, 120], Dequeued in: [62.5220199] seconds [----] I, [2018-05-15T18:38:55.872985 #8075:1261118] INFO -- : MIQ(MiqQueue.put) Message id: [1000000003237], id: [], Zone: [], Role: [embedded_ansible], Server: [], Ident: [generic], Target id: [], Instance id: [1000000000053], Task id: [], Command: [ManageIQ::Providers::EmbeddedAnsible::AutomationManager::PlaybookRunner.signal], Timeout: [600], Priority: [100], State: [ready], Deliver On: [2018-05-15 22:39:55 UTC], Data: [], Args: [:poll_ansible_tower_job_status, 120] ~~~ From what I can tell, the job is instantiated from this file /app/models/job.rb and is given a default timeout of 300 seconds: ===============================> DEFAULT_TIMEOUT = 300 ... def self.current_job_timeout(_timeout_adjustment = 1) DEFAULT_TIMEOUT end ===============================> Version-Release number of selected component (if applicable): CFME 5.9.1 How reproducible: Customers seem to hit this consistently with service provisioning. Steps to Reproduce: 1. Turn on embedded ansible and create a playbook that will exceed the 300 second timeout 2. Use the ansible playbook as a method in an automate state machine 3. Run the automate state machine Actual results: State machine is stopping due to the job timing out, even thought the ansible playbook is finishing Expected results: State machine should not fail if 10 minutes have not passed Additional info: I have recreated this locally on appliance 10.13.145.95. The logs evm.log-20180516 and automation.log-20180516 have the errors shown from this bug.
https://github.com/ManageIQ/manageiq/pull/17476
New commit detected on ManageIQ/manageiq/master: https://github.com/ManageIQ/manageiq/commit/bfb3c7a65c1e7126a7caf7257a18e4fbd23123e3 commit bfb3c7a65c1e7126a7caf7257a18e4fbd23123e3 Author: Bill Wei <bilwei> AuthorDate: Thu May 24 14:06:14 2018 -0400 Commit: Bill Wei <bilwei> CommitDate: Thu May 24 14:06:14 2018 -0400 Honor user provided execution_ttl option fixes https://bugzilla.redhat.com/show_bug.cgi?id=1581465 app/models/manageiq/providers/embedded_ansible/automation_manager/playbook_runner.rb | 7 + spec/models/manageiq/providers/embedded_ansible/automation_manager/playbook_runner_spec.rb | 18 + 2 files changed, 25 insertions(+)
Fixed and verified in 5.10.0.4.20180712211305_e6e4542. A state machine with log running ansible playbook works doesn't fail.