Bug 1835226 - ansible-runner playbooks used in provisioning failing due to connectivity issues to SCM cannot recover on task retry
Summary: ansible-runner playbooks used in provisioning failing due to connectivity iss...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat CloudForms Management Engine
Classification: Red Hat
Component: Embedded Ansible
Version: 5.11.5
Hardware: All
OS: All
high
high
Target Milestone: GA
: 5.11.10
Assignee: Tina Fitzgerald
QA Contact: Gaurav Talreja
Red Hat CloudForms Documentation
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-05-13 12:43 UTC by Felix Dewaleyne
Modified: 2020-12-18 16:45 UTC (History)
12 users (show)

Fixed In Version: 5.11.10.0
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-12-15 21:17:35 UTC
Category: ---
Cloudforms Team: CFME Core
Target Upstream Version:
Embargoed:
simaishi: cfme-5.11.z+


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github ManageIQ manageiq issues 20088 0 None open Embedded Ansible dependent on SCM for every playbook being run 2021-02-17 18:13:47 UTC
Red Hat Product Errata RHSA-2020:5554 0 None None None 2020-12-15 21:17:43 UTC

Description Felix Dewaleyne 2020-05-13 12:43:25 UTC
Description of problem:
When using a playbook as part of an automate provision, if ansible-runner is used to run a playbook and the initial connection to the SCM fails, retries of the task will never succeed, even if connectivity is resumed before all attempts are made.

Version-Release number of selected component (if applicable):
5.11.5

How reproducible:
all the time

Steps to Reproduce:
1.set up an autommate that calls a playbook as part of provisioning (using ansible-runner and a SCM)
2. break connectivity to the SCM
3. initiate a provisioning
4. restore connectivity before all retries are exhausted

Actual results:
failure to connect will cause the task to fail, then be retried.

Expected results:
when connectivity is restored, attempts should not fail anymore.

Additional info:
this issue was also raised upstream : 
https://github.com/ManageIQ/manageiq/issues/20088
https://gitter.im/ManageIQ/manageiq-providers-embedded_ansible?at=5e9f45d5f8b37f6dac6446d2

Comment 4 Nick LaMuro 2020-05-18 15:50:09 UTC
So I gave this some more thought this last week/weekend, and remembered some rational as to why we didn't implement this in the first pass of the EmbeddedAnsible rework.

The gist is an issue with consistency with running playbooks with EmbeddedAnsible with multiple appliances, and that each appliance is in charge of it's own playbook repo cloning.  The reason we decided to force "Update on Launch" as the only option is that if we didn't sync with the source repo first and allow a playbook to run without syncing first, it is possible that one appliance has a stale version of the playbook repo, and another has an updated version.

While this maybe isn't ideal when the SCM server goes down, it ensures consistency to some degree by always assuming that at playbook runtime we are using the latest source for a given run.

In the previous "Tower implementation" of EmbeddedAnsible, this wasn't an issue for two reasons:

- Tower was maintaining the git repos (so, it wasn't our problem)
- There could only be one Tower instance for a given deployment

This means that if contact to the SCM server was cut off, subsequent runs would still be consistent across the board until connection was restored.  A different solution to the same problem.


* * *


There was a discussion about how we might handle this, but it was tabled for a eventual "phase 2" EmbeddedAnsible effort.  Basically it would introduce a Federated git server role that would handle all repo cloning, and there could only be one per installation.  Then any EmbeddedAnsible appliance could talk to that server internally to fetch the git repos from there, and would avoid the need for any of them to reach out externally, which might also be more preferred from a security standpoint.

That said, this implementation is more complex, and the details of "how" we would do that weren't completely ironed out, so this is only an idea at best.  That said, I wouldn't expect a quick solution/patch for this issue to be implemented based on the previous rational described above.

Comment 5 Nick LaMuro 2020-05-18 16:55:16 UTC
Was made aware of a key distinction in the original BZ:

> ... retries of the task will never succeed

Which might be a short term fix we can solve.  I will look into what is done for this and report back.

-Nick

Comment 10 dmetzger 2020-06-22 18:52:35 UTC
A Github Issue has been open to track resolution of the reported defect. Moving to a Github Issue provides access to the larger ManageIQ community for assistance in resolving this defect.

Please refer to https://github.com/ManageIQ/ansible-manageiq-automate/issues/13 for future updates regarding this issue.

Lifecycle page: https://access.redhat.com/support/policy/updates/cloudforms
Directional Update: https://access.redhat.com/articles/4639821
FAQ: https://access.redhat.com/articles/4647061

If you have any concerns about this, please let us know.

Thanks and regards!

Comment 12 Tina Fitzgerald 2020-08-18 16:54:43 UTC
Hi Felix,

Thanks for the customer update. 

We're investigating the issue and will send an update once we have more information.

Thanks,
Tina

Comment 13 Tina Fitzgerald 2020-08-24 19:16:40 UTC
Hi Felix,

Drew created the following PR to resolve the initial connection issue: 
https://github.com/ManageIQ/manageiq/pull/20232

Our initial testing looks good.

Thanks,
Tina

Comment 16 Satoe Imaishi 2020-09-08 15:50:14 UTC
Opened #1876974 for better error handling (Comment 13), putting this back to ON_DEV for the actual issue for clone retries.

Comment 31 CFME Bot 2020-11-09 18:55:12 UTC
New commits detected on ManageIQ/manageiq/ivanchuk:

https://github.com/ManageIQ/manageiq/commit/3ae0f417dccf908881e1922117f6911f06a48f73
commit 3ae0f417dccf908881e1922117f6911f06a48f73
Author:     Keenan Brock <keenan>
AuthorDate: Wed Nov  4 16:36:54 2020 +0000
Commit:     Satoe Imaishi <simaishi>
CommitDate: Mon Nov  9 18:13:25 2020 +0000

    Merge pull request #20738 from djberg96/ansible_playbook_log_stdout

    Update playbook_log_stdout to handle possible nil job

    (cherry picked from commit 349300251dfc90b0933daf3a35faf6abfe9653b2)

    https://bugzilla.redhat.com/show_bug.cgi?id=1835226

 app/models/mixins/ansible_playbook_mixin.rb | 7 +-
 app/models/service_ansible_playbook.rb | 7 +-
 spec/models/mixins/ansible_playbook_mixin_spec.rb | 37 +
 3 files changed, 49 insertions(+), 2 deletions(-)


https://github.com/ManageIQ/manageiq/commit/558a27fd00b895eea52138ae57dfc6c39b769fd0
commit 558a27fd00b895eea52138ae57dfc6c39b769fd0
Author:     Keenan Brock <keenan>
AuthorDate: Wed Nov  4 16:37:35 2020 +0000
Commit:     Satoe Imaishi <simaishi>
CommitDate: Mon Nov  9 18:15:45 2020 +0000

    Merge pull request #20773 from d-m-u/no_postprocess_without_job

    don't call postprocess without job in playbook on_error

    (cherry picked from commit 19f3728f682f98cf77237c0828ec96998516cc46)

    https://bugzilla.redhat.com/show_bug.cgi?id=1835226

 app/models/service_ansible_playbook.rb | 8 +-
 1 file changed, 6 insertions(+), 2 deletions(-)


https://github.com/ManageIQ/manageiq/commit/0ce905f5f9515848587f6cc3225ef9f89881f86c
commit 0ce905f5f9515848587f6cc3225ef9f89881f86c
Author:     Keenan Brock <keenan>
AuthorDate: Wed Nov  4 16:39:58 2020 +0000
Commit:     Satoe Imaishi <simaishi>
CommitDate: Mon Nov  9 18:16:28 2020 +0000

    Merge pull request #20759 from djberg96/better_check_connection

    Add git remote connection check code

    (cherry picked from commit 41f40f81f619562e5334f94543dba8e0da42adb3)

    https://bugzilla.redhat.com/show_bug.cgi?id=1835226

 app/models/git_repository.rb | 23 +
 app/models/service_ansible_playbook.rb | 6 +-
 app/models/service_template_ansible_playbook.rb | 4 +
 spec/models/git_repository_spec.rb | 44 +
 4 files changed, 76 insertions(+), 1 deletion(-)

Comment 33 CFME Bot 2020-11-12 15:50:26 UTC
New commit detected on ManageIQ/manageiq-automation_engine/ivanchuk:

https://github.com/ManageIQ/manageiq-automation_engine/commit/9211d19af417a131df8de43dab18141d5d2af70f
commit 9211d19af417a131df8de43dab18141d5d2af70f
Author:     tina <tfitzger>
AuthorDate: Thu Nov 12 15:06:13 2020 +0000
Commit:     Satoe Imaishi <simaishi>
CommitDate: Thu Nov 12 15:47:31 2020 +0000

    Merge pull request #465 from d-m-u/adding_check_connection_expose

    add expose check_connection to generic service

    (cherry picked from commit e0aa23f340d4fb2e689205697a10a4a251f13311)

    https://bugzilla.redhat.com/show_bug.cgi?id=1835226

 lib/miq_automation_engine/service_models/miq_ae_service_service_generic.rb | 1 +
 1 file changed, 1 insertion(+)

Comment 40 Gaurav Talreja 2020-11-27 16:13:32 UTC
Verified in Version 5.11.10.0.20201116161314_cdd3548

Comment 45 errata-xmlrpc 2020-12-15 21:17:35 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: CloudForms 5.0.10 security, bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5554


Note You need to log in before you can comment on or make changes to this bug.