Description of problem: When using a playbook as part of an automate provision, if ansible-runner is used to run a playbook and the initial connection to the SCM fails, retries of the task will never succeed, even if connectivity is resumed before all attempts are made. Version-Release number of selected component (if applicable): 5.11.5 How reproducible: all the time Steps to Reproduce: 1.set up an autommate that calls a playbook as part of provisioning (using ansible-runner and a SCM) 2. break connectivity to the SCM 3. initiate a provisioning 4. restore connectivity before all retries are exhausted Actual results: failure to connect will cause the task to fail, then be retried. Expected results: when connectivity is restored, attempts should not fail anymore. Additional info: this issue was also raised upstream : https://github.com/ManageIQ/manageiq/issues/20088 https://gitter.im/ManageIQ/manageiq-providers-embedded_ansible?at=5e9f45d5f8b37f6dac6446d2
So I gave this some more thought this last week/weekend, and remembered some rational as to why we didn't implement this in the first pass of the EmbeddedAnsible rework. The gist is an issue with consistency with running playbooks with EmbeddedAnsible with multiple appliances, and that each appliance is in charge of it's own playbook repo cloning. The reason we decided to force "Update on Launch" as the only option is that if we didn't sync with the source repo first and allow a playbook to run without syncing first, it is possible that one appliance has a stale version of the playbook repo, and another has an updated version. While this maybe isn't ideal when the SCM server goes down, it ensures consistency to some degree by always assuming that at playbook runtime we are using the latest source for a given run. In the previous "Tower implementation" of EmbeddedAnsible, this wasn't an issue for two reasons: - Tower was maintaining the git repos (so, it wasn't our problem) - There could only be one Tower instance for a given deployment This means that if contact to the SCM server was cut off, subsequent runs would still be consistent across the board until connection was restored. A different solution to the same problem. * * * There was a discussion about how we might handle this, but it was tabled for a eventual "phase 2" EmbeddedAnsible effort. Basically it would introduce a Federated git server role that would handle all repo cloning, and there could only be one per installation. Then any EmbeddedAnsible appliance could talk to that server internally to fetch the git repos from there, and would avoid the need for any of them to reach out externally, which might also be more preferred from a security standpoint. That said, this implementation is more complex, and the details of "how" we would do that weren't completely ironed out, so this is only an idea at best. That said, I wouldn't expect a quick solution/patch for this issue to be implemented based on the previous rational described above.
Was made aware of a key distinction in the original BZ: > ... retries of the task will never succeed Which might be a short term fix we can solve. I will look into what is done for this and report back. -Nick
A Github Issue has been open to track resolution of the reported defect. Moving to a Github Issue provides access to the larger ManageIQ community for assistance in resolving this defect. Please refer to https://github.com/ManageIQ/ansible-manageiq-automate/issues/13 for future updates regarding this issue. Lifecycle page: https://access.redhat.com/support/policy/updates/cloudforms Directional Update: https://access.redhat.com/articles/4639821 FAQ: https://access.redhat.com/articles/4647061 If you have any concerns about this, please let us know. Thanks and regards!
Hi Felix, Thanks for the customer update. We're investigating the issue and will send an update once we have more information. Thanks, Tina
Hi Felix, Drew created the following PR to resolve the initial connection issue: https://github.com/ManageIQ/manageiq/pull/20232 Our initial testing looks good. Thanks, Tina
Opened #1876974 for better error handling (Comment 13), putting this back to ON_DEV for the actual issue for clone retries.
New commits detected on ManageIQ/manageiq/ivanchuk: https://github.com/ManageIQ/manageiq/commit/3ae0f417dccf908881e1922117f6911f06a48f73 commit 3ae0f417dccf908881e1922117f6911f06a48f73 Author: Keenan Brock <keenan> AuthorDate: Wed Nov 4 16:36:54 2020 +0000 Commit: Satoe Imaishi <simaishi> CommitDate: Mon Nov 9 18:13:25 2020 +0000 Merge pull request #20738 from djberg96/ansible_playbook_log_stdout Update playbook_log_stdout to handle possible nil job (cherry picked from commit 349300251dfc90b0933daf3a35faf6abfe9653b2) https://bugzilla.redhat.com/show_bug.cgi?id=1835226 app/models/mixins/ansible_playbook_mixin.rb | 7 +- app/models/service_ansible_playbook.rb | 7 +- spec/models/mixins/ansible_playbook_mixin_spec.rb | 37 + 3 files changed, 49 insertions(+), 2 deletions(-) https://github.com/ManageIQ/manageiq/commit/558a27fd00b895eea52138ae57dfc6c39b769fd0 commit 558a27fd00b895eea52138ae57dfc6c39b769fd0 Author: Keenan Brock <keenan> AuthorDate: Wed Nov 4 16:37:35 2020 +0000 Commit: Satoe Imaishi <simaishi> CommitDate: Mon Nov 9 18:15:45 2020 +0000 Merge pull request #20773 from d-m-u/no_postprocess_without_job don't call postprocess without job in playbook on_error (cherry picked from commit 19f3728f682f98cf77237c0828ec96998516cc46) https://bugzilla.redhat.com/show_bug.cgi?id=1835226 app/models/service_ansible_playbook.rb | 8 +- 1 file changed, 6 insertions(+), 2 deletions(-) https://github.com/ManageIQ/manageiq/commit/0ce905f5f9515848587f6cc3225ef9f89881f86c commit 0ce905f5f9515848587f6cc3225ef9f89881f86c Author: Keenan Brock <keenan> AuthorDate: Wed Nov 4 16:39:58 2020 +0000 Commit: Satoe Imaishi <simaishi> CommitDate: Mon Nov 9 18:16:28 2020 +0000 Merge pull request #20759 from djberg96/better_check_connection Add git remote connection check code (cherry picked from commit 41f40f81f619562e5334f94543dba8e0da42adb3) https://bugzilla.redhat.com/show_bug.cgi?id=1835226 app/models/git_repository.rb | 23 + app/models/service_ansible_playbook.rb | 6 +- app/models/service_template_ansible_playbook.rb | 4 + spec/models/git_repository_spec.rb | 44 + 4 files changed, 76 insertions(+), 1 deletion(-)
https://github.com/ManageIQ/manageiq-automation_engine/pull/465
New commit detected on ManageIQ/manageiq-automation_engine/ivanchuk: https://github.com/ManageIQ/manageiq-automation_engine/commit/9211d19af417a131df8de43dab18141d5d2af70f commit 9211d19af417a131df8de43dab18141d5d2af70f Author: tina <tfitzger> AuthorDate: Thu Nov 12 15:06:13 2020 +0000 Commit: Satoe Imaishi <simaishi> CommitDate: Thu Nov 12 15:47:31 2020 +0000 Merge pull request #465 from d-m-u/adding_check_connection_expose add expose check_connection to generic service (cherry picked from commit e0aa23f340d4fb2e689205697a10a4a251f13311) https://bugzilla.redhat.com/show_bug.cgi?id=1835226 lib/miq_automation_engine/service_models/miq_ae_service_service_generic.rb | 1 + 1 file changed, 1 insertion(+)
Verified in Version 5.11.10.0.20201116161314_cdd3548
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: CloudForms 5.0.10 security, bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5554