Created attachment 1434299 [details] pod logs Description of problem: When I turn on embedded ansible in podified appliance, embedded ansible pod gets deployed but setup scripts don't start in it. setup log file is also absent. It hangs in such state until it gets restarted by openshift (readiness/liveness probes) or app pod. It seems this issue appeared after fix of https://bugzilla.redhat.com/show_bug.cgi?id=1575071 Version-Release number of selected component (if applicable): 5.9.2.4 How reproducible: 100% but only for first time Steps to Reproduce: 1. deploy podified appliance 2. open appliance UI 3. open Configuration->Server->Server Control, turn on Embedded Ansible and save changes 4. go to openshift project where podified appliance is deployed and pay attention to ansible pod deployment Actual results: Ansible pod deployment is started. however, it doesn't become ready for long time. when pod gets redeployed after timeout, it becomes ready. All this process takes 10-15 min to get ansible ready. Embedded ansible pod gets deployed well and on time if embedded ansible is turned off and on. Expected results: correct deployment of ansible pod Additional info: app pod logs are attached.
I'm not actually sure this is an issue. At startup the embedded ansible pod has to migrate its database which can take a long time. The timing of this is heavily dependent on environment and latency to the database pod so I could see this behavior happening if the database were deployed on a separate node from the embedded ansible pod. How long is it before the initial pod is rescheduled? Maybe we can change that timeout value.
Nick, I use empty openshift env. So, there is no heavy load which may influence pod deployment. DB is deployed on the same node with the rest of appliance pods. And I anyway have pod deployment restarted. I do agree that the issue is in readiness/liveness probs if there is no timeout check from app pod. So, I believe we have to correct timeout values.
As a fix for this I'm going to remove the health checks from the embedded ansible deployment config and introduce a timeout into the setup time that we can increase. We already do health checking in the EmbeddedAnsibleWorker so adding the checks in OpenShift was really overkill in this case. Unfortunately, this won't really be something that we can backport to 5.9 because we are actually relying on the OpenShift health checks in that version. If this "fix" needs to be backported we'll have to investigate if we can increase the readiness/liveness initial timeouts.
https://github.com/ManageIQ/manageiq/pull/17603
the issue is still present in 5.10.0.1. ansible deploy pod fails because ansible pod doesn't get ready on time. As a result I have to redeploy pod manually.
New commits detected on ManageIQ/manageiq/master: https://github.com/ManageIQ/manageiq/commit/2a766b2999141a7ce4a4972d7ea2fefb754bbb07 commit 2a766b2999141a7ce4a4972d7ea2fefb754bbb07 Author: Nick Carboni <ncarboni> AuthorDate: Mon Jun 18 15:12:46 2018 -0400 Commit: Nick Carboni <ncarboni> CommitDate: Mon Jun 18 15:12:46 2018 -0400 Add a timeout to docker and container Embedded Ansible startup Previously we would wait forever. That's never good. https://bugzilla.redhat.com/show_bug.cgi?id=1576744 config/settings.yml | 2 + lib/embedded_ansible/container_embedded_ansible.rb | 2 +- lib/embedded_ansible/docker_embedded_ansible.rb | 2 +- 3 files changed, 4 insertions(+), 2 deletions(-) https://github.com/ManageIQ/manageiq/commit/572f295173cbd69d6c2a6a4d062ebc0a08197d12 commit 572f295173cbd69d6c2a6a4d062ebc0a08197d12 Author: Nick Carboni <ncarboni> AuthorDate: Mon Jun 18 15:13:31 2018 -0400 Commit: Nick Carboni <ncarboni> CommitDate: Mon Jun 18 15:13:31 2018 -0400 Remove the health checks from ContainerEmbeddedAnsible We already do health checks from the EmbeddedAnsibleWorker so we don't really need OpenShift to do more, especially if they have small built-in timeouts. https://bugzilla.redhat.com/show_bug.cgi?id=1576744 lib/embedded_ansible/container_embedded_ansible.rb | 27 +- 1 file changed, 5 insertions(+), 22 deletions(-)
Verified in 5.10.0.4
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0213