1576744 – embedded ansible starts only after restart first time

Bug 1576744 - embedded ansible starts only after restart first time

Summary: embedded ansible starts only after restart first time

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat CloudForms Management Engine
Classification:	Red Hat
Component:	cfme-openshift-embedded-ansible
Sub Component:
Version:	5.9.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	medium
Target Milestone:	GA
Target Release:	5.10.0
Assignee:	Nick Carboni
QA Contact:	Ievgen Zapolskyi
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-05-10 10:12 UTC by Ievgen Zapolskyi
Modified:	2019-02-07 22:45 UTC (History)
CC List:	5 users (show)
Fixed In Version:	5.10.0.4
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-02-07 22:45:46 UTC
Category:	Bug
Cloudforms Team:	---
Target Upstream Version:
Embargoed:
Flags:	izapolsk: automate_bug+

Attachments	(Terms of Use)
pod logs (800.57 KB, application/x-bzip) 2018-05-10 10:12 UTC, Ievgen Zapolskyi	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2019:0213	0	None	None	None	2019-02-07 22:45:50 UTC

Description Ievgen Zapolskyi 2018-05-10 10:12:30 UTC

Created attachment 1434299 [details]
pod logs

Description of problem:
When I turn on embedded ansible in podified appliance, embedded ansible pod gets deployed but setup scripts don't start in it. setup log file is also absent. It hangs in such state until it gets restarted by openshift (readiness/liveness probes) or app pod.

It seems this issue appeared after fix of 
https://bugzilla.redhat.com/show_bug.cgi?id=1575071

Version-Release number of selected component (if applicable):
5.9.2.4

How reproducible:
100%  but only for first time

Steps to Reproduce:
1. deploy podified appliance
2. open appliance UI
3. open Configuration->Server->Server Control, turn on Embedded Ansible and save changes
4. go to openshift project where podified appliance is deployed and pay attention to ansible pod deployment

Actual results:
Ansible pod deployment is started. however, it doesn't become ready for long time. when pod gets redeployed after timeout, it becomes ready.
All this process takes 10-15 min to get ansible ready.
Embedded ansible pod gets deployed well and on time if embedded ansible is turned off and on.

Expected results:
correct deployment of ansible pod

Additional info:
app pod logs are attached.

Comment 2 Nick Carboni 2018-05-10 19:00:18 UTC

I'm not actually sure this is an issue.

At startup the embedded ansible pod has to migrate its database which can take a long time. The timing of this is heavily dependent on environment and latency to the database pod so I could see this behavior happening if the database were deployed on a separate node from the embedded ansible pod.

How long is it before the initial pod is rescheduled? Maybe we can change that timeout value.

Comment 3 Ievgen Zapolskyi 2018-05-11 12:48:20 UTC

Nick, 

I use empty openshift env. So, there is no heavy load which may influence pod deployment. DB is deployed on the same node with the rest of appliance pods.  And I anyway have pod deployment restarted.
I do agree that the issue is in readiness/liveness probs if there is no timeout check from app pod.

So, I believe we have to correct timeout values.

Comment 4 Nick Carboni 2018-06-18 18:53:06 UTC

As a fix for this I'm going to remove the health checks from the embedded ansible deployment config and introduce a timeout into the setup time that we can increase.

We already do health checking in the EmbeddedAnsibleWorker so adding the checks in OpenShift was really overkill in this case.

Unfortunately, this won't really be something that we can backport to 5.9 because we are actually relying on the OpenShift health checks in that version. If this "fix" needs to be backported we'll have to investigate if we can increase the readiness/liveness initial timeouts.

Comment 5 CFME Bot 2018-06-18 21:06:18 UTC

https://github.com/ManageIQ/manageiq/pull/17603

Comment 6 Ievgen Zapolskyi 2018-06-26 12:52:33 UTC

the issue is still present in 5.10.0.1. ansible deploy pod fails because ansible pod doesn't get ready on time. As a result I have to redeploy pod manually.

Comment 7 CFME Bot 2018-07-05 17:36:25 UTC

New commits detected on ManageIQ/manageiq/master:

https://github.com/ManageIQ/manageiq/commit/2a766b2999141a7ce4a4972d7ea2fefb754bbb07
commit 2a766b2999141a7ce4a4972d7ea2fefb754bbb07
Author:     Nick Carboni <ncarboni>
AuthorDate: Mon Jun 18 15:12:46 2018 -0400
Commit:     Nick Carboni <ncarboni>
CommitDate: Mon Jun 18 15:12:46 2018 -0400

    Add a timeout to docker and container Embedded Ansible startup

    Previously we would wait forever. That's never good.

    https://bugzilla.redhat.com/show_bug.cgi?id=1576744

 config/settings.yml | 2 +
 lib/embedded_ansible/container_embedded_ansible.rb | 2 +-
 lib/embedded_ansible/docker_embedded_ansible.rb | 2 +-
 3 files changed, 4 insertions(+), 2 deletions(-)


https://github.com/ManageIQ/manageiq/commit/572f295173cbd69d6c2a6a4d062ebc0a08197d12
commit 572f295173cbd69d6c2a6a4d062ebc0a08197d12
Author:     Nick Carboni <ncarboni>
AuthorDate: Mon Jun 18 15:13:31 2018 -0400
Commit:     Nick Carboni <ncarboni>
CommitDate: Mon Jun 18 15:13:31 2018 -0400

    Remove the health checks from ContainerEmbeddedAnsible

    We already do health checks from the EmbeddedAnsibleWorker
    so we don't really need OpenShift to do more, especially if they
    have small built-in timeouts.

    https://bugzilla.redhat.com/show_bug.cgi?id=1576744

 lib/embedded_ansible/container_embedded_ansible.rb | 27 +-
 1 file changed, 5 insertions(+), 22 deletions(-)

Comment 8 Ievgen Zapolskyi 2018-07-20 15:37:29 UTC

Verified in 5.10.0.4

Comment 10 errata-xmlrpc 2019-02-07 22:45:46 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0213

Note You need to log in before you can comment on or make changes to this bug.