Bug 860561

Summary: Audrey agent uses default systemd timeout (90s). Needs to be much longer, or have timeout disabled
Product: [Fedora] Fedora Reporter: Justin Clift <jclift>
Component: aeolus-audrey-agentAssignee: Joe Vlcek <jvlcek>
Status: CLOSED WONTFIX QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 1CC: dradez, gblomqui, jvlcek, kwade, morazi
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-01-04 13:41:34 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
audrey.service file with timeout disabled none

Description Justin Clift 2012-09-26 07:27:16 UTC
Description of problem:

  I've been pulling my hair out trying to get OpenShift Origin working in multi-node configuration on EC2, through Aeolus.

  The Audrey agent has been exhibiting bizarre failures throughout this time, which have turned out to be from systemd killing the agent at the 90 second default timeout mark.

  *Please* disable timing out for the audrey.service, or at least extend it massively.  (ie 10 mins)

  On at least OpenShift Origin, the post boot configuration scripting can take at least a few minutes.


Version-Release number of selected component (if applicable):

  aeolus-audrey-agent-0.4.10-1.fc16.noarch


How reproducible:

  Every time. :(


Steps to Reproduce:
1. Boot a deployable that uses Audrey on EC2, providing Audrey with any startup script that takes several minutes.
2. Log into the instance a few minutes later.
3. You'll notice the Audrey Agent is marked as failed, with the startup script just seeming to have stopped for no real reason. :(

Comment 2 Mike Orazi 2012-09-26 14:46:51 UTC
Moving this to the fedora product to keep it distinct.

Comment 3 Joe Vlcek 2012-09-26 15:52:41 UTC
(In reply to comment #0)
> Description of problem:
> 
>   I've been pulling my hair out trying to get OpenShift Origin working in
> multi-node configuration on EC2, through Aeolus.
> 
>   The Audrey agent has been exhibiting bizarre failures throughout this
> time, which have turned out to be from systemd killing the agent at the 90
> second default timeout mark.
> 
>   *Please* disable timing out for the audrey.service, or at least extend it
> massively.  (ie 10 mins)
> 
>   On at least OpenShift Origin, the post boot configuration scripting can
> take at least a few minutes.
> 
> 
> Version-Release number of selected component (if applicable):
> 
>   aeolus-audrey-agent-0.4.10-1.fc16.noarch
> 
> 
> How reproducible:
> 
>   Every time. :(
> 
> 
> Steps to Reproduce:
> 1. Boot a deployable that uses Audrey on EC2, providing Audrey with any
> startup script that takes several minutes.
> 2. Log into the instance a few minutes later.
> 3. You'll notice the Audrey Agent is marked as failed, with the startup
> script just seeming to have stopped for no real reason. :(


This was not a use case for which the Audrey Agent was designed for.

The Audrey Agent was not designed for the use case where it would
need to run very long. The basic design of the Audrey Agent is to
provide a mechanism to start user provided tooling then exit. Starting
user provided tooling should only take a very brief amount of time.

There is a danger in expecting the Audrey Agent to run for an extended
amount of time (e.g.: 10 mins) in that launch failures could be painfully
slow to surface. 

Could you perhaps restructure the tooling being executed by the
Audrey agent to start an user provided service or daemon and allow
the Audrey Agent to exit cleanly much sooner?

Can you please provide more precises description for the use case
you envision, including the Deployable.XML being used. It would also
be valuable to know the version of the Config Server.

Thank you!

Comment 4 Justin Clift 2012-09-27 00:20:05 UTC
Hmmm, it's probably a good idea to keep in mind that the length of time needed by user provided tooling depends on how much grunt the vm has. (cpu, IO, etc)

So, with EC2 having everything from very small instances to very large ones, it can vary wildly.

A deployment whose startup tooling takes less than 90 seconds in VMware/oVirt here, takes ~5 minutes on EC2 using a c1.medium instance type, and ~10 minutes using a m1.small instance type.

So the "should only take a very brief amount to time" is very environment dependant.

Using a much larger sized instance type than desired, just to make audrey agent's fit in its time window, wouldn't be great.  Especially when the main application (OpenShift node or broker) can run happily (after the initial setup) on the smaller instance sizes.

It _may_ be possible with the OpenShift deployment I'm creating, to have Audrey kick off the tooling as a separate daemon or something.  Haven't tried it yet.

What _did_ work was to provide a new the audrey.service file with timeout disabled, overriding the default one.

For reference, the deployable template, image templates, and other files are here:

  https://github.com/justinclift/templates/tree/openshift/openshift/2_node_cluster/fedora-16

That's a temporary development location, and will be moved here when ready:

https://github.com/aeolus-incubator/templates/tree/master/openshift/2_node_cluster/fedora-16

Comment 5 Joe Vlcek 2012-09-27 12:58:45 UTC
(In reply to comment #4)
> Hmmm, it's probably a good idea to keep in mind that the length of time
> needed by user provided tooling depends on how much grunt the vm has. (cpu,
> IO, etc)
> 
> So, with EC2 having everything from very small instances to very large ones,
> it can vary wildly.
> 
> A deployment whose startup tooling takes less than 90 seconds in
> VMware/oVirt here, takes ~5 minutes on EC2 using a c1.medium instance type,
> and ~10 minutes using a m1.small instance type.
> 
> So the "should only take a very brief amount to time" is very environment
> dependant.
> 
> Using a much larger sized instance type than desired, just to make audrey
> agent's fit in its time window, wouldn't be great.  Especially when the main
> application (OpenShift node or broker) can run happily (after the initial
> setup) on the smaller instance sizes.
> 
> It _may_ be possible with the OpenShift deployment I'm creating, to have
> Audrey kick off the tooling as a separate daemon or something.  Haven't
> tried it yet.
> 
> What _did_ work was to provide a new the audrey.service file with timeout
> disabled, overriding the default one.
> 
> For reference, the deployable template, image templates, and other files are
> here:
> 
>  
> https://github.com/justinclift/templates/tree/openshift/openshift/
> 2_node_cluster/fedora-16
> 
> That's a temporary development location, and will be moved here when ready:
> 
> https://github.com/aeolus-incubator/templates/tree/master/openshift/
> 2_node_cluster/fedora-16

Justin

Very insightful reply! Thank you.

One more thing, could you please attach the the audrey.service file
that you have used successfully?

Thanks!
   Joe

Comment 7 Justin Clift 2012-09-27 13:39:53 UTC
Created attachment 618070 [details]
audrey.service file with timeout disabled

Attaching, as per request. :)

Comment 8 Joe Vlcek 2013-01-04 13:41:34 UTC
Audrey is not designed to allow the user to do it all in series under
audrey control.

Audrey is designed to be a lightweight tool that can be used to
kick off other tooling.

For instance:  If a long-ish running series of operations is
required put that sequence in a batch job or service or bash
script, and use the Audrey agent to kick it off.

Closing as will not fix.