Description of problem: I've been pulling my hair out trying to get OpenShift Origin working in multi-node configuration on EC2, through Aeolus. The Audrey agent has been exhibiting bizarre failures throughout this time, which have turned out to be from systemd killing the agent at the 90 second default timeout mark. *Please* disable timing out for the audrey.service, or at least extend it massively. (ie 10 mins) On at least OpenShift Origin, the post boot configuration scripting can take at least a few minutes. Version-Release number of selected component (if applicable): aeolus-audrey-agent-0.4.10-1.fc16.noarch How reproducible: Every time. :( Steps to Reproduce: 1. Boot a deployable that uses Audrey on EC2, providing Audrey with any startup script that takes several minutes. 2. Log into the instance a few minutes later. 3. You'll notice the Audrey Agent is marked as failed, with the startup script just seeming to have stopped for no real reason. :(
Moving this to the fedora product to keep it distinct.
(In reply to comment #0) > Description of problem: > > I've been pulling my hair out trying to get OpenShift Origin working in > multi-node configuration on EC2, through Aeolus. > > The Audrey agent has been exhibiting bizarre failures throughout this > time, which have turned out to be from systemd killing the agent at the 90 > second default timeout mark. > > *Please* disable timing out for the audrey.service, or at least extend it > massively. (ie 10 mins) > > On at least OpenShift Origin, the post boot configuration scripting can > take at least a few minutes. > > > Version-Release number of selected component (if applicable): > > aeolus-audrey-agent-0.4.10-1.fc16.noarch > > > How reproducible: > > Every time. :( > > > Steps to Reproduce: > 1. Boot a deployable that uses Audrey on EC2, providing Audrey with any > startup script that takes several minutes. > 2. Log into the instance a few minutes later. > 3. You'll notice the Audrey Agent is marked as failed, with the startup > script just seeming to have stopped for no real reason. :( This was not a use case for which the Audrey Agent was designed for. The Audrey Agent was not designed for the use case where it would need to run very long. The basic design of the Audrey Agent is to provide a mechanism to start user provided tooling then exit. Starting user provided tooling should only take a very brief amount of time. There is a danger in expecting the Audrey Agent to run for an extended amount of time (e.g.: 10 mins) in that launch failures could be painfully slow to surface. Could you perhaps restructure the tooling being executed by the Audrey agent to start an user provided service or daemon and allow the Audrey Agent to exit cleanly much sooner? Can you please provide more precises description for the use case you envision, including the Deployable.XML being used. It would also be valuable to know the version of the Config Server. Thank you!
Hmmm, it's probably a good idea to keep in mind that the length of time needed by user provided tooling depends on how much grunt the vm has. (cpu, IO, etc) So, with EC2 having everything from very small instances to very large ones, it can vary wildly. A deployment whose startup tooling takes less than 90 seconds in VMware/oVirt here, takes ~5 minutes on EC2 using a c1.medium instance type, and ~10 minutes using a m1.small instance type. So the "should only take a very brief amount to time" is very environment dependant. Using a much larger sized instance type than desired, just to make audrey agent's fit in its time window, wouldn't be great. Especially when the main application (OpenShift node or broker) can run happily (after the initial setup) on the smaller instance sizes. It _may_ be possible with the OpenShift deployment I'm creating, to have Audrey kick off the tooling as a separate daemon or something. Haven't tried it yet. What _did_ work was to provide a new the audrey.service file with timeout disabled, overriding the default one. For reference, the deployable template, image templates, and other files are here: https://github.com/justinclift/templates/tree/openshift/openshift/2_node_cluster/fedora-16 That's a temporary development location, and will be moved here when ready: https://github.com/aeolus-incubator/templates/tree/master/openshift/2_node_cluster/fedora-16
(In reply to comment #4) > Hmmm, it's probably a good idea to keep in mind that the length of time > needed by user provided tooling depends on how much grunt the vm has. (cpu, > IO, etc) > > So, with EC2 having everything from very small instances to very large ones, > it can vary wildly. > > A deployment whose startup tooling takes less than 90 seconds in > VMware/oVirt here, takes ~5 minutes on EC2 using a c1.medium instance type, > and ~10 minutes using a m1.small instance type. > > So the "should only take a very brief amount to time" is very environment > dependant. > > Using a much larger sized instance type than desired, just to make audrey > agent's fit in its time window, wouldn't be great. Especially when the main > application (OpenShift node or broker) can run happily (after the initial > setup) on the smaller instance sizes. > > It _may_ be possible with the OpenShift deployment I'm creating, to have > Audrey kick off the tooling as a separate daemon or something. Haven't > tried it yet. > > What _did_ work was to provide a new the audrey.service file with timeout > disabled, overriding the default one. > > For reference, the deployable template, image templates, and other files are > here: > > > https://github.com/justinclift/templates/tree/openshift/openshift/ > 2_node_cluster/fedora-16 > > That's a temporary development location, and will be moved here when ready: > > https://github.com/aeolus-incubator/templates/tree/master/openshift/ > 2_node_cluster/fedora-16 Justin Very insightful reply! Thank you. One more thing, could you please attach the the audrey.service file that you have used successfully? Thanks! Joe
It's already in the repo above: https://github.com/justinclift/templates/blob/openshift/openshift/2_node_cluster/fedora-16/supporting/audrey.service
Created attachment 618070 [details] audrey.service file with timeout disabled Attaching, as per request. :)
Audrey is not designed to allow the user to do it all in series under audrey control. Audrey is designed to be a lightweight tool that can be used to kick off other tooling. For instance: If a long-ish running series of operations is required put that sequence in a batch job or service or bash script, and use the Audrey agent to kick it off. Closing as will not fix.