Some node operations may take longer than 120s, which is the hardcoded timeout at https://github.com/openshift/origin-server/blob/master/node/lib/openshift-origin-node/model/v1_cart_model.rb#L280. The v2 timeout appears to be 3600s (https://github.com/openshift/origin-server/blob/master/node/lib/openshift-origin-node/utils/shell_exec.rb#L83), if I found the corresponding/equivalent operation.
I'd recommend increasing the v1 cart model timeout to 3600s, as there are still some customers using v1 for now.
Created attachment 760202 [details]
patch to increase timeout in v1 cart model
There are three relevant timeouts in the system:
1. Mcollective terminates the thread managing operation: 400 seconds.
2. Broker considers an operation to have failed: 300 seconds (240 now?).
3. The oo_spawn/shellExec timeout.
The first two timeouts cause spawned processes (ex: cartridge hooks) to just continue, forgotten about.
A common failure is for the configure hook to take too long, for broker to start running destroy in response to the timeout, and for both configure and destroy to continue running at the same time causing half-removed gears to linger.
We have observed that git can deadlock and stay running, causing processes to accumulate on long running systems.
Only the third timeout terminates spawned processes (ex: hooks). Whatever the timeout is for script execution, if it does not fire ahead of the broker or mcollective timeouts there is a risk of indeterminate results.
Also, it has been observed that the oo_spawn timeout does not always fire on time.
Checked the related code set in v1_cart_model.rb on puddle 2013-07-12 :
while (line = stdout.gets)
output << line
The timeout has been changed to 3600s, so verify this bug.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.