Bug 973717 - Increase v1 cart model timeout from 120s to 3600s
Summary: Increase v1 cart model timeout from 120s to 3600s
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Containers (Show other bugs)
(Show other bugs)
Version: 2.2.0
Hardware: Unspecified Unspecified
Target Milestone: ---
: ---
Assignee: chris alfonso
QA Contact: libra bugs
Depends On:
TreeView+ depends on / blocked
Reported: 2013-06-12 14:40 UTC by Andy Goldstein
Modified: 2017-03-08 17:35 UTC (History)
7 users (show)

Fixed In Version: rubygem-openshift-origin-node-1.9.14-1.2
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2013-08-05 17:16:24 UTC
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)
patch to increase timeout in v1 cart model (872 bytes, patch)
2013-06-12 14:56 UTC, chris alfonso
no flags Details | Diff

External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2013:1138 normal SHIPPED_LIVE OpenShift Enterprise 1.2.1 bug fix and enhancement update 2013-08-05 21:14:54 UTC

Description Andy Goldstein 2013-06-12 14:40:38 UTC
Some node operations may take longer than 120s, which is the hardcoded timeout at https://github.com/openshift/origin-server/blob/master/node/lib/openshift-origin-node/model/v1_cart_model.rb#L280. The v2 timeout appears to be 3600s (https://github.com/openshift/origin-server/blob/master/node/lib/openshift-origin-node/utils/shell_exec.rb#L83), if I found the corresponding/equivalent operation.

I'd recommend increasing the v1 cart model timeout to 3600s, as there are still some customers using v1 for now.

Comment 2 chris alfonso 2013-06-12 14:56:12 UTC
Created attachment 760202 [details]
patch to increase timeout in v1 cart model

Comment 3 Rob Millner 2013-06-12 16:52:32 UTC
There are three relevant timeouts in the system:
1. Mcollective terminates the thread managing operation: 400 seconds.

2. Broker considers an operation to have failed: 300 seconds (240 now?).

3. The oo_spawn/shellExec timeout.

The first two timeouts cause spawned processes (ex: cartridge hooks) to just continue, forgotten about.

A common failure is for the configure hook to take too long, for broker to start running destroy in response to the timeout, and for both configure and destroy to continue running at the same time causing half-removed gears to linger.

We have observed that git can deadlock and stay running, causing processes to accumulate on long running systems.

Only the third timeout terminates spawned processes (ex: hooks).  Whatever the timeout is for script execution, if it does not fire ahead of the broker or mcollective timeouts there is a risk of indeterminate results.

Also, it has been observed that the oo_spawn timeout does not always fire on time.

Comment 6 Gaoyun Pei 2013-07-16 06:07:55 UTC
Checked the related code set in v1_cart_model.rb on puddle 2013-07-12 :
        Timeout::timeout(3600) do
          while (line = stdout.gets)
            output << line

The timeout has been changed to 3600s, so verify this bug.

Comment 9 errata-xmlrpc 2013-08-05 17:16:24 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.