973717 – Increase v1 cart model timeout from 120s to 3600s

Bug 973717 - Increase v1 cart model timeout from 120s to 3600s

Summary: Increase v1 cart model timeout from 120s to 3600s

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Containers
Sub Component:
Version:	2.2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	---
Assignee:	chris alfonso
QA Contact:	libra bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2013-06-12 14:40 UTC by Andy Goldstein
Modified:	2017-03-08 17:35 UTC (History)
CC List:	7 users (show)
Fixed In Version:	rubygem-openshift-origin-node-1.9.14-1.2
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2013-08-05 17:16:24 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
patch to increase timeout in v1 cart model (872 bytes, patch) 2013-06-12 14:56 UTC, chris alfonso	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2013:1138	0	normal	SHIPPED_LIVE	OpenShift Enterprise 1.2.1 bug fix and enhancement update	2013-08-05 21:14:54 UTC

Description Andy Goldstein 2013-06-12 14:40:38 UTC

Some node operations may take longer than 120s, which is the hardcoded timeout at https://github.com/openshift/origin-server/blob/master/node/lib/openshift-origin-node/model/v1_cart_model.rb#L280. The v2 timeout appears to be 3600s (https://github.com/openshift/origin-server/blob/master/node/lib/openshift-origin-node/utils/shell_exec.rb#L83), if I found the corresponding/equivalent operation.

I'd recommend increasing the v1 cart model timeout to 3600s, as there are still some customers using v1 for now.

Comment 2 chris alfonso 2013-06-12 14:56:12 UTC

Created attachment 760202 [details]
patch to increase timeout in v1 cart model

Comment 3 Rob Millner 2013-06-12 16:52:32 UTC

There are three relevant timeouts in the system:
1. Mcollective terminates the thread managing operation: 400 seconds.

2. Broker considers an operation to have failed: 300 seconds (240 now?).

3. The oo_spawn/shellExec timeout.


The first two timeouts cause spawned processes (ex: cartridge hooks) to just continue, forgotten about.

A common failure is for the configure hook to take too long, for broker to start running destroy in response to the timeout, and for both configure and destroy to continue running at the same time causing half-removed gears to linger.

We have observed that git can deadlock and stay running, causing processes to accumulate on long running systems.

Only the third timeout terminates spawned processes (ex: hooks).  Whatever the timeout is for script execution, if it does not fire ahead of the broker or mcollective timeouts there is a risk of indeterminate results.

Also, it has been observed that the oo_spawn timeout does not always fire on time.

Comment 4 chris alfonso 2013-07-11 13:50:43 UTC

https://github.com/openshift/enterprise-server/pull/102

Comment 6 Gaoyun Pei 2013-07-16 06:07:55 UTC

Checked the related code set in v1_cart_model.rb on puddle 2013-07-12 :
    
    begin
        Timeout::timeout(3600) do
          while (line = stdout.gets)
            output << line
          end
        end

The timeout has been changed to 3600s, so verify this bug.

Comment 9 errata-xmlrpc 2013-08-05 17:16:24 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1138.html

Note You need to log in before you can comment on or make changes to this bug.