Bug 1366381 - [ocp3.2.1] deployments and scale up/down are very, very slow
Summary: [ocp3.2.1] deployments and scale up/down are very, very slow
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: openshift-controller-manager
Version: 3.2.1
Hardware: All
OS: All
unspecified
urgent
Target Milestone: ---
: ---
Assignee: Dan Mace
QA Contact: zhou ying
URL:
Whiteboard:
Depends On: 1365146
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-08-11 19:51 UTC by Scott Dodson
Modified: 2016-09-12 17:36 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Enhancement
Doc Text:
The deployment controllers' resync interval can now be configured. The previously hard-coded 2 minute default is the likely cause of performance regressions when thousands of deploymentconfigs are present in the system. Increase the resync interval by setting 'deploymentControllerResyncMinute' in /etc/origin/master/master-config.yaml.
Clone Of: 1365146
Environment:
Last Closed: 2016-09-12 17:36:17 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2016:1853 0 normal SHIPPED_LIVE Important: Red Hat OpenShift Enterprise 3.2 security update and bug fix update 2016-09-12 21:33:16 UTC

Description Scott Dodson 2016-08-11 19:51:04 UTC
+++ This bug was initially created as a clone of Bug #1365146 +++


I have a number of pods running (6) of my nodejs application.
I set the replica for the deployment config down to 1 replica

$ oc scale dc/nodejs-mongodb-example --replicas=6

and then

$ oc scale dc/nodejs-mongodb-example --replicas=1

on my project sspeiche-test1


$ oc get dc

NAME REVISION REPLICAS TRIGGERED BY

mongodb 2 1 config,image(mongodb:3.2)

nodejs-mongodb-example 2 6
config,image(nodejs-mongodb-example:latest)

$ oc get pods

NAME READY STATUS RESTARTS AGE

mongodb-2-f41yo 1/1 Running 0 1h

nodejs-mongodb-example-1-build 0/1 Completed 0 1h

nodejs-mongodb-example-2-mzy8s 1/1 Running 0 54m


$ oc describe dc/nodejs-mongodb-example

Name: nodejs-mongodb-example

Namespace: sspeiche-nodejs

Created: About an hour ago

Labels: template=nodejs-mongodb-example

Description: Defines how to deploy the application server

Annotations: <none>

Latest Version: 2

Selector: name=nodejs-mongodb-example

Replicas: 6

Triggers: Image(nodejs-mongodb-example@latest, auto=true), Config

Strategy: Rolling

Template:

Labels: name=nodejs-mongodb-example

Containers:

nodejs-mongodb-example:

Image:
172.30.47.227:5000/sspeiche-nodejs/nodejs-mongodb-example@sha256:eef9f4d331711021384e51927ee33fb2a147bcb152b88f78a0d6fb9064e0de8d

Port: 8080/TCP

QoS Tier:

cpu: BestEffort

memory: BestEffort

Limits:

memory: 256Mi

Liveness: http-get http://:8080/pagecount delay=30s timeout=3s
period=10s #success=1 #failure=3

Readiness: http-get http://:8080/pagecount delay=3s timeout=3s
period=10s #success=1 #failure=3

Environment Variables:

DATABASE_SERVICE_NAME: mongodb

MONGODB_USER: userL3V

MONGODB_PASSWORD: vuJwtB66XanYHxFO

MONGODB_DATABASE: sampledb

MONGODB_ADMIN_PASSWORD: qXUjWUjH8YGJCDD3

No volumes.


Deployment #2 (latest):

Name: nodejs-mongodb-example-2

Created: 55 minutes ago

Status: Complete

Replicas: 1 current / 1 desired

Selector:
deployment=nodejs-mongodb-example-2,deploymentconfig=nodejs-mongodb-example,name=nodejs-mongodb-example

Labels:
openshift.io/deployment-config.name=nodejs-mongodb-example,template=nodejs-mongodb-example

Pods Status: 1 Running / 0 Waiting / 0 Succeeded / 0 Failed

Deployment #1:

Created: about an hour ago

Status: Complete

Replicas: 0 current / 0 desired


Events:

FirstSeen LastSeen Count From SubobjectPath Type Reason Message

--------- -------- ----- ---- ------------- -------- ------ -------

1h 1h 1 {deploymentconfig-controller } Normal DeploymentCreated Created
new deployment "nodejs-mongodb-example-1" for version 1

1h 1h 1 {deployment-controller } Warning FailedUpdate Cannot update
deployment sspeiche-nodejs/nodejs-mongodb-example-1 status to Pending:
replicationcontrollers "nodejs-mongodb-example-1" cannot be updated: the
object has been modified; please apply your changes to the latest version
and try again

55m 55m 1 {deploymentconfig-controller } Normal DeploymentCreated Created
new deployment "nodejs-mongodb-example-2" for version 2

55m 55m 1 {deployment-controller } Warning FailedUpdate Cannot update
deployment sspeiche-nodejs/nodejs-mongodb-example-2 status to Pending:
replicationcontrollers "nodejs-mongodb-example-2" cannot be updated: the
object has been modified; please apply your changes to the latest version
and try again

45m 45m 1 {deploymentconfig-controller } Normal DeploymentScaled Scaled
deployment "nodejs-mongodb-example-2" from 4 to 1

50m 42m 2 {deploymentconfig-controller } Normal DeploymentScaled Scaled
deployment "nodejs-mongodb-example-2" from 1 to 4

40m 40m 1 {deploymentconfig-controller } Normal DeploymentScaled Scaled
deployment "nodejs-mongodb-example-2" from 4 to 6

12m 12m 1 {deploymentconfig-controller } Normal DeploymentScaled Scaled
deployment "nodejs-mongodb-example-2" from 1 to 6

17m 6m 2 {deploymentconfig-controller } Normal DeploymentScaled Scaled
deployment "nodejs-mongodb-example-2" from 6 to 1

--- Additional comment from Michal Fojtik on 2016-08-09 08:05:02 EDT ---

I think this should be fixed in >3.3, but will let Michalis prove me wrong. I think DC will retry after the conflict and that retry period seems long.

--- Additional comment from Steve Speicher on 2016-08-09 11:36:09 EDT ---

@Michal but this is an urgent outage on prod preview close now. I only see the conflict some times and thing it is due to the deployment / scale performance slowness. So if the slowness is fixed, the conflict won't surface, right?

--- Additional comment from Michal Fojtik on 2016-08-09 12:07:52 EDT ---

(In reply to Steve Speicher from comment #2)
> @Michal but this is an urgent outage on prod preview close now. I only see
> the conflict some times and thing it is due to the deployment / scale
> performance slowness. So if the slowness is fixed, the conflict won't
> surface, right?

Yeah, I think Michalis told me once that the conflict is not critical and it is retried. I'm not sure if retrying is causing the slowness or not. I'm going to investigate this tomorrow as this might be related to other deployment flakes we are seeing nowadays.

--- Additional comment from Michal Fojtik on 2016-08-09 13:32:20 EDT ---

Dan: It seems you will be the right person to look at this (it does not seem like a DC problem, but more as RC problem). Also Andy told me you were chasing something similar yesterday.

--- Additional comment from Dan Mace on 2016-08-09 13:43:15 EDT ---

The current issue I'm chasing is a minutes-long delay between DC and initial RC creation. This *could* be related since RC scaling via DC is also handled by the controllers: if processing is somehow slowed down for creating the RCs, it could also be slow for scaling the RCs in response to DC scale changes.

--- Additional comment from Dan Mace on 2016-08-09 14:58:51 EDT ---

My current theory is that deployment config controller processing time is now exceeding the reflector resync period. Each resync interval, the entire work queue is replaced and randomized, which would account for the wild variance. I'm going to do some load testing to verify. If the resync interval is indeed the problem, we'll need to introduce configuration to allow the resync interval to be modified beyond its current hard-coded 2 minute default.

--- Additional comment from Steve Speicher on 2016-08-10 14:27:30 EDT ---

There any workaround or fix planned for preview (prod)? I'm still observing the same behavior since Thursday (Aug 4th)

--- Additional comment from Dan Mace on 2016-08-10 14:32:40 EDT ---

https://github.com/openshift/ose/pull/336 is our first attempt at a fix.

Comment 3 zhou ying 2016-08-16 08:41:33 UTC
Will verify this bug, because the Bug #1365146 .

Comment 5 errata-xmlrpc 2016-09-12 17:36:17 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2016:1853


Note You need to log in before you can comment on or make changes to this bug.