1365146 – [preview] deployments and scale up/down are very, very slow

Bug 1365146 - [preview] deployments and scale up/down are very, very slow

Summary: [preview] deployments and scale up/down are very, very slow

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	OpenShift Online
Classification:	Red Hat
Component:	Deployments
Sub Component:
Version:	3.x
Hardware:	All
OS:	All
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	---
Assignee:	Dan Mace
QA Contact:	Mike Fiedler
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1366381
TreeView+	depends on / blocked

Reported:	2016-08-08 13:20 UTC by Steve Speicher
Modified:	2018-07-25 08:11 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1366381 (view as bug list)
Environment:
Last Closed:	2016-10-04 13:08:04 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Steve Speicher 2016-08-08 13:20:42 UTC

Copied / created from ServiceNow ticket INC0433522

I have a number of pods running (6) of my nodejs application.
I set the replica for the deployment config down to 1 replica

$ oc scale dc/nodejs-mongodb-example --replicas=6

and then

$ oc scale dc/nodejs-mongodb-example --replicas=1

on my project sspeiche-test1


$ oc get dc

NAME REVISION REPLICAS TRIGGERED BY

mongodb 2 1 config,image(mongodb:3.2)

nodejs-mongodb-example 2 6
config,image(nodejs-mongodb-example:latest)

$ oc get pods

NAME READY STATUS RESTARTS AGE

mongodb-2-f41yo 1/1 Running 0 1h

nodejs-mongodb-example-1-build 0/1 Completed 0 1h

nodejs-mongodb-example-2-mzy8s 1/1 Running 0 54m


$ oc describe dc/nodejs-mongodb-example

Name: nodejs-mongodb-example

Namespace: sspeiche-nodejs

Created: About an hour ago

Labels: template=nodejs-mongodb-example

Description: Defines how to deploy the application server

Annotations: <none>

Latest Version: 2

Selector: name=nodejs-mongodb-example

Replicas: 6

Triggers: Image(nodejs-mongodb-example@latest, auto=true), Config

Strategy: Rolling

Template:

Labels: name=nodejs-mongodb-example

Containers:

nodejs-mongodb-example:

Image:
172.30.47.227:5000/sspeiche-nodejs/nodejs-mongodb-example@sha256:eef9f4d331711021384e51927ee33fb2a147bcb152b88f78a0d6fb9064e0de8d

Port: 8080/TCP

QoS Tier:

cpu: BestEffort

memory: BestEffort

Limits:

memory: 256Mi

Liveness: http-get http://:8080/pagecount delay=30s timeout=3s
period=10s #success=1 #failure=3

Readiness: http-get http://:8080/pagecount delay=3s timeout=3s
period=10s #success=1 #failure=3

Environment Variables:

DATABASE_SERVICE_NAME: mongodb

MONGODB_USER: userL3V

MONGODB_PASSWORD: vuJwtB66XanYHxFO

MONGODB_DATABASE: sampledb

MONGODB_ADMIN_PASSWORD: qXUjWUjH8YGJCDD3

No volumes.


Deployment #2 (latest):

Name: nodejs-mongodb-example-2

Created: 55 minutes ago

Status: Complete

Replicas: 1 current / 1 desired

Selector:
deployment=nodejs-mongodb-example-2,deploymentconfig=nodejs-mongodb-example,name=nodejs-mongodb-example

Labels:
openshift.io/deployment-config.name=nodejs-mongodb-example,template=nodejs-mongodb-example

Pods Status: 1 Running / 0 Waiting / 0 Succeeded / 0 Failed

Deployment #1:

Created: about an hour ago

Status: Complete

Replicas: 0 current / 0 desired


Events:

FirstSeen LastSeen Count From SubobjectPath Type Reason Message

--------- -------- ----- ---- ------------- -------- ------ -------

1h 1h 1 {deploymentconfig-controller } Normal DeploymentCreated Created
new deployment "nodejs-mongodb-example-1" for version 1

1h 1h 1 {deployment-controller } Warning FailedUpdate Cannot update
deployment sspeiche-nodejs/nodejs-mongodb-example-1 status to Pending:
replicationcontrollers "nodejs-mongodb-example-1" cannot be updated: the
object has been modified; please apply your changes to the latest version
and try again

55m 55m 1 {deploymentconfig-controller } Normal DeploymentCreated Created
new deployment "nodejs-mongodb-example-2" for version 2

55m 55m 1 {deployment-controller } Warning FailedUpdate Cannot update
deployment sspeiche-nodejs/nodejs-mongodb-example-2 status to Pending:
replicationcontrollers "nodejs-mongodb-example-2" cannot be updated: the
object has been modified; please apply your changes to the latest version
and try again

45m 45m 1 {deploymentconfig-controller } Normal DeploymentScaled Scaled
deployment "nodejs-mongodb-example-2" from 4 to 1

50m 42m 2 {deploymentconfig-controller } Normal DeploymentScaled Scaled
deployment "nodejs-mongodb-example-2" from 1 to 4

40m 40m 1 {deploymentconfig-controller } Normal DeploymentScaled Scaled
deployment "nodejs-mongodb-example-2" from 4 to 6

12m 12m 1 {deploymentconfig-controller } Normal DeploymentScaled Scaled
deployment "nodejs-mongodb-example-2" from 1 to 6

17m 6m 2 {deploymentconfig-controller } Normal DeploymentScaled Scaled
deployment "nodejs-mongodb-example-2" from 6 to 1

Comment 1 Michal Fojtik 2016-08-09 12:05:02 UTC

I think this should be fixed in >3.3, but will let Michalis prove me wrong. I think DC will retry after the conflict and that retry period seems long.

Comment 2 Steve Speicher 2016-08-09 15:36:09 UTC

@Michal but this is an urgent outage on prod preview close now. I only see the conflict some times and thing it is due to the deployment / scale performance slowness. So if the slowness is fixed, the conflict won't surface, right?

Comment 3 Michal Fojtik 2016-08-09 16:07:52 UTC

(In reply to Steve Speicher from comment #2)
> @Michal but this is an urgent outage on prod preview close now. I only see
> the conflict some times and thing it is due to the deployment / scale
> performance slowness. So if the slowness is fixed, the conflict won't
> surface, right?

Yeah, I think Michalis told me once that the conflict is not critical and it is retried. I'm not sure if retrying is causing the slowness or not. I'm going to investigate this tomorrow as this might be related to other deployment flakes we are seeing nowadays.

Comment 4 Michal Fojtik 2016-08-09 17:32:20 UTC

Dan: It seems you will be the right person to look at this (it does not seem like a DC problem, but more as RC problem). Also Andy told me you were chasing something similar yesterday.

Comment 5 Dan Mace 2016-08-09 17:43:15 UTC

The current issue I'm chasing is a minutes-long delay between DC and initial RC creation. This *could* be related since RC scaling via DC is also handled by the controllers: if processing is somehow slowed down for creating the RCs, it could also be slow for scaling the RCs in response to DC scale changes.

Comment 6 Dan Mace 2016-08-09 18:58:51 UTC

My current theory is that deployment config controller processing time is now exceeding the reflector resync period. Each resync interval, the entire work queue is replaced and randomized, which would account for the wild variance. I'm going to do some load testing to verify. If the resync interval is indeed the problem, we'll need to introduce configuration to allow the resync interval to be modified beyond its current hard-coded 2 minute default.

Comment 7 Steve Speicher 2016-08-10 18:27:30 UTC

There any workaround or fix planned for preview (prod)? I'm still observing the same behavior since Thursday (Aug 4th)

Comment 8 Dan Mace 2016-08-10 18:32:40 UTC

https://github.com/openshift/ose/pull/336 is our first attempt at a fix.

Comment 11 mdong 2016-08-15 06:05:36 UTC

Verified against atomic-openshift-3.2.1.13-1.git.4.8365dd3.el7.x86_64

With the release of atomic-openshift-3.2.1.13-1.git.4.8365dd3.el7.x86_64, the following root-level configuration key(
deploymentControllerResyncMinutes: 15) added to the master-config.yaml.

Comment 13 yasun 2018-07-19 06:55:36 UTC

Hi Mike,

For this bug is a performance bug and needs test case, has SVT team cover it? If yes, please help add the Polarion case number. Thanks.

Comment 15 yasun 2018-07-25 08:11:54 UTC

Thanks, Mike.

Note You need to log in before you can comment on or make changes to this bug.