+++ This bug was initially created as a clone of Bug #1365146 +++ I have a number of pods running (6) of my nodejs application. I set the replica for the deployment config down to 1 replica $ oc scale dc/nodejs-mongodb-example --replicas=6 and then $ oc scale dc/nodejs-mongodb-example --replicas=1 on my project sspeiche-test1 $ oc get dc NAME REVISION REPLICAS TRIGGERED BY mongodb 2 1 config,image(mongodb:3.2) nodejs-mongodb-example 2 6 config,image(nodejs-mongodb-example:latest) $ oc get pods NAME READY STATUS RESTARTS AGE mongodb-2-f41yo 1/1 Running 0 1h nodejs-mongodb-example-1-build 0/1 Completed 0 1h nodejs-mongodb-example-2-mzy8s 1/1 Running 0 54m $ oc describe dc/nodejs-mongodb-example Name: nodejs-mongodb-example Namespace: sspeiche-nodejs Created: About an hour ago Labels: template=nodejs-mongodb-example Description: Defines how to deploy the application server Annotations: <none> Latest Version: 2 Selector: name=nodejs-mongodb-example Replicas: 6 Triggers: Image(nodejs-mongodb-example@latest, auto=true), Config Strategy: Rolling Template: Labels: name=nodejs-mongodb-example Containers: nodejs-mongodb-example: Image: 172.30.47.227:5000/sspeiche-nodejs/nodejs-mongodb-example@sha256:eef9f4d331711021384e51927ee33fb2a147bcb152b88f78a0d6fb9064e0de8d Port: 8080/TCP QoS Tier: cpu: BestEffort memory: BestEffort Limits: memory: 256Mi Liveness: http-get http://:8080/pagecount delay=30s timeout=3s period=10s #success=1 #failure=3 Readiness: http-get http://:8080/pagecount delay=3s timeout=3s period=10s #success=1 #failure=3 Environment Variables: DATABASE_SERVICE_NAME: mongodb MONGODB_USER: userL3V MONGODB_PASSWORD: vuJwtB66XanYHxFO MONGODB_DATABASE: sampledb MONGODB_ADMIN_PASSWORD: qXUjWUjH8YGJCDD3 No volumes. Deployment #2 (latest): Name: nodejs-mongodb-example-2 Created: 55 minutes ago Status: Complete Replicas: 1 current / 1 desired Selector: deployment=nodejs-mongodb-example-2,deploymentconfig=nodejs-mongodb-example,name=nodejs-mongodb-example Labels: openshift.io/deployment-config.name=nodejs-mongodb-example,template=nodejs-mongodb-example Pods Status: 1 Running / 0 Waiting / 0 Succeeded / 0 Failed Deployment #1: Created: about an hour ago Status: Complete Replicas: 0 current / 0 desired Events: FirstSeen LastSeen Count From SubobjectPath Type Reason Message --------- -------- ----- ---- ------------- -------- ------ ------- 1h 1h 1 {deploymentconfig-controller } Normal DeploymentCreated Created new deployment "nodejs-mongodb-example-1" for version 1 1h 1h 1 {deployment-controller } Warning FailedUpdate Cannot update deployment sspeiche-nodejs/nodejs-mongodb-example-1 status to Pending: replicationcontrollers "nodejs-mongodb-example-1" cannot be updated: the object has been modified; please apply your changes to the latest version and try again 55m 55m 1 {deploymentconfig-controller } Normal DeploymentCreated Created new deployment "nodejs-mongodb-example-2" for version 2 55m 55m 1 {deployment-controller } Warning FailedUpdate Cannot update deployment sspeiche-nodejs/nodejs-mongodb-example-2 status to Pending: replicationcontrollers "nodejs-mongodb-example-2" cannot be updated: the object has been modified; please apply your changes to the latest version and try again 45m 45m 1 {deploymentconfig-controller } Normal DeploymentScaled Scaled deployment "nodejs-mongodb-example-2" from 4 to 1 50m 42m 2 {deploymentconfig-controller } Normal DeploymentScaled Scaled deployment "nodejs-mongodb-example-2" from 1 to 4 40m 40m 1 {deploymentconfig-controller } Normal DeploymentScaled Scaled deployment "nodejs-mongodb-example-2" from 4 to 6 12m 12m 1 {deploymentconfig-controller } Normal DeploymentScaled Scaled deployment "nodejs-mongodb-example-2" from 1 to 6 17m 6m 2 {deploymentconfig-controller } Normal DeploymentScaled Scaled deployment "nodejs-mongodb-example-2" from 6 to 1 --- Additional comment from Michal Fojtik on 2016-08-09 08:05:02 EDT --- I think this should be fixed in >3.3, but will let Michalis prove me wrong. I think DC will retry after the conflict and that retry period seems long. --- Additional comment from Steve Speicher on 2016-08-09 11:36:09 EDT --- @Michal but this is an urgent outage on prod preview close now. I only see the conflict some times and thing it is due to the deployment / scale performance slowness. So if the slowness is fixed, the conflict won't surface, right? --- Additional comment from Michal Fojtik on 2016-08-09 12:07:52 EDT --- (In reply to Steve Speicher from comment #2) > @Michal but this is an urgent outage on prod preview close now. I only see > the conflict some times and thing it is due to the deployment / scale > performance slowness. So if the slowness is fixed, the conflict won't > surface, right? Yeah, I think Michalis told me once that the conflict is not critical and it is retried. I'm not sure if retrying is causing the slowness or not. I'm going to investigate this tomorrow as this might be related to other deployment flakes we are seeing nowadays. --- Additional comment from Michal Fojtik on 2016-08-09 13:32:20 EDT --- Dan: It seems you will be the right person to look at this (it does not seem like a DC problem, but more as RC problem). Also Andy told me you were chasing something similar yesterday. --- Additional comment from Dan Mace on 2016-08-09 13:43:15 EDT --- The current issue I'm chasing is a minutes-long delay between DC and initial RC creation. This *could* be related since RC scaling via DC is also handled by the controllers: if processing is somehow slowed down for creating the RCs, it could also be slow for scaling the RCs in response to DC scale changes. --- Additional comment from Dan Mace on 2016-08-09 14:58:51 EDT --- My current theory is that deployment config controller processing time is now exceeding the reflector resync period. Each resync interval, the entire work queue is replaced and randomized, which would account for the wild variance. I'm going to do some load testing to verify. If the resync interval is indeed the problem, we'll need to introduce configuration to allow the resync interval to be modified beyond its current hard-coded 2 minute default. --- Additional comment from Steve Speicher on 2016-08-10 14:27:30 EDT --- There any workaround or fix planned for preview (prod)? I'm still observing the same behavior since Thursday (Aug 4th) --- Additional comment from Dan Mace on 2016-08-10 14:32:40 EDT --- https://github.com/openshift/ose/pull/336 is our first attempt at a fix.
Will verify this bug, because the Bug #1365146 .
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2016:1853