1306720 – First attempt to scale up docker-registry fails after scale down and config for S3

Bug 1306720 - First attempt to scale up docker-registry fails after scale down and config for S3

Summary: First attempt to scale up docker-registry fails after scale down and config f...

Keywords:
Status:	CLOSED DUPLICATE of bug 1333129
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Image Registry
Sub Component:
Version:	3.2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Michail Kargakis
QA Contact:	zhou ying
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-02-11 16:26 UTC by Mike Fiedler
Modified:	2016-07-20 08:53 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-07-20 08:53:29 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Mike Fiedler 2016-02-11 16:26:45 UTC

Description of problem:

After scaling the docker-registry to 0 instances, configuring it for S3 and then scaling it back up to 1 instance, the first scale up attempt fails.  

oc describe dc/docker-registry shows desired pods as 0.

The second scale up attempt succeeds.


Version-Release number of selected component (if applicable): 3.1.1.901.   Build is enterprise-3.2/2016-02-10.1.


How reproducible:

Always after first attempt to configure for S3.  Unable to repro if reconfiguration not performed.


Steps to Reproduce:

1. oc scale --replicas=0 dc/docker-registry

2. cat > /root/config.yml <<EOF
version: 0.1
log:
  level: debug
http:
  addr: :5000
storage:
  cache:
    layerinfo: inmemory
  s3:
    accesskey: ${AWS_CONF[AWSAccessKeyId]}
    secretkey: ${AWS_CONF[AWSSecretKey]}
    region: $REGION
    bucket: $BUCKET
    encrypt: true
    secure: true
    v4auth: true
    rootdirectory: /registry
middleware:
  repository:
    - name: openshift
EOF

3. oc secrets new dockerregistry /root/config.yml

4. oc volume dc/docker-registry --add --name=dockersecrets -m /etc/registryconfig --type=secret --secret-name=dockerregistry

5. oc env dc/docker-registry REGISTRY_CONFIGURATION_PATH=/etc/registryconfig/config.yml

6. oc scale --replicas=1 dc/docker-registry

7. oc get pods
8. oc describe dc/docker-registry

Actual results:

No docker-registry running, describe shows 0 desired.

Expected results:

docker-registry pod is running with 1 replica actual and desired

Additional info:

second oc scale --replicas=1 succeeds.

Comment 1 Mike Fiedler 2016-02-11 16:28:29 UTC

In my test, I am supplying real values for the AWS access key, secret, etc.

Comment 2 Maciej Szulik 2016-02-12 16:00:59 UTC

After digging deeper here's what I found: after step 4 (updating volume) deployment controller fires configchange trigger and fires new deployment, but since replicas=0 nothing is fired (observed with oc get pods -w). Now when you actually invoke oc scale it gets somehow ignored (replicas is still 0 after this step). The problem then seems to be specific to deployment controller. I'm digging deeper into it...

Comment 3 Maciej Szulik 2016-02-12 16:12:56 UTC

My bad, it was the 5th operation that triggered (oc env) config change.

Comment 4 Maciej Szulik 2016-02-15 12:33:37 UTC

This doesn't look like a bug in the deployment controller after thorough investigation, it rather looks like a timing issue.

Here's the flow of operations:
1. scale dc down to 0
2. create secret
3. update dc's volume, which triggers 'configchange' on a dc which is handled by the controller
4. update dc's envs, which again triggers 'configchange' on a dc, again handled by the controller
5. scale up <- if this operation is executed immediately after step 4 the deployment controller does not record the intention since it's still processing the changes from step 4. If, on the other hand, one waits a couple of seconds and execute the scale operation after ~20-30 seconds that intention is properly recognized and applied to a dc.

Above is the only reasonable explanation to oddities described in this issue. 
I didn't find any errors/events or any other problems in the logs. I've spent quite a lot of time digging in into this issue and what I additionally found is, I couldn't reproduce the problem with any other custom deployment. Only docker-registry in configuration with s3 is vulnerable to this problem.

Comment 5 Maciej Szulik 2016-02-15 15:11:20 UTC

The other viable option is to pass --timeout=60s option to scale operation. This forces that the scale operation waits up to passed timeout for the replicas to be available. The problem seems to be inside the code which supports old clients which used to update RC's replicas directly instead of using Scale subresource.
See https://github.com/openshift/origin/blob/f556adc0ec9f3fe62d0860c21be6f5fcc96a95f1/pkg/deploy/controller/deploymentconfig/controller.go#L226

Comment 7 zhou ying 2016-02-19 09:41:03 UTC

Hi Mike:
  Since my S3 account still not avilable, could you please help to review this bug firstly, thanks.

Comment 8 zhou ying 2016-02-19 09:41:44 UTC

Hi Mike:
  Since my S3 account still not avilable, could you please help to verify this bug firstly, thanks.

Comment 9 Mike Fiedler 2016-02-19 14:23:49 UTC

This problem is more prevalent than just for docker-registry on S3.  I can recreate it for any application like this:

1. oc new-app centos/ruby-22-centos7~https://github.com/openshift/ruby-hello-world.git
2. wait for app to build
3. oc scale --replicas=0 dc/ruby-hello-world
4. oc env dc/ruby-hello-world MYVAR=anything
5. oc scale --replicas=1 dc/ruby-hello-world

If you do step 4 right after step 5, the pod fails to be created.

If you wait 60 seconds between 4 and 5 it works and the pod is created, so I can confirm it is a timing issue (see comment 4 and comment 5).    

Waiting or specifying a timeout is a workaround, but since this seems to affect any application, my opinion is that the problem should be addressed.

My test dev-env:

openshift v1.1.3-89-gdd651fa
kubernetes v1.2.0-alpha.7-703-gbc4550d
etcd 2.2.5

Comment 10 Xiaoli Tian 2016-02-22 08:42:18 UTC

QE suggests to address this issue mentioned in comment 9, assign it back for devel

Comment 11 Maciej Szulik 2016-02-22 13:32:09 UTC

This should be easily addressable once https://github.com/openshift/origin/pull/7149 merges. Until then I'm leaving it as is.

Comment 12 Clayton Coleman 2016-02-22 14:04:12 UTC

Marked upcoming release due to scope of fix.

Comment 13 Maciej Szulik 2016-06-28 09:34:22 UTC

With the PR mentioned in comment 11 I'm moving this to QA.

Comment 14 zhou ying 2016-06-28 10:22:26 UTC

Confirmed with latest openshift image , the issue still not be fixed.
openshift v1.3.0-alpha.2-250-g61eba05
kubernetes v1.3.0-alpha.3-599-g2746284
etcd 2.3.0+git

Steps:
[root@ip-172-18-6-229 amd64]# oc scale --replicas=0 dc/ruby-ex
deploymentconfig "ruby-ex" scaled
[root@ip-172-18-6-229 amd64]# oc get po
NAME              READY     STATUS      RESTARTS   AGE
ruby-ex-1-build   0/1       Completed   0          3m
[root@ip-172-18-6-229 amd64]# oc env dc/ruby-ex MYVAR=anything; oc scale --replicas=1 dc/ruby-ex
deploymentconfig "ruby-ex" updated
deploymentconfig "ruby-ex" scaled
[root@ip-172-18-6-229 amd64]# oc get po
NAME               READY     STATUS      RESTARTS   AGE
ruby-ex-1-build    0/1       Completed   0          4m
ruby-ex-2-deploy   1/1       Running     0          4s

[root@ip-172-18-6-229 amd64]# oc get po
NAME              READY     STATUS      RESTARTS   AGE
ruby-ex-1-build   0/1       Completed   0          13m

Comment 15 Maciej Szulik 2016-07-01 14:14:38 UTC

Michalis I guess it deserves your closer look.

Comment 16 Michail Kargakis 2016-07-04 09:27:01 UTC

Scaling immediately after deploying is a known issue in the current deployment system due to the support we provide in order to be backwards-compatible with older clients that use `oc scale`. If you want `oc scale` to succeed, you will need to run it after the latest deployment is Complete.

Is there any reason for scaling down to zero a deployment? In the case of the docker-registry deployment, you should pause the deployment, edit the config, resume the deployment.

oc rollout pause dc/docker-registry
# update envs
# probes
# etc
oc rollout resume dc/docker-registry

There is an issue about updating the docs to reflect the latest pause functionality: https://github.com/openshift/openshift-docs/issues/2068

Comment 17 Mike Fiedler 2016-07-05 11:51:16 UTC

You're correct for this registry example that scale down is not manadatory.

There are times though that you do want to scale down to 0 to prevent activity by a deployment while a change is made.   Example - this script from Clayton likely won't work correctly:  https://lists.openshift.redhat.com/openshift-archives/dev/2016-May/msg00037.html

Comment 18 Michail Kargakis 2016-07-13 08:25:49 UTC

Hm, I replied via e-mail not long ago but I don't see my answer anywhere ...
Once more, if you want to scale down to zero and then scale up then you probably need to use the Recreate strategy.

Comment 19 Michail Kargakis 2016-07-20 08:53:29 UTC

Closing as a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1333129.

Mike, this bug (scaling right after starting a new deployment) is not going to be fixed anytime soon[1] - in the meantime the right thing to do here is to change the strategy to Recreate (scales down to zero old pods, scales up new pods)

[1] As I said in a previous post, because of backwards compatibility. We may want to output an appropriate error message when somebody tries to scale a running deployment.

*** This bug has been marked as a duplicate of bug 1333129 ***

Note You need to log in before you can comment on or make changes to this bug.