Bug 1306720
Summary: | First attempt to scale up docker-registry fails after scale down and config for S3 | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Mike Fiedler <mifiedle> |
Component: | Image Registry | Assignee: | Michail Kargakis <mkargaki> |
Status: | CLOSED DUPLICATE | QA Contact: | zhou ying <yinzhou> |
Severity: | medium | Docs Contact: | |
Priority: | high | ||
Version: | 3.2.0 | CC: | aos-bugs, ccoleman, jokerman, mifiedle, mmccomas, xtian |
Target Milestone: | --- | Keywords: | NeedsTestCase |
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2016-07-20 08:53:29 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Mike Fiedler
2016-02-11 16:26:45 UTC
In my test, I am supplying real values for the AWS access key, secret, etc. After digging deeper here's what I found: after step 4 (updating volume) deployment controller fires configchange trigger and fires new deployment, but since replicas=0 nothing is fired (observed with oc get pods -w). Now when you actually invoke oc scale it gets somehow ignored (replicas is still 0 after this step). The problem then seems to be specific to deployment controller. I'm digging deeper into it... My bad, it was the 5th operation that triggered (oc env) config change. This doesn't look like a bug in the deployment controller after thorough investigation, it rather looks like a timing issue. Here's the flow of operations: 1. scale dc down to 0 2. create secret 3. update dc's volume, which triggers 'configchange' on a dc which is handled by the controller 4. update dc's envs, which again triggers 'configchange' on a dc, again handled by the controller 5. scale up <- if this operation is executed immediately after step 4 the deployment controller does not record the intention since it's still processing the changes from step 4. If, on the other hand, one waits a couple of seconds and execute the scale operation after ~20-30 seconds that intention is properly recognized and applied to a dc. Above is the only reasonable explanation to oddities described in this issue. I didn't find any errors/events or any other problems in the logs. I've spent quite a lot of time digging in into this issue and what I additionally found is, I couldn't reproduce the problem with any other custom deployment. Only docker-registry in configuration with s3 is vulnerable to this problem. The other viable option is to pass --timeout=60s option to scale operation. This forces that the scale operation waits up to passed timeout for the replicas to be available. The problem seems to be inside the code which supports old clients which used to update RC's replicas directly instead of using Scale subresource. See https://github.com/openshift/origin/blob/f556adc0ec9f3fe62d0860c21be6f5fcc96a95f1/pkg/deploy/controller/deploymentconfig/controller.go#L226 Hi Mike: Since my S3 account still not avilable, could you please help to review this bug firstly, thanks. Hi Mike: Since my S3 account still not avilable, could you please help to verify this bug firstly, thanks. This problem is more prevalent than just for docker-registry on S3. I can recreate it for any application like this: 1. oc new-app centos/ruby-22-centos7~https://github.com/openshift/ruby-hello-world.git 2. wait for app to build 3. oc scale --replicas=0 dc/ruby-hello-world 4. oc env dc/ruby-hello-world MYVAR=anything 5. oc scale --replicas=1 dc/ruby-hello-world If you do step 4 right after step 5, the pod fails to be created. If you wait 60 seconds between 4 and 5 it works and the pod is created, so I can confirm it is a timing issue (see comment 4 and comment 5). Waiting or specifying a timeout is a workaround, but since this seems to affect any application, my opinion is that the problem should be addressed. My test dev-env: openshift v1.1.3-89-gdd651fa kubernetes v1.2.0-alpha.7-703-gbc4550d etcd 2.2.5 QE suggests to address this issue mentioned in comment 9, assign it back for devel This should be easily addressable once https://github.com/openshift/origin/pull/7149 merges. Until then I'm leaving it as is. Marked upcoming release due to scope of fix. With the PR mentioned in comment 11 I'm moving this to QA. Confirmed with latest openshift image , the issue still not be fixed. openshift v1.3.0-alpha.2-250-g61eba05 kubernetes v1.3.0-alpha.3-599-g2746284 etcd 2.3.0+git Steps: [root@ip-172-18-6-229 amd64]# oc scale --replicas=0 dc/ruby-ex deploymentconfig "ruby-ex" scaled [root@ip-172-18-6-229 amd64]# oc get po NAME READY STATUS RESTARTS AGE ruby-ex-1-build 0/1 Completed 0 3m [root@ip-172-18-6-229 amd64]# oc env dc/ruby-ex MYVAR=anything; oc scale --replicas=1 dc/ruby-ex deploymentconfig "ruby-ex" updated deploymentconfig "ruby-ex" scaled [root@ip-172-18-6-229 amd64]# oc get po NAME READY STATUS RESTARTS AGE ruby-ex-1-build 0/1 Completed 0 4m ruby-ex-2-deploy 1/1 Running 0 4s [root@ip-172-18-6-229 amd64]# oc get po NAME READY STATUS RESTARTS AGE ruby-ex-1-build 0/1 Completed 0 13m Michalis I guess it deserves your closer look. Scaling immediately after deploying is a known issue in the current deployment system due to the support we provide in order to be backwards-compatible with older clients that use `oc scale`. If you want `oc scale` to succeed, you will need to run it after the latest deployment is Complete. Is there any reason for scaling down to zero a deployment? In the case of the docker-registry deployment, you should pause the deployment, edit the config, resume the deployment. oc rollout pause dc/docker-registry # update envs # probes # etc oc rollout resume dc/docker-registry There is an issue about updating the docs to reflect the latest pause functionality: https://github.com/openshift/openshift-docs/issues/2068 You're correct for this registry example that scale down is not manadatory. There are times though that you do want to scale down to 0 to prevent activity by a deployment while a change is made. Example - this script from Clayton likely won't work correctly: https://lists.openshift.redhat.com/openshift-archives/dev/2016-May/msg00037.html Hm, I replied via e-mail not long ago but I don't see my answer anywhere ... Once more, if you want to scale down to zero and then scale up then you probably need to use the Recreate strategy. Closing as a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1333129. Mike, this bug (scaling right after starting a new deployment) is not going to be fixed anytime soon[1] - in the meantime the right thing to do here is to change the strategy to Recreate (scales down to zero old pods, scales up new pods) [1] As I said in a previous post, because of backwards compatibility. We may want to output an appropriate error message when somebody tries to scale a running deployment. *** This bug has been marked as a duplicate of bug 1333129 *** |