Created attachment 1132617 [details] hooks-1-deploy.yaml Description of problem: When updating deployment config with changed components, it always get error "timed out waiting for any update progress to be made" when time arrives beyond 2 min. Only reproduced in OSE/AEP. Works in Origin Version-Release number of selected component (if applicable): openshift/oc v3.1.1.908 kubernetes v1.2.0-alpha.7-703-gbc4550d etcd 2.2.5 How reproducible: Always Steps to Reproduce: 1. $ oc create -f https://raw.githubusercontent.com/openshift-qe/v3-testfiles/master/deployment/deployment1.json deploymentconfig "hooks" created $ oc get pod hooks-1-deploy -o yaml > hooks-1-deploy.yaml $ oc get pod NAME READY STATUS RESTARTS AGE hooks-1-wl8j2 1/1 Running 0 1m 2. Update $ oc replace -f https://raw.githubusercontent.com/openshift-qe/v3-testfiles/master/deployment/updatev1.json deploymentconfig "hooks" replaced $ oc get pod hooks-2-deploy -o yaml > hooks-2-deploy.yaml $ oc get pod NAME READY STATUS RESTARTS AGE hooks-1-wl8j2 1/1 Running 0 3m hooks-2-5cdy7 1/1 Running 0 1m hooks-2-deploy 1/1 Running 0 2m $ oc get pod NAME READY STATUS RESTARTS AGE hooks-1-wl8j2 1/1 Running 0 3m hooks-2-5cdy7 1/1 Running 0 2m hooks-2-deploy 0/1 Error 0 2m 3. Check logs $ oc logs pod/hooks-2-deploy I0302 21:25:57.711799 1 deployer.go:201] Deploying from xxia-proj/hooks-1 to xxia-proj/hooks-2 (replicas: 2) I0302 21:25:58.788683 1 rolling.go:228] RollingUpdater: Continuing update with existing controller hooks-2. I0302 21:25:58.810908 1 rolling.go:228] RollingUpdater: Scaling up hooks-2 from 0 to 2, scaling down hooks-1 from 1 to 0 (keep 2 pods available, don't exceed 2 pods) I0302 21:25:58.810933 1 rolling.go:228] RollingUpdater: Scaling hooks-2 up to 1 F0302 21:28:02.181326 1 deployer.go:69] timed out waiting for any update progress to be made 4. $ oc get pod hooks-2-deploy -o yaml > hooks-2-deploy_Error.yaml Actual results: 2. hooks-2-deploy changes from Running to Error when time arrives to 2 min. 3. Error message: timed out waiting ... Note: 21:28:02 - 21:25:57 is about 2 min Expected results: 2 and 3: hooks-2-deploy should succeed finally, as in Origin instance. Additional info: See the attached YAML files. Not sure if openshift3/ose-deployer:v3.1.1.908 has problem.
Created attachment 1132618 [details] hooks-2-deploy.yaml
Usually a deployment failing to make progress can be attributed to the new deployment's pods failing in some way. For instance this can happen if you have an invalid probe and the pod never becomes ready and your DC is configured to ensure you do not have down time. Can you take a look at the pods the second deployment is creating and post any logs from them as well?
(In reply to Paul Weil from comment #3) > configured to ensure you do not have down time. Can you take a look at the > pods the second deployment is creating and post any logs from them as well? The 2nd deployment's "replicas" is 2. From the above result, one of the replica runs well (i.e. hooks-2-5cdy7 1/1 Running 0 2m). But the other replica never come out. More interesting: Download updatev1.json, we find spec.strategy.rollingParams.timeoutSeconds is 120 (i.e. 2 min). Change it to longer value 300, test again, we get: $ wget https://raw.githubusercontent.com/openshift-qe/v3-testfiles/master/deployment/deployment1.json https://raw.githubusercontent.com/openshift-qe/v3-testfiles/master/deployment/updatev1.json $ # Change updatev1.json timeoutSeconds value to 300 $ diff deployment1.json updatev1.json 13c13 < "type": "Recreate", --- > "type": "Rolling", 45a46,50 > "rollingParams": { > "updatePeriodSeconds": 1, > "intervalSeconds": 1, > "timeoutSeconds": 300 > }, 50c55,65 < "type": "ConfigChange" --- > "type": "ImageChange", > "imageChangeParams": { > "automatic": true, > "containerNames": [ > "mysql-container" > ], > "from": { > "kind": "ImageStreamTag", > "name": "mysql-55-centos7:latest" > } > } 53c68 < "replicas": 1, --- > "replicas": 2, 82c97 < "value": "Plqe5Wev" --- > "value": "Plqe5Wevchange" 103c118,120 < "status": {} --- > "status": { > "latestVersion": 2 > } $ oc create -f deployment1.json # Then wait for deployment to complete $ $ oc get pod NAME READY STATUS RESTARTS AGE hooks-1-b87q6 1/1 Running 0 57s $ oc replace -f updatev1.json deploymentconfig "hooks" replaced $ oc get pod # *Note*: One replica of the 2nd deployment is Running in short time 30s NAME READY STATUS RESTARTS AGE hooks-1-b87q6 1/1 Running 0 1m hooks-2-deploy 1/1 Running 0 41s hooks-2-u7duf 1/1 Running 0 30s # Continue checking pod several times. *Note*: For the 2nd deployment, one replica comes out shortly but the other never come out NAME READY STATUS RESTARTS AGE hooks-1-b87q6 1/1 Running 0 5m hooks-2-deploy 1/1 Running 0 4m hooks-2-u7duf 1/1 Running 0 4m # Wait about 300 sec # *Note*: hooks-2-u7duf disappears! But hooks-1-5ex89 comes out! $ oc get pod NAME READY STATUS RESTARTS AGE hooks-1-5ex89 1/1 Running 0 23s <-- new born pod hooks-1-b87q6 1/1 Running 0 7m hooks-2-deploy 0/1 Error 0 6m $ oc logs pod/hooks-2-deploy I0303 22:04:48.424121 1 deployer.go:201] Deploying from xxia-proj/hooks-1 to xxia-proj/hooks-2 (replicas: 2) I0303 22:04:49.603613 1 rolling.go:228] RollingUpdater: Continuing update with existing controller hooks-2. I0303 22:04:49.640004 1 rolling.go:228] RollingUpdater: Scaling up hooks-2 from 0 to 2, scaling down hooks-1 from 1 to 0 (keep 2 pods available, don't exceed 2 pods) I0303 22:04:49.640024 1 rolling.go:228] RollingUpdater: Scaling hooks-2 up to 1 F0303 22:09:53.752335 1 deployer.go:69] timed out waiting for any update progress to be made *Note*: The bug is not avoided by making timeoutSeconds longer $ oc get pod hooks-1-5ex89 hooks-1-b87q6 -o yaml | grep MYSQL_PASSWORD -A 1 - name: MYSQL_PASSWORD value: Plqe5Wev -- - name: MYSQL_PASSWORD value: Plqe5Wev *Note*: Unexpectedly pod hooks-1-5ex89 born in the 2nd deployment uses pod template of the 1st deployment! (See the above diff) I made several "*Note*" deserving attention. In sum: bug is reproduced; find new issue (unexpected born pod).
You can reproduce it and get the logs you need.
"RollingUpdater: Scaling up hooks-2 from 0 to 2, scaling down hooks-1 from 1 to 0 (keep 2 pods available, don't exceed 2 pods)" This is the reason why your deployment never proceeeds: First it scales up the new rc: 0/2 -> 1/2 Old rc is 1/1. Total pods are 2. The deployment process cannot scale up or down anymore. Blocked. Actually, why did you decide to change from a recreate strategy to a rolling and also bump the replica size at the same time? Is there any specific reason? maxSurge=1 and maxUnavailable=1 are set in the second deployment (you haven't specified them so they are set by default). Your new desired size is 2, so this should allow you to scale from a range of 1 pod available up to not exceeding 3 pods. Looking into it...
Ok, I verified this is due to https://github.com/openshift/origin/pull/7626 https://github.com/kubernetes/kubernetes/pull/22516 fixes it.
> https://github.com/kubernetes/kubernetes/pull/22516 fixes it. Disregard this. It fixes deployments where maxSurge is zero. I think this bug stems from the fact that the old desired replicas annotation is used in the rolling update. In "keep 2[a] pods available, don't exceed 2[b] pods": [a] minAvailable=desired-maxUnavailable => 2=2-0 [b] dontexceed=original+maxSurge => 2=1+1 I believe [b] should be: dontexceed=desired+maxSurge => 3=2+1 (as it is estimated in upstream deployments[1]) [1] https://github.com/kubernetes/kubernetes/blob/09313b084704cb6198d034246a7438e8fa198dc7/pkg/util/deployment/deployment.go#L418
Yeah, "keep 2 pods available, don't exceed 2 pods" is the reason. But testing against Origin instance (openshift/oc v1.1.3-483-g28cba69), the result is "keep 1 pods available", thus everything works well: $ oc logs pod/hooks-2-deploy I0307 05:17:28.149138 1 deployer.go:201] Deploying from xxia-proj/hooks-1 to xxia-proj/hooks-2 (replicas: 2) I0307 05:17:29.205347 1 rolling.go:228] RollingUpdater: Continuing update with existing controller hooks-2. I0307 05:17:29.219796 1 rolling.go:228] RollingUpdater: Scaling up hooks-2 from 0 to 2, scaling down hooks-1 from 1 to 0 (keep 1 pods available, don't exceed 2 pods) I0307 05:17:29.219811 1 rolling.go:228] RollingUpdater: Scaling hooks-2 up to 1 I0307 05:17:33.263548 1 rolling.go:228] RollingUpdater: Scaling hooks-1 down to 0 I0307 05:17:35.296335 1 rolling.go:228] RollingUpdater: Scaling hooks-2 up to 2 In both tests against Origin and OSE, `oc get dc hooks -o yaml > hooks-2.yaml` gets: maxSurge: 25% maxUnavailable: 25% Seems Origin treats (replicas) 2 * 25% as 1, while OSE treats 2 * 25% as 0 ? BTW: There is another issue, "hooks-2-u7duf disappears! But hooks-1-5ex89 comes out!" in comment 4, isn't it also a problem?
Since https://github.com/openshift/origin/pull/7626 we started rounding down maxUnavailable instead of rounding up so we won't accidentally treat non-hundred maxUnavailability as 100% which was a serious bug. That exposed a couple of other old bugs in the rolling updater, one being the present bug which is fixed in https://github.com/kubernetes/kubernetes/pull/22526 and will soon be merged in Origin (https://github.com/openshift/origin/pull/7828). The last issue you mention is not an issue but a feature (automatic rollback). If a deployment fails, it is scaled down ("hooks-2-u7duf disappears") and the old succeeded deployment is scaled up ("But hooks-1-5ex89 comes out!").
The code is not merged in OSE yet. Will verify using OSE when merged. Michail, in fact, the bug was not reproduced in Origin. It was only reproduced in OSE/AEP. Did you notice this? Even use the latest Origin version (openshift/oc v1.1.3-500-g115660e kubernetes v1.2.0-alpha.7-703-gbc4550d) that includes the fix (see PS) https://github.com/openshift/origin/pull/7828, it seems issue still exists. See below: (This time I modify replicas to 4 in updatev1.json for easy check) $ oc create -f deployment1.json # and wait for deployment to complete $ oc replace -f updatev1.json # replicas is modified to 4 $ oc logs pod/hooks-2-deploy I0308 03:14:37.391134 1 deployer.go:201] Deploying from xxia-proj/hooks-1 to xxia-proj/hooks-2 (replicas: 4) I0308 03:14:38.420383 1 rolling.go:228] RollingUpdater: Continuing update with existing controller hooks-2. I0308 03:14:38.432619 1 rolling.go:228] RollingUpdater: Scaling up hooks-2 from 0 to 4, scaling down hooks-1 from 1 to 0 (keep 3 pods available, don't exceed 2 pods) I0308 03:14:38.432635 1 rolling.go:228] RollingUpdater: Scaling hooks-2 up to 1 F0308 03:16:43.139926 1 deployer.go:69] timed out waiting for any update progress to be made When testing with Origin version before the fix (v1.1.3-483-g28cba69, see PS), it also gets "keep 3 pods available, don't exceed 2 pods". https://github.com/openshift/origin/pull/7828 does not take effect? PS: v1.1.3-500-g115660e includes #7828. $ git log --pretty="%h %cd %cn %s" --date=local 115660e | grep '#7828' 115660e Tue Mar 8 00:53:28 2016 OpenShift Bot Merge pull request #7828 from kargakis/another-rolling-updater-fix v1.1.3-483-g28cba69 does not includes #7828 $ git log --pretty="%h %cd %cn %s" --date=local 28cba69 | grep '#7828' $
Did you rebuilt the deployer image? It doesn't seem that you are running with my code changes.... "keep 3 pods available, don't exceed 2 pods" You should see "keep 3 pods available, don't exceed 5 pods"
Oh, was not aware of needing rebuilt deployer image. Now in Origin, verified with: openshift/origin-deployer latest 5440034cfef3 15 hours ago 522.5 MB It works well. Thank you! Will verify OSE when #7828 is merged
Should be in v3.2.0.1 built today.
Verified with openshift/oc v3.2.0.1, kubernetes v1.2.0-alpha.7-703-gbc4550d, and image: openshift3/ose-deployer v3.2.0.1 e5070cbcc689 12 hours ago The bug is fixed, the 2nd deployment can complete successfully.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2016:1064
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days