Bug 1314142 - Updating deployment config gets error "timed out waiting for any update progress to be made" [NEEDINFO]
Updating deployment config gets error "timed out waiting for any update progr...
Status: CLOSED ERRATA
Product: OpenShift Container Platform
Classification: Red Hat
Component: Deployments (Show other bugs)
3.2.0
Unspecified Unspecified
medium Severity medium
: ---
: ---
Assigned To: Michail Kargakis
zhou ying
: Regression
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2016-03-02 22:26 EST by Xingxing Xia
Modified: 2016-09-05 08:58 EDT (History)
6 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2016-05-12 12:31:14 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
mkargaki: needinfo? (bvincell)


Attachments (Terms of Use)

  None (edit)
Description Xingxing Xia 2016-03-02 22:26:01 EST
Created attachment 1132617 [details]
hooks-1-deploy.yaml

Description of problem:
When updating deployment config with changed components, it always get error "timed out waiting for any update progress to be made" when time arrives beyond 2 min.
Only reproduced in OSE/AEP. Works in Origin

Version-Release number of selected component (if applicable):
openshift/oc v3.1.1.908
kubernetes v1.2.0-alpha.7-703-gbc4550d
etcd 2.2.5

How reproducible:
Always

Steps to Reproduce:
1.
$ oc create -f https://raw.githubusercontent.com/openshift-qe/v3-testfiles/master/deployment/deployment1.json
deploymentconfig "hooks" created
$ oc get pod hooks-1-deploy -o yaml > hooks-1-deploy.yaml
$ oc get pod
NAME            READY     STATUS    RESTARTS   AGE
hooks-1-wl8j2   1/1       Running   0          1m

2. Update
$ oc replace -f https://raw.githubusercontent.com/openshift-qe/v3-testfiles/master/deployment/updatev1.json
deploymentconfig "hooks" replaced
$ oc get pod hooks-2-deploy -o yaml > hooks-2-deploy.yaml
$ oc get pod
NAME             READY     STATUS    RESTARTS   AGE
hooks-1-wl8j2    1/1       Running   0          3m
hooks-2-5cdy7    1/1       Running   0          1m
hooks-2-deploy   1/1       Running   0          2m
$ oc get pod
NAME             READY     STATUS    RESTARTS   AGE
hooks-1-wl8j2    1/1       Running   0          3m
hooks-2-5cdy7    1/1       Running   0          2m
hooks-2-deploy   0/1       Error     0          2m

3. Check logs
$ oc logs pod/hooks-2-deploy
I0302 21:25:57.711799       1 deployer.go:201] Deploying from xxia-proj/hooks-1 to xxia-proj/hooks-2 (replicas: 2)
I0302 21:25:58.788683       1 rolling.go:228] RollingUpdater: Continuing update with existing controller hooks-2.
I0302 21:25:58.810908       1 rolling.go:228] RollingUpdater: Scaling up hooks-2 from 0 to 2, scaling down hooks-1 from 1 to 0 (keep 2 pods available, don't exceed 2 pods)
I0302 21:25:58.810933       1 rolling.go:228] RollingUpdater: Scaling hooks-2 up to 1
F0302 21:28:02.181326       1 deployer.go:69] timed out waiting for any update progress to be made

4.
$ oc get pod hooks-2-deploy -o yaml > hooks-2-deploy_Error.yaml

Actual results:
2. hooks-2-deploy changes from Running to Error when time arrives to 2 min.
3. Error message: timed out waiting ...
Note: 21:28:02 - 21:25:57 is about 2 min

Expected results:
2 and 3: hooks-2-deploy should succeed finally, as in Origin instance.

Additional info:
See the attached YAML files. Not sure if openshift3/ose-deployer:v3.1.1.908 has problem.
Comment 1 Xingxing Xia 2016-03-02 22:27 EST
Created attachment 1132618 [details]
hooks-2-deploy.yaml
Comment 3 Paul Weil 2016-03-03 08:55:47 EST
Usually a deployment failing to make progress can be attributed to the new deployment's pods failing in some way.  For instance this can happen if you have an invalid probe and the pod never becomes ready and your DC is configured to ensure you do not have down time.  Can you take a look at the pods the second deployment is creating and post any logs from them as well?
Comment 4 Xingxing Xia 2016-03-03 23:41:20 EST
(In reply to Paul Weil from comment #3)
> configured to ensure you do not have down time.  Can you take a look at the
> pods the second deployment is creating and post any logs from them as well?

The 2nd deployment's "replicas" is 2. From the above result, one of the replica runs well (i.e. hooks-2-5cdy7    1/1       Running   0          2m). But the other replica never come out.

More interesting:
Download updatev1.json, we find spec.strategy.rollingParams.timeoutSeconds is 120 (i.e. 2 min). Change it to longer value 300, test again, we get:
$ wget https://raw.githubusercontent.com/openshift-qe/v3-testfiles/master/deployment/deployment1.json   https://raw.githubusercontent.com/openshift-qe/v3-testfiles/master/deployment/updatev1.json

$ # Change updatev1.json timeoutSeconds value to 300

$ diff deployment1.json updatev1.json
13c13
<             "type": "Recreate",
---
>             "type": "Rolling",
45a46,50
>             "rollingParams": {
>                 "updatePeriodSeconds": 1,
>                 "intervalSeconds": 1,
>                 "timeoutSeconds": 300
>             },
50c55,65
<                 "type": "ConfigChange"
---
>                 "type": "ImageChange",
>                 "imageChangeParams": {
>                     "automatic": true,
>                     "containerNames": [
>                         "mysql-container"
>                     ],
>                     "from": {
>                         "kind": "ImageStreamTag",
>                         "name": "mysql-55-centos7:latest"
>                     }
>                 }
53c68
<         "replicas": 1,
---
>         "replicas": 2,
82c97
<                                 "value": "Plqe5Wev"
---
>                                 "value": "Plqe5Wevchange"
103c118,120
<     "status": {}
---
>     "status": {
>         "latestVersion": 2
>    }

$ oc create -f deployment1.json # Then wait for deployment to complete

$ $ oc get pod
NAME            READY     STATUS    RESTARTS   AGE
hooks-1-b87q6   1/1       Running   0          57s

$ oc replace -f updatev1.json 
deploymentconfig "hooks" replaced

$ oc get pod # *Note*: One replica of the 2nd deployment is Running in short time 30s
NAME             READY     STATUS    RESTARTS   AGE
hooks-1-b87q6    1/1       Running   0          1m
hooks-2-deploy   1/1       Running   0          41s
hooks-2-u7duf    1/1       Running   0          30s

# Continue checking pod several times. *Note*: For the 2nd deployment, one replica comes out shortly but the other never come out
NAME             READY     STATUS    RESTARTS   AGE
hooks-1-b87q6    1/1       Running   0          5m
hooks-2-deploy   1/1       Running   0          4m
hooks-2-u7duf    1/1       Running   0          4m

# Wait about 300 sec  # *Note*: hooks-2-u7duf disappears! But hooks-1-5ex89 comes out!
$ oc get pod
NAME             READY     STATUS    RESTARTS   AGE
hooks-1-5ex89    1/1       Running   0          23s   <-- new born pod
hooks-1-b87q6    1/1       Running   0          7m
hooks-2-deploy   0/1       Error     0          6m

$ oc logs pod/hooks-2-deploy
I0303 22:04:48.424121       1 deployer.go:201] Deploying from xxia-proj/hooks-1 to xxia-proj/hooks-2 (replicas: 2)
I0303 22:04:49.603613       1 rolling.go:228] RollingUpdater: Continuing update with existing controller hooks-2.
I0303 22:04:49.640004       1 rolling.go:228] RollingUpdater: Scaling up hooks-2 from 0 to 2, scaling down hooks-1 from 1 to 0 (keep 2 pods available, don't exceed 2 pods)
I0303 22:04:49.640024       1 rolling.go:228] RollingUpdater: Scaling hooks-2 up to 1
F0303 22:09:53.752335       1 deployer.go:69] timed out waiting for any update progress to be made
*Note*: The bug is not avoided by making timeoutSeconds longer

$ oc get pod hooks-1-5ex89 hooks-1-b87q6 -o yaml | grep MYSQL_PASSWORD -A 1
      - name: MYSQL_PASSWORD
        value: Plqe5Wev
--
      - name: MYSQL_PASSWORD
        value: Plqe5Wev
*Note*: Unexpectedly pod hooks-1-5ex89 born in the 2nd deployment uses pod template of the 1st deployment! (See the above diff)

I made several "*Note*" deserving attention. In sum: bug is reproduced; find new issue (unexpected born pod).
Comment 5 Xingxing Xia 2016-03-03 23:45:22 EST
You can reproduce it and get the logs you need.
Comment 6 Michail Kargakis 2016-03-04 06:29:18 EST
"RollingUpdater: Scaling up hooks-2 from 0 to 2, scaling down hooks-1 from 1 to 0 (keep 2 pods available, don't exceed 2 pods)"

This is the reason why your deployment never proceeeds:
First it scales up the new rc: 0/2 -> 1/2
Old rc is 1/1. Total pods are 2. The deployment process cannot scale up or down anymore. Blocked.

Actually, why did you decide to change from a recreate strategy to a rolling and also bump the replica size at the same time? Is there any specific reason?

maxSurge=1 and maxUnavailable=1 are set in the second deployment (you haven't specified them so they are set by default). Your new desired size is 2, so this should allow you to scale from a range of 1 pod available up to not exceeding 3 pods.

Looking into it...
Comment 7 Michail Kargakis 2016-03-04 06:59:37 EST
Ok, I verified this is due to https://github.com/openshift/origin/pull/7626

https://github.com/kubernetes/kubernetes/pull/22516 fixes it.
Comment 8 Michail Kargakis 2016-03-04 08:25:01 EST
> https://github.com/kubernetes/kubernetes/pull/22516 fixes it.

Disregard this. It fixes deployments where maxSurge is zero. I think this bug stems from the fact that the old desired replicas annotation is used in the rolling update. In "keep 2[a] pods available, don't exceed 2[b] pods":

[a] minAvailable=desired-maxUnavailable => 2=2-0
[b] dontexceed=original+maxSurge => 2=1+1

I believe [b] should be:
dontexceed=desired+maxSurge => 3=2+1 
(as it is estimated in upstream deployments[1])


[1] https://github.com/kubernetes/kubernetes/blob/09313b084704cb6198d034246a7438e8fa198dc7/pkg/util/deployment/deployment.go#L418
Comment 9 Xingxing Xia 2016-03-07 00:33:54 EST
Yeah, "keep 2 pods available, don't exceed 2 pods" is the reason.
But testing against Origin instance (openshift/oc v1.1.3-483-g28cba69), the result is "keep 1 pods available", thus everything works well:
$ oc logs pod/hooks-2-deploy
I0307 05:17:28.149138       1 deployer.go:201] Deploying from xxia-proj/hooks-1 to xxia-proj/hooks-2 (replicas: 2)
I0307 05:17:29.205347       1 rolling.go:228] RollingUpdater: Continuing update with existing controller hooks-2.
I0307 05:17:29.219796       1 rolling.go:228] RollingUpdater: Scaling up hooks-2 from 0 to 2, scaling down hooks-1 from 1 to 0 (keep 1 pods available, don't exceed 2 pods)
I0307 05:17:29.219811       1 rolling.go:228] RollingUpdater: Scaling hooks-2 up to 1
I0307 05:17:33.263548       1 rolling.go:228] RollingUpdater: Scaling hooks-1 down to 0
I0307 05:17:35.296335       1 rolling.go:228] RollingUpdater: Scaling hooks-2 up to 2

In both tests against Origin and OSE, `oc get dc hooks -o yaml > hooks-2.yaml` gets:
      maxSurge: 25%
      maxUnavailable: 25%
Seems Origin treats (replicas) 2 * 25% as 1, while OSE treats 2 * 25% as 0 ?

BTW: There is another issue, "hooks-2-u7duf disappears! But hooks-1-5ex89 comes out!" in comment 4, isn't it also a problem?
Comment 11 Michail Kargakis 2016-03-07 06:51:14 EST
Since https://github.com/openshift/origin/pull/7626 we started rounding down maxUnavailable instead of rounding up so we won't accidentally treat non-hundred maxUnavailability as 100% which was a serious bug. That exposed a couple of other old bugs in the rolling updater, one being the present bug which is fixed in https://github.com/kubernetes/kubernetes/pull/22526 and will soon be merged in Origin (https://github.com/openshift/origin/pull/7828).

The last issue you mention is not an issue but a feature (automatic rollback). If a deployment fails, it is scaled down ("hooks-2-u7duf disappears") and the old succeeded deployment is scaled up ("But hooks-1-5ex89 comes out!").
Comment 12 Xingxing Xia 2016-03-07 23:49:02 EST
The code is not merged in OSE yet. Will verify using OSE when merged.

Michail, in fact, the bug was not reproduced in Origin. It was only reproduced in OSE/AEP. Did you notice this?

Even use the latest Origin version (openshift/oc v1.1.3-500-g115660e
kubernetes v1.2.0-alpha.7-703-gbc4550d) that includes the fix (see PS) https://github.com/openshift/origin/pull/7828, it seems issue still exists. See below:
(This time I modify replicas to 4 in updatev1.json for easy check)
$ oc create -f deployment1.json # and wait for deployment to complete
$ oc replace -f updatev1.json   # replicas is modified to 4
$ oc logs pod/hooks-2-deploy
I0308 03:14:37.391134       1 deployer.go:201] Deploying from xxia-proj/hooks-1 to xxia-proj/hooks-2 (replicas: 4)
I0308 03:14:38.420383       1 rolling.go:228] RollingUpdater: Continuing update with existing controller hooks-2.
I0308 03:14:38.432619       1 rolling.go:228] RollingUpdater: Scaling up hooks-2 from 0 to 4, scaling down hooks-1 from 1 to 0 (keep 3 pods available, don't exceed 2 pods)
I0308 03:14:38.432635       1 rolling.go:228] RollingUpdater: Scaling hooks-2 up to 1
F0308 03:16:43.139926       1 deployer.go:69] timed out waiting for any update progress to be made

When testing with Origin version before the fix (v1.1.3-483-g28cba69, see PS), it also gets "keep 3 pods available, don't exceed 2 pods".
https://github.com/openshift/origin/pull/7828 does not take effect?

PS:
v1.1.3-500-g115660e includes #7828.
$ git log --pretty="%h %cd %cn %s" --date=local 115660e | grep '#7828'
115660e Tue Mar 8 00:53:28 2016 OpenShift Bot Merge pull request #7828 from kargakis/another-rolling-updater-fix

v1.1.3-483-g28cba69 does not includes #7828
$ git log --pretty="%h %cd %cn %s" --date=local 28cba69 | grep '#7828'
$
Comment 13 Michail Kargakis 2016-03-08 04:46:39 EST
Did you rebuilt the deployer image? It doesn't seem that you are running with my code changes....

"keep 3 pods available, don't exceed 2 pods"

You should see "keep 3 pods available, don't exceed 5 pods"
Comment 14 Xingxing Xia 2016-03-08 05:23:20 EST
Oh, was not aware of needing rebuilt deployer image.
Now in Origin, verified with:
openshift/origin-deployer                         latest              5440034cfef3        15 hours ago        522.5 MB

It works well. Thank you!
Will verify OSE when #7828 is merged
Comment 15 Troy Dawson 2016-03-09 15:39:05 EST
Should be in v3.2.0.1 built today.
Comment 16 Xingxing Xia 2016-03-10 02:14:33 EST
Verified with openshift/oc v3.2.0.1, kubernetes v1.2.0-alpha.7-703-gbc4550d, and image: openshift3/ose-deployer   v3.2.0.1  e5070cbcc689   12 hours ago

The bug is fixed, the 2nd deployment can complete successfully.
Comment 18 errata-xmlrpc 2016-05-12 12:31:14 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2016:1064

Note You need to log in before you can comment on or make changes to this bug.