Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1307004

Summary:	Rolling strategy scaling down pods before new pods pass ready check
Product:	OpenShift Container Platform	Reporter:	Alexander Koksharov <akokshar>
Component:	openshift-controller-manager	Assignee:	Michail Kargakis <mkargaki>
Status:	CLOSED ERRATA	QA Contact:	zhou ying <yinzhou>
Severity:	high	Docs Contact:
Priority:	high
Version:	3.1.0	CC:	akokshar, aos-bugs, asogukpi, dmace, erich, mfojtik, mkargaki, pep, pweil, tdawson, wmeng, yinzhou
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	The rolling updater wasn't ignoring pods marked for deletion and was counting them as ready.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-04-12 19:04:52 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1267746

Description Alexander Koksharov 2016-02-12 12:27:02 UTC

Description of problem:
Rolling update does not perform as it described in documentation.

1. It was discovered that system does not wait for pod to terminate. as a result update finishes with less active pods then required.
When pods take long time to terminate rolling update scenario finished with this:

NAME           READY     STATUS        RESTARTS   AGE
pret-1-build   0/1       Completed     0          19m
pret-1-nwz64   1/1       Terminating   0          18m
pret-1-oeakn   1/1       Terminating   0          17m
pret-1-ov9j0   1/1       Terminating   0          19m
pret-1-wauee   1/1       Terminating   0          18m
pret-2-0hukg   0/1       Running       0          27s
pret-2-2qk9o   0/1       Running       0          31s
pret-2-build   0/1       Completed     0          1m
pret-2-ie41n   1/1       Running       0          46s
pret-2-j6gxm   1/1       Running       0          21s

In this example there are only two pods which are in running and ready state.

2. Pods which were requested to terminate are still marked as Ready. Whereas [0] says that pod is removed from end points list when it is shown as terminating.
[0] https://github.com/kubernetes/kubernetes/blob/release-1.1/docs/user-guide/pods.md#termination-of-pods


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Michail Kargakis 2016-02-15 10:22:25 UTC

Probably related to https://bugzilla.redhat.com/show_bug.cgi?id=1281286

Comment 2 Dan Mace 2016-02-15 16:27:53 UTC

Agree with Michail that it's related. The deployer has to trust the ready state of pods reported by the Kubelet, and if Terminating pods are considered "Ready", then the deployer has no real choice but to consider them ready. If Terminating pods shouldn't be ready, the issue is with the Kubelet (which is what sets the ready state for the pods). We need to follow up with Kubernetes to sort out the relationship between readiness and Terminating.

I wouldn't say this is a bug with deployments directly, but deployments are certainly affected in a way that seems surprising, so let's leave this bug open for now.

Comment 3 Dan Mace 2016-02-15 19:52:20 UTC

Spoke with Clayton, and decided that the updater should not count terminating pods towards the minimum even though they're "Ready". Ignoring them is also consistent with how RCs handle pods. This will require an upstream fix. Since the issue has existed since the introduction of the rolling updater, I'm marking the issue UpcomingRelease.

Comment 7 Michail Kargakis 2016-06-10 09:29:04 UTC

Can you post the output of the following command?

oc get dc/pret -o yaml

Also note that there aren't only 2 pods running and in ready state but you have 4 more pods that may be in terminating state and not under the service anymore but they still serve live connections.

Comment 8 Michal Fojtik 2016-07-20 11:06:03 UTC

Alexander bump.

Comment 9 Michal Fojtik 2016-08-01 11:35:29 UTC

Bump #2 :-)

Comment 13 Michail Kargakis 2016-08-24 19:13:48 UTC

Still not fixed. I will get to this once 1.3 is out.

Comment 14 Michail Kargakis 2016-09-16 15:26:35 UTC

Michal can you take a look at this and move the discussion upstream? We will need to do it for both the rolling updater and Deployments, if we end up doing it.

Comment 17 Michail Kargakis 2016-12-22 13:33:20 UTC

I have a fix for this upstream: https://github.com/kubernetes/kubernetes/pull/39150

Comment 18 Michail Kargakis 2017-02-01 18:00:29 UTC

Needs testing

Comment 19 zhou ying 2017-02-04 03:29:58 UTC

Hi Michail Kargakis:
   When I tested with rolling, the replicas was 4, and maxSurge: 25%, maxUnavailable: 25%, when deploy keep 3 pods available, don't exceed 5 pods, the 5 pods contain the deletion pods, like this:

[root@zhouy ~]# oc get po 
NAME               READY     STATUS              RESTARTS   AGE
ruby-ex-5-fb443    1/1       Terminating         0          33s
ruby-ex-5-lkv19    1/1       Terminating         0          16s
ruby-ex-5-sb61z    1/1       Running             0          34s
ruby-ex-6-czdmp    0/1       ContainerCreating   0          <invalid>
ruby-ex-6-deploy   1/1       Running             0          <invalid>
ruby-ex-6-fkjjr    1/1       Running             0          <invalid>
ruby-ex-6-mv810    0/1       ContainerCreating   0          <invalid>
ruby-ex-6-r3c8v    1/1       Running             0          <invalid>

So, is this right for the rolling updater wasn't ignoring pods marked for deletion and was counting them as ready ?

Comment 20 Michail Kargakis 2017-02-04 12:09:31 UTC

You have three pods running which is the minimum allowed by the deployment which is fine. Did you observe less than 3 pods running at any point in time for the deployment?

Comment 21 zhou ying 2017-02-06 01:20:43 UTC

No less than 3 pods running for the deployment.

Comment 22 zhou ying 2017-02-07 06:15:58 UTC

Can't reproduce this issue with latest OCP3.5:
openshift version
openshift v3.5.0.17+c55cf2b
kubernetes v1.5.2+43a9be4
etcd 3.1.0

[root@zhouy testjson]# oc get po 
NAME                  READY     STATUS              RESTARTS   AGE
database-1-7gld8      1/1       Running             0          12m
frontend-3-2ghjv      1/1       Terminating         0          12s
frontend-3-7c7j3      1/1       Running             0          16s
frontend-4-0lnbj      1/1       Running             0          <invalid>
frontend-4-deploy     1/1       Running             0          <invalid>
frontend-4-gbjd9      1/1       Running             0          <invalid>
frontend-4-hook-pre   0/1       Completed           0          <invalid>
frontend-4-m2jk4      1/1       Running             0          <invalid>
frontend-4-n47ng      0/1       ContainerCreating   0          <invalid>

Comment 24 errata-xmlrpc 2017-04-12 19:04:52 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:0884