Bug 1307004 - Rolling strategy scaling down pods before new pods pass ready check
Summary: Rolling strategy scaling down pods before new pods pass ready check
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: openshift-controller-manager
Version: 3.1.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: ---
Assignee: Michail Kargakis
QA Contact: zhou ying
URL:
Whiteboard:
Depends On:
Blocks: 1267746
TreeView+ depends on / blocked
 
Reported: 2016-02-12 12:27 UTC by Alexander Koksharov
Modified: 2019-11-14 07:26 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
The rolling updater wasn't ignoring pods marked for deletion and was counting them as ready.
Clone Of:
Environment:
Last Closed: 2017-04-12 19:04:52 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2017:0884 0 normal SHIPPED_LIVE Red Hat OpenShift Container Platform 3.5 RPM Release Advisory 2017-04-12 22:50:07 UTC

Description Alexander Koksharov 2016-02-12 12:27:02 UTC
Description of problem:
Rolling update does not perform as it described in documentation.

1. It was discovered that system does not wait for pod to terminate. as a result update finishes with less active pods then required.
When pods take long time to terminate rolling update scenario finished with this:

NAME           READY     STATUS        RESTARTS   AGE
pret-1-build   0/1       Completed     0          19m
pret-1-nwz64   1/1       Terminating   0          18m
pret-1-oeakn   1/1       Terminating   0          17m
pret-1-ov9j0   1/1       Terminating   0          19m
pret-1-wauee   1/1       Terminating   0          18m
pret-2-0hukg   0/1       Running       0          27s
pret-2-2qk9o   0/1       Running       0          31s
pret-2-build   0/1       Completed     0          1m
pret-2-ie41n   1/1       Running       0          46s
pret-2-j6gxm   1/1       Running       0          21s

In this example there are only two pods which are in running and ready state.

2. Pods which were requested to terminate are still marked as Ready. Whereas [0] says that pod is removed from end points list when it is shown as terminating.
[0] https://github.com/kubernetes/kubernetes/blob/release-1.1/docs/user-guide/pods.md#termination-of-pods


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Michail Kargakis 2016-02-15 10:22:25 UTC
Probably related to https://bugzilla.redhat.com/show_bug.cgi?id=1281286

Comment 2 Dan Mace 2016-02-15 16:27:53 UTC
Agree with Michail that it's related. The deployer has to trust the ready state of pods reported by the Kubelet, and if Terminating pods are considered "Ready", then the deployer has no real choice but to consider them ready. If Terminating pods shouldn't be ready, the issue is with the Kubelet (which is what sets the ready state for the pods). We need to follow up with Kubernetes to sort out the relationship between readiness and Terminating.

I wouldn't say this is a bug with deployments directly, but deployments are certainly affected in a way that seems surprising, so let's leave this bug open for now.

Comment 3 Dan Mace 2016-02-15 19:52:20 UTC
Spoke with Clayton, and decided that the updater should not count terminating pods towards the minimum even though they're "Ready". Ignoring them is also consistent with how RCs handle pods. This will require an upstream fix. Since the issue has existed since the introduction of the rolling updater, I'm marking the issue UpcomingRelease.

Comment 7 Michail Kargakis 2016-06-10 09:29:04 UTC
Can you post the output of the following command?

oc get dc/pret -o yaml

Also note that there aren't only 2 pods running and in ready state but you have 4 more pods that may be in terminating state and not under the service anymore but they still serve live connections.

Comment 8 Michal Fojtik 2016-07-20 11:06:03 UTC
Alexander bump.

Comment 9 Michal Fojtik 2016-08-01 11:35:29 UTC
Bump #2 :-)

Comment 13 Michail Kargakis 2016-08-24 19:13:48 UTC
Still not fixed. I will get to this once 1.3 is out.

Comment 14 Michail Kargakis 2016-09-16 15:26:35 UTC
Michal can you take a look at this and move the discussion upstream? We will need to do it for both the rolling updater and Deployments, if we end up doing it.

Comment 17 Michail Kargakis 2016-12-22 13:33:20 UTC
I have a fix for this upstream: https://github.com/kubernetes/kubernetes/pull/39150

Comment 18 Michail Kargakis 2017-02-01 18:00:29 UTC
Needs testing

Comment 19 zhou ying 2017-02-04 03:29:58 UTC
Hi Michail Kargakis:
   When I tested with rolling, the replicas was 4, and maxSurge: 25%, maxUnavailable: 25%, when deploy keep 3 pods available, don't exceed 5 pods, the 5 pods contain the deletion pods, like this:

[root@zhouy ~]# oc get po 
NAME               READY     STATUS              RESTARTS   AGE
ruby-ex-5-fb443    1/1       Terminating         0          33s
ruby-ex-5-lkv19    1/1       Terminating         0          16s
ruby-ex-5-sb61z    1/1       Running             0          34s
ruby-ex-6-czdmp    0/1       ContainerCreating   0          <invalid>
ruby-ex-6-deploy   1/1       Running             0          <invalid>
ruby-ex-6-fkjjr    1/1       Running             0          <invalid>
ruby-ex-6-mv810    0/1       ContainerCreating   0          <invalid>
ruby-ex-6-r3c8v    1/1       Running             0          <invalid>

So, is this right for the rolling updater wasn't ignoring pods marked for deletion and was counting them as ready ?

Comment 20 Michail Kargakis 2017-02-04 12:09:31 UTC
You have three pods running which is the minimum allowed by the deployment which is fine. Did you observe less than 3 pods running at any point in time for the deployment?

Comment 21 zhou ying 2017-02-06 01:20:43 UTC
No less than 3 pods running for the deployment.

Comment 22 zhou ying 2017-02-07 06:15:58 UTC
Can't reproduce this issue with latest OCP3.5:
openshift version
openshift v3.5.0.17+c55cf2b
kubernetes v1.5.2+43a9be4
etcd 3.1.0

[root@zhouy testjson]# oc get po 
NAME                  READY     STATUS              RESTARTS   AGE
database-1-7gld8      1/1       Running             0          12m
frontend-3-2ghjv      1/1       Terminating         0          12s
frontend-3-7c7j3      1/1       Running             0          16s
frontend-4-0lnbj      1/1       Running             0          <invalid>
frontend-4-deploy     1/1       Running             0          <invalid>
frontend-4-gbjd9      1/1       Running             0          <invalid>
frontend-4-hook-pre   0/1       Completed           0          <invalid>
frontend-4-m2jk4      1/1       Running             0          <invalid>
frontend-4-n47ng      0/1       ContainerCreating   0          <invalid>

Comment 24 errata-xmlrpc 2017-04-12 19:04:52 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:0884


Note You need to log in before you can comment on or make changes to this bug.