Bug 1678164 - Not all replicated image-registry pods obey nodeSelector rule if change nodeSelector and replicas at the same time
Summary: Not all replicated image-registry pods obey nodeSelector rule if change nodeS...
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: kube-controller-manager
Version: 4.1.0
Hardware: Unspecified
OS: Unspecified
Target Milestone: ---
: 4.4.0
Assignee: Tomáš Nožička
QA Contact: Wenjing Zheng
Depends On:
TreeView+ depends on / blocked
Reported: 2019-02-18 08:55 UTC by Wenjing Zheng
Modified: 2020-05-04 11:13 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Last Closed: 2020-05-04 11:12:48 UTC
Target Upstream Version:

Attachments (Terms of Use)

System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2020:0581 0 None None None 2020-05-04 11:13:13 UTC

Description Wenjing Zheng 2019-02-18 08:55:48 UTC
Description of problem:
If change nodeSelector and replicas at the same time, only one replicated image-registry pod will obey nodeSelector rule:
$ oc get pods -n openshift-image-registry -o wide
NAME                                               READY   STATUS    RESTARTS   AGE     IP            NODE                                         NOMINATED NODE
cluster-image-registry-operator-54ff44b885-dk6j9   1/1     Running   0          3h21m   ip-10-0-12-181.us-east-2.compute.internal    <none>
image-registry-5dd88dd48b-kjvqj                    0/1     Pending   0          9m53s   <none>        <none>                                       <none>
image-registry-78b4d6b48f-4pxcz                    1/1     Running   0          9m53s   ip-10-0-12-181.us-east-2.compute.internal    <none>
image-registry-78b4d6b48f-znbnb                    1/1     Running   0          3h21m   ip-10-0-24-186.us-east-2.compute.internal    <none>

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
1.oc edit configs.imageregistry.operator.openshift.io
    node-role.kubernetes.io/master: abc
  proxy: {}
  replicas: 2

2.oc get pods -n openshift-image-registry

Actual results:
Only one pod obey nodeSelector rule.

Expected results:
All replicated pod should obey.

Additional info:

Comment 1 Ben Parees 2019-02-18 14:54:22 UTC
I think this is expected.   The rollout of the changes cannot continue until the first new pod goes from pending to available, and your pod is pending because no nodes match the nodeselector.

You can confirm that the right settings were applied by looking at the image-registry deployment object.  Assuming you see the correct replica count and nodeselector value there, this is working as expected. To fully test it, you'll need to set the nodeselector to a node that can be scheduled to.

Comment 2 Wenjing Zheng 2019-02-19 03:46:41 UTC
I thought the first new created pod should also be pending with the same reason like the pending pod. Since they are created at the same time after the invalid nodeSelector applied.

Comment 3 Ben Parees 2019-02-19 05:04:03 UTC
I think this is a bug in deployments (or at least deployment behavior that needs to be explained).

I'm able to see similar behavior by simply:

1) oc create deployment --image=someimage mydeployment  (results in 1 running pod as expected)
2) oc edit deployment mydeployment
  - set replicas: 2
  - set nodeselector as you did (ie set it to something that can't be scheduled)
3) result: I see 3 pods- the original pod from the deployment(running) and 2 new pods:  one with the nodeselector and the other without it. 

My theory would be that the deployment controller first scales the deployment up to 2 replicas (hence why you see a second pod running), then starts rolling out the nodeselector change(at which point it gets stuck when the first new pod w/ the nodeselector gets stuck in pending...if it had proceeded, you'd have seen the oldest pod and the new running pod both get removed and replaced with another pod that has the nodeselector set), but i don't know for sure.

Assigning to master team to confirm/explain this behavior.

Master team if you can confirm this is working as expected/designed, please set it back to ON_QA so the QE team can update their test case expectations/procedure.

Comment 4 Tomáš Nožička 2019-03-07 11:54:28 UTC
Ben is right. this seems like a cosmetic bug in upstream Deployment controller where this should check that pod template also didn't change https://github.com/kubernetes/kubernetes/blob/954996e231074dc7429f7be1256a579bedd8344c/pkg/controller/deployment/deployment_controller.go#L632-L638

Comment 7 Tomáš Nožička 2020-01-30 10:17:53 UTC
Looking at this again, I now think scaling first is the correct approach.

It feels better to first create pods that are easy, or scale the old ones down before starting update on the rest. Considering cases when you switch say from 2 to 1 replica with Recreate and don't have leader election, by applying the config your expectation is to have the old once scaled to 1 and then recreate of the remaining pod.

Comment 8 Wenjing Zheng 2020-02-07 10:06:08 UTC
I can see all scaled up pods obey the node selector rule with 4.4.0-0.nightly-2020-02-06-230833 now:
$ oc get pods
NAME                                              READY   STATUS    RESTARTS   AGE
cluster-image-registry-operator-df4ccc5c9-jzhnj   2/2     Running   0          7h39m
image-registry-94bbd669f-5zcvl                    0/1     Pending   0          5m1s
image-registry-94bbd669f-7w95z                    0/1     Pending   0          5m1s

Comment 9 Wenjing Zheng 2020-02-13 02:07:47 UTC
Verified on 4.4.0-0.nightly-2020-02-12-191550.

Comment 11 errata-xmlrpc 2020-05-04 11:12:48 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.