Description of problem: OpenShift Master continues to try and schedule on disabled nodes. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. Upgraded masters to 3.2 2. Attempting to upgrade registry on new nodes, oc deploy --latest dc/docker-registry 3. Actual results: Masters attempting to deploy to the old disabled nodes rather than the new active nodes. Expected results: Deploying the registry to the active nodes Additional info: oc get events -w FIRSTSEEN LASTSEEN COUNT NAME KIND SUBOBJECT TYPE REASON SOURCE MESSAGE 2016-05-17 15:17:56 -0400 EDT 2016-05-17 15:17:56 -0400 EDT 1 docker-registry-12-t1p6w Pod Warning FailedScheduling {default-scheduler } node 'ip-172-31-9-113.ec2.internal' is not in cache 2016-05-17 15:17:56 -0400 EDT 2016-05-17 15:17:56 -0400 EDT 1 docker-registry-12-gemjq Pod Warning FailedScheduling {default-scheduler } node 'ip-172-31-9-112.ec2.internal' is not in cache 2016-05-17 15:17:56 -0400 EDT 2016-05-17 15:17:56 -0400 EDT 1 docker-registry-12 ReplicationController Normal SuccessfulCreate {replication-controller } Created pod: docker-registry-12-gemjq 2016-05-17 15:17:56 -0400 EDT 2016-05-17 15:17:56 -0400 EDT 1 docker-registry-12-nnj5q Pod Warning FailedScheduling {default-scheduler } node 'ip-172-31-9-112.ec2.internal' is not in cache 2016-05-17 15:17:56 -0400 EDT 2016-05-17 15:17:56 -0400 EDT 1 docker-registry-12 ReplicationController Normal SuccessfulCreate {replication-controller } Created pod: docker-registry-12-t1p6w 2016-05-17 15:17:56 -0400 EDT 2016-05-17 15:17:56 -0400 EDT 1 docker-registry-12-j9dxr Pod Warning FailedScheduling {default-scheduler } node 'ip-172-31-9-113.ec2.internal' is not in cache 2016-05-17 15:17:56 -0400 EDT 2016-05-17 15:17:56 -0400 EDT 1 docker-registry-12 ReplicationController Normal SuccessfulCreate {replication-controller } Created pod: docker-registry-12-nnj5q 2016-05-17 15:17:56 -0400 EDT 2016-05-17 15:17:56 -0400 EDT 1 docker-registry-12 ReplicationController Normal SuccessfulCreate {replication-controller } Created pod: docker-registry-12-j9dxr 2016-05-17 15:17:56 -0400 EDT 2016-05-17 15:17:57 -0400 EDT 2 docker-registry-12-t1p6w Pod Warning FailedScheduling {default-scheduler } node 'ip-172-31-9-113.ec2.internal' is not in cache 2016-05-17 15:17:57 -0400 EDT 2016-05-17 15:17:57 -0400 EDT 1 docker-registry-12-gemjq Pod Warning FailedScheduling {default-scheduler } node 'ip-172-31-9-113.ec2.internal' is not in cache 2016-05-17 15:17:57 -0400 EDT 2016-05-17 15:17:57 -0400 EDT 1 docker-registry-12-nnj5q Pod Warning FailedScheduling {default-scheduler } node 'ip-172-31-9-113.ec2.internal' is not in cache 2016-05-17 15:17:56 -0400 EDT 2016-05-17 15:17:57 -0400 EDT 2 docker-registry-12-j9dxr Pod Warning FailedScheduling {default-scheduler } node 'ip-172-31-9-113.ec2.internal' is not in cache 2016-05-17 15:17:56 -0400 EDT 2016-05-17 15:17:59 -0400 EDT 3 docker-registry-12-t1p6w Pod Warning FailedScheduling {default-scheduler } node 'ip-172-31-9-113.ec2.internal' is not in cache 2016-05-17 15:17:56 -0400 EDT 2016-05-17 15:17:59 -0400 EDT 2 docker-registry-12-gemjq Pod Warning FailedScheduling {default-scheduler } node 'ip-172-31-9-112.ec2.internal' is not in cache 2016-05-17 15:17:57 -0400 EDT 2016-05-17 15:17:59 -0400 EDT 2 docker-registry-12-nnj5q Pod Warning FailedScheduling {default-scheduler } node 'ip-172-31-9-113.ec2.internal' is not in cache 2016-05-17 15:17:56 -0400 EDT 2016-05-17 15:17:59 -0400 EDT 3 docker-registry-12-j9dxr Pod Warning FailedScheduling {default-scheduler } node 'ip-172-31-9-113.ec2.internal' is not in cache 2016-05-17 15:17:47 -0400 EDT 2016-05-17 15:17:59 -0400 EDT 4 docker-registry-9-5g6ga Pod Warning FailedScheduling {default-scheduler } node 'ip-172-31-9-113.ec2.internal' is not in cache 2016-05-17 15:17:59 -0400 EDT 2016-05-17 15:17:59 -0400 EDT 1 docker-registry-9 ReplicationController Normal SuccessfulDelete {replication-controller } Deleted pod: docker-registry-9-5g6ga 2016-05-17 15:17:56 -0400 EDT 2016-05-17 15:18:03 -0400 EDT 4 docker-registry-12-t1p6w Pod Warning FailedScheduling {default-scheduler } node 'ip-172-31-9-113.ec2.internal' is not in cache 2016-05-17 15:17:57 -0400 EDT 2016-05-17 15:18:03 -0400 EDT 2 docker-registry-12-gemjq Pod Warning FailedScheduling {default-scheduler } node 'ip-172-31-9-113.ec2.internal' is not in cache 2016-05-17 15:17:57 -0400 EDT 2016-05-17 15:18:03 -0400 EDT 3 docker-registry-12-nnj5q Pod Warning FailedScheduling {default-scheduler } node 'ip-172-31-9-113.ec2.internal' is not in cache 2016-05-17 15:17:56 -0400 EDT 2016-05-17 15:18:03 -0400 EDT 4 docker-registry-12-j9dxr Pod Warning FailedScheduling {default-scheduler } node 'ip-172-31-9-113.ec2.internal' is not in cache 2016-05-17 10:44:50 -0400 EDT 2016-05-17 15:18:05 -0400 EDT 672 docker-registry-9-xkbzq Pod Warning FailedScheduling {default-scheduler } node 'ip-172-31-9-113.ec2.internal' is not in cache 2016-05-17 15:18:11 -0400 EDT 2016-05-17 15:18:11 -0400 EDT 1 docker-registry-12-t1p6w Pod Warning FailedScheduling {default-scheduler } node 'ip-172-31-9-112.ec2.internal' is not in cache 2016-05-17 15:17:57 -0400 EDT 2016-05-17 15:18:11 -0400 EDT 3 docker-registry-12-gemjq Pod Warning FailedScheduling {default-scheduler } node 'ip-172-31-9-113.ec2.internal' is not in cache 2016-05-17 15:17:57 -0400 EDT 2016-05-17 15:18:11 -0400 EDT 4 docker-registry-12-nnj5q Pod Warning FailedScheduling {default-scheduler } node 'ip-172-31-9-113.ec2.internal' is not in cache 2016-05-17 15:18:11 -0400 EDT 2016-05-17 15:18:11 -0400 EDT 1 docker-registry-12-j9dxr Pod Warning FailedScheduling {default-scheduler } node 'ip-172-31-9-112.ec2.internal' is not in cache 2016-05-17 15:17:56 -0400 EDT 2016-05-17 15:18:27 -0400 EDT 5 docker-registry-12-t1p6w Pod Warning FailedScheduling {default-scheduler } node 'ip-172-31-9-113.ec2.internal' is not in cache 2016-05-17 15:17:56 -0400 EDT 2016-05-17 15:18:27 -0400 EDT 3 docker-registry-12-gemjq Pod Warning FailedScheduling {default-scheduler } node 'ip-172-31-9-112.ec2.internal' is not in cache 2016-05-17 15:17:56 -0400 EDT 2016-05-17 15:18:27 -0400 EDT 5 docker-registry-12-j9dxr Pod Warning FailedScheduling {default-scheduler } node 'ip-172-31-9-113.ec2.internal' is not in cache 2016-05-17 15:17:57 -0400 EDT 2016-05-17 15:18:27 -0400 EDT 5 docker-registry-12-nnj5q Pod Warning FailedScheduling {default-scheduler } node 'ip-172-31-9-113.ec2.internal' is not in cache
Related and possible fix: https://github.com/kubernetes/kubernetes/pull/22568
I can't reproduced the issue. How do you upgrade masters? Are you following https://docs.openshift.com/enterprise/latest/install_config/upgrading/index.html
(In reply to Anping Li from comment #2) > I can't reproduced the issue. How do you upgrade masters? Are you following > https://docs.openshift.com/enterprise/latest/install_config/upgrading/index. > html Yes, manually. Performed the yum upgrade followed by a reboot in our case but still following : https://docs.openshift.com/enterprise/latest/install_config/upgrading/manual_upgrades.html#upgrading-masters
This is a bit of a corner case and hence I am listing the steps to reproduce this issue. I would suggest following these steps to reproduce the issue on a cluster that does NOT have this bug fix yet to ensure that these steps work and can successfully reproduce the issue. 1. Start with a multi-node cluster with 2 user/compute nodes 2. Create a RC with replica count 1 3. Create a corresponding service 4. Mark the node that has the pod as Unschedulable 5. Scale the RC to 2 The pod created in step 5 should fail to schedule since the predicate does a lookup of the node on which the pod belonging to the same service landed. This lookup is done to see the node labels for determining service affinity. Since the node is Unschedulable and missing from the predicate cache, the lookup fails and the scheduler returns an error.
Verified on oc v3.2.1.2 Fixed. 2016-06-17 14:46:25 +0800 CST 2016-06-17 14:46:26 +0800 CST 2 database-1-5hb31 Pod Warning FailedScheduling {default-scheduler } pod (database-1-5hb31) failed to fit in any node fit failure on node (ip-172-18-1-30.ec2.internal): Region 2016-06-17 14:46:25 +0800 CST 2016-06-17 14:46:28 +0800 CST 3 database-1-5hb31 Pod Warning FailedScheduling {default-scheduler } pod (database-1-5hb31) failed to fit in any node fit failure on node (ip-172-18-1-30.ec2.internal): Region
wmeng: Were you able to verify that after the fix, the pod in step 5 was created successfully on the node that was schedulable?
The result depends on the nodes configuration and scheduler configuration. If the argument is configured on scheduler {"argument": {"serviceAffinity": {"labels": ["region"]}}, "name": "Region"}. replicas is 1, the running pod is on NodeA, no other schedulable nodes have same region label with NodeA. we make NodeA unschelable, scale RC replicas up, the new pods are pending. This is expected. If there are schedulable nodes have same region label with NodeA(on which node the running pod is), the new pods will be created on them when scaling up. If service affinity argument is not configured on scheduler, the new pods will be created on other schedulable nodes.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2016:1383