Hide Forgot
Description of problem: Following up https://github.com/openshift/installer/issues/1020 Version-Release number of selected component (if applicable): $ oc get clusterversions.config.openshift.io NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.0.0-0.nightly-2019-04-10-182914 True False 124m Cluster version is 4.0.0-0.nightly-2019-04-10-182914 How reproducible: Steps to Reproduce: $ oc get scheduler NAME AGE cluster 13m $ oc get scheduler cluster -o yaml apiVersion: config.openshift.io/v1 kind: Scheduler metadata: creationTimestamp: "2019-04-12T18:35:04Z" generation: 1 name: cluster resourceVersion: "67588" selfLink: /apis/config.openshift.io/v1/schedulers/cluster uid: b069e611-5d51-11e9-9957-065019240360 spec: defaultNodeSelector: aaa=bbb $ oc new-project aaa $ oc create -f https://raw.githubusercontent.com/hongkailiu/svt-case-doc/master/files/pod_test.yaml pod/web created $ oc get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES web 1/1 Running 0 9s 10.128.2.58 ip-10-0-142-150.us-east-2.compute.internal <none> <none> $ oc get node --show-labels ip-10-0-142-150.us-east-2.compute.internal NAME STATUS ROLES AGE VERSION LABELS ip-10-0-142-150.us-east-2.compute.internal Ready worker 130m v1.12.4+509916ce1 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m4.large,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-east-2,failure-domain.beta.kubernetes.io/zone=us-east-2a,kubernetes.io/hostname=ip-10-0-142-150,node-role.kubernetes.io/worker=,node.openshift.io/os_id=rhcos,node.openshift.io/os_version=4.1 Actual results: Expected results: There is no node with label `aaa=bbb`. The test pod should be `Pending` instead of Running. Additional info: More questions: 1. why does the `metadata.name` have to be `cluster`? $ oc create -f abc.yaml The Scheduler "" is invalid: metadata.name: Invalid value: "my-scheduler": must be cluster 2. how to delete the crd? $ oc delete scheduler cluster Error from server (Forbidden): schedulers.config.openshift.io "cluster" is forbidden: deleting required schedulers.config.openshift.io resource, named cluster, is not allowed
https://github.com/openshift/cluster-kube-apiserver-operator/pull/394 > Error from server (Forbidden): schedulers.config.openshift.io "cluster" is forbidden: deleting required schedulers.config.openshift.io resource, named cluster, is not allowed It's validation code that prohibits deletion. > why does the `metadata.name` have to be `cluster`? Because we want the operator to be singleton meaning we want only scheduler CR `cluster` nothing else.
Add annotation PRs: https://github.com/openshift/cluster-authentication-operator/pull/117 https://github.com/operator-framework/operator-marketplace/pull/173 https://github.com/openshift/cluster-kube-apiserver-operator/pull/439 https://github.com/openshift/cluster-image-registry-operator/pull/266 https://github.com/openshift/cluster-version-operator/pull/174 https://github.com/openshift/service-ca-operator/pull/50 https://github.com/operator-framework/operator-lifecycle-manager/pull/824 https://github.com/openshift/machine-config-operator/pull/667 https://github.com/openshift/cluster-network-operator/pull/157 https://github.com/openshift/cluster-ingress-operator/pull/217 https://github.com/openshift/cluster-openshift-controller-manager-operator/pull/93 https://github.com/openshift/console-operator/pull/214 https://github.com/openshift/cluster-dns-operator/pull/100 https://github.com/openshift/machine-api-operator/pull/300 https://github.com/openshift/cluster-kube-scheduler-operator/pull/115 https://github.com/openshift/cluster-samples-operator/pull/139 https://github.com/openshift/cluster-kube-controller-manager-operator/pull/237 https://github.com/openshift/cluster-openshift-apiserver-operator/pull/195 https://github.com/openshift/cloud-credential-operator/pull/61 https://github.com/openshift/cluster-etcd-operator/pull/10 https://github.com/openshift/cluster-storage-operator/pull/30 https://github.com/openshift/cluster-machine-approver/pull/20 The enable defaultNodeSelector: https://github.com/openshift/cluster-kube-apiserver-operator/pull/394
Retest 4/30 $ oc get clusterversions.config.openshift.io NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.1.0-0.nightly-2019-04-28-064010 True False 88m Cluster version is 4.1.0-0.nightly-2019-04-28-064010 $ oc get schedulers.config.openshift.io cluster -o yaml apiVersion: config.openshift.io/v1 kind: Scheduler metadata: annotations: release.openshift.io/create-only: "true" creationTimestamp: "2019-04-30T13:09:32Z" generation: 2 name: cluster resourceVersion: "39437" selfLink: /apis/config.openshift.io/v1/schedulers/cluster uid: 31c2c846-6b49-11e9-983d-023db2b3209c spec: defaultNodeSelector: aaa=bbb $ oc new-project aaa $ oc create -f https://raw.githubusercontent.com/hongkailiu/svt-case-doc/master/files/pod_test.yaml $ oc get pods NAME READY STATUS RESTARTS AGE web 1/1 Running 0 35m $ oc get node $(oc get pods -o wide --no-headers | awk {'print $7'}) --show-labels NAME STATUS ROLES AGE VERSION LABELS ip-10-0-173-225.us-east-2.compute.internal Ready worker 116m v1.13.4+27a00af64 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m4.large,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-east-2,failure-domain.beta.kubernetes.io/zone=us-east-2c,kubernetes.io/hostname=ip-10-0-173-225,node-role.kubernetes.io/worker=,node.openshift.io/os_id=rhcos,node.openshift.io/os_version=4.1 No node with aaa=bbb label Pod status: Running Retest FAILED!
Simon, I made the same mistake few minutes before :) Edit scheduler cluster and change "defaultNodeSelector: aaa=bbb" to "DefaultNodeSelector: aaa=bbb". D of default in caps. $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.1.0-0.nightly-2019-04-28-064010 True False 11h Cluster version is 4.1.0-0.nightly-2019-04-28-064010 $ oc get scheduler cluster -o yaml apiVersion: config.openshift.io/v1 kind: Scheduler metadata: annotations: release.openshift.io/create-only: "true" creationTimestamp: "2019-04-30T03:49:30Z" generation: 3 name: cluster resourceVersion: "223489" selfLink: /apis/config.openshift.io/v1/schedulers/cluster uid: f57d14ab-6afa-11e9-b805-0673e8707602 spec: DefaultNodeSelector: aaa=bbb $ oc describe pod web Name: web Namespace: sunilc Priority: 0 PriorityClassName: <none> Node: <none> Labels: app=web Annotations: openshift.io/scc: anyuid Status: Pending IP: Containers: test-go: Image: quay.io/hongkailiu/test-go:testctl-0.0.6-83ce61e2 Port: 8080/TCP Host Port: 0/TCP Command: /testctl Args: http start -v Environment: GIN_MODE: release Mounts: /var/run/secrets/kubernetes.io/serviceaccount from default-token-mw988 (ro) Conditions: Type Status PodScheduled False Volumes: default-token-mw988: Type: Secret (a volume populated by a Secret) SecretName: default-token-mw988 Optional: false QoS Class: BestEffort Node-Selectors: aaa=bbb Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s node.kubernetes.io/unreachable:NoExecute for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 5s (x5 over 93s) default-scheduler 0/5 nodes are available: 5 node(s) didn't match node selector.
Retest 2 $ oc get clusterversions.config.openshift.io NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.1.0-0.nightly-2019-04-28-064010 True False 130m Cluster version is 4.1.0-0.nightly-2019-04-28-064010 using correct key name (capital first letter): DefaultNodeSelector: aaa=bbb $ oc get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES web 0/1 Pending 0 2m29s <none> <none> <none> <none> $ oc describe pod web Name: web Namespace: aaa Priority: 0 PriorityClassName: <none> Node: <none> Labels: app=web Annotations: openshift.io/scc: anyuid Status: Pending IP: Containers: test-go: Image: quay.io/hongkailiu/test-go:testctl-0.0.6-83ce61e2 Port: 8080/TCP Host Port: 0/TCP Command: /testctl Args: http start -v Environment: GIN_MODE: release Mounts: /var/run/secrets/kubernetes.io/serviceaccount from default-token-rwtgm (ro) Conditions: Type Status PodScheduled False Volumes: default-token-rwtgm: Type: Secret (a volume populated by a Secret) SecretName: default-token-rwtgm Optional: false QoS Class: BestEffort Node-Selectors: aaa=bbb Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s node.kubernetes.io/unreachable:NoExecute for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 22s (x8 over 2m49s) default-scheduler 0/6 nodes are available: 6 node(s) didn't match node selector. Retest PASS!
This looks odd then: https://github.com/openshift/api/blob/master/config/v1/types_scheduling.go#L50 DefaultNodeSelector string `json:"defaultNodeSelector,omitempty"` Probably I am looking at a wrong src file.
Shouldn't this be `defaultNodeSelector`? Can you provide me with the access to the cluster? I specifically need kube-apiserver logs.
Created attachment 1574322 [details] api-server log
Sunil, be advised, after making the change to the defaultNodeSelector in the cluster Scheduler config resource, the kube-apiserver pods must redeploy. This can take several minutes. Until the kube-apiserver pods redeploy, the defaultNodeSelector will not have effect.
Thanks ravig & Seth for clarification. After applying the change, I was monitoring kube-scheduler pods for restart which is not the case as ravig advised. After waiting for several minutes, I do see all 3 kube-apiserver pods restarted after which new pods have defaultNodeSelector label. Also I see `DefaultNodeSelector` does not work which I assumed earlier, only `defaultNodeSelector` works.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0758