Bug 1699460
| Summary: | defaultNodeSelector does not work in crd Scheduler | ||||||
|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Hongkai Liu <hongkliu> | ||||
| Component: | Node | Assignee: | ravig <rgudimet> | ||||
| Status: | CLOSED ERRATA | QA Contact: | Sunil Choudhary <schoudha> | ||||
| Severity: | medium | Docs Contact: | |||||
| Priority: | unspecified | ||||||
| Version: | 4.1.0 | CC: | aos-bugs, gblomqui, hongkliu, jokerman, mmccomas, schoudha, sjenning, skordas | ||||
| Target Milestone: | --- | ||||||
| Target Release: | 4.1.0 | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2019-06-04 10:47:31 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
|
Description
Hongkai Liu
2019-04-12 19:04:02 UTC
https://github.com/openshift/cluster-kube-apiserver-operator/pull/394 > Error from server (Forbidden): schedulers.config.openshift.io "cluster" is forbidden: deleting required schedulers.config.openshift.io resource, named cluster, is not allowed It's validation code that prohibits deletion. > why does the `metadata.name` have to be `cluster`? Because we want the operator to be singleton meaning we want only scheduler CR `cluster` nothing else. Add annotation PRs: https://github.com/openshift/cluster-authentication-operator/pull/117 https://github.com/operator-framework/operator-marketplace/pull/173 https://github.com/openshift/cluster-kube-apiserver-operator/pull/439 https://github.com/openshift/cluster-image-registry-operator/pull/266 https://github.com/openshift/cluster-version-operator/pull/174 https://github.com/openshift/service-ca-operator/pull/50 https://github.com/operator-framework/operator-lifecycle-manager/pull/824 https://github.com/openshift/machine-config-operator/pull/667 https://github.com/openshift/cluster-network-operator/pull/157 https://github.com/openshift/cluster-ingress-operator/pull/217 https://github.com/openshift/cluster-openshift-controller-manager-operator/pull/93 https://github.com/openshift/console-operator/pull/214 https://github.com/openshift/cluster-dns-operator/pull/100 https://github.com/openshift/machine-api-operator/pull/300 https://github.com/openshift/cluster-kube-scheduler-operator/pull/115 https://github.com/openshift/cluster-samples-operator/pull/139 https://github.com/openshift/cluster-kube-controller-manager-operator/pull/237 https://github.com/openshift/cluster-openshift-apiserver-operator/pull/195 https://github.com/openshift/cloud-credential-operator/pull/61 https://github.com/openshift/cluster-etcd-operator/pull/10 https://github.com/openshift/cluster-storage-operator/pull/30 https://github.com/openshift/cluster-machine-approver/pull/20 The enable defaultNodeSelector: https://github.com/openshift/cluster-kube-apiserver-operator/pull/394 Retest 4/30
$ oc get clusterversions.config.openshift.io
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.1.0-0.nightly-2019-04-28-064010 True False 88m Cluster version is 4.1.0-0.nightly-2019-04-28-064010
$ oc get schedulers.config.openshift.io cluster -o yaml
apiVersion: config.openshift.io/v1
kind: Scheduler
metadata:
annotations:
release.openshift.io/create-only: "true"
creationTimestamp: "2019-04-30T13:09:32Z"
generation: 2
name: cluster
resourceVersion: "39437"
selfLink: /apis/config.openshift.io/v1/schedulers/cluster
uid: 31c2c846-6b49-11e9-983d-023db2b3209c
spec:
defaultNodeSelector: aaa=bbb
$ oc new-project aaa
$ oc create -f https://raw.githubusercontent.com/hongkailiu/svt-case-doc/master/files/pod_test.yaml
$ oc get pods
NAME READY STATUS RESTARTS AGE
web 1/1 Running 0 35m
$ oc get node $(oc get pods -o wide --no-headers | awk {'print $7'}) --show-labels
NAME STATUS ROLES AGE VERSION LABELS
ip-10-0-173-225.us-east-2.compute.internal Ready worker 116m v1.13.4+27a00af64 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m4.large,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-east-2,failure-domain.beta.kubernetes.io/zone=us-east-2c,kubernetes.io/hostname=ip-10-0-173-225,node-role.kubernetes.io/worker=,node.openshift.io/os_id=rhcos,node.openshift.io/os_version=4.1
No node with aaa=bbb label
Pod status: Running
Retest FAILED!
Simon, I made the same mistake few minutes before :)
Edit scheduler cluster and change "defaultNodeSelector: aaa=bbb" to "DefaultNodeSelector: aaa=bbb". D of default in caps.
$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.1.0-0.nightly-2019-04-28-064010 True False 11h Cluster version is 4.1.0-0.nightly-2019-04-28-064010
$ oc get scheduler cluster -o yaml
apiVersion: config.openshift.io/v1
kind: Scheduler
metadata:
annotations:
release.openshift.io/create-only: "true"
creationTimestamp: "2019-04-30T03:49:30Z"
generation: 3
name: cluster
resourceVersion: "223489"
selfLink: /apis/config.openshift.io/v1/schedulers/cluster
uid: f57d14ab-6afa-11e9-b805-0673e8707602
spec:
DefaultNodeSelector: aaa=bbb
$ oc describe pod web
Name: web
Namespace: sunilc
Priority: 0
PriorityClassName: <none>
Node: <none>
Labels: app=web
Annotations: openshift.io/scc: anyuid
Status: Pending
IP:
Containers:
test-go:
Image: quay.io/hongkailiu/test-go:testctl-0.0.6-83ce61e2
Port: 8080/TCP
Host Port: 0/TCP
Command:
/testctl
Args:
http
start
-v
Environment:
GIN_MODE: release
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-mw988 (ro)
Conditions:
Type Status
PodScheduled False
Volumes:
default-token-mw988:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-mw988
Optional: false
QoS Class: BestEffort
Node-Selectors: aaa=bbb
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 5s (x5 over 93s) default-scheduler 0/5 nodes are available: 5 node(s) didn't match node selector.
Retest 2
$ oc get clusterversions.config.openshift.io
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.1.0-0.nightly-2019-04-28-064010 True False 130m Cluster version is 4.1.0-0.nightly-2019-04-28-064010
using correct key name (capital first letter):
DefaultNodeSelector: aaa=bbb
$ oc get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
web 0/1 Pending 0 2m29s <none> <none> <none> <none>
$ oc describe pod web
Name: web
Namespace: aaa
Priority: 0
PriorityClassName: <none>
Node: <none>
Labels: app=web
Annotations: openshift.io/scc: anyuid
Status: Pending
IP:
Containers:
test-go:
Image: quay.io/hongkailiu/test-go:testctl-0.0.6-83ce61e2
Port: 8080/TCP
Host Port: 0/TCP
Command:
/testctl
Args:
http
start
-v
Environment:
GIN_MODE: release
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-rwtgm (ro)
Conditions:
Type Status
PodScheduled False
Volumes:
default-token-rwtgm:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-rwtgm
Optional: false
QoS Class: BestEffort
Node-Selectors: aaa=bbb
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 22s (x8 over 2m49s) default-scheduler 0/6 nodes are available: 6 node(s) didn't match node selector.
Retest PASS!
This looks odd then: https://github.com/openshift/api/blob/master/config/v1/types_scheduling.go#L50 DefaultNodeSelector string `json:"defaultNodeSelector,omitempty"` Probably I am looking at a wrong src file. Shouldn't this be `defaultNodeSelector`? Can you provide me with the access to the cluster? I specifically need kube-apiserver logs. Created attachment 1574322 [details]
api-server log
Sunil, be advised, after making the change to the defaultNodeSelector in the cluster Scheduler config resource, the kube-apiserver pods must redeploy. This can take several minutes. Until the kube-apiserver pods redeploy, the defaultNodeSelector will not have effect. Thanks ravig & Seth for clarification. After applying the change, I was monitoring kube-scheduler pods for restart which is not the case as ravig advised. After waiting for several minutes, I do see all 3 kube-apiserver pods restarted after which new pods have defaultNodeSelector label. Also I see `DefaultNodeSelector` does not work which I assumed earlier, only `defaultNodeSelector` works. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0758 |