Hide Forgot
Description of problem: The olm-operators pod should, like all CVO managed components without a special exception, tolerate masters, but it does not. Version-Release number of selected component (if applicable): $ oc adm release info --commits | grep lifecycle operator-lifecycle-manager https://github.com/operator-framework/operator-lifecycle-manager 1f312481ae3641eea471abb792f9b056206e4cf4 How reproducible: Every time. Steps to Reproduce: 1. Break your Machine API provider, e.g. by running libvirt with a non-standard volume pool before [1] lands. 2. Launch a cluster. 3. Wait for things to stabilize. Then: $ oc get pods --all-namespaces | grep Pending openshift-ingress router-default-7688479d99-nbnj8 0/1 Pending 0 31m openshift-monitoring prometheus-operator-647d84b5c6-rsplb 0/1 Pending 0 31m openshift-operator-lifecycle-manager olm-operators-sf5sm 0/1 Pending 0 36m $ oc get pods -o yaml -n openshift-operator-lifecycle-manager olm-operators-sf5sm apiVersion: v1 kind: Pod metadata: creationTimestamp: 2019-01-30T19:55:21Z generateName: olm-operators- labels: olm.catalogSource: olm-operators olm.configMapResourceVersion: "2424" name: olm-operators-sf5sm namespace: openshift-operator-lifecycle-manager ownerReferences: - apiVersion: operators.coreos.com/v1alpha1 blockOwnerDeletion: false controller: false kind: CatalogSource name: olm-operators uid: eeccfb46-24c8-11e9-a9d2-664f163f5f0f resourceVersion: "2437" selfLink: /api/v1/namespaces/openshift-operator-lifecycle-manager/pods/olm-operators-sf5sm uid: fa08ea5e-24c8-11e9-8d1a-52fdfc072182 spec: containers: - command: - configmap-server - -c - olm-operators - -n - openshift-operator-lifecycle-manager image: quay.io/operatorframework/configmap-operator-registry:latest imagePullPolicy: Always livenessProbe: exec: command: - grpc_health_probe - -addr=localhost:50051 failureThreshold: 3 initialDelaySeconds: 2 periodSeconds: 10 successThreshold: 1 timeoutSeconds: 1 name: configmap-registry-server ports: - containerPort: 50051 name: grpc protocol: TCP readinessProbe: exec: command: - grpc_health_probe - -addr=localhost:50051 failureThreshold: 3 initialDelaySeconds: 1 periodSeconds: 10 successThreshold: 1 timeoutSeconds: 1 resources: {} terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumeMounts: - mountPath: /var/run/secrets/kubernetes.io/serviceaccount name: olm-operators-configmap-server-token-ndd9k readOnly: true dnsPolicy: ClusterFirst priority: 0 restartPolicy: Always schedulerName: default-scheduler securityContext: {} serviceAccount: olm-operators-configmap-server serviceAccountName: olm-operators-configmap-server terminationGracePeriodSeconds: 30 volumes: - name: olm-operators-configmap-server-token-ndd9k secret: defaultMode: 420 secretName: olm-operators-configmap-server-token-ndd9k status: conditions: - lastProbeTime: null lastTransitionTime: 2019-01-30T19:55:21Z message: '0/1 nodes are available: 1 node(s) had taints that the pod didn''t tolerate.' reason: Unschedulable status: "False" type: PodScheduled phase: Pending qosClass: BestEffort Actual results: Pending pod with "0/1 nodes are available: 1 node(s) had taints that the pod didn't tolerate.". Expected results: A running pod. Additional info: "high" severity is based on Clayton's request [2]. [1]: https://github.com/openshift/cluster-api-provider-libvirt/pull/45 [2]: https://github.com/openshift/installer/pull/1146#issuecomment-459037176
I think that unless you have a technical reason why your component won't currently work on masters (e.g. [1]), you should *tolerate* them. This is different from *restricting* to masters; you can certainly continue to tolerate compute nodes as well. One use-case is all-in-one libvirt clusters (one master, zero compute nodes). [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1671136#c1
(In reply to W. Trevor King from comment #3) > I think that unless you have a technical reason why your component won't > currently work on masters (e.g. [1]), you should *tolerate* them. This is > different from *restricting* to masters; you can certainly continue to > tolerate compute nodes as well. One use-case is all-in-one libvirt clusters > (one master, zero compute nodes). > > [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1671136#c1 Is tolerating everything acceptable, or is there a specific taint key for masters? tolerations: - operator: "Exists" https://github.com/operator-framework/operator-lifecycle-manager/pull/708
From looking at OLM's other deployments and the cluster-ingress-operator manifests, it seems like this works. I'll let you know once merged.
The toleration change has been merged: https://github.com/operator-framework/operator-lifecycle-manager/pull/708 https://jira.coreos.com/browse/ALM-908
Now, the nodeSelector label added in the deployments, all the pods of the OLM running on the master node. Verify it. OLM version id: cce4af21efb662527a8f71d22f7f2c37007ea4bf [jzhang@dhcp-140-18 payload]$ oc get deployment NAME READY UP-TO-DATE AVAILABLE AGE catalog-operator 1/1 1 1 6h26m olm-operator 1/1 1 1 6h26m packageserver 2/2 2 2 36m [jzhang@dhcp-140-18 payload]$ oc get deployment -o yaml |grep nodeSelector -A 3 nodeSelector: beta.kubernetes.io/os: linux node-role.kubernetes.io/master: "" restartPolicy: Always -- nodeSelector: beta.kubernetes.io/os: linux node-role.kubernetes.io/master: "" restartPolicy: Always -- nodeSelector: beta.kubernetes.io/os: linux node-role.kubernetes.io/master: "" restartPolicy: Always [jzhang@dhcp-140-18 payload]$ oc get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE catalog-operator-7fc5d98dbd-vf5rc 1/1 Running 0 6h26m 10.130.0.11 ip-10-0-141-17.us-east-2.compute.internal <none> olm-operator-75558c6d7-s7mrt 1/1 Running 0 40m 10.130.0.47 ip-10-0-141-17.us-east-2.compute.internal <none> olm-operators-ldt4l 1/1 Running 0 6h26m 10.130.0.12 ip-10-0-141-17.us-east-2.compute.internal <none> packageserver-54d858d7c6-jwkmf 1/1 Running 0 23m 10.128.0.72 ip-10-0-175-83.us-east-2.compute.internal <none> packageserver-54d858d7c6-kx478 1/1 Running 0 23m 10.129.0.54 ip-10-0-154-197.us-east-2.compute.internal <none> [jzhang@dhcp-140-18 payload]$ oc get nodes --show-labels NAME STATUS ROLES AGE VERSION LABELS ip-10-0-141-17.us-east-2.compute.internal Ready master 6h49m v1.12.4+ec459b84aa beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m4.xlarge,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-east-2,failure-domain.beta.kubernetes.io/zone=us-east-2a,kubernetes.io/hostname=ip-10-0-141-17,node-role.kubernetes.io/master= ip-10-0-142-252.us-east-2.compute.internal Ready worker 6h34m v1.12.4+ec459b84aa beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m4.large,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-east-2,failure-domain.beta.kubernetes.io/zone=us-east-2a,kubernetes.io/hostname=ip-10-0-142-252,node-role.kubernetes.io/worker= ip-10-0-152-115.us-east-2.compute.internal Ready worker 6h34m v1.12.4+ec459b84aa beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m4.large,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-east-2,failure-domain.beta.kubernetes.io/zone=us-east-2b,kubernetes.io/hostname=ip-10-0-152-115,node-role.kubernetes.io/worker= ip-10-0-154-197.us-east-2.compute.internal Ready master 6h49m v1.12.4+ec459b84aa beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m4.xlarge,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-east-2,failure-domain.beta.kubernetes.io/zone=us-east-2b,kubernetes.io/hostname=ip-10-0-154-197,node-role.kubernetes.io/master= ip-10-0-168-71.us-east-2.compute.internal Ready worker 6h34m v1.12.4+ec459b84aa beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m4.large,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-east-2,failure-domain.beta.kubernetes.io/zone=us-east-2c,kubernetes.io/hostname=ip-10-0-168-71,node-role.kubernetes.io/worker= ip-10-0-175-83.us-east-2.compute.internal Ready master 6h49m v1.12.4+ec459b84aa beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m4.xlarge,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=us-east-2,failure-domain.beta.kubernetes.io/zone=us-east-2c,kubernetes.io/hostname=ip-10-0-175-83,node-role.kubernetes.io/master= PS: the operators installed by the OLM will not deploy on the master node, it depends on the operator component itself, it as expected. Correct me if I'm wrong. [jzhang@dhcp-140-18 payload]$ oc get pods -n openshift-operators -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE etcd-operator-755449645b-ljfkk 3/3 Running 0 2m37s 10.128.2.17 ip-10-0-152-115.us-east-2.compute.internal <none>
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0758