Bug 1900322

Summary: metal3 pod's toleration for key: node-role.kubernetes.io/master currently matches on exact value matches but should match on Exists
Product: OpenShift Container Platform Reporter: Andreas Karis <akaris>
Component: Bare Metal Hardware ProvisioningAssignee: Robin Cernin <rcernin>
Bare Metal Hardware Provisioning sub component: cluster-baremetal-operator QA Contact: Amit Ugol <augol>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: medium CC: aos-bugs, omichael, rcernin
Version: 4.7   
Target Milestone: ---   
Target Release: 4.7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Previously value of the metal3 pod's NoSchedule toleration matched exactly on value "true". Now NoSchedule toleration uses Exist operator which makes the NoSchedule toleration match on any value. This removes any confusion for the operator who needs to set NoSchedule toleration.
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-02-24 15:35:25 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Andreas Karis 2020-11-22 08:42:45 UTC
Description of problem:

> This may be fixed in a later version. I ran across this on a customer environment in 4.4 and do not have a test system with metal3 available, at the moment.

The metal3 pod's toleration currently matches on exact value matches.

https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/
~~~
A toleration "matches" a taint if the keys are the same and the effects are the same, and:

    the operator is Exists (in which case no value should be specified), or
    the operator is Equal and the values are equal.
~~~

However, it should match on "operator: Exists", the same as the vast majority of our pods which are allowed to run on unschedulable masters. 
~~~
[kni@provisioner ~]$ oc get pod -n openshift-machine-api metal3-7d8bdb796d-wpt4h  -o yaml | grep -i tolera -A20
  tolerations:
  - effect: NoSchedule
    key: node-role.kubernetes.io/master
  - key: CriticalAddonsOnly
    operator: Exists
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 120
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 120
~~~

Vs from a lab that shows that we usually match on exists:
~~~
[akaris@linux sriov-network-operator]$ oc get pods -A -o wide | grep ip-10-0-133-15.eu-west-1.compute.internal | grep Running | awk '{print $1 " " $2}' | while read a b ; do echo === $a/$b === ; oc get pod -n $a $b -o yaml | grep 'key: node-role.kubernetes.io/master' -C1; done
=== openshift-apiserver-operator/openshift-apiserver-operator-7546b84744-b55ms ===
  - effect: NoSchedule
    key: node-role.kubernetes.io/master
    operator: Exists
=== openshift-apiserver/apiserver-7c85b978fd-n8h8d ===
  - effect: NoSchedule
    key: node-role.kubernetes.io/master
    operator: Exists
=== openshift-authentication-operator/authentication-operator-849d6b8888-lgn5h ===
  - effect: NoSchedule
    key: node-role.kubernetes.io/master
    operator: Exists
=== openshift-authentication/oauth-openshift-56cd58fcbf-drxgv ===
  - effect: NoSchedule
    key: node-role.kubernetes.io/master
    operator: Exists
=== openshift-cluster-machine-approver/machine-approver-58fc6999c-lmqdp ===
  - effect: NoSchedule
    key: node-role.kubernetes.io/master
    operator: Exists
=== openshift-cluster-node-tuning-operator/cluster-node-tuning-operator-7cf7b68cff-7jxf9 ===
  - effect: NoSchedule
    key: node-role.kubernetes.io/master
    operator: Exists
=== openshift-cluster-node-tuning-operator/tuned-xfmlt ===
=== openshift-cluster-version/cluster-version-operator-5f4d94dcd9-vpv6d ===
  - effect: NoSchedule
    key: node-role.kubernetes.io/master
    operator: Exists
=== openshift-console/console-5c7fd94d5d-gzb4s ===
  - effect: NoSchedule
    key: node-role.kubernetes.io/master
    operator: Exists
=== openshift-controller-manager-operator/openshift-controller-manager-operator-6f95cb6dff-cx2s6 ===
  - effect: NoSchedule
    key: node-role.kubernetes.io/master
    operator: Exists
=== openshift-controller-manager/controller-manager-hnz5h ===
=== openshift-dns-operator/dns-operator-69b6698b4c-x4sqq ===
  - effect: NoSchedule
    key: node-role.kubernetes.io/master
    operator: Exists
=== openshift-dns/dns-default-4fhhx ===
=== openshift-etcd-operator/etcd-operator-7f5bcbf444-nd5zj ===
  - effect: NoSchedule
    key: node-role.kubernetes.io/master
    operator: Exists
=== openshift-etcd/etcd-ip-10-0-133-15.eu-west-1.compute.internal ===
=== openshift-image-registry/node-ca-n8jzl ===
=== openshift-kube-apiserver-operator/kube-apiserver-operator-7bb7f6c9db-h7h57 ===
  - effect: NoSchedule
    key: node-role.kubernetes.io/master
    operator: Exists
=== openshift-kube-apiserver/kube-apiserver-ip-10-0-133-15.eu-west-1.compute.internal ===
=== openshift-kube-controller-manager-operator/kube-controller-manager-operator-66c98959c7-4d928 ===
  - effect: NoSchedule
    key: node-role.kubernetes.io/master
    operator: Exists
=== openshift-kube-controller-manager/kube-controller-manager-ip-10-0-133-15.eu-west-1.compute.internal ===
=== openshift-kube-scheduler-operator/openshift-kube-scheduler-operator-6c7f76d7b4-l9hxf ===
  - effect: NoSchedule
    key: node-role.kubernetes.io/master
    operator: Exists
=== openshift-kube-scheduler/openshift-kube-scheduler-ip-10-0-133-15.eu-west-1.compute.internal ===
=== openshift-kube-storage-version-migrator-operator/kube-storage-version-migrator-operator-88df9db45-f5mcg ===
  - effect: NoSchedule
    key: node-role.kubernetes.io/master
    operator: Exists
=== openshift-machine-config-operator/etcd-quorum-guard-798955868-jwvfl ===
  - effect: NoSchedule
    key: node-role.kubernetes.io/master
    operator: Exists
=== openshift-machine-config-operator/machine-config-daemon-f2xrp ===
=== openshift-machine-config-operator/machine-config-operator-5cdf6fdfdf-l6hp7 ===
  - effect: NoSchedule
    key: node-role.kubernetes.io/master
    operator: Exists
=== openshift-machine-config-operator/machine-config-server-rzk76 ===
  - effect: NoSchedule
    key: node-role.kubernetes.io/master
    operator: Exists
=== openshift-monitoring/node-exporter-62hh2 ===
=== openshift-multus/multus-admission-controller-4stbg ===
  - effect: NoSchedule
    key: node-role.kubernetes.io/master
    operator: Exists
=== openshift-multus/multus-x8ms7 ===
=== openshift-network-operator/network-operator-7c67d58b9b-nrvt7 ===
  - effect: NoSchedule
    key: node-role.kubernetes.io/master
    operator: Exists
=== openshift-operator-lifecycle-manager/catalog-operator-7fdbcccd94-8fbp9 ===
  - effect: NoSchedule
    key: node-role.kubernetes.io/master
    operator: Exists
=== openshift-operator-lifecycle-manager/olm-operator-69bc9b8675-stjkv ===
  - effect: NoSchedule
    key: node-role.kubernetes.io/master
    operator: Exists
=== openshift-ovn-kubernetes/ovnkube-master-b2hft ===
  tolerations:
  - key: node-role.kubernetes.io/master
    operator: Exists
=== openshift-ovn-kubernetes/ovnkube-node-jwrdt ===
=== openshift-ovn-kubernetes/ovs-node-zsrfw ===
=== openshift-service-ca-operator/service-ca-operator-648466c4f4-7w6rm ===
  - effect: NoSchedule
    key: node-role.kubernetes.io/master
    operator: Exists
=== openshift-service-catalog-apiserver-operator/openshift-service-catalog-apiserver-operator-5bfc4645f5-kxcc7 ===
  - effect: NoSchedule
    key: node-role.kubernetes.io/master
    operator: Exists
=== openshift-service-catalog-controller-manager-operator/openshift-service-catalog-controller-manager-operator-89488bqc8 ===
  - effect: NoSchedule
    key: node-role.kubernetes.io/master
    operator: Exists
=== openshift-storage/csi-cephfsplugin-f9zht ===
=== openshift-storage/csi-rbdplugin-c5h8p ===
~~~

We have a customer who modified their master node taint to:
~~~
  name: openshift-master-1
  resourceVersion: "24746231"
  selfLink: /api/v1/nodes/openshift-master-1
  uid: efec3896-1250-4b42-be13-dadcd0493479
spec:
  taints:
  - effect: NoSchedule
    key: node-role.kubernetes.io/master
    value: "true"
status:
  addresses:
~~~

It's subtle, but the default is:
~~~
  taints:
  - effect: NoSchedule
    key: node-role.kubernetes.io/master
~~~

After the customer added `value: true` to the taint of their 3 master nodes, metal3 could not be scheduled on the masters. I agree that administrators should not change the taint, but the vast majority of our pods have a toleration for key existence, not for exact value match, and metal3 should have the same behavior.

Otherwise, it will match on the exact value of node-role.kubernetes.io/master. That's why "value: true" stopped the metal3 pod from working:
~~~
77m         Warning   FailedScheduling         pod/machine-api-controllers-7f794c7b-stlf6         0/8 nodes are available: 1 node(s) were unschedulable, 3 node(s) had taints that the pod didn't tolerate, 4 node(s) didn't match node selector
~~~

Comment 6 errata-xmlrpc 2021-02-24 15:35:25 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633