The ensureTolerations logic is very old [1], and only matches by 'key' value [2]. That doesn't work well with entries like the CVO's own Deployment, which has had multiple node.kubernetes.io/not-ready entries for years [3,4]. In at least one 4.7.2 cluster, and probably most clusters, that leads to an in-cluster Deployment with: $ yaml2json <deployment.yaml | jq -cS '.items[].spec.template.spec.tolerations[]' {"effect":"NoSchedule","key":"node-role.kubernetes.io/master","operator":"Exists"} {"effect":"NoSchedule","key":"node.kubernetes.io/unschedulable","operator":"Exists"} {"effect":"NoSchedule","key":"node.kubernetes.io/network-unavailable","operator":"Exists"} {"effect":"NoExecute","key":"node.kubernetes.io/not-ready","operator":"Exists","tolerationSeconds":120} {"effect":"NoExecute","key":"node.kubernetes.io/unreachable","operator":"Exists","tolerationSeconds":120} {"effect":"NoExecute","key":"node.kubernetes.io/not-ready","operator":"Exists","tolerationSeconds":120} As the manifest's final not-ready NoExecute falsely matches and thus clobbers the manifests original not-ready NoSchedule entry. We should probably be matching on (key,effect) tuples. Toleration docs in [5]. Poking at the manifests in a recent nightly: $ oc adm release extract --to manifests registry.ci.openshift.org/ocp/release:4.8.0-0.nightly-2021-03-22-104536 $ grep -rhA100 tolerations: manifests/ | grep -o 'effect:.*' | sort | uniq -c | sort -n 9 effect: 10 effect: NoSchedule 12 effect: NoExecute 24 effect: "NoSchedule" 46 effect: "NoExecute" $ grep -rhA100 tolerations: manifests/ | grep -o 'operator:.*' | sort | uniq -c | sort -n 14 operator: 31 operator: Exists 61 operator: "Exists" $ grep -rhA100 tolerations: manifests/ | grep -o 'key:.*' | sort | uniq -c | sort -n 1 key: "node.kubernetes.io/memory-pressure" 1 key: node-role.kubernetes.io/master 1 key: node.kubernetes.io/network-unavailable 4 key: node-role.kubernetes.io/master # Just tolerate NoSchedule taint on master node. If there are other conditions like disk-pressure etc, let's not schedule the control-plane pods onto that node. 5 key: ca-bundle.crt 6 key: node.kubernetes.io/not-ready 6 key: node.kubernetes.io/unreachable 10 key: node-role.kubernetes.io/master 14 key: 16 key: "node-role.kubernetes.io/master" 23 key: "node.kubernetes.io/unreachable" 24 key: "node.kubernetes.io/not-ready" Using grep instead of a YAML parser is a bit sloppy, so that may certainly include some non-toleration entries. Turns out the "no-effect", "no-operator", and "no-key" entries are all from CRDs: $ grep -rA100 tolerations: manifests/ | grep 'effect:$' manifests/0000_50_olm_00-subscriptions.crd.yaml- effect: manifests/0000_50_olm_00-clusterserviceversions.crd.yaml- effect: ... $ grep -rA100 tolerations: manifests/ | grep 'operator:$' manifests/0000_50_olm_00-subscriptions.crd.yaml- operator: manifests/0000_50_olm_00-clusterserviceversions.crd.yaml- operator: ... $ grep -rA100 tolerations: manifests/ | grep 'key:$' manifests/0000_50_olm_00-subscriptions.crd.yaml- key: manifests/0000_50_olm_00-clusterserviceversions.crd.yaml- key: manifests/0000_50_olm_00-clusterserviceversions.crd.yaml- key: ... So we're all Exists at the moment, and logic that looks for (key,effect) tuples should be fine. Or maybe (operator,key,effect) tuples, because admins can add arbitrary tolerations locally. Hrm... [1]: https://github.com/openshift/cluster-version-operator/commit/d9f6718de071cd886851b68e8b3d72fbe6618f6f#diff-c4f3683148029d6da95ececa09e0625f5e860ff5c37c306c19cfc1f27d218911R282-R300 [2]: https://github.com/openshift/cluster-version-operator/blame/63471e9191b6c342d2bf037d233ef98c5e4c8468/lib/resourcemerge/core.go#L390 [3]: https://github.com/openshift/cluster-version-operator/blame/63471e9191b6c342d2bf037d233ef98c5e4c8468/install/0000_00_cluster-version-operator_03_deployment.yaml#L64-L84 [4]: https://github.com/openshift/cluster-version-operator/pull/182/files#diff-d545314649981da131e9fa5eec88fd1fc37172eb51389689aa227464191bb133R62-R72 [5]: https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/
I've cloned off bug 1942271 to get the insights gathering in place we want before we make drastic changes to how the CVO merges tolerations.
I also took a pass through the other CVO-managed manifests from a recent 4.8 nightly: $ oc adm release extract --to manifests registry.ci.openshift.org/ocp/release:4.8.0-0.nightly-2021-03-26-054333 $ for X in manifests/*.yaml; do yaml2json < "${X}" | jq -r '.[] | select(.kind == "Deployment" or .kind == "DaemonSet") | .metadata as $m | .spec.template.spec | select(.tolerations != null and (.tolerations | length) > (.tolerations | unique_by(.key) | length)) | $m.namespace + " " + $m.name + " " + ([ .tolerations[].key] | tostring)'; done ...no hits... That doesn't limit what admins might have injected, but does mean that the CVO's own Deployment is the only self-inflicted conflict.
We still need more data from the insights gathering in the bug 1942271 series, so this is unlikely to get fixed before 4.8 GAs.
Reproducing with 4.8.0-fc.3 # oc -n openshift-cluster-version get deployments -ojson | jq -cS '.items[].spec.template.spec.tolerations[]' {"effect":"NoSchedule","key":"node-role.kubernetes.io/master","operator":"Exists"} {"effect":"NoSchedule","key":"node.kubernetes.io/unschedulable","operator":"Exists"} {"effect":"NoSchedule","key":"node.kubernetes.io/network-unavailable","operator":"Exists"} {"effect":"NoExecute","key":"node.kubernetes.io/not-ready","operator":"Exists","tolerationSeconds":120} {"effect":"NoExecute","key":"node.kubernetes.io/unreachable","operator":"Exists","tolerationSeconds":120} {"effect":"NoExecute","key":"node.kubernetes.io/not-ready","operator":"Exists","tolerationSeconds":120} We can see there are 2 NoExecute not-ready tolerations. The former one should be NoSchedule not-ready which is overridden. Verifying with 4.8.0-0.nightly-2021-06-06-164529 # oc -n openshift-cluster-version get deployments -ojson | jq -cS '.items[].spec.template.spec.tolerations[]' {"effect":"NoSchedule","key":"node-role.kubernetes.io/master","operator":"Exists"} {"effect":"NoSchedule","key":"node.kubernetes.io/unschedulable","operator":"Exists"} {"effect":"NoSchedule","key":"node.kubernetes.io/network-unavailable","operator":"Exists"} {"effect":"NoSchedule","key":"node.kubernetes.io/not-ready","operator":"Exists"} {"effect":"NoExecute","key":"node.kubernetes.io/unreachable","operator":"Exists","tolerationSeconds":120} {"effect":"NoExecute","key":"node.kubernetes.io/not-ready","operator":"Exists","tolerationSeconds":120} The NoSchedule not-ready toleration is kept around. Moving it to verified state.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438