Bug 1941901 - Toleration merge logic does not account for multiple entries with the same key
Summary: Toleration merge logic does not account for multiple entries with the same key
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cluster Version Operator
Version: 4.1.z
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.8.0
Assignee: W. Trevor King
QA Contact: Yang Yang
URL:
Whiteboard:
Depends On: 1942271
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-03-23 04:24 UTC by W. Trevor King
Modified: 2021-07-27 22:55 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: The cluster-version operator matched tolerations by key when reconciling pod templates. Consequence: Manifests with multiple tolerations for the same key, such as the cluster-version operator's own deployment, would have the later entry clobber the earlier entry, causing the in-cluster tolerations to diverge from the manifest author's desired tolerations. Fix: The cluster-version operator now considers tolerations matching only when they are completely equal. Result: The cluster-version operator will work to keep all tolerations present in the manifest in the in-cluster resource.
Clone Of:
: 1942271 (view as bug list)
Environment:
Last Closed: 2021-07-27 22:55:03 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-version-operator pull 578 0 None open Bug 1941901: lib/resourcemerge/core: Fix toleration matching logic 2021-05-27 23:09:16 UTC
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 22:55:16 UTC

Description W. Trevor King 2021-03-23 04:24:23 UTC
The ensureTolerations logic is very old [1], and only matches by 'key' value [2].  That doesn't work well with entries like the CVO's own Deployment, which has had multiple node.kubernetes.io/not-ready entries for years [3,4].  In at least one 4.7.2 cluster, and probably most clusters, that leads to an in-cluster Deployment with:

$ yaml2json <deployment.yaml | jq -cS '.items[].spec.template.spec.tolerations[]'
{"effect":"NoSchedule","key":"node-role.kubernetes.io/master","operator":"Exists"}
{"effect":"NoSchedule","key":"node.kubernetes.io/unschedulable","operator":"Exists"}
{"effect":"NoSchedule","key":"node.kubernetes.io/network-unavailable","operator":"Exists"}
{"effect":"NoExecute","key":"node.kubernetes.io/not-ready","operator":"Exists","tolerationSeconds":120}
{"effect":"NoExecute","key":"node.kubernetes.io/unreachable","operator":"Exists","tolerationSeconds":120}
{"effect":"NoExecute","key":"node.kubernetes.io/not-ready","operator":"Exists","tolerationSeconds":120}

As the manifest's final not-ready NoExecute falsely matches and thus clobbers the manifests original not-ready NoSchedule entry.  We should probably be matching on (key,effect) tuples.  Toleration docs in [5].

Poking at the manifests in a recent nightly:

$ oc adm release extract --to manifests registry.ci.openshift.org/ocp/release:4.8.0-0.nightly-2021-03-22-104536
$ grep -rhA100 tolerations: manifests/ | grep -o 'effect:.*' | sort | uniq -c | sort -n
      9 effect:
     10 effect: NoSchedule
     12 effect: NoExecute
     24 effect: "NoSchedule"
     46 effect: "NoExecute"
$ grep -rhA100 tolerations: manifests/ | grep -o 'operator:.*' | sort | uniq -c | sort -n
     14 operator:
     31 operator: Exists
     61 operator: "Exists"
$ grep -rhA100 tolerations: manifests/ | grep -o 'key:.*' | sort | uniq -c | sort -n
      1 key: "node.kubernetes.io/memory-pressure"
      1 key: node-role.kubernetes.io/master  
      1 key: node.kubernetes.io/network-unavailable
      4 key: node-role.kubernetes.io/master  # Just tolerate NoSchedule taint on master node. If there are other conditions like disk-pressure etc, let's not schedule the control-plane pods onto that node.
      5 key: ca-bundle.crt
      6 key: node.kubernetes.io/not-ready
      6 key: node.kubernetes.io/unreachable
     10 key: node-role.kubernetes.io/master
     14 key:
     16 key: "node-role.kubernetes.io/master"
     23 key: "node.kubernetes.io/unreachable"
     24 key: "node.kubernetes.io/not-ready"

Using grep instead of a YAML parser is a bit sloppy, so that may certainly include some non-toleration entries.  Turns out the "no-effect", "no-operator", and "no-key" entries are all from CRDs:

$ grep -rA100 tolerations: manifests/ | grep 'effect:$'
manifests/0000_50_olm_00-subscriptions.crd.yaml-                          effect:
manifests/0000_50_olm_00-clusterserviceversions.crd.yaml-                                                effect:
...
$ grep -rA100 tolerations: manifests/ | grep 'operator:$'                                            
manifests/0000_50_olm_00-subscriptions.crd.yaml-                          operator:
manifests/0000_50_olm_00-clusterserviceversions.crd.yaml-                                                operator:
...
$ grep -rA100 tolerations: manifests/ | grep 'key:$'
manifests/0000_50_olm_00-subscriptions.crd.yaml-                          key:
manifests/0000_50_olm_00-clusterserviceversions.crd.yaml-                                                key:
manifests/0000_50_olm_00-clusterserviceversions.crd.yaml-                                                          key:
...

So we're all Exists at the moment, and logic that looks for (key,effect) tuples should be fine.  Or maybe (operator,key,effect) tuples, because admins can add arbitrary tolerations locally.  Hrm...

[1]: https://github.com/openshift/cluster-version-operator/commit/d9f6718de071cd886851b68e8b3d72fbe6618f6f#diff-c4f3683148029d6da95ececa09e0625f5e860ff5c37c306c19cfc1f27d218911R282-R300
[2]: https://github.com/openshift/cluster-version-operator/blame/63471e9191b6c342d2bf037d233ef98c5e4c8468/lib/resourcemerge/core.go#L390
[3]: https://github.com/openshift/cluster-version-operator/blame/63471e9191b6c342d2bf037d233ef98c5e4c8468/install/0000_00_cluster-version-operator_03_deployment.yaml#L64-L84
[4]: https://github.com/openshift/cluster-version-operator/pull/182/files#diff-d545314649981da131e9fa5eec88fd1fc37172eb51389689aa227464191bb133R62-R72
[5]: https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/

Comment 1 W. Trevor King 2021-03-24 02:40:33 UTC
I've cloned off bug 1942271 to get the insights gathering in place we want before we make drastic changes to how the CVO merges tolerations.

Comment 2 W. Trevor King 2021-03-27 23:14:09 UTC
I also took a pass through the other CVO-managed manifests from a recent 4.8 nightly:

$ oc adm release extract --to manifests registry.ci.openshift.org/ocp/release:4.8.0-0.nightly-2021-03-26-054333
$ for X in manifests/*.yaml; do yaml2json < "${X}" | jq -r '.[] | select(.kind == "Deployment" or .kind == "DaemonSet") | .metadata as $m | .spec.template.spec | select(.tolerations != null and (.tolerations | length) > (.tolerations | unique_by(.key) | length)) | $m.namespace + " " + $m.name + " " + ([
.tolerations[].key] | tostring)'; done
...no hits...

That doesn't limit what admins might have injected, but does mean that the CVO's own Deployment is the only self-inflicted conflict.

Comment 3 W. Trevor King 2021-05-18 15:58:59 UTC
We still need more data from the insights gathering in the bug 1942271 series, so this is unlikely to get fixed before 4.8 GAs.

Comment 5 Yang Yang 2021-06-07 09:57:16 UTC
Reproducing with 4.8.0-fc.3

# oc -n openshift-cluster-version get deployments -ojson |  jq -cS '.items[].spec.template.spec.tolerations[]'
{"effect":"NoSchedule","key":"node-role.kubernetes.io/master","operator":"Exists"}
{"effect":"NoSchedule","key":"node.kubernetes.io/unschedulable","operator":"Exists"}
{"effect":"NoSchedule","key":"node.kubernetes.io/network-unavailable","operator":"Exists"}
{"effect":"NoExecute","key":"node.kubernetes.io/not-ready","operator":"Exists","tolerationSeconds":120}
{"effect":"NoExecute","key":"node.kubernetes.io/unreachable","operator":"Exists","tolerationSeconds":120}
{"effect":"NoExecute","key":"node.kubernetes.io/not-ready","operator":"Exists","tolerationSeconds":120}

We can see there are 2 NoExecute not-ready tolerations. The former one should be NoSchedule not-ready which is overridden.

Verifying with 4.8.0-0.nightly-2021-06-06-164529

# oc -n openshift-cluster-version get deployments -ojson |  jq -cS '.items[].spec.template.spec.tolerations[]'
{"effect":"NoSchedule","key":"node-role.kubernetes.io/master","operator":"Exists"}
{"effect":"NoSchedule","key":"node.kubernetes.io/unschedulable","operator":"Exists"}
{"effect":"NoSchedule","key":"node.kubernetes.io/network-unavailable","operator":"Exists"}
{"effect":"NoSchedule","key":"node.kubernetes.io/not-ready","operator":"Exists"}
{"effect":"NoExecute","key":"node.kubernetes.io/unreachable","operator":"Exists","tolerationSeconds":120}
{"effect":"NoExecute","key":"node.kubernetes.io/not-ready","operator":"Exists","tolerationSeconds":120}

The NoSchedule not-ready toleration is kept around. Moving it to verified state.

Comment 8 errata-xmlrpc 2021-07-27 22:55:03 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438


Note You need to log in before you can comment on or make changes to this bug.