Description of problem: No global tolerations are set for the DaemonSet NodeCA in openshift-image-registry. In case taints are set on specific nodes, DaemonSet doesn't control the pods on that specific nodes anymore and when it gets redeployed, these pods are being deleted. Version-Release number of selected component (if applicable): 4.2.10 How reproducible: Every time. Steps to Reproduce: 1. check NodeCA DaemonSet oc get ds NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE node-ca 5 5 5 5 5 kubernetes.io/os=linux 7d18h 2. taint a node - oc adm taint nodes infra0 infra='true':NoSchedule 3. check again the DaemonSet. The desired count decreases by 1 oc get ds NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE node-ca 4 4 4 4 4 kubernetes.io/os=linux 7d18h 4. by the next DaemonSet redeployment, pod will not get deployed on the tainted node anymore. Expected results: After tainting the nodes, the number of pods NodeCA DaemonSet controls should be the same. Additional info:
After tainting the nodes, the number of pods NodeCA DaemonSet controls is the same with 4.4.0-0.nightly-2020-02-03-005212: spec: providerID: aws:///us-east-2b/i-097409042f4872c6c taints: - effect: NoSchedule key: infra value: "true"
Any chance of backporting to 4.2/4.3, or alternative workarounds? The documented process for setting up dedicated OCS nodes [1] has the user taint the storage nodes, which will break nodeCA. [1] https://access.redhat.com/documentation/en-us/red_hat_openshift_container_storage/4.2/html-single/deploying_openshift_container_storage/index#creating-an-openshift-container-storage-service_rhocs
*** Bug 1801474 has been marked as a duplicate of this bug. ***
@Chet - backport PRs are now up [1][2], there will be separate BZs to track the release process for 4.3.z and 4.2.z. It may take some time for the 4.2.z fix to go out since it must be released in 4.3.z first. Currently (02/28/2020) there is a large backlog of 4.3.z fixes, and we are gating patches based our QE teams's capacity. [1] https://github.com/openshift/cluster-image-registry-operator/pull/472 [2] https://github.com/openshift/cluster-image-registry-operator/pull/473
These PRs are not correct for NoExecute I think, should this be : tolerations: - operator: Exists As seen in this PR? https://github.com/openshift/cluster-image-registry-operator/pull/457
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0581