Bug 1785115 - No global tolerations for NodeCA DaemonSet
Summary: No global tolerations for NodeCA DaemonSet
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Image Registry
Version: 4.2.z
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.4.0
Assignee: Oleg Bulatov
QA Contact: Wenjing Zheng
URL:
Whiteboard:
Depends On:
Blocks: 1808431
TreeView+ depends on / blocked
 
Reported: 2019-12-19 07:49 UTC by rdomnu
Modified: 2023-09-07 21:18 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: the nodeca daemonset didn't tolerate NoSchedule taints Consequence: its pods were missing on such nodes Fix: add toleration Result: tainted nodes received updates from the nodeca daemonset
Clone Of:
: 1808431 (view as bug list)
Environment:
Last Closed: 2020-05-04 11:20:43 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-image-registry-operator pull 421 0 None closed Bug 1785115: tolerate all NoSchedule taints 2020-10-14 20:31:20 UTC
Red Hat Product Errata RHBA-2020:0581 0 None None None 2020-05-04 11:21:18 UTC

Internal Links: 1780318

Description rdomnu 2019-12-19 07:49:10 UTC
Description of problem:
No global tolerations are set for the DaemonSet NodeCA in openshift-image-registry. In case taints are set on specific nodes, DaemonSet doesn't control the pods on that specific nodes anymore and when it gets redeployed, these pods are being deleted.

Version-Release number of selected component (if applicable):
4.2.10


How reproducible:
Every time. 


Steps to Reproduce:
1. check NodeCA DaemonSet
oc get ds                                                                                                                                                                                            
NAME      DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR            AGE
node-ca   5         5         5       5            5           kubernetes.io/os=linux   7d18h

2. taint a node - oc adm taint nodes infra0 infra='true':NoSchedule
3. check again the DaemonSet. The desired count decreases by 1
oc get ds
NAME      DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR            AGE
node-ca   4         4         4       4            4           kubernetes.io/os=linux   7d18h
4. by the next DaemonSet redeployment, pod will not get deployed on the tainted node anymore.


Expected results:
After tainting the nodes, the number of pods NodeCA DaemonSet controls should be the same.


Additional info:

Comment 2 Wenjing Zheng 2020-02-03 08:19:31 UTC
After tainting the nodes, the number of pods NodeCA DaemonSet controls is the same with 4.4.0-0.nightly-2020-02-03-005212:
spec:
  providerID: aws:///us-east-2b/i-097409042f4872c6c
  taints:
  - effect: NoSchedule
    key: infra
    value: "true"

Comment 3 Chet Hosey 2020-02-10 18:35:44 UTC
Any chance of backporting to 4.2/4.3, or alternative workarounds?

The documented process for setting up dedicated OCS nodes [1] has the user taint the storage nodes, which will break nodeCA.

[1] https://access.redhat.com/documentation/en-us/red_hat_openshift_container_storage/4.2/html-single/deploying_openshift_container_storage/index#creating-an-openshift-container-storage-service_rhocs

Comment 4 Adam Kaplan 2020-02-11 13:47:21 UTC
*** Bug 1801474 has been marked as a duplicate of this bug. ***

Comment 6 Adam Kaplan 2020-02-28 14:11:48 UTC
@Chet - backport PRs are now up [1][2], there will be separate BZs to track the release process for 4.3.z and 4.2.z. 

It may take some time for the 4.2.z fix to go out since it must be released in 4.3.z first. Currently (02/28/2020) there is a large backlog of 4.3.z fixes, and we are gating patches based our QE teams's capacity.

[1] https://github.com/openshift/cluster-image-registry-operator/pull/472
[2] https://github.com/openshift/cluster-image-registry-operator/pull/473

Comment 7 David Dreeggors 2020-04-02 14:30:29 UTC
These PRs are not correct for NoExecute I think, should this be :
tolerations:
      - operator: Exists

As seen in this PR?

https://github.com/openshift/cluster-image-registry-operator/pull/457

Comment 9 errata-xmlrpc 2020-05-04 11:20:43 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581


Note You need to log in before you can comment on or make changes to this bug.