Bug 1785115

Summary: No global tolerations for NodeCA DaemonSet
Product: OpenShift Container Platform Reporter: rdomnu
Component: Image RegistryAssignee: Oleg Bulatov <obulatov>
Status: CLOSED ERRATA QA Contact: Wenjing Zheng <wzheng>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.2.zCC: adam.kaplan, adeshpan, andcosta, aos-bugs, ChetRHosey, ddreggor, hcisneir, mharri, susuresh, wewang
Target Milestone: ---   
Target Release: 4.4.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: the nodeca daemonset didn't tolerate NoSchedule taints Consequence: its pods were missing on such nodes Fix: add toleration Result: tainted nodes received updates from the nodeca daemonset
Story Points: ---
Clone Of:
: 1808431 (view as bug list) Environment:
Last Closed: 2020-05-04 11:20:43 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1808431    

Description rdomnu 2019-12-19 07:49:10 UTC
Description of problem:
No global tolerations are set for the DaemonSet NodeCA in openshift-image-registry. In case taints are set on specific nodes, DaemonSet doesn't control the pods on that specific nodes anymore and when it gets redeployed, these pods are being deleted.

Version-Release number of selected component (if applicable):
4.2.10


How reproducible:
Every time. 


Steps to Reproduce:
1. check NodeCA DaemonSet
oc get ds                                                                                                                                                                                            
NAME      DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR            AGE
node-ca   5         5         5       5            5           kubernetes.io/os=linux   7d18h

2. taint a node - oc adm taint nodes infra0 infra='true':NoSchedule
3. check again the DaemonSet. The desired count decreases by 1
oc get ds
NAME      DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR            AGE
node-ca   4         4         4       4            4           kubernetes.io/os=linux   7d18h
4. by the next DaemonSet redeployment, pod will not get deployed on the tainted node anymore.


Expected results:
After tainting the nodes, the number of pods NodeCA DaemonSet controls should be the same.


Additional info:

Comment 2 Wenjing Zheng 2020-02-03 08:19:31 UTC
After tainting the nodes, the number of pods NodeCA DaemonSet controls is the same with 4.4.0-0.nightly-2020-02-03-005212:
spec:
  providerID: aws:///us-east-2b/i-097409042f4872c6c
  taints:
  - effect: NoSchedule
    key: infra
    value: "true"

Comment 3 Chet Hosey 2020-02-10 18:35:44 UTC
Any chance of backporting to 4.2/4.3, or alternative workarounds?

The documented process for setting up dedicated OCS nodes [1] has the user taint the storage nodes, which will break nodeCA.

[1] https://access.redhat.com/documentation/en-us/red_hat_openshift_container_storage/4.2/html-single/deploying_openshift_container_storage/index#creating-an-openshift-container-storage-service_rhocs

Comment 4 Adam Kaplan 2020-02-11 13:47:21 UTC
*** Bug 1801474 has been marked as a duplicate of this bug. ***

Comment 6 Adam Kaplan 2020-02-28 14:11:48 UTC
@Chet - backport PRs are now up [1][2], there will be separate BZs to track the release process for 4.3.z and 4.2.z. 

It may take some time for the 4.2.z fix to go out since it must be released in 4.3.z first. Currently (02/28/2020) there is a large backlog of 4.3.z fixes, and we are gating patches based our QE teams's capacity.

[1] https://github.com/openshift/cluster-image-registry-operator/pull/472
[2] https://github.com/openshift/cluster-image-registry-operator/pull/473

Comment 7 David Dreeggors 2020-04-02 14:30:29 UTC
These PRs are not correct for NoExecute I think, should this be :
tolerations:
      - operator: Exists

As seen in this PR?

https://github.com/openshift/cluster-image-registry-operator/pull/457

Comment 9 errata-xmlrpc 2020-05-04 11:20:43 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581