Bug 1801474

Summary: node-ca daemonset toleration conflicts with clusterlogging CR
Product: OpenShift Container Platform Reporter: Hugo Cisneiros (Eitch) <hcisneir>
Component: Image RegistryAssignee: Oleg Bulatov <obulatov>
Status: CLOSED ERRATA QA Contact: Wenjing Zheng <wzheng>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.2.zCC: adam.kaplan, aos-bugs, ChetRHosey, ddreggor, wewang
Target Milestone: ---Keywords: Reopened
Target Release: 4.4.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: the nodeca daemon didn't tolerate the NoExecute taint, but ClusterLogging documentation recommends to use NoExecute Consequence: the nodeca daemon doesn't manage certificates on such nodes Fix: tolerate all taints Result: additionalTrustedCA are synced to all nodes with any taints
Story Points: ---
Clone Of:
: 1820242 (view as bug list) Environment:
Last Closed: 2020-05-04 11:35:28 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1820242    

Description Hugo Cisneiros (Eitch) 2020-02-10 23:11:24 UTC
Description of problem:

When following the documentation for deploying ClusterLogging and adding taints to nodes to only run Logging components, the image registry 'node-ca' daemonset does not include the proper toleration and these nodes with taints don't run the 'node-ca' pods. 

node-ca daemonset has this toleration:

     tolerations:
      - effect: NoSchedule
        key: node-role.kubernetes.io/master
        operator: Exists

To run in all nodes, regardless of any toleration, this could be:

     tolerations:
       operator: Exists

Version-Release number of selected component (if applicable):

4.2.16

How reproducible:

1. Deploy a ClusterLogging instance, customized to use tolerations and taints:

https://docs.openshift.com/container-platform/4.2/logging/config/cluster-logging-tolerations.html

Toleration customization:

      tolerations:
      - effect: NoExecute
        key: logging
        operator: Exists


2. Taint nodes with:

$ oc adm taint nodes <node1|node2|node3> logging=true:NoExecute

Actual results:

After the taint, 'node-ca' pods were deleted from tainted nodes:

$ oc get events -n openshift-image-registry
[...]
53m         Normal    SuccessfulDelete            daemonset/node-ca                                       Deleted pod: node-ca-pjxfn
53m         Normal    SuccessfulDelete            daemonset/node-ca                                       Deleted pod: node-ca-2kclr
53m         Normal    SuccessfulDelete            daemonset/node-ca                                       Deleted pod: node-ca-c5nn9

Expected results:

Pods are not deleted.

Additional info:

Comment 1 Adam Kaplan 2020-02-11 13:47:21 UTC
Duplicate of Bug 1785115 - this will be fixed in v4.4.0.

*** This bug has been marked as a duplicate of bug 1785115 ***

Comment 2 Oleg Bulatov 2020-02-11 16:59:39 UTC
Bug 1785115 was about NoSchedule, but this one about NoExecute. I agree we need to tolerate all effects.

Comment 4 Wenjing Zheng 2020-02-18 10:08:27 UTC
Below toleration is added to node-ca on 4.4.0-0.nightly-2020-02-17-192940 : 
     tolerations:
      - operator: Exists

Comment 5 Chet Hosey 2020-02-27 15:00:05 UTC
Any chance of backporting to 4.2/4.3, or alternative workarounds?

The documented process for setting up dedicated OCS nodes [1] has the user taint the storage nodes, which will break nodeCA.

[1] https://access.redhat.com/documentation/en-us/red_hat_openshift_container_storage/4.2/html-single/deploying_openshift_container_storage/index#creating-an-openshift-container-storage-service_rhocs

Comment 6 David Dreeggors 2020-04-02 14:25:22 UTC
It looks like from PR 457 that what you actually have is:

tolerations:
      - effect: NoSchedule
        operator: Exists

not this:

tolerations:
      - operator: Exists


Line 38 (- effect: NoSchedule) is not actually removed correct? This would not not allow for NoExecute taints

Comment 7 David Dreeggors 2020-04-02 14:28:27 UTC
Sorry that was PR 421 I was looking at from a linked BZ 

https://bugzilla.redhat.com/show_bug.cgi?id=1785115

Comment 9 errata-xmlrpc 2020-05-04 11:35:28 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581