Bug 1813479 - openshift-dns daemonset doesn't include toleration to run on nodes with taints
Summary: openshift-dns daemonset doesn't include toleration to run on nodes with taints
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: DNS
Version: 4.2.z
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: ---
: 4.5.0
Assignee: Miciah Dashiel Butler Masters
QA Contact: Arvind iyengar
URL:
Whiteboard:
Depends On:
Blocks: 1847197
TreeView+ depends on / blocked
 
Reported: 2020-03-13 23:19 UTC by Vagner Farias
Modified: 2020-08-26 14:27 UTC (History)
23 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: The DNS operator was changed in OpenShift 4.2.13 to remove a blanket toleration for all taints from the operator's operand. This change was made in order to prevent the operand from being scheduled to a node before the node's networking was ready. Consequence: Adding arbitrary taints to nodes could cause problems related to the DNS operator's operand. For one, adding a NoSchedule taint to nodes could lead to alerts' being raised for operand pods that were already running on the newly tainted nodes. For another, taints could prevent the operand from running on a node. The operand needs to run on every node in order to add the cluster image registry's host name and address to the node host's /etc/hosts file. Without this entry in /etc/hosts, the node's container runtime could fail to pull images from the image registry, breaking upgrades and user workloads. Fix: The toleration for all taints has been restored for the DNS operator's operand. The operand also has a node selector to ensure that it runs only on Linux nodes. Result: The operand runs on, and it updates /etc/hosts on, all Linux node hosts. "Missing CNI default network" events may be observed when the operand starts on a node that is still initializing, but such errors are transient and can be ignored.
Clone Of:
: 1847197 (view as bug list)
Environment:
Last Closed: 2020-07-13 17:20:05 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Github openshift cluster-dns-operator pull 171 None closed Bug 1813479: Tolerate all taints 2020-09-18 21:21:57 UTC
Red Hat Product Errata RHBA-2020:2409 None None None 2020-07-13 17:20:31 UTC

Internal Links: 1822211

Description Vagner Farias 2020-03-13 23:19:54 UTC
Description of problem:
openshift-dns daemonset doesn't include toleration to run on nodes with taints. After a NoSchedule taint is configured for a node, the daemonset stops managing  the pods on that node and 2 things happen:

- alerts are shown in OCP dashboard: Pods of DaemonSet openshift-dns/dns-default are running where they are not supposed to run.

- if the pods are deleted on nodes with taint, they won't be recovered.

Version-Release number of selected component (if applicable):
OCP 4.2.20

How reproducible:
Whenever taints are applied to nodes.

Steps to Reproduce:
1. "oc -n openshift-dns get ds" to check desired nodes for the ds.
2. Apply NoSchedule taint to node
3. "oc -n openshift-dns get ds" to check that desired count has less one node.
4. Observe alerts on OCP dashboard
5. "oc -n openshift-dns get pods -o wide" to verify that pods are still running on tainted node


Actual results:
openshift-dns pods stop being managed by daemonset on nodes with a taint.


Expected results:
openshift-dns should continue to be managed by daemonset and have pods running on every node.

Additional info:

This change[1] might be related to the issue.

[1] https://github.com/openshift/cluster-dns-operator/commit/6be3d017118b89203f00b9a915ffdfdb9975f145

Comment 1 Wolfgang Kulhanek 2020-03-16 21:13:10 UTC
How do we work around that? For Machine Config Daemon DS and Node CA DS I can patch the DS definition to add the required tolerations. But the DNS operator seems to immediately "fix" the tolerations if I try that.

Comment 2 Ryan Howe 2020-03-17 16:35:25 UTC
https://github.com/openshift/cluster-dns-operator/commit/6be3d017118b89203f00b9a915ffdfdb9975f145

It looks like in 4.3 that any taints on any of the worker nodes will cause the dns not to run on the no worker node. 

  https://github.com/openshift/cluster-dns-operator/blob/release-4.3/assets/dns/daemonset.yaml#L151-L152

Also if any taint is added to a master that does not have the key "node-role.kubernetes.io/master" dns pods will not run on that node. 

I created a documentation bug asking for our docs to state the requirements for taints on masters. 


In my 4.2 cluster the dns has the following tolerations: 

  tolerations:
  - key: node-role.kubernetes.io/master
    operator: Exists
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/disk-pressure
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/pid-pressure
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/unschedulable
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/memory-pressure
    operator: Exists

If  I set any taint NoSchedule taint on a node it will not schedule to the node. If I set any NoSchedule taint to a master that does not contain  "key: node-role.kubernetes.io/master" then it will schedule to the masters. 


I created doc but around adding info around requirement for NoSchedule taints set on the masters. 

https://bugzilla.redhat.com/show_bug.cgi?id=1814325

Comment 3 Miciah Dashiel Butler Masters 2020-03-19 21:39:50 UTC
The entire purpose of a taint is to prevent pods from being scheduled to or executed on a node[1].  Per your report, a taint with the "NoSchedule" effect prevents DNS pods from being scheduled to the tainted node.  In that respect, the taint feature is behaving correctly.  Why do you expect pods to be scheduled to tainted nodes?  Why are you tainting nodes if you do not want the taint to have the usual effect?

It might make sense to change the alerting rule to suppress the reported alerts.

It is surprising that pods on tainted nodes "stop being managed".  Can you clarify what you mean by "stop being managed"?  Does deleting the DaemonSet or performing a rolling update fail to affect the pods?

Please note that a taint with effect "NoSchedule" does not evict pods, so pods that are already scheduled and running will continue running; if you want the pods to be evicted, add a taint with the "NoExecute" effect.

Please note also that DaemonSet pods automatically get the following tolerations:

node.kubernetes.io/not-ready:NoExecute
node.kubernetes.io/unreachable:NoExecute
node.kubernetes.io/disk-pressure:NoSchedule
node.kubernetes.io/memory-pressure:NoSchedule
node.kubernetes.io/pid-pressure:NoSchedule
node.kubernetes.io/unschedulable:NoSchedule

1. https://docs.openshift.com/container-platform/4.2/nodes/scheduling/nodes-scheduler-taints-tolerations.html

Comment 4 Vagner Farias 2020-03-23 22:11:48 UTC
Hi Miciah,

The purpose of taints is understood and this isn't being questioned. What is being questioned is that configuring a taint to a worker node, dns pods won't run anymore on this node. The expectation was that dns pods should still run on  nodes with taints because it's a required OCP component, similarly to what happens with internal registry and Machine Config Daemon.

As openshift-dns daemonset is managed by an operator it isn't likely possible to change it's tolerations (I couldn't find a way). If it's acceptable to have dns pods running only on master nodes, in the extreme situation that all worker nodes have taints, than perhaps this isn't a bug. But I'd still consider this a bit odd, as "by default" dns pods are running everywhere, hence the expectation that they should run everywhere. I'd say this is the main issue to be clarified or discussed.

Answering your specific question, by "stop being managed" I meant that rogue dns pods are left behind. I didn't try to delete the daemonset or perform a rolling update, but I manually deleted the pods on the nodes with taints and the warnings disappeared. I see that this was only required because the taint was set with effect "NoSchedule", as this was the configuration specified by OpenShift Container Storage documentation.

Comment 5 Sushil 2020-04-07 06:58:41 UTC
daemonset that are part of the cluster should always have the tolerations, so that a customer putting a taint is not going to break things.

We have had so many breakages just because we have daemonsets that do not have tolerations.
Another eg: https://bugzilla.redhat.com/show_bug.cgi?id=1780318

Comment 10 Miciah Dashiel Butler Masters 2020-05-08 20:08:20 UTC
Derek, can you respond to the following?

https://github.com/openshift/cluster-dns-operator/pull/171#issuecomment-622110777

> @derekwaynecarr did we get agreement to allow tolerating all taints
> 
> Is there an enhancement / update that clarifies official stance of project on this?

My takeaway from an earlier E-mail discussion was that we could go ahead and add the toleration to fix this issue, and someone (who?) in sig-node or sig-network could follow up with an enhancement proposal to add a softAdmitHandler or backpressure to the kubelet to prevent scheduling a pod before networking is ready.  Can we merge #171 while that discussion happens separately?

Comment 12 Miciah Dashiel Butler Masters 2020-05-21 15:52:16 UTC
The fix is waiting on https://github.com/openshift/enhancements/pull/335#issuecomment-632165965.

Comment 13 Andrew McDermott 2020-05-27 16:17:33 UTC
The PR is in flight and once it merges we will consider how far to backport.

Moving to 4.6 as not a release blocker or an upgrade blocker for 4.5.

Comment 14 Andrew McDermott 2020-06-04 16:22:06 UTC
https://github.com/openshift/cluster-dns-operator/pull/171 actually merged.

Should we move this BZ back to 4.5?

@Eric - looking for guidance here.

Comment 15 Eric Paris 2020-06-08 14:39:52 UTC
Given I see numerous customer cases I think backporting is a good idea, yes.

Comment 16 Miciah Dashiel Butler Masters 2020-06-08 15:10:42 UTC
Because this fix is already merged in 4.5 and comment 15 advises that it should be backported, I am moving the target release back to 4.5.0, and we can discuss how far to backport the fix.

Vagner, this is fixed in 4.5; are you requesting a backport to earlier releases?  If so, what is the earliest release where a backport is requested?

Comment 17 Remington Santos 2020-06-08 17:25:03 UTC
Hi Miciah, responding in place of Vagner, our customer are in release 4.2 and doesn't have plan to update this version.

Comment 18 Vagner Farias 2020-06-08 17:35:36 UTC
Clearing needinfo per Remington's comment.

Comment 23 Arvind iyengar 2020-06-10 10:41:04 UTC
As of writing the fix has been tested and verified in "4.5.0-0.nightly-2020-06-09-223121" release version. It is noted that the toleration for all taints now works properly. 

Some excerpts from the test: 

* Version: 
----
$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.5.0-0.nightly-2020-06-09-223121   True        False         5h32m   Cluster version is 4.5.0-0.nightly-2020-06-09-223121
----

* The daemonset has no additional "node-role.kubernetes.io/master" keys:
----
tolerations:
- operator: Exists
----

* Post addition of "NoSchedule" taint on the worker nodes does not have any effect with the daemonset unlike the previous release version:
-----
$ oc -n openshift-dns get ds 
NAME          DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR            AGE
dns-default   6         6         6       6            6           kubernetes.io/os=linux   5h23m

$ oc get nodes                                         
NAME                                         STATUS   ROLES    AGE     VERSION
ip-10-0-131-0.us-east-2.compute.internal     Ready    worker   4h55m   v1.18.3+a637491 <--
ip-10-0-144-108.us-east-2.compute.internal   Ready    master   5h7m    v1.18.3+a637491
ip-10-0-172-188.us-east-2.compute.internal   Ready    worker   4h55m   v1.18.3+a637491
ip-10-0-183-160.us-east-2.compute.internal   Ready    master   5h7m    v1.18.3+a637491
ip-10-0-203-65.us-east-2.compute.internal    Ready    worker   4h56m   v1.18.3+a637491
ip-10-0-211-202.us-east-2.compute.internal   Ready    master   5h8m    v1.18.3+a637491

$ oc adm taint node ip-10-0-131-0.us-east-2.compute.internal dns-taint=test:NoSchedule    
node/ip-10-0-131-0.us-east-2.compute.internal taint

$ oc describe node ip-10-0-131-0.us-east-2.compute.internal
CreationTimestamp:  Wed, 10 Jun 2020 10:04:47 +0530
Taints:             dns-taint=test:NoSchedule
Unschedulable:      false

$ oc -n openshift-dns get ds
NAME          DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR            AGE
dns-default   6         6         6       6            6           kubernetes.io/os=linux   5h23m

$ oc -n openshift-dns describe  ds  dns-default 
Name:           dns-default
Selector:       dns.operator.openshift.io/daemonset-dns=default
Node-Selector:  kubernetes.io/os=linux
Labels:         dns.operator.openshift.io/owning-dns=default
Annotations:    deprecated.daemonset.template.generation: 1
Desired Number of Nodes Scheduled: 6
Current Number of Nodes Scheduled: 6
Number of Nodes Scheduled with Up-to-date Pods: 6
Number of Nodes Scheduled with Available Pods: 6
Number of Nodes Misscheduled: 0
-----


* Reference of the same test indicating dns pod being marked out from the tainted worker node, from v4.4 release version:
-----
$ oc -n openshift-dns get ds 
NAME          DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR            AGE
dns-default   6         6         6       6            6           kubernetes.io/os=linux   147m

$ oc adm taint node ip-10-0-149-87.us-east-2.compute.internal dns-taint=test:NoSchedule
node/ip-10-0-149-87.us-east-2.compute.internal tainted

$ oc -n openshift-dns describe  ds  dns-default 
Name:           dns-default
Selector:       dns.operator.openshift.io/daemonset-dns=default
Node-Selector:  kubernetes.io/os=linux
Labels:         dns.operator.openshift.io/owning-dns=default
Annotations:    deprecated.daemonset.template.generation: 1
Desired Number of Nodes Scheduled: 5
Current Number of Nodes Scheduled: 5
Number of Nodes Scheduled with Up-to-date Pods: 5
Number of Nodes Scheduled with Available Pods: 5
Number of Nodes Misscheduled: 1

$ oc -n openshift-dns   get ds  dns-default         
NAME          DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR            AGE
dns-default   5         5         5       5            5           kubernetes.io/os=linux   160m
-----

Comment 25 errata-xmlrpc 2020-07-13 17:20:05 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409

Comment 26 Daniel Del Ciancio 2020-07-20 16:27:52 UTC
Hello,
I realize that this bug has been fixed in 4.5 and it blocks another bug https://bugzilla.redhat.com/show_bug.cgi?id=1847197 which tracks it being fixed in 4.4.  However, it does not seem to block a 4.3 bug. 

My customer has asked if there is a plan to backport this to 4.3?

Thanks!

Comment 29 John McMeeking 2020-08-09 01:10:26 UTC
I'm working on a customer problem for Red Hat OpenShift on IBM Cloud v4.3.  I'll second the question regarding plans to backport to 4.3.

I will also inform them this bug has been fixed in 4.4.

Comment 30 Miciah Dashiel Butler Masters 2020-08-10 04:03:32 UTC
The 4.3 backport is covered by bug 1859685.  For completeness, we have the following Bugzilla reports for this issue:

• For 4.5.0, Bug 1813479 (this report).
• For 4.4.z, Bug 1847197.
• For 4.3.z, Bug 1859685.

The way the backport process works is that the backport to version 4.y depends on the fix for version 4.(y+1), so Bug 1859685 (the 4.3.z backport) depends on Bug 1847197 (the 4.4.z backport), which depends on this report.


Note You need to log in before you can comment on or make changes to this bug.