Bug 1847197 - openshift-dns daemonset doesn't include toleration to run on nodes with taints
Summary: openshift-dns daemonset doesn't include toleration to run on nodes with taints
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: DNS
Version: 4.2.z
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: ---
: 4.4.z
Assignee: Miciah Dashiel Butler Masters
QA Contact: Hongan Li
URL:
Whiteboard:
: 1850464 (view as bug list)
Depends On: 1813479
Blocks: 1723620 1859685
TreeView+ depends on / blocked
 
Reported: 2020-06-15 21:18 UTC by Miciah Dashiel Butler Masters
Modified: 2020-11-21 03:17 UTC (History)
23 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: The DNS operator was changed in OpenShift 4.2.13 to remove a blanket toleration for all taints from the operator's operand. This change was made in order to prevent the operand from being scheduled to a node before the node's networking was ready. Consequence: Adding arbitrary taints to nodes could cause problems related to the DNS operator's operand. For one, adding a NoSchedule taint to nodes could lead to alerts' being raised for operand pods that were already running on the newly tainted nodes. For another, taints could prevent the operand from running on a node. The operand needs to run on every node in order to add the cluster image registry's host name and address to the node host's /etc/hosts file. Without this entry in /etc/hosts, the node's container runtime could fail to pull images from the image registry, breaking upgrades and user workloads. Fix: The toleration for all taints has been restored for the DNS operator's operand. The operand also has a node selector to ensure that it runs only on Linux nodes. Result: The operand runs on, and it updates /etc/hosts on, all Linux node hosts. "Missing CNI default network" events may be observed when the operand starts on a node that is still initializing, but such errors are transient and can be ignored.
Clone Of: 1813479
Environment:
Last Closed: 2020-08-04 14:16:01 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-dns-operator pull 179 0 None closed [release-4.4] Bug 1847197: Tolerate all taints 2021-02-15 04:15:12 UTC
Red Hat Product Errata RHBA-2020:3128 0 None None None 2020-08-04 14:16:34 UTC

Description Miciah Dashiel Butler Masters 2020-06-15 21:18:56 UTC
+++ This bug was initially created as a clone of Bug #1813479 +++

Description of problem:
openshift-dns daemonset doesn't include toleration to run on nodes with taints. After a NoSchedule taint is configured for a node, the daemonset stops managing  the pods on that node and 2 things happen:

- alerts are shown in OCP dashboard: Pods of DaemonSet openshift-dns/dns-default are running where they are not supposed to run.

- if the pods are deleted on nodes with taint, they won't be recovered.

Version-Release number of selected component (if applicable):
OCP 4.2.20

How reproducible:
Whenever taints are applied to nodes.

Steps to Reproduce:
1. "oc -n openshift-dns get ds" to check desired nodes for the ds.
2. Apply NoSchedule taint to node
3. "oc -n openshift-dns get ds" to check that desired count has less one node.
4. Observe alerts on OCP dashboard
5. "oc -n openshift-dns get pods -o wide" to verify that pods are still running on tainted node


Actual results:
openshift-dns pods stop being managed by daemonset on nodes with a taint.


Expected results:
openshift-dns should continue to be managed by daemonset and have pods running on every node.

Additional info:

This change[1] might be related to the issue.

[1] https://github.com/openshift/cluster-dns-operator/commit/6be3d017118b89203f00b9a915ffdfdb9975f145

Comment 1 Miciah Dashiel Butler Masters 2020-06-18 19:21:35 UTC
A PR is posted and awaiting review.  We'll try to get it merged next sprint.

Comment 2 Andrew McDermott 2020-06-25 16:12:59 UTC
*** Bug 1850464 has been marked as a duplicate of this bug. ***

Comment 3 Miciah Dashiel Butler Masters 2020-07-09 05:11:32 UTC
The fix to the master branch has merged.  We'll work on the 4.4 backport in the upcoming sprint.

Comment 6 Hongan Li 2020-07-27 02:40:08 UTC
Verified with 4.4.0-0.nightly-2020-07-24-031753 and the issue has been fixed.

The dns pod can be running on nodes with a taint.

Comment 8 errata-xmlrpc 2020-08-04 14:16:01 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.4.15 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:3128


Note You need to log in before you can comment on or make changes to this bug.