Bug 1805177 - Connectivity issues through vxlan affect local DNS resolving
Summary: Connectivity issues through vxlan affect local DNS resolving
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.3.0
Hardware: Unspecified
OS: Unspecified
low
low
Target Milestone: ---
: 4.5.0
Assignee: Dan Mace
QA Contact: Hongan Li
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-02-20 12:39 UTC by Pablo Alonso Rodriguez
Modified: 2023-09-07 21:58 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-05-11 13:04:22 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Pablo Alonso Rodriguez 2020-02-20 12:39:09 UTC
Description of problem:

In OCP4, pods get the cluster IP of dns-default service at openshift-dns namespace as nameserver in resolv.conf, so any DNS request may be sent to any healthy dns pod on any node. 

The problem comes when a dns pod in a certain node is healthy, but there is a networking problem that prevents node-to-node communication through vxlan. In such case, if there are ${FAILED_NODES} that cannot be accessible, any DNS request has a ${FAILED_NODES}/${TOTAL_NODES} chance of failing. 

In case one or more nodes cannot communicate with the rest through vxlan, one could of course expect pod-to-pod communication to fail, but having also partial but cluster-wide unavailability of DNS is a too severe side effect for a failure on even a small subset of nodes.

Version-Release number of selected component (if applicable):

4.3

How reproducible:

Always as long as there is vxlan communication issues between nodes

Steps to Reproduce:
1. Prevent communication through vxlan port to a subset of nodes in order to isolate them (either with iptables/nftables or an external firewall).
2. Be sure that the isolated nodes can still communicate to the master, so they are not marked as NotReady, i.e. close only vxlan port.
3. Try several DNS requests from a pod of any kind.


Actual results:

Partial unavailability of DNS even in nodes that have not been touched.

Expected results:

No partial unavailability.

Additional info:

A possible solution would be to make the local openshift-dns pod to be first nameserver and then the rest.

Comment 1 Dan Mace 2020-02-20 15:52:39 UTC
You seem to have described a generalized silent networking failure which would impact all pod->pod traffic from the afflicted node — not just DNS. I'm not sure that's a bug with DNS, per say. Maybe more of a resiliency and/or performance improvement. In any case, defining, measuring, and improving cluster DNS resiliency in the face of specific failure modes does seem worth exploring, and so I appreciate you bringing the concrete use case.

For now, I'd like to continue the conversation, and so I'm going to move the bug to 4.5 so that we can do that without setting an expectation of any action in the 4.4 release, which is imminent.

Some other related thoughts:

  * We're actively exploring NodeLocal DNS[1], which may help in this scenario
  * In k8s 1.17+ there's a new topologyKeys Service API which may be relevant for implementations[2]

[1] https://issues.redhat.com/browse/NE-270
[2] https://kubernetes.io/docs/concepts/services-networking/service-topology/#prefer-node-local-endpoints

Comment 2 Pablo Alonso Rodriguez 2020-02-20 16:09:07 UTC
Hello,

One minor clarification: Yes, I am considering that one node gets isolated in what regards pod-to-pod communication. However, what is worrying is that DNS queries are affected on nodes that are not facing issues even if they are doing nothing related to the failed nodes or any application deployed on top of them.

A sample use case is if somebody installs nodes on different sites (i.e. different networks) for HA purposes and there is a firewall in between that can be faulty or mistakenly configured. In such multi-site scenarios, situations where node-->master communication can work but node-->node communication can experience failures are feasible.

Another use case would be some strange kernel bug that may cause UDP packets to be dropped unnecessarily. This kind of bugs usually affect DNS, but can also affect VXLAN as well (as it is UDP as well).

Anyway, preferring local endpoints looks like the most optimal solution to me.

Comment 3 Dan Mace 2020-02-20 16:14:36 UTC
Thanks for the extra details. Those are very useful justifications. I don't know if apiserver connectivity from the node is a possibility in this scenario. If so, I wonder if the local CoreDNS cache may become stale. But  reaching a degraded CoreDNS may be better than not being able to reach it at all...

Comment 4 Pablo Alonso Rodriguez 2020-02-20 16:19:17 UTC
In the scenario I have in mind, Failed node --> Master API communication would need to work. If it didn't, failed node would be marked as not ready (by default after 40s if I recall correctly) and then its pods would be evicted after pod eviction timeout (5 minutes by default). If I am not wrong, this would mean that coredns pods on failed node would be evicted and should be removed from endpoints list of the service, so issue would stop reproducing.

Comment 7 Hongan Li 2020-03-19 08:11:09 UTC
Tested with 4.5.0-0.nightly-2020-03-18-115438 but seems `topologyKeys:` is not added to the service.

$ oc -n openshift-dns get svc -o yaml
apiVersion: v1
items:
- apiVersion: v1
  kind: Service
  spec:
    clusterIP: 172.30.0.10
    ports:
    - name: dns
      port: 53
      protocol: UDP
      targetPort: dns
    - name: dns-tcp
      port: 53
      protocol: TCP
      targetPort: dns-tcp
    - name: metrics
      port: 9153
      protocol: TCP
      targetPort: metrics
    selector:
      dns.operator.openshift.io/daemonset-dns: default
    sessionAffinity: None
    type: ClusterIP
  status:

Comment 8 Dan Mace 2020-03-30 15:30:59 UTC
https://github.com/openshift/cluster-dns-operator/pull/156 can't fix this, because Service Topology is alpha in 1.17 and won't be enabled in 4.5 (or any foreseeable future release).

NodeLocal DNS could perhaps help here, but that's also not happening in 4.5. I doubt we'll have sort of solution here any time soon. I'm going to reduce the priority for now.

Comment 9 Ben Bennett 2020-05-11 13:04:22 UTC
This is not exactly a bug, just a side effect of the current design.

We understand that that it causes problems when a node can reach the API server, but the networking is not working.  We are tracking the change to address this as a feature in https://issues.redhat.com/browse/NE-270


Note You need to log in before you can comment on or make changes to this bug.