Bug 2059639

Summary: [OVN] Openshift-dns service is created with internal traffic policy cluster and OVN used DNS service instead of local endpoint
Product: OpenShift Container Platform Reporter: Andre Costa <andcosta>
Component: NetworkingAssignee: Miciah Dashiel Butler Masters <mmasters>
Networking sub component: DNS QA Contact: Hongan Li <hongli>
Status: CLOSED DUPLICATE Docs Contact:
Severity: medium    
Priority: medium CC: aos-bugs, mmasters
Version: 4.9   
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-03-01 16:45:23 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Andre Costa 2022-03-01 15:30:28 UTC
Description of problem:
On Openshift 4.9 openshift-dns service is created with internalTrafficPolicy set to Cluster and OVN doesn't seem to have the same fix we introduced in OpenshiftSDN to have sdn poins query the local dns pod endpoint instead of the dns-default service.

https://bugzilla.redhat.com/show_bug.cgi?id=1919737

The code seems to mention that this change will be removed once internalTrafficPolicy is implemented. Is the Cluster setting correctly or should we have service with internalTrafficPolicy set to Local?

If this setting on the service is not suppose to be changed to possible other bad side effects on the cluster, can we have a similar implementation on OVN like it looks we are suppose to be working on it from the same bugzilla mentioned above?

OpenShift release version:
OCP 4.9 with OVNKubernetes

Cluster Platform:
All

How reproducible:
Unknown

Steps to Reproduce (in detail):
Unknown


Impact of the problem:
Sporadic DNS failures.

Additional info:

 $ oc get svc dns-default -o yaml -n openshift-dns

apiVersion: v1
kind: Service
metadata:
  annotations:
    service.alpha.openshift.io/serving-cert-signed-by: openshift-service-serving-signer@1641922101
    service.beta.openshift.io/serving-cert-secret-name: dns-default-metrics-tls
    service.beta.openshift.io/serving-cert-signed-by: openshift-service-serving-signer@1641922101
  creationTimestamp: "2022-01-11T17:32:01Z"
  labels:
    dns.operator.openshift.io/owning-dns: default
  name: dns-default
  namespace: openshift-dns
  ownerReferences:
  - apiVersion: operator.openshift.io/v1
    controller: true
    kind: DNS
    name: default
    uid: 8397e280-d4f3-44a5-8a8f-a978bbdbaa7e
  resourceVersion: "10030"
  uid: ce8dcef3-7ba6-45c4-9ae2-c00f494e155c
spec:
  clusterIP: 172.32.0.10
  clusterIPs:
  - 172.32.0.10
  internalTrafficPolicy: Cluster
  ipFamilies:
  - IPv4
  ipFamilyPolicy: SingleStack
  ports:
  - name: dns
    port: 53
    protocol: UDP
    targetPort: dns
  - name: dns-tcp
    port: 53
    protocol: TCP
    targetPort: dns-tcp
  - name: metrics
    port: 9154
    protocol: TCP
    targetPort: metrics
  selector:
    dns.operator.openshift.io/daemonset-dns: default
  sessionAffinity: None
  type: ClusterIP
status:
  loadBalancer: {}

Comment 1 Miciah Dashiel Butler Masters 2022-03-01 16:24:33 UTC
Setting blocker- as this doesn't appear to be a regression, upgrade issue, or otherwise something that should block a release.  

This issue appears to be related to bug 1919737, which we fixed with a patch to openshift-sdn.  This new BZ is about addressing the same issue in OVN-Kubernetes.

The spec.internalTrafficPolicy API field is relatively new; "internalTrafficPolicy: Cluster" is the default the API sets.  The DNS operator isn't explicitly setting internalTrafficPolicy.  The Kubernetes documentation is contradictory as to when "internalTrafficPolicy" was enabled by default (<https://kubernetes.io/docs/concepts/services-networking/service-traffic-policy/> says Kubernetes 1.23, and <https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.22.md> says Kubernetes 1.22).  The field seems to be present in Kubernetes 1.22 (as evidenced by bug 2002461), so we can set "internalTrafficPolicy: Local" in OpenShift 4.9 (which is based on Kubernetes 1.22; see <https://access.redhat.com/solutions/4870701>) and later.  

I'll check with the SDN team to see whether specifying "internalTrafficPolicy: Local" works or could break anything with openshift-sdn and OVN-Kubernetes.

Comment 2 Miciah Dashiel Butler Masters 2022-03-01 16:45:23 UTC
Surya from the SDN team reminded me about bug 2039698 (and 4.9.z backport bug 2055317), which adds a fix in OVN-Kubernetes similar to the one in openshift-sdn.  Surya also reminded me that "internalTrafficPolicy: Local" is not really what we need for the DNS service; we need the service to *prefer* a local endpoint and fall back to any available endpoint if no local endpoint is available.  There is work upstream to add "internalTrafficPolicy: PreferLocal" (see <https://github.com/kubernetes/enhancements/pull/3016>), but right now, "internalTrafficPolicy" does not fit our needs.  

I'm closing this report as a duplicate of bug 2055317; please let me know if I have misunderstood the request in this BZ.

*** This bug has been marked as a duplicate of bug 2055317 ***