Bug 1921797

Summary: [OCP4.6 on Azure] packets dropped between master and worker
Product: OpenShift Container Platform Reporter: Angelo Gabrieli <agabriel>
Component: NetworkingAssignee: mcambria <mcambria>
Networking sub component: openshift-sdn QA Contact: zhaozhanqi <zzhao>
Status: CLOSED DUPLICATE Docs Contact:
Severity: high    
Priority: high CC: aconstan, anbhat, anton, aos-bugs, atenart, bbennett, bjarolim, dmoessne, jligon, jminter, mcambria, namato, nmalik, openshift-bugs-escalate, pweil, scuppett, sukulkar
Version: 4.6Keywords: Reopened, ServiceDeliveryImpact
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1941753 (view as bug list) Environment:
Last Closed: 2021-05-10 17:59:23 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1941753    
Bug Blocks:    

Description Angelo Gabrieli 2021-01-28 15:58:06 UTC
Description of problem:

OCP version: 4.6.12
Cloud provider: Azure
RHCOS kernel version: 4.18.0-193.40.1.el8_2.x86_64
Network plugin: SDN

- create a new test namespace
- create a new dummy (sleep) pod on a worker node and access it with `oc rsh`
- perform a ping to the DNS pods IP 
- perform a dig to the DNS pods IP


# oc rsh sleep
sh-4.2#
sh-4.2# ping 10.130.0.38
PING 10.130.0.38 (10.130.0.38) 56(84) bytes of data.
^C
--- 10.130.0.38 ping statistics ---
5 packets transmitted, 0 received, 100% packet loss, time 4117ms

sh-4.2#
sh-4.2# dig @10.130.0.38 -p 5353 <some DNS name>

; <<>> DiG 9.11.4-P2-RedHat-9.11.4-26.P2.el7_9.3 <<>> @10.130.0.38 -p 5353 <some DNS name>
; (1 server found)
;; global options: +cmd
;; connection timed out; no servers could be reached
sh-4.2#


No NetworkPolicy in place:


# oc get networkpolicy -A
No resources found


Version-Release number of selected component (if applicable):
OCP version: 4.6.12
Cloud provider: Azure
RHCOS kernel version: 4.18.0-193.40.1.el8_2.x86_64


How reproducible:


Steps to Reproduce:
1.
2.
3.


Actual results:
VXLAN Network traffic blocked


Expected results:
VXLAN Network traffic allowed


Additional info:

Comment 17 Ben Bennett 2021-03-04 14:56:45 UTC

*** This bug has been marked as a duplicate of bug 1933761 ***

Comment 18 Ben Bennett 2021-03-04 14:57:41 UTC

*** This bug has been marked as a duplicate of bug 1928773 ***

Comment 19 Ben Bennett 2021-03-04 15:02:23 UTC
https://bugzilla.redhat.com/show_bug.cgi?id=1928773 will change the DNS service to prefer the resolver on the local node if present.  This should prevent a lot of cross-node dns traffic.

https://bugzilla.redhat.com/show_bug.cgi?id=1933761 is also related (and will be backported to 4.6) since it changes the max ttl that coredns will return to clients to 900s (from 30s).  This means that if an upstream resolver sets a high ttl, we cap it at 30s today.  After the change we will cap at 15m.  That allows the pod's resolvers to cache the responses for much longer, and should avoid repeated dns requests.

Comment 26 Dan Winship 2021-03-22 17:09:24 UTC
It seems like this is a bug in OVS. I have cloned this bug to bug 1941753 for the OVS team to investigate. (Assuming it is an OVS bug, that bug will track fixing it in OVS and then this bug will track getting the fixed OVS package into OCP.)

Comment 27 zhaozhanqi 2021-03-26 05:08:57 UTC
is this same issue with https://bugzilla.redhat.com/show_bug.cgi?id=1825219 ?

Comment 33 Ben Bennett 2021-05-10 17:59:23 UTC

*** This bug has been marked as a duplicate of bug 1825219 ***