Bug 1853198 - OVS connection timeout leading to un-programmed OVS flows by the SDN
Summary: OVS connection timeout leading to un-programmed OVS flows by the SDN
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.3.0
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 4.3.z
Assignee: Alexander Constantinescu
QA Contact: zhaozhanqi
URL:
Whiteboard:
Depends On: 1853197
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-07-02 07:00 UTC by Alexander Constantinescu
Modified: 2020-08-19 11:10 UTC (History)
32 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1853197
Environment:
Last Closed: 2020-08-19 11:10:15 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Github openshift sdn pull 156 None closed Bug 1853198: [release-4.3] Backport OVS-retries 2020-11-02 10:11:03 UTC
Red Hat Product Errata RHBA-2020:3259 None None None 2020-08-19 11:10:50 UTC

Description Alexander Constantinescu 2020-07-02 07:00:57 UTC
+++ This bug was initially created as a clone of Bug #1853197 +++

+++ This bug was initially created as a clone of Bug #1853193 +++

+++ This bug was initially created as a clone of Bug #1851182 +++

+++ This bug was initially created as a clone of Bug #1838007 +++

Description of problem:

This is a placeholder bug intended to track all networking related investigations concerning the quay.io outage on 2020-05-19

Please attach a detailed description of what happened, ex:

 - Pod X to Pod Y communication did not work
 - Pod X to Service Y did not work 
 - IP X to IP Y did not work 
 - Node X to node Y could not communicate

attach logs / must-gather / sos-reports with any networking related information from the time of the events

Version-Release number of selected component (if applicable):

This concerns 4.3.18 - 4.3.19 upgrade

How reproducible:

To be determinded

Steps to Reproduce:
1.
2.
3.

--- Additional comment from Alexander Constantinescu on 2020-06-25 17:18:27 UTC ---

The reason why the bad DNS pod cannot be reached is because the SDN failed to update the VNID information in table 80 on the node hosting the bad DNS pod. The packets are subsequently dropped once they reach this table. 

This is the ofproto trace command from the sre-pod to the bad DNS pod:


======================================================================================================

ovs-appctl ofproto/trace  br0 "in_port=1,udp,nw_dst=10.130.2.3,nw_src=10.131.2.13,tun_src=10.0.142.181,udp_dst=5353,tun_dst=10.0.139.42"
Flow: udp,tun_src=10.0.142.181,tun_dst=10.0.139.42,tun_ipv6_src=::,tun_ipv6_dst=::,tun_gbp_id=0,tun_gbp_flags=0,tun_tos=0,tun_ttl=0,tun_erspan_ver=0,tun_flags=0,in_port=1,vlan_tci=0x0000,dl_src=00:00:00:00:00:00,dl_dst=00:00:00:00:00:00,nw_src=10.131.2.13,nw_dst=10.130.2.3,nw_tos=0,nw_ecn=0,nw_ttl=0,tp_src=0,tp_dst=5353
bridge("br0")
-------------
 0. ct_state=-trk,ip, priority 300
    ct(table=0)
    drop
     -> A clone of the packet is forked to recirculate. The forked pipeline will be resumed at table 0.
     -> Sets the packet to an untracked state, and clears all the conntrack fields.
Final flow: unchanged
Megaflow: recirc_id=0,ct_state=-trk,eth,ip,tun_id=0,tun_src=10.0.142.181,tun_dst=10.0.139.42,tun_tos=0,tun_flags=-df-csum-key,in_port=1,nw_frag=no
Datapath actions: ct,recirc(0x32ba52)
===============================================================================
recirc(0x32ba52) - resume conntrack with default ct_state=trk|new (use --ct-next to customize)
===============================================================================
Flow: recirc_id=0x32ba52,ct_state=new|trk,eth,udp,tun_src=10.0.142.181,tun_dst=10.0.139.42,tun_ipv6_src=::,tun_ipv6_dst=::,tun_gbp_id=0,tun_gbp_flags=0,tun_tos=0,tun_ttl=0,tun_erspan_ver=0,tun_flags=0,in_port=1,vlan_tci=0x0000,dl_src=00:00:00:00:00:00,dl_dst=00:00:00:00:00:00,nw_src=10.131.2.13,nw_dst=10.130.2.3,nw_tos=0,nw_ecn=0,nw_ttl=0,tp_src=0,tp_dst=5353
bridge("br0")
-------------
    thaw
        Resuming from table 0
 0. ip,in_port=1,nw_src=10.128.0.0/14, priority 200
    move:NXM_NX_TUN_ID[0..31]->NXM_NX_REG0[]
     -> NXM_NX_REG0[] is now 0
    goto_table:10
10. tun_src=10.0.142.181, priority 100, cookie 0x5cd1398a
    goto_table:30
30. ip,nw_dst=10.130.2.0/23, priority 200
    goto_table:70
70. ip,nw_dst=10.130.2.3, priority 100
    load:0x790ec8->NXM_NX_REG1[]
    load:0x4->NXM_NX_REG2[]
    goto_table:80
80. priority 0
    drop
Final flow: recirc_id=0x32ba52,ct_state=new|trk,eth,udp,reg1=0x790ec8,reg2=0x4,tun_src=10.0.142.181,tun_dst=10.0.139.42,tun_ipv6_src=::,tun_ipv6_dst=::,tun_gbp_id=0,tun_gbp_flags=0,tun_tos=0,tun_ttl=0,tun_erspan_ver=0,tun_flags=0,in_port=1,vlan_tci=0x0000,dl_src=00:00:00:00:00:00,dl_dst=00:00:00:00:00:00,nw_src=10.131.2.13,nw_dst=10.130.2.3,nw_tos=0,nw_ecn=0,nw_ttl=0,tp_src=0,tp_dst=5353
Megaflow: recirc_id=0x32ba52,ct_state=-rpl+trk,eth,ip,tun_id=0,tun_src=10.0.142.181,tun_dst=10.0.139.42,tun_tos=0,tun_flags=-df-csum-key,in_port=1,nw_src=10.131.0.0/16,nw_dst=10.130.2.3,nw_frag=no
Datapath actions: drop

======================================================================================================


If we check the SDN logs for the DNS pod we find the following:

I0619 14:37:30.867774    2207 pod.go:503] CNI_ADD openshift-dns/dns-default-ksr8f got IP 10.130.2.3, ofport 4
I0619 14:37:30.872105    2207 ovs.go:169] Error executing ovs-ofctl: ovs-ofctl: br0: failed to connect to socket (Broken pipe)
E0619 14:37:30.872127    2207 pod.go:232] Error updating OVS multicast flows for VNID 7933640: exit status 1
I0619 14:37:30.872105    2207 ovs.go:169] Error executing ovs-ofctl: ovs-ofctl: br0: failed to connect to socket (Broken pipe)
E0619 14:37:30.872198    2207 networkpolicy.go:308] Error syncing OVS flows for VNID: exit status 1

And subsenquently the OVS logs;

2020-06-19T14:37:30.829Z|00053|bridge|INFO|bridge br0: added interface veth4199d3c2 on port 4
2020-06-19T14:37:30.854Z|00054|connmgr|INFO|br0<->unix#66: 5 flow_mods in the last 0 s (5 adds)
2020-06-19T14:37:30.871Z|00055|rconn|INFO|br0<->unix#67: connection timed out
2020-06-19T14:37:30.871Z|00056|rconn|INFO|br0<->unix#68: connection timed out

So it seems OVS timed-out talking to its unix socket, which is why the OVS flows could not be programmed for this particular pod by the SDN. 

I will add a retry mechanism to SDN to circumvent this issue.

I will clone this bug and track it there, in case investigations into this BZ continues. 

There are still no indications that this caused the OSD problems on the 19th of May.

--- Additional comment from Clayton Coleman on 2020-06-25 21:09:54 UTC ---

Are the metrics we need to add (like iptables) that ensure we eventually converge?  Everything should retry everywhere at some interval, but if we add a retry we also need to meauser how often it happens.

--- Additional comment from Clayton Coleman on 2020-06-25 21:13:22 UTC ---

https://search.apps.build01.ci.devcluster.openshift.com/?search=ovs-ofctl%3A+br0%3A+failed+to+connect+to+socket&maxAge=48h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job

Not sure all of these are the same (could be races during shutdown), but this is showing up in ~0.64% of failing CI runs in last 2 days, 1.7% of failing CI runs in last 14 days

--- Additional comment from Alexander Constantinescu on 2020-06-26 13:10:34 UTC ---

In response to #comment 1: Added a commit with metric for failed OVS commits by openshift-sdn (see referenced PR)

--- Additional comment from Alexander Constantinescu on 2020-07-02 06:57:56 UTC ---

Hmm, the PR merged...the bot should have update this to MODIFIED, so I am doing that manually

Comment 3 zhaozhanqi 2020-08-10 10:09:17 UTC
Verified this bug on 4.3.0-0.nightly-2020-08-07-045359

Did some regression, not reproduced this issue.

Comment 5 errata-xmlrpc 2020-08-19 11:10:15 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.3.33 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:3259


Note You need to log in before you can comment on or make changes to this bug.