Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1853198

Summary:	OVS connection timeout leading to un-programmed OVS flows by the SDN
Product:	OpenShift Container Platform	Reporter:	Alexander Constantinescu <aconstan>
Component:	Networking	Assignee:	Alexander Constantinescu <aconstan>
Networking sub component:	openshift-sdn	QA Contact:	zhaozhanqi <zzhao>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	urgent
Priority:	urgent	CC:	aaleman, aarapov, akonarde, aos-bugs, apahim, asegundo, aweiteka, bbennett, cattias, cblecker, ccoleman, dcbw, dhansen, jaharrin, jbeakley, jchevret, jeder, kbsingh, lmohanty, markmc, marobrie, mcambria, nmalik, pbergene, scuppett, sdodson, tparikh, trankin, tsmetana, vrutkovs, wking, zzhao
Version:	4.3.0	Keywords:	ServiceDeliveryBlocker, Upgrades
Target Milestone:	---
Target Release:	4.3.z
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:	1853197	Environment:
Last Closed:	2020-08-19 11:10:15 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1853197
Bug Blocks:

Description Alexander Constantinescu 2020-07-02 07:00:57 UTC

+++ This bug was initially created as a clone of Bug #1853197 +++

+++ This bug was initially created as a clone of Bug #1853193 +++

+++ This bug was initially created as a clone of Bug #1851182 +++

+++ This bug was initially created as a clone of Bug #1838007 +++

Description of problem:

This is a placeholder bug intended to track all networking related investigations concerning the quay.io outage on 2020-05-19

Please attach a detailed description of what happened, ex:

 - Pod X to Pod Y communication did not work
 - Pod X to Service Y did not work 
 - IP X to IP Y did not work 
 - Node X to node Y could not communicate

attach logs / must-gather / sos-reports with any networking related information from the time of the events

Version-Release number of selected component (if applicable):

This concerns 4.3.18 - 4.3.19 upgrade

How reproducible:

To be determinded

Steps to Reproduce:
1.
2.
3.

--- Additional comment from Alexander Constantinescu on 2020-06-25 17:18:27 UTC ---

The reason why the bad DNS pod cannot be reached is because the SDN failed to update the VNID information in table 80 on the node hosting the bad DNS pod. The packets are subsequently dropped once they reach this table. 

This is the ofproto trace command from the sre-pod to the bad DNS pod:


======================================================================================================

ovs-appctl ofproto/trace  br0 "in_port=1,udp,nw_dst=10.130.2.3,nw_src=10.131.2.13,tun_src=10.0.142.181,udp_dst=5353,tun_dst=10.0.139.42"
Flow: udp,tun_src=10.0.142.181,tun_dst=10.0.139.42,tun_ipv6_src=::,tun_ipv6_dst=::,tun_gbp_id=0,tun_gbp_flags=0,tun_tos=0,tun_ttl=0,tun_erspan_ver=0,tun_flags=0,in_port=1,vlan_tci=0x0000,dl_src=00:00:00:00:00:00,dl_dst=00:00:00:00:00:00,nw_src=10.131.2.13,nw_dst=10.130.2.3,nw_tos=0,nw_ecn=0,nw_ttl=0,tp_src=0,tp_dst=5353
bridge("br0")
-------------
 0. ct_state=-trk,ip, priority 300
    ct(table=0)
    drop
     -> A clone of the packet is forked to recirculate. The forked pipeline will be resumed at table 0.
     -> Sets the packet to an untracked state, and clears all the conntrack fields.
Final flow: unchanged
Megaflow: recirc_id=0,ct_state=-trk,eth,ip,tun_id=0,tun_src=10.0.142.181,tun_dst=10.0.139.42,tun_tos=0,tun_flags=-df-csum-key,in_port=1,nw_frag=no
Datapath actions: ct,recirc(0x32ba52)
===============================================================================
recirc(0x32ba52) - resume conntrack with default ct_state=trk|new (use --ct-next to customize)
===============================================================================
Flow: recirc_id=0x32ba52,ct_state=new|trk,eth,udp,tun_src=10.0.142.181,tun_dst=10.0.139.42,tun_ipv6_src=::,tun_ipv6_dst=::,tun_gbp_id=0,tun_gbp_flags=0,tun_tos=0,tun_ttl=0,tun_erspan_ver=0,tun_flags=0,in_port=1,vlan_tci=0x0000,dl_src=00:00:00:00:00:00,dl_dst=00:00:00:00:00:00,nw_src=10.131.2.13,nw_dst=10.130.2.3,nw_tos=0,nw_ecn=0,nw_ttl=0,tp_src=0,tp_dst=5353
bridge("br0")
-------------
    thaw
        Resuming from table 0
 0. ip,in_port=1,nw_src=10.128.0.0/14, priority 200
    move:NXM_NX_TUN_ID[0..31]->NXM_NX_REG0[]
     -> NXM_NX_REG0[] is now 0
    goto_table:10
10. tun_src=10.0.142.181, priority 100, cookie 0x5cd1398a
    goto_table:30
30. ip,nw_dst=10.130.2.0/23, priority 200
    goto_table:70
70. ip,nw_dst=10.130.2.3, priority 100
    load:0x790ec8->NXM_NX_REG1[]
    load:0x4->NXM_NX_REG2[]
    goto_table:80
80. priority 0
    drop
Final flow: recirc_id=0x32ba52,ct_state=new|trk,eth,udp,reg1=0x790ec8,reg2=0x4,tun_src=10.0.142.181,tun_dst=10.0.139.42,tun_ipv6_src=::,tun_ipv6_dst=::,tun_gbp_id=0,tun_gbp_flags=0,tun_tos=0,tun_ttl=0,tun_erspan_ver=0,tun_flags=0,in_port=1,vlan_tci=0x0000,dl_src=00:00:00:00:00:00,dl_dst=00:00:00:00:00:00,nw_src=10.131.2.13,nw_dst=10.130.2.3,nw_tos=0,nw_ecn=0,nw_ttl=0,tp_src=0,tp_dst=5353
Megaflow: recirc_id=0x32ba52,ct_state=-rpl+trk,eth,ip,tun_id=0,tun_src=10.0.142.181,tun_dst=10.0.139.42,tun_tos=0,tun_flags=-df-csum-key,in_port=1,nw_src=10.131.0.0/16,nw_dst=10.130.2.3,nw_frag=no
Datapath actions: drop

======================================================================================================


If we check the SDN logs for the DNS pod we find the following:

I0619 14:37:30.867774    2207 pod.go:503] CNI_ADD openshift-dns/dns-default-ksr8f got IP 10.130.2.3, ofport 4
I0619 14:37:30.872105    2207 ovs.go:169] Error executing ovs-ofctl: ovs-ofctl: br0: failed to connect to socket (Broken pipe)
E0619 14:37:30.872127    2207 pod.go:232] Error updating OVS multicast flows for VNID 7933640: exit status 1
I0619 14:37:30.872105    2207 ovs.go:169] Error executing ovs-ofctl: ovs-ofctl: br0: failed to connect to socket (Broken pipe)
E0619 14:37:30.872198    2207 networkpolicy.go:308] Error syncing OVS flows for VNID: exit status 1

And subsenquently the OVS logs;

2020-06-19T14:37:30.829Z|00053|bridge|INFO|bridge br0: added interface veth4199d3c2 on port 4
2020-06-19T14:37:30.854Z|00054|connmgr|INFO|br0<->unix#66: 5 flow_mods in the last 0 s (5 adds)
2020-06-19T14:37:30.871Z|00055|rconn|INFO|br0<->unix#67: connection timed out
2020-06-19T14:37:30.871Z|00056|rconn|INFO|br0<->unix#68: connection timed out

So it seems OVS timed-out talking to its unix socket, which is why the OVS flows could not be programmed for this particular pod by the SDN. 

I will add a retry mechanism to SDN to circumvent this issue.

I will clone this bug and track it there, in case investigations into this BZ continues. 

There are still no indications that this caused the OSD problems on the 19th of May.

--- Additional comment from Clayton Coleman on 2020-06-25 21:09:54 UTC ---

Are the metrics we need to add (like iptables) that ensure we eventually converge?  Everything should retry everywhere at some interval, but if we add a retry we also need to meauser how often it happens.

--- Additional comment from Clayton Coleman on 2020-06-25 21:13:22 UTC ---

https://search.apps.build01.ci.devcluster.openshift.com/?search=ovs-ofctl%3A+br0%3A+failed+to+connect+to+socket&maxAge=48h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job

Not sure all of these are the same (could be races during shutdown), but this is showing up in ~0.64% of failing CI runs in last 2 days, 1.7% of failing CI runs in last 14 days

--- Additional comment from Alexander Constantinescu on 2020-06-26 13:10:34 UTC ---

In response to #comment 1: Added a commit with metric for failed OVS commits by openshift-sdn (see referenced PR)

--- Additional comment from Alexander Constantinescu on 2020-07-02 06:57:56 UTC ---

Hmm, the PR merged...the bot should have update this to MODIFIED, so I am doing that manually

Comment 3 zhaozhanqi 2020-08-10 10:09:17 UTC

Verified this bug on 4.3.0-0.nightly-2020-08-07-045359

Did some regression, not reproduced this issue.

Comment 5 errata-xmlrpc 2020-08-19 11:10:15 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.3.33 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:3259