1853193 – OVS connection timeout leading to un-programmed OVS flows by the SDN

Bug 1853193 - OVS connection timeout leading to un-programmed OVS flows by the SDN

Summary: OVS connection timeout leading to un-programmed OVS flows by the SDN

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.5.z
Assignee:	Alexander Constantinescu
QA Contact:	zhaozhanqi
Docs Contact:
URL:
Whiteboard:
Depends On:	1851182
Blocks:	1853197
TreeView+	depends on / blocked

Reported:	2020-07-02 06:58 UTC by Alexander Constantinescu
Modified:	2023-09-15 00:33 UTC (History)
CC List:	33 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1851182
Clones:	1853197 (view as bug list)
Environment:
Last Closed:	2020-07-30 18:56:59 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift sdn pull 154	0	None	closed	Bug 1853193: [release-4.5] Backport OVS-retries	2021-02-10 03:23:34 UTC
Red Hat Product Errata	RHBA-2020:3028	0	None	None	None	2020-07-30 18:57:33 UTC

Description Alexander Constantinescu 2020-07-02 06:58:31 UTC

+++ This bug was initially created as a clone of Bug #1851182 +++

+++ This bug was initially created as a clone of Bug #1838007 +++

Description of problem:

This is a placeholder bug intended to track all networking related investigations concerning the quay.io outage on 2020-05-19

Please attach a detailed description of what happened, ex:

 - Pod X to Pod Y communication did not work
 - Pod X to Service Y did not work 
 - IP X to IP Y did not work 
 - Node X to node Y could not communicate

attach logs / must-gather / sos-reports with any networking related information from the time of the events

Version-Release number of selected component (if applicable):

This concerns 4.3.18 - 4.3.19 upgrade

How reproducible:

To be determinded

Steps to Reproduce:
1.
2.
3.

--- Additional comment from Alexander Constantinescu on 2020-06-25 17:18:27 UTC ---

The reason why the bad DNS pod cannot be reached is because the SDN failed to update the VNID information in table 80 on the node hosting the bad DNS pod. The packets are subsequently dropped once they reach this table. 

This is the ofproto trace command from the sre-pod to the bad DNS pod:


======================================================================================================

ovs-appctl ofproto/trace  br0 "in_port=1,udp,nw_dst=10.130.2.3,nw_src=10.131.2.13,tun_src=10.0.142.181,udp_dst=5353,tun_dst=10.0.139.42"
Flow: udp,tun_src=10.0.142.181,tun_dst=10.0.139.42,tun_ipv6_src=::,tun_ipv6_dst=::,tun_gbp_id=0,tun_gbp_flags=0,tun_tos=0,tun_ttl=0,tun_erspan_ver=0,tun_flags=0,in_port=1,vlan_tci=0x0000,dl_src=00:00:00:00:00:00,dl_dst=00:00:00:00:00:00,nw_src=10.131.2.13,nw_dst=10.130.2.3,nw_tos=0,nw_ecn=0,nw_ttl=0,tp_src=0,tp_dst=5353
bridge("br0")
-------------
 0. ct_state=-trk,ip, priority 300
    ct(table=0)
    drop
     -> A clone of the packet is forked to recirculate. The forked pipeline will be resumed at table 0.
     -> Sets the packet to an untracked state, and clears all the conntrack fields.
Final flow: unchanged
Megaflow: recirc_id=0,ct_state=-trk,eth,ip,tun_id=0,tun_src=10.0.142.181,tun_dst=10.0.139.42,tun_tos=0,tun_flags=-df-csum-key,in_port=1,nw_frag=no
Datapath actions: ct,recirc(0x32ba52)
===============================================================================
recirc(0x32ba52) - resume conntrack with default ct_state=trk|new (use --ct-next to customize)
===============================================================================
Flow: recirc_id=0x32ba52,ct_state=new|trk,eth,udp,tun_src=10.0.142.181,tun_dst=10.0.139.42,tun_ipv6_src=::,tun_ipv6_dst=::,tun_gbp_id=0,tun_gbp_flags=0,tun_tos=0,tun_ttl=0,tun_erspan_ver=0,tun_flags=0,in_port=1,vlan_tci=0x0000,dl_src=00:00:00:00:00:00,dl_dst=00:00:00:00:00:00,nw_src=10.131.2.13,nw_dst=10.130.2.3,nw_tos=0,nw_ecn=0,nw_ttl=0,tp_src=0,tp_dst=5353
bridge("br0")
-------------
    thaw
        Resuming from table 0
 0. ip,in_port=1,nw_src=10.128.0.0/14, priority 200
    move:NXM_NX_TUN_ID[0..31]->NXM_NX_REG0[]
     -> NXM_NX_REG0[] is now 0
    goto_table:10
10. tun_src=10.0.142.181, priority 100, cookie 0x5cd1398a
    goto_table:30
30. ip,nw_dst=10.130.2.0/23, priority 200
    goto_table:70
70. ip,nw_dst=10.130.2.3, priority 100
    load:0x790ec8->NXM_NX_REG1[]
    load:0x4->NXM_NX_REG2[]
    goto_table:80
80. priority 0
    drop
Final flow: recirc_id=0x32ba52,ct_state=new|trk,eth,udp,reg1=0x790ec8,reg2=0x4,tun_src=10.0.142.181,tun_dst=10.0.139.42,tun_ipv6_src=::,tun_ipv6_dst=::,tun_gbp_id=0,tun_gbp_flags=0,tun_tos=0,tun_ttl=0,tun_erspan_ver=0,tun_flags=0,in_port=1,vlan_tci=0x0000,dl_src=00:00:00:00:00:00,dl_dst=00:00:00:00:00:00,nw_src=10.131.2.13,nw_dst=10.130.2.3,nw_tos=0,nw_ecn=0,nw_ttl=0,tp_src=0,tp_dst=5353
Megaflow: recirc_id=0x32ba52,ct_state=-rpl+trk,eth,ip,tun_id=0,tun_src=10.0.142.181,tun_dst=10.0.139.42,tun_tos=0,tun_flags=-df-csum-key,in_port=1,nw_src=10.131.0.0/16,nw_dst=10.130.2.3,nw_frag=no
Datapath actions: drop

======================================================================================================


If we check the SDN logs for the DNS pod we find the following:

I0619 14:37:30.867774    2207 pod.go:503] CNI_ADD openshift-dns/dns-default-ksr8f got IP 10.130.2.3, ofport 4
I0619 14:37:30.872105    2207 ovs.go:169] Error executing ovs-ofctl: ovs-ofctl: br0: failed to connect to socket (Broken pipe)
E0619 14:37:30.872127    2207 pod.go:232] Error updating OVS multicast flows for VNID 7933640: exit status 1
I0619 14:37:30.872105    2207 ovs.go:169] Error executing ovs-ofctl: ovs-ofctl: br0: failed to connect to socket (Broken pipe)
E0619 14:37:30.872198    2207 networkpolicy.go:308] Error syncing OVS flows for VNID: exit status 1

And subsenquently the OVS logs;

2020-06-19T14:37:30.829Z|00053|bridge|INFO|bridge br0: added interface veth4199d3c2 on port 4
2020-06-19T14:37:30.854Z|00054|connmgr|INFO|br0<->unix#66: 5 flow_mods in the last 0 s (5 adds)
2020-06-19T14:37:30.871Z|00055|rconn|INFO|br0<->unix#67: connection timed out
2020-06-19T14:37:30.871Z|00056|rconn|INFO|br0<->unix#68: connection timed out

So it seems OVS timed-out talking to its unix socket, which is why the OVS flows could not be programmed for this particular pod by the SDN. 

I will add a retry mechanism to SDN to circumvent this issue.

I will clone this bug and track it there, in case investigations into this BZ continues. 

There are still no indications that this caused the OSD problems on the 19th of May.

--- Additional comment from Clayton Coleman on 2020-06-25 21:09:54 UTC ---

Are the metrics we need to add (like iptables) that ensure we eventually converge?  Everything should retry everywhere at some interval, but if we add a retry we also need to meauser how often it happens.

--- Additional comment from Clayton Coleman on 2020-06-25 21:13:22 UTC ---

https://search.apps.build01.ci.devcluster.openshift.com/?search=ovs-ofctl%3A+br0%3A+failed+to+connect+to+socket&maxAge=48h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job

Not sure all of these are the same (could be races during shutdown), but this is showing up in ~0.64% of failing CI runs in last 2 days, 1.7% of failing CI runs in last 14 days

--- Additional comment from Alexander Constantinescu on 2020-06-26 13:10:34 UTC ---

In response to #comment 1: Added a commit with metric for failed OVS commits by openshift-sdn (see referenced PR)

--- Additional comment from Alexander Constantinescu on 2020-07-02 06:57:56 UTC ---

Hmm, the PR merged...the bot should have update this to MODIFIED, so I am doing that manually

Comment 4 zhaozhanqi 2020-07-22 06:26:16 UTC

Verified this bug on 4.5.0-0.nightly-2020-07-22-010225

Comment 6 errata-xmlrpc 2020-07-30 18:56:59 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:3028

Comment 7 Red Hat Bugzilla 2023-09-15 00:33:49 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days

Note You need to log in before you can comment on or make changes to this bug.

aaleman
aarapov
akonarde
aos-bugs
apahim
asegundo
aweiteka
bbennett
cattias
cblecker
ccoleman
cdc
dcbw
dhansen
jaharrin
jbeakley
jchevret
jeder
kbsingh
lmohanty
markmc
marobrie
mcambria
nmalik
pbergene
scuppett
sdodson
tparikh
trankin
tsmetana
vrutkovs
wking
zzhao