Bug 1853197 - OVS connection timeout leading to un-programmed OVS flows by the SDN
Summary: OVS connection timeout leading to un-programmed OVS flows by the SDN
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.3.0
Hardware: Unspecified
OS: Unspecified
Target Milestone: ---
: 4.4.z
Assignee: Alexander Constantinescu
QA Contact: huirwang
Depends On: 1853193
Blocks: 1853198
TreeView+ depends on / blocked
Reported: 2020-07-02 06:59 UTC by Alexander Constantinescu
Modified: 2021-04-05 17:47 UTC (History)
32 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1853193
: 1853198 (view as bug list)
Last Closed: 2020-08-04 14:16:01 UTC
Target Upstream Version:

Attachments (Terms of Use)

System ID Private Priority Status Summary Last Updated
Github openshift sdn pull 155 0 None closed Bug 1853197: [release-4.4] Backport OVS-retries 2021-02-19 10:12:15 UTC
Red Hat Product Errata RHBA-2020:3128 0 None None None 2020-08-04 14:16:34 UTC

Description Alexander Constantinescu 2020-07-02 06:59:25 UTC
+++ This bug was initially created as a clone of Bug #1853193 +++

+++ This bug was initially created as a clone of Bug #1851182 +++

+++ This bug was initially created as a clone of Bug #1838007 +++

Description of problem:

This is a placeholder bug intended to track all networking related investigations concerning the quay.io outage on 2020-05-19

Please attach a detailed description of what happened, ex:

 - Pod X to Pod Y communication did not work
 - Pod X to Service Y did not work 
 - IP X to IP Y did not work 
 - Node X to node Y could not communicate

attach logs / must-gather / sos-reports with any networking related information from the time of the events

Version-Release number of selected component (if applicable):

This concerns 4.3.18 - 4.3.19 upgrade

How reproducible:

To be determinded

Steps to Reproduce:

--- Additional comment from Alexander Constantinescu on 2020-06-25 17:18:27 UTC ---

The reason why the bad DNS pod cannot be reached is because the SDN failed to update the VNID information in table 80 on the node hosting the bad DNS pod. The packets are subsequently dropped once they reach this table. 

This is the ofproto trace command from the sre-pod to the bad DNS pod:


ovs-appctl ofproto/trace  br0 "in_port=1,udp,nw_dst=,nw_src=,tun_src=,udp_dst=5353,tun_dst="
Flow: udp,tun_src=,tun_dst=,tun_ipv6_src=::,tun_ipv6_dst=::,tun_gbp_id=0,tun_gbp_flags=0,tun_tos=0,tun_ttl=0,tun_erspan_ver=0,tun_flags=0,in_port=1,vlan_tci=0x0000,dl_src=00:00:00:00:00:00,dl_dst=00:00:00:00:00:00,nw_src=,nw_dst=,nw_tos=0,nw_ecn=0,nw_ttl=0,tp_src=0,tp_dst=5353
 0. ct_state=-trk,ip, priority 300
     -> A clone of the packet is forked to recirculate. The forked pipeline will be resumed at table 0.
     -> Sets the packet to an untracked state, and clears all the conntrack fields.
Final flow: unchanged
Megaflow: recirc_id=0,ct_state=-trk,eth,ip,tun_id=0,tun_src=,tun_dst=,tun_tos=0,tun_flags=-df-csum-key,in_port=1,nw_frag=no
Datapath actions: ct,recirc(0x32ba52)
recirc(0x32ba52) - resume conntrack with default ct_state=trk|new (use --ct-next to customize)
Flow: recirc_id=0x32ba52,ct_state=new|trk,eth,udp,tun_src=,tun_dst=,tun_ipv6_src=::,tun_ipv6_dst=::,tun_gbp_id=0,tun_gbp_flags=0,tun_tos=0,tun_ttl=0,tun_erspan_ver=0,tun_flags=0,in_port=1,vlan_tci=0x0000,dl_src=00:00:00:00:00:00,dl_dst=00:00:00:00:00:00,nw_src=,nw_dst=,nw_tos=0,nw_ecn=0,nw_ttl=0,tp_src=0,tp_dst=5353
        Resuming from table 0
 0. ip,in_port=1,nw_src=, priority 200
     -> NXM_NX_REG0[] is now 0
10. tun_src=, priority 100, cookie 0x5cd1398a
30. ip,nw_dst=, priority 200
70. ip,nw_dst=, priority 100
80. priority 0
Final flow: recirc_id=0x32ba52,ct_state=new|trk,eth,udp,reg1=0x790ec8,reg2=0x4,tun_src=,tun_dst=,tun_ipv6_src=::,tun_ipv6_dst=::,tun_gbp_id=0,tun_gbp_flags=0,tun_tos=0,tun_ttl=0,tun_erspan_ver=0,tun_flags=0,in_port=1,vlan_tci=0x0000,dl_src=00:00:00:00:00:00,dl_dst=00:00:00:00:00:00,nw_src=,nw_dst=,nw_tos=0,nw_ecn=0,nw_ttl=0,tp_src=0,tp_dst=5353
Megaflow: recirc_id=0x32ba52,ct_state=-rpl+trk,eth,ip,tun_id=0,tun_src=,tun_dst=,tun_tos=0,tun_flags=-df-csum-key,in_port=1,nw_src=,nw_dst=,nw_frag=no
Datapath actions: drop


If we check the SDN logs for the DNS pod we find the following:

I0619 14:37:30.867774    2207 pod.go:503] CNI_ADD openshift-dns/dns-default-ksr8f got IP, ofport 4
I0619 14:37:30.872105    2207 ovs.go:169] Error executing ovs-ofctl: ovs-ofctl: br0: failed to connect to socket (Broken pipe)
E0619 14:37:30.872127    2207 pod.go:232] Error updating OVS multicast flows for VNID 7933640: exit status 1
I0619 14:37:30.872105    2207 ovs.go:169] Error executing ovs-ofctl: ovs-ofctl: br0: failed to connect to socket (Broken pipe)
E0619 14:37:30.872198    2207 networkpolicy.go:308] Error syncing OVS flows for VNID: exit status 1

And subsenquently the OVS logs;

2020-06-19T14:37:30.829Z|00053|bridge|INFO|bridge br0: added interface veth4199d3c2 on port 4
2020-06-19T14:37:30.854Z|00054|connmgr|INFO|br0<->unix#66: 5 flow_mods in the last 0 s (5 adds)
2020-06-19T14:37:30.871Z|00055|rconn|INFO|br0<->unix#67: connection timed out
2020-06-19T14:37:30.871Z|00056|rconn|INFO|br0<->unix#68: connection timed out

So it seems OVS timed-out talking to its unix socket, which is why the OVS flows could not be programmed for this particular pod by the SDN. 

I will add a retry mechanism to SDN to circumvent this issue.

I will clone this bug and track it there, in case investigations into this BZ continues. 

There are still no indications that this caused the OSD problems on the 19th of May.

--- Additional comment from Clayton Coleman on 2020-06-25 21:09:54 UTC ---

Are the metrics we need to add (like iptables) that ensure we eventually converge?  Everything should retry everywhere at some interval, but if we add a retry we also need to meauser how often it happens.

--- Additional comment from Clayton Coleman on 2020-06-25 21:13:22 UTC ---


Not sure all of these are the same (could be races during shutdown), but this is showing up in ~0.64% of failing CI runs in last 2 days, 1.7% of failing CI runs in last 14 days

--- Additional comment from Alexander Constantinescu on 2020-06-26 13:10:34 UTC ---

In response to #comment 1: Added a commit with metric for failed OVS commits by openshift-sdn (see referenced PR)

--- Additional comment from Alexander Constantinescu on 2020-07-02 06:57:56 UTC ---

Hmm, the PR merged...the bot should have update this to MODIFIED, so I am doing that manually

Comment 6 errata-xmlrpc 2020-08-04 14:16:01 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.4.15 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.


Comment 7 W. Trevor King 2021-04-05 17:47:53 UTC
Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1].  If you feel like this bug still needs to be a suspect, please add keyword again.

[1]: https://github.com/openshift/enhancements/pull/475

Note You need to log in before you can comment on or make changes to this bug.