Bug 1914393

Summary: amq-streams Kafka networkpolicies cause SDN errors "Error executing ovs-ofctl: ovs-ofctl: -:3: 0/0: invalid IP address"
Product: OpenShift Container Platform Reporter: Marek Schmidt <maschmid>
Component: NetworkingAssignee: Ben Bennett <bbennett>
Networking sub component: openshift-sdn QA Contact: zhaozhanqi <zzhao>
Status: CLOSED DUPLICATE Docs Contact:
Severity: unspecified    
Priority: unspecified CC: afield, astoycos, bleanhar, danw, dmoessne, igreen, lmohanty, sreber
Version: 4.6   
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-01-11 14:05:32 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
One of the sdn pod logs with OPENSHIFT_SDN_LOG_LEVEL=5
none
must-gather at state with broken networking
none
Kafka network policies none

Description Marek Schmidt 2021-01-08 17:50:15 UTC
Created attachment 1745663 [details]
One of the sdn pod logs with OPENSHIFT_SDN_LOG_LEVEL=5

Description of problem:

I have noticed weird networking issues on the cluster after installing amq-streams Kafka (hanging TCP connections to pods on some node from other nodes)

sdn logs show suspicious errors when Kafka (and its NetworkPolicies) is created:

sdn-dkdnz sdn I0108 17:32:13.863539    1819 ovs.go:158] Error executing ovs-ofctl: ovs-ofctl: -:3: 0/0: invalid IP address
sdn-dkdnz sdn E0108 17:32:13.863579    1819 networkpolicy.go:277] Error syncing OVS flows: timed out waiting for the condition

Version-Release number of selected component (if applicable):
4.6.9

How reproducible:
The errors in the logs always.

The networking issues sometimes, after updating worker the workers machineconfig (e.g. by changing images.config.openshift.io config) which triggers all pods to be restarted on the nodes.

Steps to Reproduce:
1. (Reproduced on a default OCP 4.6.9 installed on AWS (IPI) with 3 workers)
2. stern --namespace openshift-sdn sdn --include Error
3. Install amq-streams operator from redhat-operators
4. oc new-project kafka
5. Create a default Kafka CR in the kafka namespace (e.g. in the Console)
6. Notice the errors in SDN logs

Actual results:
Suspicious SDN errors in sdn logs
Networking issues after machineconfig update with Kafka installed.

Expected results:
No suspicious SDN errors in sdn logs
No networking issues after machineconfig update 

Additional info:

I0108 17:31:21.715913    1819 ovs.go:151] Executing: ovs-ofctl -O OpenFlow13 bundle br0 - <<
flow delete table=80, reg1=8083932
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.128.2.8, reg0=3399858, ip, nw_src=10.128.2.7, tcp, tp_dst=2181,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.128.2.8, reg0=8083932, ip, nw_src=, tcp, tp_dst=2181,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.128.2.8, reg0=8083932, ip, nw_src=, tcp, tp_dst=2181,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.128.2.8, reg0=8083932, ip, nw_src=10.128.2.8, tcp, tp_dst=2181,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.128.2.8, reg0=8083932, ip, nw_src=10.128.2.8, tcp, tp_dst=2888,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.128.2.8, reg0=8083932, ip, nw_src=10.128.2.8, tcp, tp_dst=3888,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.128.2.8, reg0=8083932, ip, nw_src=10.128.2.9, tcp, tp_dst=2181,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.128.2.8, reg0=8083932, ip, nw_src=10.129.2.17, tcp, tp_dst=2181,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.128.2.8, reg0=8083932, ip, nw_src=10.129.2.17, tcp, tp_dst=2888,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.128.2.8, reg0=8083932, ip, nw_src=10.129.2.17, tcp, tp_dst=3888,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.128.2.8, reg0=8083932, ip, nw_src=10.131.0.18, tcp, tp_dst=2181,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.128.2.8, reg0=8083932, ip, nw_src=10.131.0.18, tcp, tp_dst=2888,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.128.2.8, reg0=8083932, ip, nw_src=10.131.0.18, tcp, tp_dst=3888,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.129.2.17, reg0=3399858, ip, nw_src=10.128.2.7, tcp, tp_dst=2181,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.129.2.17, reg0=8083932, ip, nw_src=, tcp, tp_dst=2181,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.129.2.17, reg0=8083932, ip, nw_src=, tcp, tp_dst=2181,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.129.2.17, reg0=8083932, ip, nw_src=10.128.2.8, tcp, tp_dst=2181,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.129.2.17, reg0=8083932, ip, nw_src=10.128.2.8, tcp, tp_dst=2888,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.129.2.17, reg0=8083932, ip, nw_src=10.128.2.8, tcp, tp_dst=3888,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.129.2.17, reg0=8083932, ip, nw_src=10.128.2.9, tcp, tp_dst=2181,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.129.2.17, reg0=8083932, ip, nw_src=10.129.2.17, tcp, tp_dst=2181,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.129.2.17, reg0=8083932, ip, nw_src=10.129.2.17, tcp, tp_dst=2888,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.129.2.17, reg0=8083932, ip, nw_src=10.129.2.17, tcp, tp_dst=3888,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.129.2.17, reg0=8083932, ip, nw_src=10.131.0.18, tcp, tp_dst=2181,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.129.2.17, reg0=8083932, ip, nw_src=10.131.0.18, tcp, tp_dst=2888,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.129.2.17, reg0=8083932, ip, nw_src=10.131.0.18, tcp, tp_dst=3888,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.131.0.18, reg0=3399858, ip, nw_src=10.128.2.7, tcp, tp_dst=2181,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.131.0.18, reg0=8083932, ip, nw_src=, tcp, tp_dst=2181,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.131.0.18, reg0=8083932, ip, nw_src=, tcp, tp_dst=2181,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.131.0.18, reg0=8083932, ip, nw_src=10.128.2.8, tcp, tp_dst=2181,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.131.0.18, reg0=8083932, ip, nw_src=10.128.2.8, tcp, tp_dst=2888,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.131.0.18, reg0=8083932, ip, nw_src=10.128.2.8, tcp, tp_dst=3888,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.131.0.18, reg0=8083932, ip, nw_src=10.128.2.9, tcp, tp_dst=2181,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.131.0.18, reg0=8083932, ip, nw_src=10.129.2.17, tcp, tp_dst=2181,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.131.0.18, reg0=8083932, ip, nw_src=10.129.2.17, tcp, tp_dst=2888,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.131.0.18, reg0=8083932, ip, nw_src=10.129.2.17, tcp, tp_dst=3888,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.131.0.18, reg0=8083932, ip, nw_src=10.131.0.18, tcp, tp_dst=2181,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.131.0.18, reg0=8083932, ip, nw_src=10.131.0.18, tcp, tp_dst=2888,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.131.0.18, reg0=8083932, ip, nw_src=10.131.0.18, tcp, tp_dst=3888,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=, reg0=3399858, ip, nw_src=10.128.2.7, tcp, tp_dst=9091,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=, reg0=3399858, ip, nw_src=10.128.2.7, tcp, tp_dst=9091,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=, reg0=8083932, ip, nw_src=, tcp, tp_dst=9091,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=, reg0=8083932, ip, nw_src=, tcp, tp_dst=9091,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=, reg0=8083932, ip, nw_src=, tcp, tp_dst=9091,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=, reg0=8083932, ip, nw_src=, tcp, tp_dst=9091,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=, reg0=8083932, ip, nw_src=10.128.2.9, tcp, tp_dst=9091,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=, reg0=8083932, ip, nw_src=10.128.2.9, tcp, tp_dst=9091,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=, tcp, tp_dst=9092,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=, tcp, tp_dst=9092,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=, tcp, tp_dst=9093,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=, tcp, tp_dst=9093,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.128.2.9, reg0=3399858, ip, nw_src=10.128.2.7, tcp, tp_dst=9091,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.128.2.9, reg0=8083932, ip, nw_src=, tcp, tp_dst=9091,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.128.2.9, reg0=8083932, ip, nw_src=, tcp, tp_dst=9091,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.128.2.9, reg0=8083932, ip, nw_src=10.128.2.9, tcp, tp_dst=9091,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.128.2.9, tcp, tp_dst=9092,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.128.2.9, tcp, tp_dst=9093,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=100, reg1=8083932, ip, nw_dst=10.128.2.8, actions=drop
flow add table=80, priority=100, reg1=8083932, ip, nw_dst=10.129.2.17, actions=drop
flow add table=80, priority=100, reg1=8083932, ip, nw_dst=10.131.0.18, actions=drop
flow add table=80, priority=100, reg1=8083932, ip, nw_dst=10.128.2.9, actions=drop
flow add table=80, priority=100, reg1=8083932, ip, nw_dst=, actions=drop
flow add table=80, priority=50, reg1=8083932, actions=output:NXM_NX_REG2[]
I0108 17:31:21.735682    1819 ovs.go:158] Error executing ovs-ofctl: ovs-ofctl: -:3: 0/0: invalid IP address

Comment 1 Marek Schmidt 2021-01-08 18:33:56 UTC
Created attachment 1745668 [details]
must-gather at state with broken networking

Attaching must-gather from a state when the cluster has two alerts


50% of the image-registry/image-registry targets in openshift-image-registry namespace are down.
50% of the router-internal-default/router-internal-default targets in openshift-ingress namespace are down.

and 50% of HTTP requests to http://httpd-example-foobar.apps... route hang (timeout)

Comment 2 Marek Schmidt 2021-01-08 19:10:35 UTC
Created attachment 1745671 [details]
Kafka network policies

Comment 3 Andrew Stoycos 2021-01-08 20:40:31 UTC
Hi Marek,

Can you try manually restarting openvswitch with `systemctl restart openvswitch` on all nodes to see if that remedies the issues? 

thanks, Andrew

Comment 4 Marek Schmidt 2021-01-08 20:56:31 UTC
(In reply to Andrew Stoycos from comment #3)
> Hi Marek,
> 
> Can you try manually restarting openvswitch with `systemctl restart
> openvswitch` on all nodes to see if that remedies the issues? 
> 
> thanks, Andrew

Ruuning `systemctl restart openvswitch` on all workers resolved all the alerts and the route works again.

Comment 5 Andrew Stoycos 2021-01-09 01:50:42 UTC
After some testing I was still unable to reproduce this issue on a development cluster, however we did learn a few things. 

It seems that restarting systemd's openvswitch is a viable fix both in this instance and the associated one seen in https://bugzilla.redhat.com/show_bug.cgi?id=1914284 which leads me to believe it is in fact a regression with SDN going into 4.6.9. To combat this problem all upgrades to 4.6.9 have been until we can pinpoint exactly what's going on.  

Due to the problems association with only OCP4.6.9, the absence of any SDN changes in 4.6.7 & 4.6.8, and the issue's relation to the network policy OVS rule table, we believe the problem originated within the following PR which refactored much of the Network Policy code -> 

https://github.com/openshift/sdn/pull/228

Specifically I think it has to do with the syncNamespace() function in the SDN code, where updates to OVS rules are never getting applied... With errors like 

`I0108 17:31:21.735682    1819 ovs.go:158] Error executing ovs-ofctl: ovs-ofctl: -:3: 0/0: invalid IP address` 

Moving forward we will work with the rest of the team to find a fix and push a patch as soon as possible. 

- Andrew

Comment 6 Dan Winship 2021-01-11 14:05:32 UTC
consolidating the two bugs

*** This bug has been marked as a duplicate of bug 1914284 ***

Comment 7 W. Trevor King 2021-04-05 17:47:36 UTC
Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1].  If you feel like this bug still needs to be a suspect, please add keyword again.

[1]: https://github.com/openshift/enhancements/pull/475