Bug 1914393 - amq-streams Kafka networkpolicies cause SDN errors "Error executing ovs-ofctl: ovs-ofctl: -:3: 0/0: invalid IP address"
Summary: amq-streams Kafka networkpolicies cause SDN errors "Error executing ovs-ofctl...
Keywords:
Status: CLOSED DUPLICATE of bug 1914284
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.6
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: Ben Bennett
QA Contact: zhaozhanqi
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-01-08 17:50 UTC by Marek Schmidt
Modified: 2024-03-25 17:47 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-01-11 14:05:32 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
One of the sdn pod logs with OPENSHIFT_SDN_LOG_LEVEL=5 (3.08 MB, text/plain)
2021-01-08 17:50 UTC, Marek Schmidt
no flags Details
must-gather at state with broken networking (19.21 MB, application/x-bzip)
2021-01-08 18:33 UTC, Marek Schmidt
no flags Details
Kafka network policies (5.46 KB, text/plain)
2021-01-08 19:10 UTC, Marek Schmidt
no flags Details

Description Marek Schmidt 2021-01-08 17:50:15 UTC
Created attachment 1745663 [details]
One of the sdn pod logs with OPENSHIFT_SDN_LOG_LEVEL=5

Description of problem:

I have noticed weird networking issues on the cluster after installing amq-streams Kafka (hanging TCP connections to pods on some node from other nodes)

sdn logs show suspicious errors when Kafka (and its NetworkPolicies) is created:

sdn-dkdnz sdn I0108 17:32:13.863539    1819 ovs.go:158] Error executing ovs-ofctl: ovs-ofctl: -:3: 0/0: invalid IP address
sdn-dkdnz sdn E0108 17:32:13.863579    1819 networkpolicy.go:277] Error syncing OVS flows: timed out waiting for the condition

Version-Release number of selected component (if applicable):
4.6.9

How reproducible:
The errors in the logs always.

The networking issues sometimes, after updating worker the workers machineconfig (e.g. by changing images.config.openshift.io config) which triggers all pods to be restarted on the nodes.

Steps to Reproduce:
1. (Reproduced on a default OCP 4.6.9 installed on AWS (IPI) with 3 workers)
2. stern --namespace openshift-sdn sdn --include Error
3. Install amq-streams operator from redhat-operators
4. oc new-project kafka
5. Create a default Kafka CR in the kafka namespace (e.g. in the Console)
6. Notice the errors in SDN logs

Actual results:
Suspicious SDN errors in sdn logs
Networking issues after machineconfig update with Kafka installed.

Expected results:
No suspicious SDN errors in sdn logs
No networking issues after machineconfig update 

Additional info:

I0108 17:31:21.715913    1819 ovs.go:151] Executing: ovs-ofctl -O OpenFlow13 bundle br0 - <<
flow delete table=80, reg1=8083932
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.128.2.8, reg0=3399858, ip, nw_src=10.128.2.7, tcp, tp_dst=2181,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.128.2.8, reg0=8083932, ip, nw_src=, tcp, tp_dst=2181,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.128.2.8, reg0=8083932, ip, nw_src=, tcp, tp_dst=2181,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.128.2.8, reg0=8083932, ip, nw_src=10.128.2.8, tcp, tp_dst=2181,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.128.2.8, reg0=8083932, ip, nw_src=10.128.2.8, tcp, tp_dst=2888,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.128.2.8, reg0=8083932, ip, nw_src=10.128.2.8, tcp, tp_dst=3888,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.128.2.8, reg0=8083932, ip, nw_src=10.128.2.9, tcp, tp_dst=2181,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.128.2.8, reg0=8083932, ip, nw_src=10.129.2.17, tcp, tp_dst=2181,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.128.2.8, reg0=8083932, ip, nw_src=10.129.2.17, tcp, tp_dst=2888,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.128.2.8, reg0=8083932, ip, nw_src=10.129.2.17, tcp, tp_dst=3888,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.128.2.8, reg0=8083932, ip, nw_src=10.131.0.18, tcp, tp_dst=2181,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.128.2.8, reg0=8083932, ip, nw_src=10.131.0.18, tcp, tp_dst=2888,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.128.2.8, reg0=8083932, ip, nw_src=10.131.0.18, tcp, tp_dst=3888,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.129.2.17, reg0=3399858, ip, nw_src=10.128.2.7, tcp, tp_dst=2181,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.129.2.17, reg0=8083932, ip, nw_src=, tcp, tp_dst=2181,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.129.2.17, reg0=8083932, ip, nw_src=, tcp, tp_dst=2181,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.129.2.17, reg0=8083932, ip, nw_src=10.128.2.8, tcp, tp_dst=2181,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.129.2.17, reg0=8083932, ip, nw_src=10.128.2.8, tcp, tp_dst=2888,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.129.2.17, reg0=8083932, ip, nw_src=10.128.2.8, tcp, tp_dst=3888,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.129.2.17, reg0=8083932, ip, nw_src=10.128.2.9, tcp, tp_dst=2181,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.129.2.17, reg0=8083932, ip, nw_src=10.129.2.17, tcp, tp_dst=2181,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.129.2.17, reg0=8083932, ip, nw_src=10.129.2.17, tcp, tp_dst=2888,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.129.2.17, reg0=8083932, ip, nw_src=10.129.2.17, tcp, tp_dst=3888,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.129.2.17, reg0=8083932, ip, nw_src=10.131.0.18, tcp, tp_dst=2181,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.129.2.17, reg0=8083932, ip, nw_src=10.131.0.18, tcp, tp_dst=2888,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.129.2.17, reg0=8083932, ip, nw_src=10.131.0.18, tcp, tp_dst=3888,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.131.0.18, reg0=3399858, ip, nw_src=10.128.2.7, tcp, tp_dst=2181,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.131.0.18, reg0=8083932, ip, nw_src=, tcp, tp_dst=2181,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.131.0.18, reg0=8083932, ip, nw_src=, tcp, tp_dst=2181,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.131.0.18, reg0=8083932, ip, nw_src=10.128.2.8, tcp, tp_dst=2181,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.131.0.18, reg0=8083932, ip, nw_src=10.128.2.8, tcp, tp_dst=2888,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.131.0.18, reg0=8083932, ip, nw_src=10.128.2.8, tcp, tp_dst=3888,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.131.0.18, reg0=8083932, ip, nw_src=10.128.2.9, tcp, tp_dst=2181,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.131.0.18, reg0=8083932, ip, nw_src=10.129.2.17, tcp, tp_dst=2181,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.131.0.18, reg0=8083932, ip, nw_src=10.129.2.17, tcp, tp_dst=2888,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.131.0.18, reg0=8083932, ip, nw_src=10.129.2.17, tcp, tp_dst=3888,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.131.0.18, reg0=8083932, ip, nw_src=10.131.0.18, tcp, tp_dst=2181,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.131.0.18, reg0=8083932, ip, nw_src=10.131.0.18, tcp, tp_dst=2888,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.131.0.18, reg0=8083932, ip, nw_src=10.131.0.18, tcp, tp_dst=3888,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=, reg0=3399858, ip, nw_src=10.128.2.7, tcp, tp_dst=9091,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=, reg0=3399858, ip, nw_src=10.128.2.7, tcp, tp_dst=9091,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=, reg0=8083932, ip, nw_src=, tcp, tp_dst=9091,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=, reg0=8083932, ip, nw_src=, tcp, tp_dst=9091,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=, reg0=8083932, ip, nw_src=, tcp, tp_dst=9091,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=, reg0=8083932, ip, nw_src=, tcp, tp_dst=9091,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=, reg0=8083932, ip, nw_src=10.128.2.9, tcp, tp_dst=9091,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=, reg0=8083932, ip, nw_src=10.128.2.9, tcp, tp_dst=9091,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=, tcp, tp_dst=9092,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=, tcp, tp_dst=9092,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=, tcp, tp_dst=9093,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=, tcp, tp_dst=9093,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.128.2.9, reg0=3399858, ip, nw_src=10.128.2.7, tcp, tp_dst=9091,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.128.2.9, reg0=8083932, ip, nw_src=, tcp, tp_dst=9091,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.128.2.9, reg0=8083932, ip, nw_src=, tcp, tp_dst=9091,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.128.2.9, reg0=8083932, ip, nw_src=10.128.2.9, tcp, tp_dst=9091,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.128.2.9, tcp, tp_dst=9092,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.128.2.9, tcp, tp_dst=9093,  actions=output:NXM_NX_REG2[]
flow add table=80, priority=100, reg1=8083932, ip, nw_dst=10.128.2.8, actions=drop
flow add table=80, priority=100, reg1=8083932, ip, nw_dst=10.129.2.17, actions=drop
flow add table=80, priority=100, reg1=8083932, ip, nw_dst=10.131.0.18, actions=drop
flow add table=80, priority=100, reg1=8083932, ip, nw_dst=10.128.2.9, actions=drop
flow add table=80, priority=100, reg1=8083932, ip, nw_dst=, actions=drop
flow add table=80, priority=50, reg1=8083932, actions=output:NXM_NX_REG2[]
I0108 17:31:21.735682    1819 ovs.go:158] Error executing ovs-ofctl: ovs-ofctl: -:3: 0/0: invalid IP address

Comment 1 Marek Schmidt 2021-01-08 18:33:56 UTC
Created attachment 1745668 [details]
must-gather at state with broken networking

Attaching must-gather from a state when the cluster has two alerts


50% of the image-registry/image-registry targets in openshift-image-registry namespace are down.
50% of the router-internal-default/router-internal-default targets in openshift-ingress namespace are down.

and 50% of HTTP requests to http://httpd-example-foobar.apps... route hang (timeout)

Comment 2 Marek Schmidt 2021-01-08 19:10:35 UTC
Created attachment 1745671 [details]
Kafka network policies

Comment 3 Andrew Stoycos 2021-01-08 20:40:31 UTC
Hi Marek,

Can you try manually restarting openvswitch with `systemctl restart openvswitch` on all nodes to see if that remedies the issues? 

thanks, Andrew

Comment 4 Marek Schmidt 2021-01-08 20:56:31 UTC
(In reply to Andrew Stoycos from comment #3)
> Hi Marek,
> 
> Can you try manually restarting openvswitch with `systemctl restart
> openvswitch` on all nodes to see if that remedies the issues? 
> 
> thanks, Andrew

Ruuning `systemctl restart openvswitch` on all workers resolved all the alerts and the route works again.

Comment 5 Andrew Stoycos 2021-01-09 01:50:42 UTC
After some testing I was still unable to reproduce this issue on a development cluster, however we did learn a few things. 

It seems that restarting systemd's openvswitch is a viable fix both in this instance and the associated one seen in https://bugzilla.redhat.com/show_bug.cgi?id=1914284 which leads me to believe it is in fact a regression with SDN going into 4.6.9. To combat this problem all upgrades to 4.6.9 have been until we can pinpoint exactly what's going on.  

Due to the problems association with only OCP4.6.9, the absence of any SDN changes in 4.6.7 & 4.6.8, and the issue's relation to the network policy OVS rule table, we believe the problem originated within the following PR which refactored much of the Network Policy code -> 

https://github.com/openshift/sdn/pull/228

Specifically I think it has to do with the syncNamespace() function in the SDN code, where updates to OVS rules are never getting applied... With errors like 

`I0108 17:31:21.735682    1819 ovs.go:158] Error executing ovs-ofctl: ovs-ofctl: -:3: 0/0: invalid IP address` 

Moving forward we will work with the rest of the team to find a fix and push a patch as soon as possible. 

- Andrew

Comment 6 Dan Winship 2021-01-11 14:05:32 UTC
consolidating the two bugs

*** This bug has been marked as a duplicate of bug 1914284 ***

Comment 7 W. Trevor King 2021-04-05 17:47:36 UTC
Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1].  If you feel like this bug still needs to be a suspect, please add keyword again.

[1]: https://github.com/openshift/enhancements/pull/475


Note You need to log in before you can comment on or make changes to this bug.