Created attachment 1745663 [details] One of the sdn pod logs with OPENSHIFT_SDN_LOG_LEVEL=5 Description of problem: I have noticed weird networking issues on the cluster after installing amq-streams Kafka (hanging TCP connections to pods on some node from other nodes) sdn logs show suspicious errors when Kafka (and its NetworkPolicies) is created: sdn-dkdnz sdn I0108 17:32:13.863539 1819 ovs.go:158] Error executing ovs-ofctl: ovs-ofctl: -:3: 0/0: invalid IP address sdn-dkdnz sdn E0108 17:32:13.863579 1819 networkpolicy.go:277] Error syncing OVS flows: timed out waiting for the condition Version-Release number of selected component (if applicable): 4.6.9 How reproducible: The errors in the logs always. The networking issues sometimes, after updating worker the workers machineconfig (e.g. by changing images.config.openshift.io config) which triggers all pods to be restarted on the nodes. Steps to Reproduce: 1. (Reproduced on a default OCP 4.6.9 installed on AWS (IPI) with 3 workers) 2. stern --namespace openshift-sdn sdn --include Error 3. Install amq-streams operator from redhat-operators 4. oc new-project kafka 5. Create a default Kafka CR in the kafka namespace (e.g. in the Console) 6. Notice the errors in SDN logs Actual results: Suspicious SDN errors in sdn logs Networking issues after machineconfig update with Kafka installed. Expected results: No suspicious SDN errors in sdn logs No networking issues after machineconfig update Additional info: I0108 17:31:21.715913 1819 ovs.go:151] Executing: ovs-ofctl -O OpenFlow13 bundle br0 - << flow delete table=80, reg1=8083932 flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.128.2.8, reg0=3399858, ip, nw_src=10.128.2.7, tcp, tp_dst=2181, actions=output:NXM_NX_REG2[] flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.128.2.8, reg0=8083932, ip, nw_src=, tcp, tp_dst=2181, actions=output:NXM_NX_REG2[] flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.128.2.8, reg0=8083932, ip, nw_src=, tcp, tp_dst=2181, actions=output:NXM_NX_REG2[] flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.128.2.8, reg0=8083932, ip, nw_src=10.128.2.8, tcp, tp_dst=2181, actions=output:NXM_NX_REG2[] flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.128.2.8, reg0=8083932, ip, nw_src=10.128.2.8, tcp, tp_dst=2888, actions=output:NXM_NX_REG2[] flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.128.2.8, reg0=8083932, ip, nw_src=10.128.2.8, tcp, tp_dst=3888, actions=output:NXM_NX_REG2[] flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.128.2.8, reg0=8083932, ip, nw_src=10.128.2.9, tcp, tp_dst=2181, actions=output:NXM_NX_REG2[] flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.128.2.8, reg0=8083932, ip, nw_src=10.129.2.17, tcp, tp_dst=2181, actions=output:NXM_NX_REG2[] flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.128.2.8, reg0=8083932, ip, nw_src=10.129.2.17, tcp, tp_dst=2888, actions=output:NXM_NX_REG2[] flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.128.2.8, reg0=8083932, ip, nw_src=10.129.2.17, tcp, tp_dst=3888, actions=output:NXM_NX_REG2[] flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.128.2.8, reg0=8083932, ip, nw_src=10.131.0.18, tcp, tp_dst=2181, actions=output:NXM_NX_REG2[] flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.128.2.8, reg0=8083932, ip, nw_src=10.131.0.18, tcp, tp_dst=2888, actions=output:NXM_NX_REG2[] flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.128.2.8, reg0=8083932, ip, nw_src=10.131.0.18, tcp, tp_dst=3888, actions=output:NXM_NX_REG2[] flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.129.2.17, reg0=3399858, ip, nw_src=10.128.2.7, tcp, tp_dst=2181, actions=output:NXM_NX_REG2[] flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.129.2.17, reg0=8083932, ip, nw_src=, tcp, tp_dst=2181, actions=output:NXM_NX_REG2[] flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.129.2.17, reg0=8083932, ip, nw_src=, tcp, tp_dst=2181, actions=output:NXM_NX_REG2[] flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.129.2.17, reg0=8083932, ip, nw_src=10.128.2.8, tcp, tp_dst=2181, actions=output:NXM_NX_REG2[] flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.129.2.17, reg0=8083932, ip, nw_src=10.128.2.8, tcp, tp_dst=2888, actions=output:NXM_NX_REG2[] flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.129.2.17, reg0=8083932, ip, nw_src=10.128.2.8, tcp, tp_dst=3888, actions=output:NXM_NX_REG2[] flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.129.2.17, reg0=8083932, ip, nw_src=10.128.2.9, tcp, tp_dst=2181, actions=output:NXM_NX_REG2[] flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.129.2.17, reg0=8083932, ip, nw_src=10.129.2.17, tcp, tp_dst=2181, actions=output:NXM_NX_REG2[] flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.129.2.17, reg0=8083932, ip, nw_src=10.129.2.17, tcp, tp_dst=2888, actions=output:NXM_NX_REG2[] flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.129.2.17, reg0=8083932, ip, nw_src=10.129.2.17, tcp, tp_dst=3888, actions=output:NXM_NX_REG2[] flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.129.2.17, reg0=8083932, ip, nw_src=10.131.0.18, tcp, tp_dst=2181, actions=output:NXM_NX_REG2[] flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.129.2.17, reg0=8083932, ip, nw_src=10.131.0.18, tcp, tp_dst=2888, actions=output:NXM_NX_REG2[] flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.129.2.17, reg0=8083932, ip, nw_src=10.131.0.18, tcp, tp_dst=3888, actions=output:NXM_NX_REG2[] flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.131.0.18, reg0=3399858, ip, nw_src=10.128.2.7, tcp, tp_dst=2181, actions=output:NXM_NX_REG2[] flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.131.0.18, reg0=8083932, ip, nw_src=, tcp, tp_dst=2181, actions=output:NXM_NX_REG2[] flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.131.0.18, reg0=8083932, ip, nw_src=, tcp, tp_dst=2181, actions=output:NXM_NX_REG2[] flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.131.0.18, reg0=8083932, ip, nw_src=10.128.2.8, tcp, tp_dst=2181, actions=output:NXM_NX_REG2[] flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.131.0.18, reg0=8083932, ip, nw_src=10.128.2.8, tcp, tp_dst=2888, actions=output:NXM_NX_REG2[] flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.131.0.18, reg0=8083932, ip, nw_src=10.128.2.8, tcp, tp_dst=3888, actions=output:NXM_NX_REG2[] flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.131.0.18, reg0=8083932, ip, nw_src=10.128.2.9, tcp, tp_dst=2181, actions=output:NXM_NX_REG2[] flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.131.0.18, reg0=8083932, ip, nw_src=10.129.2.17, tcp, tp_dst=2181, actions=output:NXM_NX_REG2[] flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.131.0.18, reg0=8083932, ip, nw_src=10.129.2.17, tcp, tp_dst=2888, actions=output:NXM_NX_REG2[] flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.131.0.18, reg0=8083932, ip, nw_src=10.129.2.17, tcp, tp_dst=3888, actions=output:NXM_NX_REG2[] flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.131.0.18, reg0=8083932, ip, nw_src=10.131.0.18, tcp, tp_dst=2181, actions=output:NXM_NX_REG2[] flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.131.0.18, reg0=8083932, ip, nw_src=10.131.0.18, tcp, tp_dst=2888, actions=output:NXM_NX_REG2[] flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.131.0.18, reg0=8083932, ip, nw_src=10.131.0.18, tcp, tp_dst=3888, actions=output:NXM_NX_REG2[] flow add table=80, priority=150, reg1=8083932, ip, nw_dst=, reg0=3399858, ip, nw_src=10.128.2.7, tcp, tp_dst=9091, actions=output:NXM_NX_REG2[] flow add table=80, priority=150, reg1=8083932, ip, nw_dst=, reg0=3399858, ip, nw_src=10.128.2.7, tcp, tp_dst=9091, actions=output:NXM_NX_REG2[] flow add table=80, priority=150, reg1=8083932, ip, nw_dst=, reg0=8083932, ip, nw_src=, tcp, tp_dst=9091, actions=output:NXM_NX_REG2[] flow add table=80, priority=150, reg1=8083932, ip, nw_dst=, reg0=8083932, ip, nw_src=, tcp, tp_dst=9091, actions=output:NXM_NX_REG2[] flow add table=80, priority=150, reg1=8083932, ip, nw_dst=, reg0=8083932, ip, nw_src=, tcp, tp_dst=9091, actions=output:NXM_NX_REG2[] flow add table=80, priority=150, reg1=8083932, ip, nw_dst=, reg0=8083932, ip, nw_src=, tcp, tp_dst=9091, actions=output:NXM_NX_REG2[] flow add table=80, priority=150, reg1=8083932, ip, nw_dst=, reg0=8083932, ip, nw_src=10.128.2.9, tcp, tp_dst=9091, actions=output:NXM_NX_REG2[] flow add table=80, priority=150, reg1=8083932, ip, nw_dst=, reg0=8083932, ip, nw_src=10.128.2.9, tcp, tp_dst=9091, actions=output:NXM_NX_REG2[] flow add table=80, priority=150, reg1=8083932, ip, nw_dst=, tcp, tp_dst=9092, actions=output:NXM_NX_REG2[] flow add table=80, priority=150, reg1=8083932, ip, nw_dst=, tcp, tp_dst=9092, actions=output:NXM_NX_REG2[] flow add table=80, priority=150, reg1=8083932, ip, nw_dst=, tcp, tp_dst=9093, actions=output:NXM_NX_REG2[] flow add table=80, priority=150, reg1=8083932, ip, nw_dst=, tcp, tp_dst=9093, actions=output:NXM_NX_REG2[] flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.128.2.9, reg0=3399858, ip, nw_src=10.128.2.7, tcp, tp_dst=9091, actions=output:NXM_NX_REG2[] flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.128.2.9, reg0=8083932, ip, nw_src=, tcp, tp_dst=9091, actions=output:NXM_NX_REG2[] flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.128.2.9, reg0=8083932, ip, nw_src=, tcp, tp_dst=9091, actions=output:NXM_NX_REG2[] flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.128.2.9, reg0=8083932, ip, nw_src=10.128.2.9, tcp, tp_dst=9091, actions=output:NXM_NX_REG2[] flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.128.2.9, tcp, tp_dst=9092, actions=output:NXM_NX_REG2[] flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.128.2.9, tcp, tp_dst=9093, actions=output:NXM_NX_REG2[] flow add table=80, priority=100, reg1=8083932, ip, nw_dst=10.128.2.8, actions=drop flow add table=80, priority=100, reg1=8083932, ip, nw_dst=10.129.2.17, actions=drop flow add table=80, priority=100, reg1=8083932, ip, nw_dst=10.131.0.18, actions=drop flow add table=80, priority=100, reg1=8083932, ip, nw_dst=10.128.2.9, actions=drop flow add table=80, priority=100, reg1=8083932, ip, nw_dst=, actions=drop flow add table=80, priority=50, reg1=8083932, actions=output:NXM_NX_REG2[] I0108 17:31:21.735682 1819 ovs.go:158] Error executing ovs-ofctl: ovs-ofctl: -:3: 0/0: invalid IP address
Created attachment 1745668 [details] must-gather at state with broken networking Attaching must-gather from a state when the cluster has two alerts 50% of the image-registry/image-registry targets in openshift-image-registry namespace are down. 50% of the router-internal-default/router-internal-default targets in openshift-ingress namespace are down. and 50% of HTTP requests to http://httpd-example-foobar.apps... route hang (timeout)
Created attachment 1745671 [details] Kafka network policies
Hi Marek, Can you try manually restarting openvswitch with `systemctl restart openvswitch` on all nodes to see if that remedies the issues? thanks, Andrew
(In reply to Andrew Stoycos from comment #3) > Hi Marek, > > Can you try manually restarting openvswitch with `systemctl restart > openvswitch` on all nodes to see if that remedies the issues? > > thanks, Andrew Ruuning `systemctl restart openvswitch` on all workers resolved all the alerts and the route works again.
After some testing I was still unable to reproduce this issue on a development cluster, however we did learn a few things. It seems that restarting systemd's openvswitch is a viable fix both in this instance and the associated one seen in https://bugzilla.redhat.com/show_bug.cgi?id=1914284 which leads me to believe it is in fact a regression with SDN going into 4.6.9. To combat this problem all upgrades to 4.6.9 have been until we can pinpoint exactly what's going on. Due to the problems association with only OCP4.6.9, the absence of any SDN changes in 4.6.7 & 4.6.8, and the issue's relation to the network policy OVS rule table, we believe the problem originated within the following PR which refactored much of the Network Policy code -> https://github.com/openshift/sdn/pull/228 Specifically I think it has to do with the syncNamespace() function in the SDN code, where updates to OVS rules are never getting applied... With errors like `I0108 17:31:21.735682 1819 ovs.go:158] Error executing ovs-ofctl: ovs-ofctl: -:3: 0/0: invalid IP address` Moving forward we will work with the rest of the team to find a fix and push a patch as soon as possible. - Andrew
consolidating the two bugs *** This bug has been marked as a duplicate of bug 1914284 ***
Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1]. If you feel like this bug still needs to be a suspect, please add keyword again. [1]: https://github.com/openshift/enhancements/pull/475