Description of problem: - When using the NetworkPolicy networking plugin, customer set a NetId for a namespace to the value of '0'. - After experiencing major incident with pods losing network connectivity we removed all NetworkPolicy rules from the cluster. - This temporarily mitigated the issue, but it's not a demanded state. - Customer would like to enable the NetworkPolicies back, but there is a suspicion that pods on some nodes are not writing NetworkPolicy rules to the OVS tables (router running in OVS pods) what is rendering the pods inaccessible (again lost of connection). - Based on the troubleshooting it looks that on the affected nodes the tables in the OVS pods are not updated with the NetworkPolicy rules: --- Prereqs: $ oc get netnamespace |grep ocp-devtools-mon-deva ocp-devtools-mon-deva 10705001 [<ip>] NetId 10705001 is hexa 0xa35869 ---------------------------------------------------------------------------------------------------------------------- Non-working example $ oc get nodes |grep <node1> <node1> Ready compute,oneagent,vmware-ocp 311d v1.11.0+d4cacc0 $ oc get pods -n ocp-devtools-mon-deva -o wide |grep dsweb-wd4lf dsweb-wd4lf 1/1 Running 1 29d <ip> <node1> <none> $ oc get pods -n openshift-sdn -o wide |grep <node> ovs-4fj5p 1/1 Running 2 29d <ip> <node1> <none> $ oc exec -n openshift-sdn ovs-4fj5p -- ovs-ofctl -O OpenFlow13 dump-flows br0 |grep 'table=80' cookie=0x0, duration=14479.183s, table=80, n_packets=13090, n_bytes=1620191, priority=300,ip,nw_src=<ip> actions=output:NXM_NX_REG2[] cookie=0x0, duration=14479.155s, table=80, n_packets=41373, n_bytes=8336737, priority=200,ct_state=+rpl,ip actions=output:NXM_NX_REG2[] cookie=0x0, duration=13421.650s, table=80, n_packets=0, n_bytes=0, priority=50,reg1=0xf05199 actions=output:NXM_NX_REG2[] cookie=0x0, duration=484.252s, table=80, n_packets=0, n_bytes=0, priority=50,reg1=0x4bd73 actions=output:NXM_NX_REG2[] cookie=0x0, duration=168.378s, table=80, n_packets=342, n_bytes=74727, priority=50,reg1=0x9205bc actions=output:NXM_NX_REG2[] cookie=0x0, duration=14479.183s, table=80, n_packets=0, n_bytes=0, priority=0 actions=drop Following NetworkPolicies are missing in OVS. See the working example: $ oc get networkpolicy -n ocp-devtools-mon-deva NAME POD-SELECTOR AGE allow-from-default-namespace <none> 42d allow-from-prometheus-namespace <none> 55d allow-from-same-namespace <none> 55d deny-by-default <none> 55d ------------------------------------------------------------------------------------------------------------------------------------- Working example $ oc get nodes |grep <node2> <node2> Ready compute,vmware-ocp 290d v1.11.0+d4cacc0 $ oc get pods -n ocp-devtools-mon-deva -o wide |grep dsweb-z2qqj dsweb-z2qqj 1/1 Running 0 48d <ip> <node2> <none> $ oc get pods -n openshift-sdn -o wide |grep <node2> ovs-wkmvj 1/1 Running 0 65d <ip> <node2> <none> NetworkPolicies are visible in OVS tables: $ oc get networkpolicy -n ocp-devtools-mon-deva NAME POD-SELECTOR AGE allow-from-default-namespace <none> 42d allow-from-prometheus-namespace <none> 55d allow-from-same-namespace <none> 55d deny-by-default <none> 55d $ oc exec -n openshift-sdn ovs-wkmvj -- ovs-ofctl -O OpenFlow13 dump-flows br0 |grep 'table=80' cookie=0x0, duration=5662032.628s, table=80, n_packets=115125529, n_bytes=10235452960, priority=300,ip,nw_src=<ip> actions=output:NXM_NX_REG2[] cookie=0x0, duration=5662032.607s, table=80, n_packets=1824916152, n_bytes=1496307028960, priority=200,ct_state=+rpl,ip actions=output:NXM_NX_REG2[] cookie=0x0, duration=3644103.874s, table=80, n_packets=2987772, n_bytes=214023819, priority=150,reg0=0,reg1=0xa35869 actions=output:NXM_NX_REG2[] cookie=0x0, duration=3644103.874s, table=80, n_packets=0, n_bytes=0, priority=150,reg0=0x407188,reg1=0xa35869 actions=output:NXM_NX_REG2[] cookie=0x0, duration=3644103.874s, table=80, n_packets=0, n_bytes=0, priority=150,reg0=0xa35869,reg1=0xa35869 actions=output:NXM_NX_REG2[] cookie=0x0, duration=198753.658s, table=80, n_packets=0, n_bytes=0, priority=150,reg0=0,reg1=0x78ba28 actions=output:NXM_NX_REG2[] cookie=0x0, duration=198753.658s, table=80, n_packets=0, n_bytes=0, priority=150,reg0=0x407188,reg1=0x78ba28 actions=output:NXM_NX_REG2[] cookie=0x0, duration=198753.658s, table=80, n_packets=261, n_bytes=32776, priority=150,reg0=0x78ba28,reg1=0x78ba28 actions=output:NXM_NX_REG2[] cookie=0x0, duration=5662023.066s, table=80, n_packets=24868, n_bytes=2111488, priority=50,reg1=0x4bd73 actions=output:NXM_NX_REG2[] cookie=0x0, duration=5662022.941s, table=80, n_packets=0, n_bytes=0, priority=50,reg1=0xa604c1 actions=output:NXM_NX_REG2[] cookie=0x0, duration=5662017.893s, table=80, n_packets=52, n_bytes=2808, priority=50,reg1=0x95a600 actions=output:NXM_NX_REG2[] cookie=0x0, duration=5662013.039s, table=80, n_packets=3282, n_bytes=2028725, priority=50,reg1=0x4d1eec actions=output:NXM_NX_REG2[] cookie=0x0, duration=5661992.178s, table=80, n_packets=331716, n_bytes=28232378, priority=50,reg1=0x34212e actions=output:NXM_NX_REG2[] cookie=0x0, duration=5661991.084s, table=80, n_packets=4477638, n_bytes=380666476, priority=50,reg1=0x9f7b8e actions=output:NXM_NX_REG2[] cookie=0x0, duration=5661983.260s, table=80, n_packets=22250, n_bytes=11875374, priority=50,reg1=0x2dfe7d actions=output:NXM_NX_REG2[] cookie=0x0, duration=5661971.680s, table=80, n_packets=0, n_bytes=0, priority=50,reg1=0x45d534 actions=output:NXM_NX_REG2[] cookie=0x0, duration=5661970.667s, table=80, n_packets=2924, n_bytes=204680, priority=50,reg1=0xc57314 actions=output:NXM_NX_REG2[] cookie=0x0, duration=5661970.662s, table=80, n_packets=0, n_bytes=0, priority=50,reg1=0x25128d actions=output:NXM_NX_REG2[] cookie=0x0, duration=5661961.367s, table=80, n_packets=841888, n_bytes=73446979, priority=50,reg1=0xbc7f17 actions=output:NXM_NX_REG2[] cookie=0x0, duration=4858591.544s, table=80, n_packets=9, n_bytes=678, priority=50,reg1=0xe123cd actions=output:NXM_NX_REG2[] cookie=0x0, duration=3818442.586s, table=80, n_packets=3412320, n_bytes=308286697, priority=50,reg1=0x5d36ac actions=output:NXM_NX_REG2[] cookie=0x0, duration=3303336.127s, table=80, n_packets=55132, n_bytes=3857262, priority=50,reg1=0xc023a4 actions=output:NXM_NX_REG2[] cookie=0x0, duration=3275725.117s, table=80, n_packets=0, n_bytes=0, priority=50,reg1=0xc8f512 actions=output:NXM_NX_REG2[] cookie=0x0, duration=2998069.735s, table=80, n_packets=386, n_bytes=27066, priority=50,reg1=0xf7bd66 actions=output:NXM_NX_REG2[] cookie=0x0, duration=2948757.827s, table=80, n_packets=2189, n_bytes=332964, priority=50,reg1=0xcb8224 actions=output:NXM_NX_REG2[] cookie=0x0, duration=2669176.887s, table=80, n_packets=2282500, n_bytes=282648430, priority=50,reg1=0x5ad472 actions=output:NXM_NX_REG2[] cookie=0x0, duration=2625523.131s, table=80, n_packets=1054980, n_bytes=150708050, priority=50,reg1=0x69ea7e actions=output:NXM_NX_REG2[] cookie=0x0, duration=2625505.557s, table=80, n_packets=2321313, n_bytes=754698238, priority=50,reg1=0xe41c0a actions=output:NXM_NX_REG2[] cookie=0x0, duration=2592354.031s, table=80, n_packets=180, n_bytes=51946, priority=50,reg1=0x73d882 actions=output:NXM_NX_REG2[] cookie=0x0, duration=2524625.027s, table=80, n_packets=196743, n_bytes=13937272, priority=50,reg1=0x9205bc actions=output:NXM_NX_REG2[] cookie=0x0, duration=2521358.977s, table=80, n_packets=163984, n_bytes=13958962, priority=50,reg1=0x895225 actions=output:NXM_NX_REG2[] cookie=0x0, duration=2359221.803s, table=80, n_packets=1829928, n_bytes=128295238, priority=50,reg1=0x5f9c9 actions=output:NXM_NX_REG2[] cookie=0x0, duration=1753009.325s, table=80, n_packets=1462433, n_bytes=217226725, priority=50,reg1=0x20e620 actions=output:NXM_NX_REG2[] cookie=0x0, duration=1738872.604s, table=80, n_packets=89082, n_bytes=7677888, priority=50,reg1=0x35db44 actions=output:NXM_NX_REG2[] cookie=0x0, duration=1381503.630s, table=80, n_packets=552468, n_bytes=78088376, priority=50,reg1=0x1cfbdd actions=output:NXM_NX_REG2[] cookie=0x0, duration=542353.595s, table=80, n_packets=563528, n_bytes=42879484, priority=50,reg1=0x5d4faa actions=output:NXM_NX_REG2[] cookie=0x0, duration=280447.330s, table=80, n_packets=1156, n_bytes=436050, priority=50,reg1=0x73efff actions=output:NXM_NX_REG2[] cookie=0x0, duration=101836.399s, table=80, n_packets=111960, n_bytes=7837584, priority=50,reg1=0xdd3688 actions=output:NXM_NX_REG2[] cookie=0x0, duration=17378.286s, table=80, n_packets=0, n_bytes=0, priority=50,reg1=0x18ba25 actions=output:NXM_NX_REG2[] cookie=0x0, duration=9229.523s, table=80, n_packets=817, n_bytes=299463, priority=50,reg1=0x72b3bc actions=output:NXM_NX_REG2[] cookie=0x0, duration=1168.964s, table=80, n_packets=0, n_bytes=0, priority=50,reg1=0xceb332 actions=output:NXM_NX_REG2[] cookie=0x0, duration=5662032.628s, table=80, n_packets=5669, n_bytes=419674, priority=0 actions=drop --- - This is not the expected and correct behaviour and we consider it a bug. We need to have the tables being updated with the rules again.
> - When using the NetworkPolicy networking plugin, customer set a NetId for a namespace to the value of '0'. You cannot do this. Users cannot manually edit openshift-sdn's internal objects and expect things to keep working. They need to revert any changes they made to NetNamespaces (ensuring that the namespace `default` has NetId 0, and every other namespace has a unique NetId), and then reboot all of the masters and nodes. That should be enough to make things start working again.
(In reply to Dan Winship from comment #1) > > - When using the NetworkPolicy networking plugin, customer set a NetId for a namespace to the value of '0'. > > You cannot do this. Users cannot manually edit openshift-sdn's internal > objects and expect things to keep working. > > They need to revert any changes they made to NetNamespaces (ensuring that > the namespace `default` has NetId 0, and every other namespace has a unique > NetId), and then reboot all of the masters and nodes. That should be enough > to make things start working again. Hello Dan, the customer is not changing internal objects. This is just a dump of the tables showing the network policy rules are not written into ovs, which causes OpenShift rejecting the traffic to the particular namespace. What should be collected in such cases so we can have a better look into it? -Roman
So it turns out that what the customer was doing was that they were trying to manually create NetNamespaces with pre-assigned EgressIPs. This doesn't work (though the documentation gives no hint of that fact); you have to create the Project/Namespace, then *wait for openshift-sdn to create the NetNamespace object itself*, and then you can assign the EgressIPs to the NetNamespace once it has been created. I've filed bug 1928851 about making openshift-sdn at least not break if you get this wrong. It's possible we could actually make this work the way the customer expected it would. If not we'll at least document the restriction better. (For now, the workaround is that you have to let openshift-sdn create the NetNamespace and then modify it afterward, not create it yourself.)