+++ This bug was initially created as a clone of Bug #1790440 +++ +++ This bug was initially created as a clone of Bug #1790407 +++ Description of problem: No new ovs flows add to table 80 after restart sdn pod and create allow-all networkpolicy Version-Release number of selected component (if applicable): Versions: 4.4.0-0.nightly-2020-04-07-130324 How reproducible: Frequently using automation, difficult using manual steps Steps to Reproduce: Exact automation steps with timestamps 1. Create a project [19:13:28] INFO> Shell Commands: oc new-project gpjn8 Now using project "gpjn8" on server "https://api.example.com:6443". You can add applications to this project with the 'new-app' command. For example, try: oc new-app ruby~https://github.com/sclorg/ruby-ex.git to build a new example application in Python. Or use kubectl to deploy a simple Kubernetes application: kubectl create deployment hello-node --image=gcr.io/hello-minikube-zero-install/hello-node 2. Create two pods in this project. [19:13:29] INFO> Shell Commands: oc create -f verification-tests/features/tierN/testdata/networking/list_for_pods.json replicationcontroller/test-rc created service/test-service created [19:13:30] INFO> Exit Status: 0 [19:13:30] INFO> oc get pods -o wide -l name\=test-pods -n gpjn8 NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES test-rc-bt7r6 1/1 Running 0 69m 10.131.0.20 worker-westus-9d9nh <none> <none> test-rc-d4mjw 1/1 Running 0 69m 10.129.2.11 worker-westus-74s72 <none> <none> 3. Create a deny-all networkpolicy in the project. [19:13:43] INFO> Shell Commands: oc create -f verification-tests/testdata/networking/networkpolicy/defaultdeny-v1-semantic.yaml -n gpjn8 networkpolicy.networking.k8s.io/default-deny created [19:13:43] INFO> Exit Status: 0 4. Verify pod traffic is denied [19:13:44] INFO> Shell Commands: oc exec test-rc-bt7r6 -n gpjn8 -- curl -s --connect-timeout 100 10.129.2.11:8080 STDERR: command terminated with exit code 28 [19:15:25] INFO> Exit Status: 28 5. Restart SDN pod which pod test-rc-bt7r6 is located, node worker-westus-9d9nh [19:15:26] INFO> Shell Commands: oc delete pods sdn-w2x27 --wait=false --namespace=openshift-sdn pod "sdn-w2x27" deleted [19:15:26] INFO> Exit Status: 0 6. After new sdn pod running, create a allow-all networkpolicy in the project. [19:15:39] INFO> Shell Commands: oc create -f verification-tests/features/tierN/testdata/networking/networkpolicy/allow-all.yaml networkpolicy.networking.k8s.io/allow-all created [19:15:39] INFO> Exit Status: 0 7. Try to curl from the pod on the node on which the sdn pod was deleted [19:15:39] INFO> Shell Commands: oc create -f verification-tests/features/tierN/testdata/networking/networkpolicy/allow-all.yaml networkpolicy.networking.k8s.io/allow-all created [19:15:39] INFO> Exit Status: 0 8. Try to curl to the pod on the node on which the sdn pod was deleted [19:15:40] INFO> Shell Commands: oc exec test-rc-d4mjw -n gpjn8 -- curl -s --connect-timeout 140 10.131.0.20:8080 STDERR: command terminated with exit code 7 [19:17:54] INFO> Exit Status: 7 Actual Result: The curl fails when attempting to connect to the pod on the node on which the sdn pod was deleted # dump the openflows for the netnamespace for the project netnamespace=$(printf "%x\n" $(oc get netnamespaces $(oc get pods -o jsonpath={.items[0].metadata.namespace}) -o jsonpath={.netid})) # dump all the matching flows for n in $(oc get pods -o wide -o jsonpath={.items[*].spec.nodeName}) ; do for f in $(oc get pod -n openshift-sdn -l app=ovs --field-selector=spec.nodeName=$n -o jsonpath={.items[0].metadata.name}) ; do echo $f ; oc -n openshift-sdn exec $f -- ovs-ofctl -O OpenFlow13 dump-flows br0 | grep $netnamespace | tee flows-$netnamespace-$f ; done ; done flows-node-worker-westus-9d9nh: cookie=0x0, duration=2831.705s, table=20, n_packets=6, n_bytes=252, priority=100,arp,in_port=21,arp_spa=10.131.0.20,arp_sha=00:00:0a:83:00:14/00:00:ff:ff:ff:ff actions=load:0xc43e71->NXM_NX_REG0[],goto_table:21 cookie=0x0, duration=2831.705s, table=20, n_packets=13, n_bytes=1002, priority=100,ip,in_port=21,nw_src=10.131.0.20 actions=load:0xc43e71->NXM_NX_REG0[],goto_table:21 cookie=0x0, duration=2831.705s, table=25, n_packets=0, n_bytes=0, priority=100,ip,nw_src=10.131.0.20 actions=load:0xc43e71->NXM_NX_REG0[],goto_table:30 cookie=0x0, duration=2831.706s, table=70, n_packets=11, n_bytes=924, priority=100,ip,nw_dst=10.131.0.20 actions=load:0xc43e71->NXM_NX_REG1[],load:0x15->NXM_NX_REG2[],goto_table:80 flows-node-worker-westus-74s72 cookie=0x0, duration=2832.078s, table=20, n_packets=6, n_bytes=252, priority=100,arp,in_port=12,arp_spa=10.129.2.11,arp_sha=00:00:0a:81:02:0b/00:00:ff:ff:ff:ff actions=load:0xc43e71->NXM_NX_REG0[],goto_table:21 cookie=0x0, duration=2832.078s, table=20, n_packets=11, n_bytes=924, priority=100,ip,in_port=12,nw_src=10.129.2.11 actions=load:0xc43e71->NXM_NX_REG0[],goto_table:21 cookie=0x0, duration=2832.078s, table=25, n_packets=0, n_bytes=0, priority=100,ip,nw_src=10.129.2.11 actions=load:0xc43e71->NXM_NX_REG0[],goto_table:30 cookie=0x0, duration=2832.078s, table=70, n_packets=13, n_bytes=1002, priority=100,ip,nw_dst=10.129.2.11 actions=load:0xc43e71->NXM_NX_REG1[],load:0xc->NXM_NX_REG2[],goto_table:80 cookie=0x0, duration=2703.262s, table=80, n_packets=3, n_bytes=286, priority=150,reg1=0xc43e71 actions=output:NXM_NX_REG2[] The failing node is missing the table=80 priority=150 rule actions=output:NXM_NX_REG2[] Expected results: The pods should talk to each in that project after add allow-all policy.
> Frequently using automation, difficult using manual steps I suspect this may be related to the rule being created while the SDN pod is stopped. Can you add a 30 second sleep to the automated test between the sdn pod being deleted and the allow-all rule being created, and launch the test a bunch of times and see if it reproduces that way please?
I added a "wait for SDN pod ready" step to the automation and that seems to have fixed the issue. I'm not sure what the synchronization expectation is if the networkpolicy is created when not all the SDN pods are ready. Note: it seems to only take ~15 seconds for the SDN pod to be ready. [16:32:11] INFO> Shell Commands: oc delete pods sdn-582sr --wait=false --namespace=openshift-sdn [16:32:26] INFO> Shell Commands: oc create -f features/tierN/testdata/networking/networkpolicy/allow-all.yaml networkpolicy.networking.k8s.io/allow-all created
Hi Ross, I see this is in the errata, but I don't think this should be published in an errata at all. Can you remove it please? This isn't really solved, although I think it's fine to wait for the SDN pod to be recreated and leave this open in a lower priority. Or even close it with a WONTFIX, but definitely not verified. This is an extreme corner case. This happens only in the following conditions: 1- There is a deny-all networkpolicy without any allow policy at all. 2- The SDN pod stops for whatever reason 3- Before the SDN pod starts again, someone creates an allow-all rule. Must be an allow-all, a rule allowing specific ports, or a subset of pods in the namespace won't trigger this. This is a incredibly specific corner case... and I don't think anyone will ever hit it. But nevertheless it's still not fixed and shouldn't be verified. Thanks!
hi, Juan IMO, this iisue happen since the SDN pod has not be working well totally. so the networkpolicy has not be working as expected. I think this can be accepted since the policy will take effect after the sdn pod started totally.
Sorry for the verified, I'm still learning the BZ workflows. I agree this is still an issue. When creating network policies, customer shouldn't have to wait 30 seconds or wait for any particular SDN pods to be ready, once network policy is created it should be implemented correctly.
Zhanqi, it's not worth the effort since it's an edge case and it's almost impossible to hit in real life. If a customer actually hits this we'll fix it, but I find that extremely unlikely.