Bug 1821986

Summary: No new ovs flows add to table 80 after restart sdn pod and create a allow-all networkpolicy
Product: OpenShift Container Platform Reporter: Ross Brattain <rbrattai>
Component: NetworkingAssignee: Juan Luis de Sousa-Valadas <jdesousa>
Networking sub component: openshift-sdn QA Contact: Ross Brattain <rbrattai>
Status: CLOSED WONTFIX Docs Contact:
Severity: low    
Priority: low CC: anbhat, bbennett, huirwang, jdesousa, vcojot
Version: 4.4   
Target Milestone: ---   
Target Release: 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: SDN-QA-IMPACT
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1790440 Environment:
Last Closed: 2020-09-10 15:08:53 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1790440    
Bug Blocks: 1790407, 1790805, 1793952    

Description Ross Brattain 2020-04-08 02:29:53 UTC
+++ This bug was initially created as a clone of Bug #1790440 +++

+++ This bug was initially created as a clone of Bug #1790407 +++

Description of problem:
No new ovs flows add to table 80 after restart sdn pod and create allow-all networkpolicy

Version-Release number of selected component (if applicable):
Versions:

4.4.0-0.nightly-2020-04-07-130324


How reproducible:
Frequently using automation, difficult using manual steps

Steps to Reproduce:

Exact automation steps with timestamps

1. Create a project

[19:13:28] INFO> Shell Commands: oc new-project gpjn8
Now using project "gpjn8" on server "https://api.example.com:6443".

You can add applications to this project with the 'new-app' command. For example, try:

    oc new-app ruby~https://github.com/sclorg/ruby-ex.git

to build a new example application in Python. Or use kubectl to deploy a simple Kubernetes application:

    kubectl create deployment hello-node --image=gcr.io/hello-minikube-zero-install/hello-node




2. Create two pods in this project.


[19:13:29] INFO> Shell Commands: oc create -f verification-tests/features/tierN/testdata/networking/list_for_pods.json
replicationcontroller/test-rc created
service/test-service created

[19:13:30] INFO> Exit Status: 0
[19:13:30] INFO> oc get pods -o wide -l name\=test-pods  -n gpjn8
NAME            READY   STATUS    RESTARTS   AGE   IP            NODE                                        NOMINATED NODE   READINESS GATES
test-rc-bt7r6   1/1     Running   0          69m   10.131.0.20   worker-westus-9d9nh   <none>           <none>
test-rc-d4mjw   1/1     Running   0          69m   10.129.2.11   worker-westus-74s72   <none>           <none>



3. Create a deny-all networkpolicy in the project.

[19:13:43] INFO> Shell Commands: oc create -f verification-tests/testdata/networking/networkpolicy/defaultdeny-v1-semantic.yaml  -n gpjn8
networkpolicy.networking.k8s.io/default-deny created

[19:13:43] INFO> Exit Status: 0

4. Verify pod traffic is denied
[19:13:44] INFO> Shell Commands: oc exec test-rc-bt7r6   -n gpjn8  -- curl -s --connect-timeout 100 10.129.2.11:8080

STDERR:
command terminated with exit code 28

[19:15:25] INFO> Exit Status: 28


5. Restart SDN pod which pod test-rc-bt7r6 is located, node worker-westus-9d9nh

[19:15:26] INFO> Shell Commands: oc delete pods sdn-w2x27 --wait=false  --namespace=openshift-sdn
pod "sdn-w2x27" deleted

[19:15:26] INFO> Exit Status: 0


6. After new sdn pod running, create a allow-all networkpolicy in the project.

[19:15:39] INFO> Shell Commands: oc create -f verification-tests/features/tierN/testdata/networking/networkpolicy/allow-all.yaml
networkpolicy.networking.k8s.io/allow-all created

[19:15:39] INFO> Exit Status: 0


7.  Try to curl from the pod on the node on which the sdn pod was deleted

[19:15:39] INFO> Shell Commands: oc create -f verification-tests/features/tierN/testdata/networking/networkpolicy/allow-all.yaml
networkpolicy.networking.k8s.io/allow-all created

[19:15:39] INFO> Exit Status: 0

8. Try to curl to the pod on the node on which the sdn pod was deleted


[19:15:40] INFO> Shell Commands: oc exec test-rc-d4mjw   -n gpjn8  -- curl -s --connect-timeout 140 10.131.0.20:8080

STDERR:
command terminated with exit code 7

[19:17:54] INFO> Exit Status: 7



Actual Result:
The curl fails when attempting to connect to the pod on the node on which the sdn pod was deleted



# dump the openflows for the netnamespace for the project
netnamespace=$(printf "%x\n"  $(oc get netnamespaces $(oc get pods  -o jsonpath={.items[0].metadata.namespace}) -o jsonpath={.netid}))

# dump all the matching flows
for n in $(oc get pods -o wide -o jsonpath={.items[*].spec.nodeName})  ; do for f in $(oc get pod -n openshift-sdn -l app=ovs   --field-selector=spec.nodeName=$n  -o jsonpath={.items[0].metadata.name}) ; do echo $f ;  oc -n openshift-sdn exec $f -- ovs-ofctl -O OpenFlow13 dump-flows br0  | grep $netnamespace | tee flows-$netnamespace-$f ; done ; done


flows-node-worker-westus-9d9nh:

cookie=0x0, duration=2831.705s, table=20, n_packets=6, n_bytes=252, priority=100,arp,in_port=21,arp_spa=10.131.0.20,arp_sha=00:00:0a:83:00:14/00:00:ff:ff:ff:ff actions=load:0xc43e71->NXM_NX_REG0[],goto_table:21
cookie=0x0, duration=2831.705s, table=20, n_packets=13, n_bytes=1002, priority=100,ip,in_port=21,nw_src=10.131.0.20 actions=load:0xc43e71->NXM_NX_REG0[],goto_table:21
cookie=0x0, duration=2831.705s, table=25, n_packets=0, n_bytes=0, priority=100,ip,nw_src=10.131.0.20 actions=load:0xc43e71->NXM_NX_REG0[],goto_table:30
cookie=0x0, duration=2831.706s, table=70, n_packets=11, n_bytes=924, priority=100,ip,nw_dst=10.131.0.20 actions=load:0xc43e71->NXM_NX_REG1[],load:0x15->NXM_NX_REG2[],goto_table:80



flows-node-worker-westus-74s72

cookie=0x0, duration=2832.078s, table=20, n_packets=6, n_bytes=252, priority=100,arp,in_port=12,arp_spa=10.129.2.11,arp_sha=00:00:0a:81:02:0b/00:00:ff:ff:ff:ff actions=load:0xc43e71->NXM_NX_REG0[],goto_table:21
cookie=0x0, duration=2832.078s, table=20, n_packets=11, n_bytes=924, priority=100,ip,in_port=12,nw_src=10.129.2.11 actions=load:0xc43e71->NXM_NX_REG0[],goto_table:21
cookie=0x0, duration=2832.078s, table=25, n_packets=0, n_bytes=0, priority=100,ip,nw_src=10.129.2.11 actions=load:0xc43e71->NXM_NX_REG0[],goto_table:30
cookie=0x0, duration=2832.078s, table=70, n_packets=13, n_bytes=1002, priority=100,ip,nw_dst=10.129.2.11 actions=load:0xc43e71->NXM_NX_REG1[],load:0xc->NXM_NX_REG2[],goto_table:80
cookie=0x0, duration=2703.262s, table=80, n_packets=3, n_bytes=286, priority=150,reg1=0xc43e71 actions=output:NXM_NX_REG2[]


The failing node is missing the table=80 priority=150 rule actions=output:NXM_NX_REG2[]




Expected results:

The pods should talk to each in that project after add allow-all policy.

Comment 3 Juan Luis de Sousa-Valadas 2020-04-13 10:19:25 UTC
> Frequently using automation, difficult using manual steps 
I suspect this may be related to the rule being created while the SDN pod is stopped.

Can you add a 30 second sleep to the automated test between the sdn pod being deleted and the allow-all rule being created, and launch the test a bunch of times and see if it reproduces that way please?

Comment 4 Ross Brattain 2020-04-14 16:46:13 UTC
I added a "wait for SDN pod ready" step to the automation and that seems to have fixed the issue.  I'm not sure what the synchronization expectation is if the networkpolicy is created when not all the SDN pods are ready.


Note: it seems to only take ~15 seconds for the SDN pod to be ready.

      [16:32:11] INFO> Shell Commands: oc delete pods sdn-582sr --wait=false --namespace=openshift-sdn

      [16:32:26] INFO> Shell Commands: oc create -f features/tierN/testdata/networking/networkpolicy/allow-all.yaml
      networkpolicy.networking.k8s.io/allow-all created

Comment 6 Juan Luis de Sousa-Valadas 2020-04-15 14:39:27 UTC
Hi Ross,
I see this is in the errata, but I don't think this should be published in an errata at all. Can you remove it please?

This isn't really solved, although I think it's fine to wait for the SDN pod to be recreated and leave this open in a lower priority. Or even close it with a WONTFIX, but definitely not verified.

This is an extreme corner case. This happens only in the following conditions:
1- There is a deny-all networkpolicy without any allow policy at all.
2- The SDN pod stops for whatever reason
3- Before the SDN pod starts again, someone creates an allow-all rule. Must be an allow-all, a rule allowing specific ports, or a subset of pods in the namespace won't trigger this.

This is a incredibly specific corner case... and I don't think anyone will ever hit it. But nevertheless it's still not fixed and shouldn't be verified.

Thanks!

Comment 7 zhaozhanqi 2020-04-16 02:13:46 UTC
hi, Juan

IMO, this iisue happen since the SDN pod has not be working well totally. so the networkpolicy has not be working as expected. I think this can be accepted since the policy will take effect after the sdn pod started totally.

Comment 8 Ross Brattain 2020-04-16 13:28:22 UTC
Sorry for the verified, I'm still learning the BZ workflows.

I agree this is still an issue.  When creating network policies, customer shouldn't have to wait 30 seconds or wait for any particular SDN pods to be ready, once network policy is created it should be implemented correctly.

Comment 12 Juan Luis de Sousa-Valadas 2020-09-10 15:08:53 UTC
Zhanqi, it's not worth the effort since it's an edge case and it's almost impossible to hit in real life.
If a customer actually hits this we'll fix it, but I find that extremely unlikely.