Bug 1272295
Summary: | Should clean up the orphan openflow rules after no traffic for hard_timeout | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | OKD | Reporter: | Meng Bo <bmeng> | ||||||
Component: | Networking | Assignee: | Dan Winship <danw> | ||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Meng Bo <bmeng> | ||||||
Severity: | medium | Docs Contact: | |||||||
Priority: | medium | ||||||||
Version: | 3.x | CC: | aos-bugs, danw, kzhang, rkhan, yadu | ||||||
Target Milestone: | --- | ||||||||
Target Release: | --- | ||||||||
Hardware: | Unspecified | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2015-11-23 21:13:29 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
Description
Meng Bo
2015-10-16 02:14:33 UTC
(In reply to Meng Bo from comment #0) > As the SDN setup script, it set the hard_timeout=900 in table0. > > # grep hard_timeout > ./Godeps/_workspace/src/github.com/openshift/openshift-sdn/pkg/ovssubnet/ > controller/multitenant/bin/openshift-sdn-multitenant-setup.sh > ovs-ofctl -O OpenFlow13 add-flow br0 "table=0, actions=learn(table=8, > priority=200, hard_timeout=900, NXM_OF_ETH_DST[]=NXM_OF_ETH_SRC[], > load:NXM_NX_TUN_IPV4_SRC[]->NXM_NX_TUN_IPV4_DST[], output:NXM_OF_IN_PORT[]), > goto_table:1" That hard_timeout applies only to rules added by that learn() action (eg, rules that have table=8 and priority=200). There are no such rules in your dump-flow output, so the timeout is working. However, the rules in table 3 and table 6 ought to have been deleted with their associated pods, so if the pods have been deleted then there's some bug here... Can you attach your "journalctl -u openshift-node.service" output? That should show the output from the relevant "ovs-ofctl del-flows" commands... The priority=200 rules in table 4 are associated with services, not pods, and so they should still be there if you still have services defined (even if there aren't any pods implementing those services). (There should probably be a test that the service rules do get cleaned up properly when services go away.) All of the rules in table 7 and the two priority=100 rules in table 8 are associated with nodes, not pods, so they should still be there as long as the nodes are still in service. (There should probably be a test that these get cleaned up correctly too.) All of the remaining rules are permanent. http://pastebin.test.redhat.com/320760 Here is the openshift-node log when creaing/deleting pod/service. I have highlighted the lines which look strange to me. Thanks. > Oct 19 14:52:11 node1.bmeng.local openshift-node[2328]: W1019 14:52:11.174872 2328 manager.go:1748] Hairpin setup failed for pod "hello-pod_u1p1": open /sys/devices/virtual/net/veth4ede697/brport/hairpin_mode: no such file or directory That's harmless and has been fixed in upstream kubernetes. I'll submit a PR to pull in that fix. > Oct 19 15:03:51 node1.bmeng.local openshift-node[2328]: E1019 15:03:51.759344 2328 manager.go:1204] Failed tearing down the infra container: exit status 1 That's the error, but I guess we don't log the output of that script by default; can you add "--loglevel=5" to the openshift command line in openshift-node.service and try again? http://pastebin.test.redhat.com/321035 Logs for deleting pod with loglevel=5 Found the PR in openshift-sdn repo https://github.com/openshift/openshift-sdn/pull/186 @Dan The error in line 44 of this log (http://pastebin.test.redhat.com/320760) is also harmless? Seems it failed to delete openflow rule. ah, you're right. (I only tested deleting pods, not deleting services) I have check the issue with sync the latest openshift-sdn to origin. The service openflow rule can be deleted, but the pods' rules still cannot be cleaned. node logs as below: Oct 22 12:35:21 node1.bmeng.local ovs-vsctl[6385]: ovs|00001|vsctl|INFO|Called as ovs-vsctl --if-exists del-port vethc02aba4 Oct 22 12:35:21 node1.bmeng.local openshift-node[6298]: I1022 12:35:21.781796 6298 plugins.go:60] TearDownPod 'multitenant' network plugin output: + lockwrap Teard own Oct 22 12:35:21 node1.bmeng.local openshift-node[6298]: + flock 200 Oct 22 12:35:21 node1.bmeng.local openshift-node[6298]: + Teardown Oct 22 12:35:21 node1.bmeng.local openshift-node[6298]: + get_ipaddr_pid_veth Oct 22 12:35:21 node1.bmeng.local openshift-node[6298]: ++ docker inspect --format '{{.HostConfig.NetworkMode}}' 4cf8ecc974ce0b1d224b1bcafd6d101d5695383d916430bc91b1e 326bdfd0c5a Oct 22 12:35:21 node1.bmeng.local openshift-node[6298]: + network_mode=bridge Oct 22 12:35:21 node1.bmeng.local openshift-node[6298]: + '[' bridge == host ']' Oct 22 12:35:21 node1.bmeng.local openshift-node[6298]: + [[ bridge =~ container:.* ]] Oct 22 12:35:21 node1.bmeng.local openshift-node[6298]: ++ docker inspect --format '{{.NetworkSettings.IPAddress}}' 4cf8ecc974ce0b1d224b1bcafd6d101d5695383d916430bc91 b1e326bdfd0c5a Oct 22 12:35:21 node1.bmeng.local openshift-node[6298]: + ipaddr=10.1.0.6 Oct 22 12:35:21 node1.bmeng.local openshift-node[6298]: ++ docker inspect --format '{{.State.Pid}}' 4cf8ecc974ce0b1d224b1bcafd6d101d5695383d916430bc91b1e326bdfd0c5a Oct 22 12:35:21 node1.bmeng.local openshift-node[6298]: + pid=6215 Oct 22 12:35:21 node1.bmeng.local openshift-node[6298]: ++ get_veth_host 6215 Oct 22 12:35:21 node1.bmeng.local openshift-node[6298]: ++ local pid=6215 Oct 22 12:35:21 node1.bmeng.local openshift-node[6298]: +++ nsenter -n -t 6215 -- ethtool -S eth0 Oct 22 12:35:21 node1.bmeng.local openshift-node[6298]: +++ sed -n -e 's/.*peer_ifindex: //p' Oct 22 12:35:21 node1.bmeng.local openshift-node[6298]: ++ local veth_ifindex=21 Oct 22 12:35:21 node1.bmeng.local openshift-node[6298]: ++ ip link show Oct 22 12:35:21 node1.bmeng.local openshift-node[6298]: ++ sed -ne 's/^21: \([^:@]*\).*/\1/p' Oct 22 12:35:21 node1.bmeng.local openshift-node[6298]: + veth_host=vethc02aba4 Oct 22 12:35:21 node1.bmeng.local openshift-node[6298]: + del_ovs_port Oct 22 12:35:21 node1.bmeng.local openshift-node[6298]: + ovs-vsctl --if-exists del-port vethc02aba4 Oct 22 12:35:21 node1.bmeng.local openshift-node[6298]: + del_ovs_flows Oct 22 12:35:21 node1.bmeng.local openshift-node[6298]: ++ ovs-ofctl -O OpenFlow13 dump-ports-desc br0 Oct 22 12:35:21 node1.bmeng.local openshift-node[6298]: ++ grep vethc02aba4 Oct 22 12:35:21 node1.bmeng.local openshift-node[6298]: ++ tr -d ' ' Oct 22 12:35:21 node1.bmeng.local openshift-node[6298]: ++ cut -d '(' -f 1 Oct 22 12:35:21 node1.bmeng.local openshift-node[6298]: + ovs_port= Oct 22 12:35:21 node1.bmeng.local openshift-node[6298]: + ovs-ofctl -O OpenFlow13 del-flows br0 table=3,cookie=0x/0xffffffff Oct 22 12:35:21 node1.bmeng.local openshift-node[6298]: ovs-ofctl: invalid numeric format 0x Oct 22 12:35:21 node1.bmeng.local openshift-node[6298]: , exit status 1 Oct 22 12:35:21 node1.bmeng.local openshift-node[6298]: E1022 12:35:21.781830 6298 manager.go:1336] Failed tearing down the infra container: exit status 1 Oct 22 12:35:21 node1.bmeng.local openshift-node[6298]: I1022 12:35:21.786531 6298 manager.go:1413] Killing container "4cf8ecc974ce0b1d224b1bcafd6d101d5695383d9164 30bc91b1e326bdfd0c5a u1p1/hello-pod" with 30 second grace period BTW, I found only the openflow rules of user's namespace can be cleaned once the service/pod is deleted by user. All the rules in default namespace will not be cleaned. You can see there are two rules point to port 5000 and 80, since I deleted and recreated both the docker-registry and router once. cookie=0x0, duration=4845.973s, table=4, n_packets=0, n_bytes=0, priority=200,tcp,nw_dst=172.30.206.20,tp_dst=5000 actions=output:2 cookie=0x0, duration=5984.690s, table=4, n_packets=0, n_bytes=0, priority=200,tcp,nw_dst=172.30.132.224,tp_dst=5000 actions=output:2 cookie=0x0, duration=159.736s, table=4, n_packets=0, n_bytes=0, priority=200,tcp,nw_dst=172.30.0.1,tp_dst=443 actions=output:2 cookie=0x0, duration=4845.963s, table=4, n_packets=0, n_bytes=0, priority=200,tcp,nw_dst=172.30.183.25,tp_dst=80 actions=output:2 cookie=0x0, duration=5984.679s, table=4, n_packets=0, n_bytes=0, priority=200,tcp,nw_dst=172.30.87.133,tp_dst=80 actions=output:2 (In reply to Meng Bo from comment #9) > BTW, I found only the openflow rules of user's namespace can be cleaned once > the service/pod is deleted by user. > > > All the rules in default namespace will not be cleaned. I'm not able to reproduce this. Can you give me an exact sequence of commands? (In reply to Dan Winship from comment #10) > (In reply to Meng Bo from comment #9) > > BTW, I found only the openflow rules of user's namespace can be cleaned once > > the service/pod is deleted by user. > > > > > > All the rules in default namespace will not be cleaned. > > I'm not able to reproduce this. Can you give me an exact sequence of > commands? Ok, I cannot reproduce it neither.... maybe some of the rules are created before I upgrade the binary. Checked with the latest origin code and openshift-sdn version 1e4edc9abb6bb8ac7e5cd946ddec4c10cc714d67 The openflow rules can be deleted now after the pod deleted. Will verify the bug once the changes merged into origin. Created attachment 1086359 [details] openflow-dump @danw Seems it still has some problem for the removing openflow rules for deleted services. I do not have the exact reproduce steps. Just using the script to create/delete resource in my env for the weekend. And there are 180 rules for deleted services on one of my node. 106 rules on another node. script: =========== for i in 1 2 3 4; do oc create -f https://raw.githubusercontent.com/openshift-qe/v3-testfiles/master/networking/list_for_pods.json -n u1p$i ; done sleep 5 oadm pod-network join-projects --to u1p1 u1p2 u1p3 sleep 5 for i in 1 2 3 4; do oc delete rc --all -n u1p$i ; done for i in 1 2 3 4; do oc delete svc --all -n u1p$i ; done sleep 5 =========== Openflow dump attached. (In reply to Meng Bo from comment #13) > oadm pod-network join-projects --to u1p1 u1p2 u1p3 do you still run into the problem if you don't do any pod-network reorganizing? > Openflow dump attached. Can you attach the journal output? It logs every service rule it deletes, along with the output from the ovs-ofctl command if it fails. (Look for "Output of deleting", but note that the log message may continue on the following lines.) Created attachment 1086790 [details] node_log_for_legacy_service Hi Dan, I can reproduce the problem without the step `oadm pod-network join-projects`, but I still don't know what's the exact steps to reproduce this. The attached log seems does not contain everything it should have, some of the log cannot be logged into journal log. The service with IP 172.30.231.118, failed to be deleted from OVS. But no "Output of deleting" related log for the IP. I think the issue in comment#13 should be race condition related problem. How do you think if we only focus on the functionality of the openflow rules deleting here? I will raise the bug in other scope. (In reply to Meng Bo from comment #15) > The attached log seems does not contain everything it should have, some of > the log cannot be logged into journal log. The service with IP > 172.30.231.118, failed to be deleted from OVS. But no "Output of deleting" > related log for the IP. OK, so for some reason we're not even calling DelServiceOFRules() for that deletion... > I think the issue in comment#13 should be race condition related problem. > How do you think if we only focus on the functionality of the openflow rules > deleting here? I will raise the bug in other scope. Sounds good. other than the apparent race condition with deleting services, this should now be fixed in origin master. The issue has been fixed on openshift v1.0.6-997-gff3b522. Move bug to verified. |