Description of problem: After create some pods and delete some pods. there are lots of 'could not open network device veth862f2b3 (No such device)' in /var/log/openvswitch/ovs-vswitchd.log. 'ovs-vsctl list-ifaces br0' show those delete interface still exists. Version-Release number of selected component (if applicable): openshift v3.0.2.905 kubernetes v1.2.0-alpha.1-1107-g4c8e6f4 etcd 2.1.2 How reproducible: always Steps to Reproduce: 1. create some applications and delete some applications 2. check /var/log/openvswitch/ovs-vswitchd.log # cat ovs-vswitchd.log|grep 'could not open network device' 3. check ip addr #ip addr|grep veth 4. check the ovs interface #ovs-vsctl list-ifaces br0 Actual results: 1. tail /var/log/openvswitch/ovs-vswitchd.log 2015-11-03T05:12:05.952Z|158130|bridge|WARN|could not open network device veth862f2b3 (No such device) 2015-11-03T05:12:05.955Z|158131|bridge|WARN|could not open network device veth51413f4 (No such device) 2015-11-03T05:12:05.959Z|158132|bridge|WARN|could not open network device veth4a49dbf (No such device) 2015-11-03T05:12:05.962Z|158133|bridge|WARN|could not open network device veth147590f (No such device) 2015-11-03T05:12:05.968Z|158134|bridge|WARN|could not open network device vethe6f1e77 (No such device) 2015-11-03T05:12:05.973Z|158135|bridge|WARN|could not open network device veth3e22398 (No such device) 2. # cat ovs-vswitchd.log|grep 'could not open network device' |wc 19895 179055 2049185 3. #cat ovs-vswitchd.log|grep 'could not open network device' | awk -F"|" '{print $5}'|sort |uniq|wc 147 1323 8673 4.1 # ip addr|grep veth 2569: vethb451848@if2568: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master ovs-system state UP 13: veth9e63e82@if12: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master ovs-system state UP 2577: vethfba7c86@if2576: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master ovs-system state UP 2583: veth1958a87@if2582: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master ovs-system state UP 2589: veth155f7b4@if2588: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master ovs-system state UP 2609: veth6b7da7c@if2608: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master ovs-system state UP 2637: veth839d399@if2636: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master ovs-system state UP 2445: vethb662529@if2444: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master ovs-system state UP 2499: vethe3258f7@if2498: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master ovs-system state UP 2503: vethafefb1f@if2502: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master ovs-system state UP 2517: veth06037bc@if2516: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master ovs-system state UP 2523: vethcc2d72f@if2522: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master ovs-system state UP 2539: veth7de699e@if2538: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master ovs-system state UP 4.2 #ip addr|grep veth|wc 13 143 1439 5.1 ovs-vsctl list-ifaces br0 tun0 veth00298c6 veth02c73a4 veth0446afd veth06037bc veth06acd7e veth073678e veth076cba1 veth079520e <---snip---> <---snip---> <---snip---> <---snip---> veth0bdf81d veth1116a3e veth146f903 veth147590f veth155f7b4 veth18e96f1 veth1958a87 vethe96d879 vethec3078f vethf262638 vethf97f662 vethfba7c86 vethfe3c354 vethfec5574 vethff5fb87 vethff85ab1 vovsbr vxlan0 5.2 # ovs-vsctl list-ifaces br0 |wc -l 163 Expected Result: 1. There isn't 'could not open network device veth862f2b3' in ovs-vswitchd.log 2. ovs port should be deleted by openshift when pods are deleted Additional info: openshift should call something like 'ovs-vsctl del-port [BRIDGE] PORT' to delete ovs port when delete pods
[root@openshift-minion-1 vagrant]# cat /var/log/openvswitch/ovs-vswitchd.log|grep 'could not open network device' |wc [root@openshift-minion-1 vagrant]# ovs-vsctl list-ifaces br0tun0 vovsbr vxlan0 can you attach the openshift journal output?
Created attachment 1090664 [details] node journal logs
Nov 06 15:57:37 openshift-155.lab.eng.nay.redhat.com atomic-openshift-node[5186]: E1106 15:57:37.835561 5186 common.go:535] Error fetching Net ID for namespace: wewang2, skipped netNsEvent: &{ADDED wewang2 15} Nov 06 16:01:06 openshift-155.lab.eng.nay.redhat.com atomic-openshift-node[5186]: I1106 16:01:06.797368 5186 manager.go:1451] Container "95bc279983850eaf2b2a8af99850972974f6d24587ee6584ee5e37b461875e97 ruby-helloworld-database wewang2/database-1-lj377" exited after 1.148343426s Nov 06 16:01:06 openshift-155.lab.eng.nay.redhat.com atomic-openshift-node[5186]: E1106 16:01:06.797478 5186 manager.go:1342] Failed tearing down the infra container: Error fetching VNID for namespace: wewang2 Nov 06 16:01:06 openshift-155.lab.eng.nay.redhat.com atomic-openshift-node[5186]: I1106 16:01:06.797936 5186 manager.go:1451] Container "95bc279983850eaf2b2a8af99850972974f6d24587ee6584ee5e37b461875e97 /" exited after 1.132851637s Nov 06 16:01:06 openshift-155.lab.eng.nay.redhat.com atomic-openshift-node[5186]: W1106 16:01:06.797959 5186 manager.go:1457] No ref for pod '"95bc279983850eaf2b2a8af99850972974f6d24587ee6584ee5e37b461875e97 /"' Nov 06 16:01:06 openshift-155.lab.eng.nay.redhat.com atomic-openshift-node[5186]: E1106 16:01:06.797979 5186 manager.go:1342] Failed tearing down the infra container: Error fetching VNID for namespace: wewang2 ... Nov 06 16:01:05 openshift-155.lab.eng.nay.redhat.com atomic-openshift-node[5186]: E1106 16:01:05.520164 5186 event.go:198] Server rejected event '&api.Event{TypeMeta:unversioned.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:api.ObjectMeta{Name:"ruby-sample-build-1-build.14140e156d7b7f73", GenerateName:"", Namespace:"wewang2", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:unversioned.Time{Time:time.Time{sec:0, nsec:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*unversioned.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil)}, InvolvedObject:api.ObjectReference{Kind:"Pod", Namespace:"wewang2", Name:"ruby-sample-build-1-build", UID:"3d7510ce-845c-11e5-8113-fa163e141094", APIVersion:"v1", ResourceVersion:"7061", FieldPath:"spec.containers{docker-build}"}, Reason:"Killing", Message:"Killing with docker id b0e83f3067e3", Source:api.EventSource{Component:"kubelet", Host:"openshift-155.lab.eng.nay.redhat.com"}, FirstTimestamp:unversioned.Time{Time:time.Time{sec:63582393665, nsec:486684019, loc:(*time.Location)(0x4ad42e0)}}, LastTimestamp:unversioned.Time{Time:time.Time{sec:63582393665, nsec:486684019, loc:(*time.Location)(0x4ad42e0)}}, Count:1}': 'Event "ruby-sample-build-1-build.14140e156d7b7f73" is forbidden: Unable to create new content in namespace wewang2 because it is being terminated.' (will not retry!) ----------------- Perhaps we have a condition where the namespace got deleted, and so the nodes all got the VNID deletion event, and then *after* that they start deleting pods. But since the node no longer has the VNID in the cache, it fails the TearDownPod step. But we don't even need the VNID for pod teardown, since we use the pod IP address (and previously cookies). I think we should just remove the bits in TearDownPod that get the VNID and if we do need it in the future, figure out how to get it then.
Upstream PR: https://github.com/openshift/openshift-sdn/pull/207
Origin PR: https://github.com/openshift/origin/pull/5772
These PRs got merged, so the next openshift origin release should have the relevant fixes.
The fix works well on origin openshift v1.1-315-gdc545a5-dirty. move bug to modified status and waiting OSE puddle. ovs-port was deleted once pod are deleted on origin 1) create some pods and check the ovs port node1: #ovs-vsctl list-ifaces br0 tun0 veth2795fbc veth9c32332 vethe58a5cb vetheef0077 vovsbr vxlan0 node2: #ovs-vsctl list-ifaces br0 tun0 veth617b9ac vethbc10972 vethe36dc1e vovsbr vxlan0 node3: #ovs-vsctl list-ifaces br0 tun0 veth66c76b9 veth7f2fe81 veth943d2d4 vovsbr vxlan0 2) delete pods and check ovs port again node1: #ovs-vsctl list-ifaces br0 tun0 vovsbr vxlan0 node2: #ovs-vsctl list-ifaces br0 tun0 vovsbr vxlan0 node3: #ovs-vsctl list-ifaces br0 tun0 vovsbr vxlan0
It looks like the fix is in atomic-openshift-3.1.0.4.git.10.ec10652 (2015-11-30) and possibly earlier versions. Do you need to test again with the OSE puddle versions or is this bug good to go?
Verified and pass on atomic-openshift-3.1.0.902
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2016:0070