Bug 1277383 - ovs-port wasn't deleted when openshift deleted pods
Summary: ovs-port wasn't deleted when openshift deleted pods
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 3.0.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: ---
Assignee: Dan Williams
QA Contact: Meng Bo
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-11-03 08:34 UTC by Anping Li
Modified: 2016-01-26 19:16 UTC (History)
6 users (show)

Fixed In Version: atomic-openshift-3.1.0.4.git.10.ec10652
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-01-26 19:16:42 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
node journal logs (44.72 KB, application/x-gzip)
2015-11-06 13:23 UTC, Anping Li
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2016:0070 0 normal SHIPPED_LIVE Important: Red Hat OpenShift Enterprise 3.1.1 bug fix and enhancement update 2016-01-27 00:12:41 UTC

Description Anping Li 2015-11-03 08:34:47 UTC
Description of problem:
After create some pods and delete some pods. there are lots of 'could not open network device veth862f2b3 (No such device)' in /var/log/openvswitch/ovs-vswitchd.log. 
'ovs-vsctl list-ifaces br0' show those delete interface still exists. 

Version-Release number of selected component (if applicable):
openshift v3.0.2.905
kubernetes v1.2.0-alpha.1-1107-g4c8e6f4
etcd 2.1.2

How reproducible:
always

Steps to Reproduce:
1. create some applications and delete some applications

2. check /var/log/openvswitch/ovs-vswitchd.log
  # cat ovs-vswitchd.log|grep 'could not open network device' 
  
3. check ip addr
  #ip addr|grep  veth

4. check the ovs interface
  #ovs-vsctl list-ifaces br0 

Actual results:
1. tail /var/log/openvswitch/ovs-vswitchd.log
2015-11-03T05:12:05.952Z|158130|bridge|WARN|could not open network device veth862f2b3 (No such device)
2015-11-03T05:12:05.955Z|158131|bridge|WARN|could not open network device veth51413f4 (No such device)
2015-11-03T05:12:05.959Z|158132|bridge|WARN|could not open network device veth4a49dbf (No such device)
2015-11-03T05:12:05.962Z|158133|bridge|WARN|could not open network device veth147590f (No such device)
2015-11-03T05:12:05.968Z|158134|bridge|WARN|could not open network device vethe6f1e77 (No such device)
2015-11-03T05:12:05.973Z|158135|bridge|WARN|could not open network device veth3e22398 (No such device)

2. # cat ovs-vswitchd.log|grep 'could not open network device' |wc 
  19895  179055 2049185
3. #cat ovs-vswitchd.log|grep 'could not open network device' | awk -F"|" '{print $5}'|sort |uniq|wc
    147    1323    8673
4.1 # ip addr|grep veth
2569: vethb451848@if2568: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master ovs-system state UP 
13: veth9e63e82@if12: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master ovs-system state UP 
2577: vethfba7c86@if2576: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master ovs-system state UP 
2583: veth1958a87@if2582: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master ovs-system state UP 
2589: veth155f7b4@if2588: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master ovs-system state UP 
2609: veth6b7da7c@if2608: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master ovs-system state UP 
2637: veth839d399@if2636: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master ovs-system state UP 
2445: vethb662529@if2444: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master ovs-system state UP 
2499: vethe3258f7@if2498: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master ovs-system state UP 
2503: vethafefb1f@if2502: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master ovs-system state UP 
2517: veth06037bc@if2516: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master ovs-system state UP 
2523: vethcc2d72f@if2522: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master ovs-system state UP 
2539: veth7de699e@if2538: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master ovs-system state UP 

4.2 #ip addr|grep  veth|wc
     13     143    1439

5.1 ovs-vsctl list-ifaces br0
tun0
veth00298c6
veth02c73a4
veth0446afd
veth06037bc
veth06acd7e
veth073678e
veth076cba1
veth079520e
<---snip--->
<---snip--->
<---snip--->
<---snip--->
veth0bdf81d
veth1116a3e
veth146f903
veth147590f
veth155f7b4
veth18e96f1
veth1958a87
vethe96d879
vethec3078f
vethf262638
vethf97f662
vethfba7c86
vethfe3c354
vethfec5574
vethff5fb87
vethff85ab1
vovsbr
vxlan0

5.2 # ovs-vsctl list-ifaces br0 |wc -l
    163
    

Expected Result:
1. There isn't 'could not open network device veth862f2b3' in ovs-vswitchd.log
2. ovs port should be deleted by openshift when pods are deleted


Additional info:

openshift should call something like 'ovs-vsctl del-port [BRIDGE] PORT' to delete ovs port when delete pods

Comment 2 Dan Winship 2015-11-04 14:31:56 UTC
[root@openshift-minion-1 vagrant]# cat /var/log/openvswitch/ovs-vswitchd.log|grep 'could not open network device' |wc
[root@openshift-minion-1 vagrant]# ovs-vsctl list-ifaces br0tun0
vovsbr
vxlan0

can you attach the openshift journal output?

Comment 3 Anping Li 2015-11-06 13:23:44 UTC
Created attachment 1090664 [details]
node journal logs

Comment 4 Dan Williams 2015-11-06 16:56:05 UTC
Nov 06 15:57:37 openshift-155.lab.eng.nay.redhat.com atomic-openshift-node[5186]: E1106 15:57:37.835561    5186 common.go:535] Error fetching Net ID for namespace: wewang2, skipped netNsEvent: &{ADDED wewang2 15}

Nov 06 16:01:06 openshift-155.lab.eng.nay.redhat.com atomic-openshift-node[5186]: I1106 16:01:06.797368    5186 manager.go:1451] Container "95bc279983850eaf2b2a8af99850972974f6d24587ee6584ee5e37b461875e97 ruby-helloworld-database wewang2/database-1-lj377" exited after 1.148343426s

Nov 06 16:01:06 openshift-155.lab.eng.nay.redhat.com atomic-openshift-node[5186]: E1106 16:01:06.797478    5186 manager.go:1342] Failed tearing down the infra container: Error fetching VNID for namespace: wewang2

Nov 06 16:01:06 openshift-155.lab.eng.nay.redhat.com atomic-openshift-node[5186]: I1106 16:01:06.797936    5186 manager.go:1451] Container "95bc279983850eaf2b2a8af99850972974f6d24587ee6584ee5e37b461875e97 /" exited after 1.132851637s

Nov 06 16:01:06 openshift-155.lab.eng.nay.redhat.com atomic-openshift-node[5186]: W1106 16:01:06.797959    5186 manager.go:1457] No ref for pod '"95bc279983850eaf2b2a8af99850972974f6d24587ee6584ee5e37b461875e97 /"'

Nov 06 16:01:06 openshift-155.lab.eng.nay.redhat.com atomic-openshift-node[5186]: E1106 16:01:06.797979    5186 manager.go:1342] Failed tearing down the infra container: Error fetching VNID for namespace: wewang2

...

Nov 06 16:01:05 openshift-155.lab.eng.nay.redhat.com atomic-openshift-node[5186]: E1106 16:01:05.520164    5186 event.go:198] Server rejected event '&api.Event{TypeMeta:unversioned.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:api.ObjectMeta{Name:"ruby-sample-build-1-build.14140e156d7b7f73", GenerateName:"", Namespace:"wewang2", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:unversioned.Time{Time:time.Time{sec:0, nsec:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*unversioned.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil)}, InvolvedObject:api.ObjectReference{Kind:"Pod", Namespace:"wewang2", Name:"ruby-sample-build-1-build", UID:"3d7510ce-845c-11e5-8113-fa163e141094", APIVersion:"v1", ResourceVersion:"7061", FieldPath:"spec.containers{docker-build}"}, Reason:"Killing", Message:"Killing with docker id b0e83f3067e3", Source:api.EventSource{Component:"kubelet", Host:"openshift-155.lab.eng.nay.redhat.com"}, FirstTimestamp:unversioned.Time{Time:time.Time{sec:63582393665, nsec:486684019, loc:(*time.Location)(0x4ad42e0)}}, LastTimestamp:unversioned.Time{Time:time.Time{sec:63582393665, nsec:486684019, loc:(*time.Location)(0x4ad42e0)}}, Count:1}': 'Event "ruby-sample-build-1-build.14140e156d7b7f73" is forbidden: Unable to create new content in namespace wewang2 because it is being terminated.' (will not retry!)

-----------------

Perhaps we have a condition where the namespace got deleted, and so the nodes all got the VNID deletion event, and then *after* that they start deleting pods.  But since the node no longer has the VNID in the cache, it fails the TearDownPod step.

But we don't even need the VNID for pod teardown, since we use the pod IP address (and previously cookies).  I think we should just remove the bits in TearDownPod that get the VNID and if we do need it in the future, figure out how to get it then.

Comment 5 Dan Williams 2015-11-06 19:47:11 UTC
Upstream PR: https://github.com/openshift/openshift-sdn/pull/207

Comment 6 Dan Williams 2015-11-06 21:41:55 UTC
Origin PR: https://github.com/openshift/origin/pull/5772

Comment 7 Dan Williams 2015-11-30 15:57:57 UTC
These PRs got merged, so the next openshift origin release should have the relevant fixes.

Comment 8 Anping Li 2015-12-03 06:26:38 UTC
The fix works well on origin openshift v1.1-315-gdc545a5-dirty. move bug to modified status and waiting OSE puddle.



ovs-port was deleted once pod are deleted on origin

1) create some pods and check the ovs port
node1:  #ovs-vsctl list-ifaces br0
tun0
veth2795fbc
veth9c32332
vethe58a5cb
vetheef0077
vovsbr
vxlan0
node2:  #ovs-vsctl list-ifaces br0
tun0
veth617b9ac
vethbc10972
vethe36dc1e
vovsbr
vxlan0
node3:  #ovs-vsctl list-ifaces br0
tun0
veth66c76b9
veth7f2fe81
veth943d2d4
vovsbr
vxlan0
2) delete pods and check ovs port again

node1:  #ovs-vsctl list-ifaces br0
tun0
vovsbr
vxlan0
node2:  #ovs-vsctl list-ifaces br0
tun0
vovsbr
vxlan0
node3:  #ovs-vsctl list-ifaces br0
tun0
vovsbr
vxlan0

Comment 9 Dan Williams 2015-12-17 15:40:22 UTC
It looks like the fix is in atomic-openshift-3.1.0.4.git.10.ec10652 (2015-11-30) and possibly earlier versions.  Do you need to test again with the OSE puddle versions or is this bug good to go?

Comment 10 Anping Li 2015-12-18 03:25:36 UTC
Verified and pass on atomic-openshift-3.1.0.902

Comment 12 errata-xmlrpc 2016-01-26 19:16:42 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2016:0070


Note You need to log in before you can comment on or make changes to this bug.