Bug 1565075 - the veth cannot be added back after running ovs mod-flows to modify sdn version in table 253
Summary: the veth cannot be added back after running ovs mod-flows to modify sdn versi...
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 3.10.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 3.10.0
Assignee: Dan Williams
QA Contact: Meng Bo
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-04-09 10:34 UTC by Hongan Li
Modified: 2018-04-23 06:38 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-04-23 06:38:48 UTC
Target Upstream Version:


Attachments (Terms of Use)
can see the sdn pod is restarted twice after mod-flows and here is logs (12.96 KB, text/plain)
2018-04-09 10:34 UTC, Hongan Li
no flags Details

Description Hongan Li 2018-04-09 10:34:12 UTC
Created attachment 1419181 [details]
can see the sdn pod is restarted twice after mod-flows and here is logs

Description of problem:
the veth cannot be added back after running ovs mod-flows command:
ovs-ofctl mod-flows br0 "table=253, actions=note:01.ff" -O openflow13

Version-Release number of selected component (if applicable):
oc v3.10.0-0.16.0
kubernetes v1.9.1+a0ce1bc657
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://qe-hongli-master-etcd-1:8443
openshift v3.10.0-0.14.0
kubernetes v1.9.1+a0ce1bc657

How reproducible:
always

Steps to Reproduce:
1. ensure sdn and ovs pods are running well, then create your project and pod and ensure the pod's IP is reachable.

# oc get pod -o wide
NAMESPACE               NAME                                         READY     STATUS    RESTARTS   AGE       IP            NODE
lha                     caddy-docker                                 1/1       Running   0          20m       10.129.0.2    qe-hongli-node-registry-router-1

# oc exec ovs-bwcbp -- ovs-ofctl show br0 -O openflow13
OFPT_FEATURES_REPLY (OF1.3) (xid=0x2): dpid:000042f5a8316442
n_tables:254, n_buffers:0
capabilities: FLOW_STATS TABLE_STATS PORT_STATS GROUP_STATS QUEUE_STATS
OFPST_PORT_DESC reply (OF1.3) (xid=0x3):
 1(vxlan0): addr:2e:ba:5d:4c:81:93
     config:     0
     state:      LIVE
     speed: 0 Mbps now, 0 Mbps max
 2(tun0): addr:82:f2:36:f6:42:02
     config:     0
     state:      LIVE
     speed: 0 Mbps now, 0 Mbps max
 3(veth87b98a50): addr:2a:3d:68:86:ed:7a
     config:     0
     state:      LIVE
     current:    10GB-FD COPPER
     speed: 10000 Mbps now, 0 Mbps max
 LOCAL(br0): addr:42:f5:a8:31:64:42
     config:     PORT_DOWN
     state:      LINK_DOWN
     speed: 0 Mbps now, 0 Mbps max
OFPT_GET_CONFIG_REPLY (OF1.3) (xid=0x5): frags=nx-match miss_send_len=0
# oc exec ovs-bwcbp -- ovs-ofctl dump-flows br0 -O openflow13 | grep output:3
 cookie=0x0, duration=1308.041s, table=40, n_packets=2, n_bytes=84, priority=100,arp,arp_tpa=10.129.0.2 actions=output:3


2. run ovs mod-flows command on the node which the pod landed.
# oc exec ovs-bwcbp -- ovs-ofctl mod-flows br0 "table=253, actions=note:01.ff" -O openflow13

3. the ovs rules of table-253 changes to "note:01.ff" and changes back to "note:01.06" in 40s.

# oc exec ovs-bwcbp -- ovs-ofctl dump-flows br0 -O openflow13 | grep 253
 cookie=0x0, duration=1967.301s, table=253, n_packets=0, n_bytes=0, actions=note:01.ff.00.00.00.00
# oc exec ovs-bwcbp -- ovs-ofctl dump-flows br0 -O openflow13 | grep 253
 cookie=0x0, duration=2000.458s, table=253, n_packets=0, n_bytes=0, actions=note:01.ff.00.00.00.00
# oc exec ovs-bwcbp -- ovs-ofctl dump-flows br0 -O openflow13 | grep 253
 cookie=0x0, duration=5.167s, table=253, n_packets=0, n_bytes=0, actions=note:01.06.00.00.00.00

4. wait for over 10 minutes, but the veth cannot be added back.
# oc exec ovs-bwcbp -- ovs-ofctl show br0 -O openflow13
OFPT_FEATURES_REPLY (OF1.3) (xid=0x2): dpid:0000426876519442
n_tables:254, n_buffers:0
capabilities: FLOW_STATS TABLE_STATS PORT_STATS GROUP_STATS QUEUE_STATS
OFPST_PORT_DESC reply (OF1.3) (xid=0x3):
 1(vxlan0): addr:d2:7b:94:d4:e6:b2
     config:     0
     state:      LIVE
     speed: 0 Mbps now, 0 Mbps max
 2(tun0): addr:1a:fb:7d:4c:5c:64
     config:     0
     state:      LIVE
     speed: 0 Mbps now, 0 Mbps max
 LOCAL(br0): addr:42:68:76:51:94:42
     config:     PORT_DOWN
     state:      LINK_DOWN
     speed: 0 Mbps now, 0 Mbps max
OFPT_GET_CONFIG_REPLY (OF1.3) (xid=0x5): frags=nx-match miss_send_len=0

5. the pod's IP address is not updated and unreachable. 

Actual results:
the veth cannot be added back after running ovs mod-flows command 

Expected results:
the veth should be back and pod should be reachable.

Additional info:
workaround: systemctl restart docker

Comment 1 Ben Bennett 2018-04-09 19:17:56 UTC
Wait... why should people be able to run mod-flows on the ovs bridge we are using?

What are you really trying to do?

Comment 3 Dan Williams 2018-04-14 21:33:53 UTC
Any chance you can get the openshift-node logs too?  The errors about connecting to the runtime socket are a bit concerning, that goes on for 13 seconds which may well be timing out; I don't see why that should be happening if the openshift-node itself is still running.

Also, when this happens can you get a 'ps aux' on the system?  It may be that openshift-node is waiting for the SDN to do something, but the SDN is waiting for the node to do something.  I thought they were all doing that async, but maybe not.

Comment 5 Hongan Li 2018-04-23 06:38:48 UTC
Re-test in latest build but cannot reproduce the issue again. The veth can be added to ovs and works well after SDN reinitialized. 

So close it for now.

oc v3.10.0-0.27.0
kubernetes v1.10.0+b81c8f8
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://qe-hongli-310-auto-master-etcd-1:8443
openshift v3.10.0-0.27.0
kubernetes v1.10.0+b81c8f8


Note You need to log in before you can comment on or make changes to this bug.