Bug 1565075

Summary: the veth cannot be added back after running ovs mod-flows to modify sdn version in table 253
Product: OpenShift Container Platform Reporter: Hongan Li <hongli>
Component: NetworkingAssignee: Dan Williams <dcbw>
Status: CLOSED WORKSFORME QA Contact: Meng Bo <bmeng>
Severity: high Docs Contact:
Priority: high    
Version: 3.10.0CC: aos-bugs, hongli
Target Milestone: ---   
Target Release: 3.10.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-04-23 06:38:48 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
can see the sdn pod is restarted twice after mod-flows and here is logs none

Description Hongan Li 2018-04-09 10:34:12 UTC
Created attachment 1419181 [details]
can see the sdn pod is restarted twice after mod-flows and here is logs

Description of problem:
the veth cannot be added back after running ovs mod-flows command:
ovs-ofctl mod-flows br0 "table=253, actions=note:01.ff" -O openflow13

Version-Release number of selected component (if applicable):
oc v3.10.0-0.16.0
kubernetes v1.9.1+a0ce1bc657
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://qe-hongli-master-etcd-1:8443
openshift v3.10.0-0.14.0
kubernetes v1.9.1+a0ce1bc657

How reproducible:
always

Steps to Reproduce:
1. ensure sdn and ovs pods are running well, then create your project and pod and ensure the pod's IP is reachable.

# oc get pod -o wide
NAMESPACE               NAME                                         READY     STATUS    RESTARTS   AGE       IP            NODE
lha                     caddy-docker                                 1/1       Running   0          20m       10.129.0.2    qe-hongli-node-registry-router-1

# oc exec ovs-bwcbp -- ovs-ofctl show br0 -O openflow13
OFPT_FEATURES_REPLY (OF1.3) (xid=0x2): dpid:000042f5a8316442
n_tables:254, n_buffers:0
capabilities: FLOW_STATS TABLE_STATS PORT_STATS GROUP_STATS QUEUE_STATS
OFPST_PORT_DESC reply (OF1.3) (xid=0x3):
 1(vxlan0): addr:2e:ba:5d:4c:81:93
     config:     0
     state:      LIVE
     speed: 0 Mbps now, 0 Mbps max
 2(tun0): addr:82:f2:36:f6:42:02
     config:     0
     state:      LIVE
     speed: 0 Mbps now, 0 Mbps max
 3(veth87b98a50): addr:2a:3d:68:86:ed:7a
     config:     0
     state:      LIVE
     current:    10GB-FD COPPER
     speed: 10000 Mbps now, 0 Mbps max
 LOCAL(br0): addr:42:f5:a8:31:64:42
     config:     PORT_DOWN
     state:      LINK_DOWN
     speed: 0 Mbps now, 0 Mbps max
OFPT_GET_CONFIG_REPLY (OF1.3) (xid=0x5): frags=nx-match miss_send_len=0
# oc exec ovs-bwcbp -- ovs-ofctl dump-flows br0 -O openflow13 | grep output:3
 cookie=0x0, duration=1308.041s, table=40, n_packets=2, n_bytes=84, priority=100,arp,arp_tpa=10.129.0.2 actions=output:3


2. run ovs mod-flows command on the node which the pod landed.
# oc exec ovs-bwcbp -- ovs-ofctl mod-flows br0 "table=253, actions=note:01.ff" -O openflow13

3. the ovs rules of table-253 changes to "note:01.ff" and changes back to "note:01.06" in 40s.

# oc exec ovs-bwcbp -- ovs-ofctl dump-flows br0 -O openflow13 | grep 253
 cookie=0x0, duration=1967.301s, table=253, n_packets=0, n_bytes=0, actions=note:01.ff.00.00.00.00
# oc exec ovs-bwcbp -- ovs-ofctl dump-flows br0 -O openflow13 | grep 253
 cookie=0x0, duration=2000.458s, table=253, n_packets=0, n_bytes=0, actions=note:01.ff.00.00.00.00
# oc exec ovs-bwcbp -- ovs-ofctl dump-flows br0 -O openflow13 | grep 253
 cookie=0x0, duration=5.167s, table=253, n_packets=0, n_bytes=0, actions=note:01.06.00.00.00.00

4. wait for over 10 minutes, but the veth cannot be added back.
# oc exec ovs-bwcbp -- ovs-ofctl show br0 -O openflow13
OFPT_FEATURES_REPLY (OF1.3) (xid=0x2): dpid:0000426876519442
n_tables:254, n_buffers:0
capabilities: FLOW_STATS TABLE_STATS PORT_STATS GROUP_STATS QUEUE_STATS
OFPST_PORT_DESC reply (OF1.3) (xid=0x3):
 1(vxlan0): addr:d2:7b:94:d4:e6:b2
     config:     0
     state:      LIVE
     speed: 0 Mbps now, 0 Mbps max
 2(tun0): addr:1a:fb:7d:4c:5c:64
     config:     0
     state:      LIVE
     speed: 0 Mbps now, 0 Mbps max
 LOCAL(br0): addr:42:68:76:51:94:42
     config:     PORT_DOWN
     state:      LINK_DOWN
     speed: 0 Mbps now, 0 Mbps max
OFPT_GET_CONFIG_REPLY (OF1.3) (xid=0x5): frags=nx-match miss_send_len=0

5. the pod's IP address is not updated and unreachable. 

Actual results:
the veth cannot be added back after running ovs mod-flows command 

Expected results:
the veth should be back and pod should be reachable.

Additional info:
workaround: systemctl restart docker

Comment 1 Ben Bennett 2018-04-09 19:17:56 UTC
Wait... why should people be able to run mod-flows on the ovs bridge we are using?

What are you really trying to do?

Comment 3 Dan Williams 2018-04-14 21:33:53 UTC
Any chance you can get the openshift-node logs too?  The errors about connecting to the runtime socket are a bit concerning, that goes on for 13 seconds which may well be timing out; I don't see why that should be happening if the openshift-node itself is still running.

Also, when this happens can you get a 'ps aux' on the system?  It may be that openshift-node is waiting for the SDN to do something, but the SDN is waiting for the node to do something.  I thought they were all doing that async, but maybe not.

Comment 5 Hongan Li 2018-04-23 06:38:48 UTC
Re-test in latest build but cannot reproduce the issue again. The veth can be added to ovs and works well after SDN reinitialized. 

So close it for now.

oc v3.10.0-0.27.0
kubernetes v1.10.0+b81c8f8
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://qe-hongli-310-auto-master-etcd-1:8443
openshift v3.10.0-0.27.0
kubernetes v1.10.0+b81c8f8