Description of problem: Every deployment in the horizontal scale environment shows a TeardownNetworkError in kubectl get events for the deployment pod. Environment is on AWS. 209 nodes including 3 master/etcd, 5 HAProxy routers, 2 docker-registry pods. Using the multitennant plugin The test case is a script to load the cluster with 1000 projects with 4 running pods, 3 running services and some other artifacts Version-Release number of selected component (if applicable): 3.2.0.8 How reproducible: Always (in this env, anyways) Steps to Reproduce: 1. In env described above - not sure about others 2. create new project 3. create new app using django-example template 4. watch kubectl get events --all-namespaces -w as it deploys. Event below will be seen Actual results: mff 2016-03-29 14:00:25 -0400 EDT 2016-03-29 14:00:25 -0400 EDT 1 django-example-2-deploy Pod Warning FailedSync {kubelet ip-172-31-3-96.us-west-2.compute.internal} Error syncing pod, skipping: failed to "TeardownNetwork" for "django-example-2-deploy_mff" with TeardownNetworkError: "Failed to teardown network for pod \"041b30b7-f5d8-11e5-ad39-02243e13a1d3\" using network plugins \"redhat/openshift-ovs-multitenant\": exit status 1" Expected results: No errors on normal deployments. Additional info: openshift_master_portal_net=172.24.0.0/14 osm_cluster_network_cidr=172.20.0.0/14 osm_host_subnet_length=8 os_sdn_network_plugin_name='redhat/openshift-ovs-multitenant'
This is happening in the much smaller SVT reliability cluster (3.2.0.8) as well
Can you please run the script to gather debugging information: https://docs.openshift.com/enterprise/3.1/admin_guide/sdn_troubleshooting.html#further-help
Created attachment 1141879 [details] network debug script output
Ran the script on master of a smaller (3 node) cluster experiencing the problem. Not much in the output - let me know what else would help.
Ooops, did not notice tgz. Attaching. This was a fresh 3 node install of 3.2.0.8 on AWS
Created attachment 1141894 [details] debug tarball output
*** Bug 1322697 has been marked as a duplicate of this bug. ***
Analysis... it looks like atomic-openshift-node was restarted at some point, while containers were still running. Then on restart: - the previous OVS setup either needed to be updated, was not complete, or there is a bug in openshift-sdn startup that fails to detect when the OVS config is OK - the OVS bridge is removed and recreated, which obviously detaches all ports that were previously attached to the bridge Mar 30 13:44:12 xxxxx.compute.internal ovs-vsctl[7405]: ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl --if-exists del-br br0 -- add-br br0 -- set Bridge br0 fail-mode=secure protocols=OpenFlow13 - then when atomic-openshift-node sees that it should kill/cleanup pods that were previous running before the restart, it attempts to tear them down Mar 30 13:58:48 xxxxx.compute.internal atomic-openshift-node[7277]: I0330 13:58:48.152764 7277 kubelet.go:2245] Killing unwanted pod "router-5-deploy" - for some reason, ovs-vsctl returns an error even if --if-exists is given, which it shouldn't. This doesn't happen normally executing those commands manually. Mar 30 13:58:48 xxxxx.compute.internal ovs-vsctl[9553]: ovs|00001|vsctl|INFO|Called as ovs-vsctl --if-exists del-port vethd6032cc Mar 30 13:58:48 xxxxx.compute.internal ovs-vsctl[9559]: ovs|00001|vsctl|ERR|no row "vethd6032cc" in table Port Was atomic-openshift-node restarted at some point here? Perhaps it was previously configured for single-tenant plugin and then was switched to multi-tenant while pods were still active on the node?
Fixed in https://github.com/openshift/openshift-sdn/pull/277
Hit this exact issue on fresh install on our scale cluster. No previous state. 0-deployments passed.
re: comment 8 - no node restarts I am aware of. Definitely no re-configuration of the network plugin. Always multiten. I hit it again today on a fresh install with no node restarts. Let me know if you want additional debug.sh output from that cluster.
Disregard comment 11, I see there's an upstream fix.
Marking back to Assigned, fix not merged in origin repo yet.
Merged to origin now; https://github.com/openshift/origin/pull/8468
Move it back to MODIFIED till it is built in latest OSE.
This should be in atomic-openshift-3.2.0.16-1.git.0.738b760.el7 which has been built and readied for qe.
Verified on openshift v3.2.0.16 with steps in bz#1322697 Will not meet the error. Move to verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2016:1064